首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Novelty detection aims at reducing redundant information from a chronologically ordered list of documents or sentences. Other studies of novelty detection have been conducted on the English language, but few papers have addressed the problem of multilingual novelty detection. Likewise, research in multilingual information retrieval have rarely been applied to novelty detection. This paper attempts to bridge the two disciplines by first describing the preprocessing steps for English, Malay and Chinese, then applying document and sentence-level novelty detection for the three languages on APWSJ and TREC 2004 Novelty Track data. Experiments on sentence-level novelty detection show similar results for all three languages, which indicates that our algorithm is suitable for multilingual novelty detection at the sentence level. However, results for document-level novelty detection show a disparity across the different languages, with English and Malay outperforming Chinese. After applying sentence-level novelty detection to detect novel documents, we observe substantial improvements on all three languages. This demonstrates that segmenting documents into sentences improves document-level novelty detection in multiple languages, and has practical benefits for a real-time multilingual novelty detection system.  相似文献   

2.
主要研究如何更好地让计算机智能地纠正英语学习者的发音错误。借助语音识别中的HMM(隐马尔可夫模型)建模方法,用Viterbi算法和改进的后验概率算法对中国学习者的英语发音进行自动识别,通过对基本单元进行切分和评分,最后,为英语学习者提供可信度比较高的发音信息反馈,纠正发音错误。  相似文献   

3.
中文Web文本的特征获取与分类   总被引:16,自引:0,他引:16  
许建潮  胡明 《计算机工程》2005,31(8):24-25,39
已有许多方法用于英文网页的特征抽取,相对而言适合于中文网页的方法还不多。该文设计了一个综合考虑位置,频率和词长3个因素的中文Web文本词权重的计算公式,提出了一种用变长度染色体遗传算法提取Web文本特征的方法。实验表明该方法在降低特征矢量数方面是有效的。  相似文献   

4.
中文字段匹配算法   总被引:6,自引:0,他引:6  
陈挺  郭颖  刘云超 《计算机工程》2003,29(13):118-119,124
首先介绍了几个英文字段匹配算法,然后给出了一个字段匹配过程框架,最后重点描述了几个中文字符型字段匹配算法。  相似文献   

5.
基于多种知识源的汉语自动分词   总被引:5,自引:0,他引:5  
提出一种汉语分词方法。与其它的如利用单一统计特性的统计方法或者纯规则方法不同,该方法利用字、词、上下文、语法及语义等多种知识源对汉字串中每一隔点的切分可能性进行考察,并在无法彻底消歧的情况下通过模糊综合得出最可能的切分结果。用户可以根据需要修改系统以适应不同文本的特征,并能接收前后词法、语法、语义分析阶段的反馈。因此,该方法具有准确率高、灵活、健壮、回溯迅速的特点。  相似文献   

6.
将图像分析实践中的经验知识与粒计算的基本思想相结合,总结形成了特征离散点计算,并将其应用于自然手写汉字文本行分割当中。在特征离散点计算的结构化问题求解框架下,提出了一种反馈式分列行投影文本行分割方法,分为特征离散点选择、特征离散点采样与优化、特征离散点编组与反馈以及行边缘优化四个环节。该方法在哈尔滨工业大学多人手写数据库上取得了相对以往算法较好的实验结果,同时分割速度较快。  相似文献   

7.
模糊kNN在文本分类中的应用研究   总被引:1,自引:0,他引:1  
自动文本分类是根据已经分配好类标签的训练文档集,来对新文档分配类标签.针对模糊kNN算法用于文本分类的性能进行了一系列的实验研究与分析.在中英文两个不同的语料集上,采用四种著名的文本特征选择方法进行特征选择,对改进的模糊kNN方法与经典kNN及目前广泛使用的基于相似度加权的kNN方法进行实验比较.结果表明,在不同的特征选择方法下,该算法均能削弱训练样本分布的不均匀性对分类性能的影响,提高分类精度,并且在一定程度上降低对k值的敏感性.  相似文献   

8.
该文采用中英韩跨语种文本数据研究不同语种文档间相似度的计算方法。首先,通过共现词映射将某语种空间中的文档向量表示成另一语种空间中的文档向量;其次,利用潜在语义分析补充了不同语言间一词多义现象造成的向量缺失;最后,在具有等价语义信息的同一语种空间中计算了两个文档之间的余弦相似度。该文工作避开了外部词典和知识库,利用中英韩三个语种的对齐语料库,建立了不同语种词汇间的对应关系。结果表明,共现词映射对计算不同语种文档之间的相似度具有较大影响,对同语义的不同语种文档(即译文)的检索准确率达到95%,验证了该方法的有效性。  相似文献   

9.
Imaged document text retrieval without OCR   总被引:6,自引:0,他引:6  
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method  相似文献   

10.
针对当前常用的XML压缩算法没有考虑中文特点的情况,结合中文与XML的特点,提出一种高压缩率的适合中文XML文档的压缩算法COX。利用中文分词技术对XML文档进行分词处理,通过统计词频后获得排序的词典,利用Huffman编码思想对高频及长词汇进行压缩编码;解析XML文档后,把文档元素进行分类,同一类型的元素放入同一容器之中;算法还特别针对数字类型的数据进行了特殊处理。实验结果显示,相对于通用的压缩软件,COX具有更好的压缩效果,但压缩和解压缩时间要慢一些。  相似文献   

11.
一种无约束手写体数字串分割方法   总被引:11,自引:1,他引:11  
针对无约束手写体数字串中的连笔字符,本文提出以基于识别的分割方法为主,结合运用剖分方法和全局识别方法等多种分割策略的数字串分割方法。这种方法直接针对数字串分割,也可以运用到非数字字符串的分割中,其分割思想对连笔汉字的分割也具有一定指导意义。  相似文献   

12.
由于中文和英文在语法和句法等方面的差异,面向中文文本的本体学习方法尚存在一定困难。研究了面向中文文本的玉米病虫害本体学习方法。提出单字合并法,将其与TFIDF方法结合,进行概念抽取;将欧几里德距离与余弦距离加权平均计算概念相似度,进行概念关系抽取。从中国玉米网选取50篇领域文档,应用上述方法构建了玉米病虫害本体。  相似文献   

13.
歧义处理是影响分词系统切分精度的重要因素,是自动分词系统设计中的一个核心问题。本文介绍了一种新的分词算法,利用汉语句内相邻字之间的互信息及t-信息差这两个统计量,解决汉语自动分词中的歧义字段的切分问题。试验结果表明,该方法可以有效地提高歧义处理的正确率。  相似文献   

14.
This paper presents an approach for extracting and segmenting tables from Chinese ink documents based on a matrix model. An ink document is first modeled as a matrix containing ink rows, including writing and drawing ones. Each row consists of collinear ink lines containing ink characters. Together with their associated drawing rows, adjacent writing rows having an identical distribution of writing lines and?or the same associated drawing rows if available are extracted to form a table. Row and column headers, nested sub-headers and cells are identified. Experiments demonstrate that the proposed approach is more effective and robust.  相似文献   

15.
基于Lucene的英汉跨语言信息检索   总被引:8,自引:0,他引:8  
描述了一个英汉跨语言检索系统的设计与实现,其主要研究目的在于寻找更为有效的英汉查询翻译方法,以及提高汉语检索系统的性能。在英汉查询翻译方面,以英汉双语词典为基础,建立了查询翻译算法。在汉语检索方面,分析不同索引单元对于检索性能的影响,基于Lucene全文索引工具包建立了搜索引擎。在系统评测方面,提出了一种根据主题,快速构建评测数据的方法。  相似文献   

16.
Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.  相似文献   

17.
基于邻接知识的汉语自动分词系统   总被引:5,自引:0,他引:5  
  相似文献   

18.
为避免“绝对”声韵分割策略的主观性和随意性,结合语谱图以及匹配追踪算法,实现了一种对汉语孤立字进行重叠声韵分割的新的时频方法.以语谱图判决得到的浊音起点为声韵母过渡段的起点,以匹配追踪原子参数在浊音起点之后所达到的第一个极值的位置为过渡段终点.仿真实验结果表明,该方法的分割正确率可达87.5%;将分割后的声韵母单元分别送入语音识别系统,与以整个字节为识别单元相比识别率提高了1.33%.  相似文献   

19.
本文在分析了现有切词方法和汉语特点的基础上,提出一各上具有多知识支持的分词方法SSK。SSK方法采用一种分层结构的词典,使每个词在匹配过程中能自动生成 其所有可能的重切,使切词失败时的歧义处理变得十分简单、有效。SSK方法不但得到字、词层次上知识的支持,肯得到语法、语义知识的支持。该方法通过语法、语义检查可及时晚报除一些切分错误,减少了歧义切分,且SSK方法具有简单的词汇学习功能,提高了切词正确率。  相似文献   

20.
The touching character segmentation problem becomes complex when touching strings are multi-oriented. Moreover in graphical documents sometimes characters in a single-touching string have different orientations. Segmentation of such complex touching is more challenging. In this paper, we present a scheme towards the segmentation of English multi-oriented touching strings into individual characters. When two or more characters touch, they generate a big cavity region in the background portion. Based on the convex hull information, at first, we use this background information to find some initial points for segmentation of a touching string into possible primitives (a primitive consists of a single character or part of a character). Next, the primitives are merged to get optimum segmentation. A dynamic programming algorithm is applied for this purpose using the total likelihood of characters as the objective function. A SVM classifier is used to find the likelihood of a character. To consider multi-oriented touching strings the features used in the SVM are invariant to character orientation. Experiments were performed in different databases of real and synthetic touching characters and the results show that the method is efficient in segmenting touching characters of arbitrary orientations and sizes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号