首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper describes our work on developing a language-independent technique for discovery of implicit knowledge from multilingual information sources. Text mining has been gaining popularity in the knowledge discovery field, particularity with the increasing availability of digital documents in various languages from all around the world. However, currently most text mining tools mainly focus only on processing monolingual documents (particularly English documents): little attention has been paid to apply the techniques to handle the documents in Asian languages, and further extend the mining algorithms to support the aspects of multilingual information sources. In this work, we attempt to develop a language-neutral method to tackle the linguistics difficulties in the text mining process. Using a variation of automatic clustering techniques, which apply a neural net approach, namely the Self-Organizing Maps (SOM), we have conducted several experiments to uncover associated documents based on a Chinese corpus, Chinese-English bilingual parallel corpora, and a hybrid Chinese-English corpus. The experiments show some interesting results and a couple of potential paths for future work in the field of multilingual information discovery. Besides, this work is expected to act as a starting point for exploring the impacts on linguistics issues with the machine-learning approach to mining sensible linguistics elements from multilingual text collections.  相似文献   

2.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

3.
文档检索是自然语言处理的研究热点,相对于短文本文档具有信息丰富且冗长的特征。在长文本检索中,查询语句与长文本中的句子往往不是全部相关,可能会出现某些高相似片段的强干扰,因此查询语句与文档之间的相关性评分不能简单采用基于词语或字符串之间的相似度计算。提出了一种文本片段化机制(TSM)进行文档检索,首先将每个候选文档划分成片段,再计算查询语句与文档片段之间的相关度,所使用的相关度匹配方案考虑了语义和词频等因素,筛选出关键的文本片段并得出相关片段比率,综合这些片段信息计算查询与文档之间的相关性得分,从而获取Top-K文档集。针对Glasgow信息检索专用数据集的实验结果表明,利用文本片段化机制进行文本匹配可以提高信息检索的性能。  相似文献   

4.
基于链接描述文本及其上下文的Web信息检索   总被引:20,自引:0,他引:20  
文档之间的超链接结构是Web信息检索和传统信息检索的最大区别之一,由此产生了基于超链接结构的检索技术。描述了链接描述文档的概念,并在此基础上研究链接文本(anchor text)及其上下文信息在检索中的作用。通过使用超过169万篇网页的大规模真实数据集以及TREC 2001提供的相关文档及评价方法进行测试,得到如下结论:首先,链接描述文档对网页主题的概括有高度的精确性,但是对网页内容的描述有极大的不完全性;其次,与传统检索方法相比,使用链接文本在已知网页定位的任务上能够使系统性能提高96%,但是链接文本及其上下文信息无法在未知信息查询任务上改善检索性能;最后,把基于链接描述文本的方法与传统方法相结合,能够在检索性能上提高近16%。  相似文献   

5.
Word sense disambiguation automatically determines the appropriate senses of a word in context. We have previously shown that self-organized document maps have properties similar to a large-scale semantic structure that is useful for word sense disambiguation. This work evaluates the impact of different linguistic features on self-organized document maps for word sense disambiguation. The features evaluated are various qualitative features, e.g. part-of-speech and syntactic labels, and quantitative features, e.g. cut-off levels for word frequency. It is shown that linguistic features help make contextual information explicit. If the training corpus is large even contextually weak features, such as base forms, will act in concert to produce sense distinctions in a statistically significant way. However, the most important features are syntactic dependency relations and base forms annotated with part of speech or syntactic labels. We achieve 62.9% ± 0.73% correct results on the fine grained lexical task of the English SENSEVAL-2 data. On the 96.7% of the test cases which need no back-off to the most frequent sense we achieve 65.7% correct results.  相似文献   

6.
检索一篇文档在其他语言中的译文对于双语平行语料库的建立是一件很有意义的工作。本文提出一种改进的跨语言相似文档检索算法,该算法使用双语词典或统计翻译模型作为双语知识库,查找两篇文档的共同翻译词对,把翻译词对的权重作为一种特征来进行相似度计算,用Dice方法的改进算法计算双语文档的相似度。在实验中,统计检索文档的译文排在检索结果前 N位的总次数来评价算法的性能,并使用了两个噪音数据集来评价算法的有效性。实验表明,在噪音数据干扰比较大的情况下,译文排在检索结果前5位的译文结果接近90%。实验证明,翻译词对的权重对于相似度计算有很大帮助,本算法可以有效地发现一种语言书写的文档在另一种语言中的译稿。  相似文献   

7.
The Cambridge University Multimedia Document Retrieval (CU-MDR) Demo System is a web-based application that allows the user to query a database of radio broadcasts that are available on the Internet. The audio from several radio stations is downloaded and transcribed automatically. This gives a collection of text and audio documents that can be searched by a user. The paper describes how speech recognition and information retrieval techniques are combined in the CU-MDR Demo System and shows how the user can interact with it.  相似文献   

8.
一种基于锚文本的并行检索策略   总被引:1,自引:0,他引:1       下载免费PDF全文
高珊  何婷婷  胡文敏 《计算机工程》2008,34(19):30-31,3
进行Web信息检索时,页面中的锚文本与正文存在较大相关性,多数检索系统忽视了锚文本对页面正文的贡献。该文提出一种提高检索精度的方法,为文档集建立一个基于页面正文的索引和一个基于锚文本的索引,对其采取并行检索策略。实验结果表明,该方法可以有效处理特定结构的网页集。  相似文献   

9.
刘树安  于大鹏 《控制与决策》2001,16(Z1):805-807
在研究现有文本信息检索技术的基础上,设计了基于推理网络的文本检索模型.提出一种改进的推理算法,以实现从文档观察事件到索引词出现事件的推理,使新模型可以更全面地利用文本数据信息.最后通过一个推理网络实例来说明实现推理的数学过程.  相似文献   

10.
基于本体的Web文本挖掘与信息检索   总被引:1,自引:0,他引:1       下载免费PDF全文
艾伟  孙四明  张峰 《计算机工程》2010,36(22):75-77
针对传统Web文本挖掘技术缺少语义理解能力的不足,提出并实现一种基于本体的Web文本挖掘模型,即利用基于本体概念体系的向量空间模型替代传统的向量空间模型来表示文档,在此基础上进行Web文本挖掘,并给出一种集成语义信息检索的设计。实验结果初步验证了本体模型在Web文本挖掘技术上应用的可行性。  相似文献   

11.
基于文本的信息隐藏技术研究   总被引:2,自引:0,他引:2  
信息隐藏技术是近几年来信息安全领域出现的一种新技术,不同于传统的密码学技术。它主要研究如何将机密信息隐藏于另一公开的信息中,然后通过公开信息的传输来传递机密信息。作为信息隐藏载体的公开信息可以是一般的文本文件、数字图像、数字视频和数字音频等等。而该文主要介绍了现有的几种基于文本的信息隐藏技术。  相似文献   

12.
随着计算机软硬件技术的进步以及Hypertext模型的出现,使全文检索技术应用普及的可能性变为现实。本文分析了全文检索技术应用于图书情报领域信息管理的意义,并给出了实施全文检索机制的主要步骤和方法。  相似文献   

13.
李向辉  钟诚 《微机发展》2006,16(9):97-99
介绍文本文件信息隐藏的几种典型编码方法,并比较各种方法的信息隐藏量;分析Word文档的文件结构,提出一种通过字符缩放编码、字体RGB灰度编码、改变Word文本文档中字符下划线RGB灰度值来实现隐藏秘密信息的方法。理论分析和实验结果表明该方法能提高信息隐藏量。  相似文献   

14.
分析结构化文档的表示方法及检索特点,对一种用于结构化文档检索的贝叶斯网络进行研究。讨论该贝叶斯网络的构造方法、概率估计及推理过程。用网络节点表示文档索引术语和结构单元,用弧表示术语和结构单元的隶属关系,根据TF-IDF方法估计各节点的先验概率,当给定一个查询时,通过计算每个结构单元的条件概率得到该结构单元的相关值。实例验证了该贝叶斯网络的有效性。  相似文献   

15.
面向XML文档的概念检索技术   总被引:11,自引:1,他引:11  
孙登峰 《计算机应用》2003,23(1):110-112
面向XML文档的信息检索是一个重要的研究课题,文中介绍了结构化文档的结构索引以及语义检索中的“上下文共现分析”技术,并在此基础上提出了一个面向XML文档的概念检索原型系统,并对系统设计及实现中应注意考虑的几个主要问题进行了分析。  相似文献   

16.
The Self Organizing Map (SOM) algorithm has been utilized, with much success, in a variety of applications for the automatic organization of full-text document collections. A great advantage of the SOM method is that document collections can be ordered in such a way so that documents with similar content are positioned at nearby locations of the 2-dimensional SOM lattice. The resulting ordered map thus presents a general view of the document collection which helps the exploration of information contained in the whole document space. The most notable example of such an application is the WEBSOM method where the document collection is ordered onto a map by utilizing word category histograms for representing the documents data vectors. In this paper, we introduce the LSISOM method which resembles WEBSOM in the sense that the document maps are generated from word category histograms rather than simple histograms of the words. However, a major difference between the two methods is that in WEBSOM the word category histograms are formed using statistical information of short word contexts whereas in LSISOM these histograms are obtained from the SOM clustering of the Latent Semantic Indexing representation of document terms.  相似文献   

17.
中文Web文档库全文检索技术研究与实现   总被引:13,自引:0,他引:13  
全文检索是一种非常有效的信息检索技术,本文结合国家863项目《WWW文档协同写作系统》的设计与开发,研究对中文Web文档库实现全文检索的主要技术,着重讨论了字表法全文检索技术细节,最后介绍了一个实用的全文检索系统的实现。  相似文献   

18.
张刚  谭建龙 《软件学报》2008,19(1):136-143
分布式信息检索的文档集合划分方案的评价是一个困难的问题,目前还没有良好的评价标准.从文档集合划分问题本身出发,给出了两个划分模型来刻画文档集合划分问题,从而使这两个模型可以作为文档集合划分的有效评价指标.在此基础上,提出了一种类Huffman编码的模型快速求解算法,可以求出在给定查询测试集情况下的最优文档划分方案,该方案可以作为其他文档划分方案的参考.实验表明,两个文档划分模型可以成为有效的文档集合划分评价标准.  相似文献   

19.
目前对于查询相似度的计算通常是从比对检索结果与查询式的相似度来考虑。本文提出一种基于贝叶斯分类的算法来计算XML查询结果相似度。在计算出每个检索结果文档与查询式相似度的基础上,使用贝叶斯分类器将XML检索文档分类成相关与不相关两个集合,再由计算相关文档与不相关文档的相似度来决定最终的相似度值。最后,通过实验分析表明,在不影响查全率的前提下,这样得到的相似度计算精度比传统方法高15%左右,有效地提高了检索性能。  相似文献   

20.
In this paper we introduce a dynamic programming algorithm which performs linear text segmentation by global minimization of a segmentation cost function which incorporates two factors: (a) within-segment word similarity and (b) prior information about segment length. We evaluate segmentation accuracy of the algorithm by precision, recall and Beeferman's segmentation metric. On a segmentation task which involves Choi's text collection, the algorithm achieves the best segmentation accuracy so far reported in the literature. The algorithm also achieves high accuracy on a second task which involves previously unused texts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号