基于聚类语言模型的生物文献检索技术研究 Research on Biological Literature Information Retrieval Based on Cluster Language Model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于聚类语言模型的生物文献检索技术研究

引用本文：	文健,李舟军.基于聚类语言模型的生物文献检索技术研究[J].中文信息学报,2008,22(1):61-66,122.

作者姓名：	文健李舟军

作者单位：	1. 国防科学技术大学计算机学院湖南长沙 410073; 2. 北京航空航天大学计算机科学与工程学院北京 100083

基金项目：	国家自然科学基金资助项目(60573057)

摘要：	近年来研究表明使用主题语言模型增强了信息检索的性能,但是仍然不能解决信息检索存在的一些难点问题,如数据稀疏问题,同义词问题,多义词问题,对文档中不可见项和可见项的平滑问题。这些问题在一些领域相关文献检索中显得尤其重要,比如大规模的生物文献检索。本文提出了一种新的基于聚类的主题语言模型方法进行生物文献检索,这主要包括两个方面工作,一是采用本体库中的概念表示文档,并在此基础上进行模糊聚类,把聚类的结果作为数据集中的主题,文档属于某个主题的概率由文档与聚类的模糊相似度决定。二是采用EM算法来估计主题产生项的概率。把上述方法集成到语言模型中就得到本文的语言模型。本文的语言模型能够准确描述项在不同主题中的分布概率,以及文档属于某个主题的概率,并且利用本体中概念部分地解决了同义词问题,而且项可以由不同的主题产生,这也能够部分解决词的多义问题。本文的方法在TREC 2004/05 Genomics Track数据集上进行了测试,与简单语言模型以及现有主题语言模型相比,检索性能得到一定的提高。
关键词：	计算机应用中文信息处理主题语言模型信息检索聚类
文章编号：	1003-0077（2008）01-00061-06
收稿时间：	2007-05-26
修稿时间：	2007-12-05
Research on Biological Literature Information Retrieval Based on Cluster Language Model

WEN Jian,LI Zhou-jun.Research on Biological Literature Information Retrieval Based on Cluster Language Model[J].Journal of Chinese Information Processing,2008,22(1):61-66,122.

Authors:	WEN Jian LI Zhou-jun

Affiliation:	1. Computer School, National University of Defence Technology, Changsha, Hunan 410073, China; 2. School of Computer Science & Engineering, Beihang University, Beijing 100083, China

Abstract:	Recent researches present topic language model improves the performance of information retrieval,but many problems still has not been solved include data sparseness problem,synonymy and polysemy problems,smoo-thing the seen term or not seen term.All the problems are important to IR,especially in domain literature IR,for example biological literatures.In this paper,a new topic language model based on cluster was proposed.The work mainly included two aspects.First,documents were represented by concepts of ontology,and concept-based clustering is done using Fuzzy C-Means,the clustering result was considered as the topics of document collections.The probability of a document generating topics is estimated by the similarity between the document and each cluster.Then,the probability of topic generating words is estimated using Expectation Maximization algorithm.At last,Through integrating the above algorithms into the aspect model,our topic language model was formed.This new language model accurately describes the distributed probability of words in different topics and the probability of a document generating a topic.Moreover,it can partly solve synonymy and polysemy problems.The new method was evaluated on TREC 2004/05 Genomics Track collections.Experiments have shown that the retrieval performance has been improved by the new method compared with simple language model.

Keywords:	computer application Chinese information processing topic language model information retrieval cluster
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏