首页 | 官方网站   微博 | 高级检索  
     

基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进
引用本文:杨楠,李亚平.基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进[J].计算机应用,2019,39(6):1701-1706.
作者姓名:杨楠  李亚平
作者单位:中国人民大学信息学院,北京,100872;中国人民大学信息学院,北京,100872
基金项目:国家自然科学基金资助项目(61773385)。
摘    要:对于用户泛化和模糊的查询,将Web搜索引擎返回的列表内容聚类处理,便于用户有效查找感兴趣的内容。由于返回的列表由称为片段(snippet)的短文本组成,而传统的单词频率-逆文档频率(TF-IDF)特征选择模型不能适用于稀疏的短文本,使得聚类性能下降。一个有效的方法就是通过一个外部的知识库对短文本进行扩展。受到基于神经网络词表示方法的启发,提出了通过词嵌入技术的Word2Vec模型对短文本扩展,即采用Word2Vec模型的TopN个最相似的单词用于对片段(snippet)的扩展,扩展文档使得TF-IDF模型特征选择得到聚类性能的提高。同时考虑到通用性单词造成的噪声引入,对扩展文档的TF-IDF矩阵进行了词频权重修正。实验在两个公开数据集ODP239和SearchSnippets上完成,将所提方法和纯snippet无扩展的方法、基于Wordnet的特征扩展方法和基于Wikipedia的特征扩展方法进行了对比。实验结果表明,所提方法在聚类性能方面优于对比方法。

关 键 词:特征扩展  片段  词嵌入技术  搜索结果聚类
收稿时间:2018-10-19
修稿时间:2018-12-13

Improvement of Web search result clustering performance based on Word2Vec model feature extension
YANG Nan,LI Yaping.Improvement of Web search result clustering performance based on Word2Vec model feature extension[J].journal of Computer Applications,2019,39(6):1701-1706.
Authors:YANG Nan  LI Yaping
Affiliation:School of Information, Renmin University of China, Beijing 100872, China
Abstract:Aiming at generalized or fuzzy queries, the content of the returned list of Web search engines is clustered to help users to find the desired information quickly. Generaly, the returned list consists of short texts called snippets carring few information which traditional Term Frequency-Inverse Document Frequency (TF-IDF) feature selection model is not suitable for, so the clustering performance is very low. An effective way to solve this problem is to extend snippets according to a external knowledge base. Inspired by neural network based word presenting method, a new snippet extension approach based on Word2Vec model was proposed. In the model, TopN similar words in Word2Vec model were used to extend snippets and the extended text was able to improve the clustering performance of TF-IDF feature selection. Meanwhile,in order to reduce the impact of noise caused by some common used terms, the term frequency weight in TF-IDF matrix of the extended text was modified. The experiments were conducted on two open datasets OPD239 and SearchSnippets to compare the proposed method with pure snippets, Wordnet based and Wikipedia based feature extensions. The experimental results show that the proposed method outperforms other comparative methods significantly in term of clustering effect.
Keywords:feature extension                                                                                                                        snippet                                                                                                                        word embedding technology                                                                                                                        search result clustering
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号