首页 | 官方网站   微博 | 高级检索  
     

基于多语义因子分层聚类的文本特征提取方法
引用本文:王靖. 基于多语义因子分层聚类的文本特征提取方法[J]. 计算机应用研究, 2020, 37(10): 2951-2955,2960
作者姓名:王靖
作者单位:云南大学 软件学院,昆明650000;云南大学 信息学院,昆明650000
摘    要:针对同类文本中提取的关键词形式多样,且在相似性与相关性上具有模糊关系,提出一种对词语进行分层聚类的文本特征提取方法。该方法在考虑文本间相同词贡献文本相似度的前提下,结合词语相似性与相关性作为语义距离,并根据该语义距离的不同,引入分层聚类并赋予不同聚类权值的方法,最终得到以词和簇共同作为特征单元的带有聚类权值的向量空间模型。引入了word2vec训练词向量得到文本相似度,并根据Skip-Gram+Huffman Softmax模型的算法特点,运用点互信息公式准确获取词语间的相关度。通过文本的分类实验表明,所提出的方法较目前常用的仅使用相似度单层聚类后再统计的方法,能更有效地提高文本特征提取的准确性。

关 键 词:语义  文本特征  分层聚类  词向量
收稿时间:2019-05-20
修稿时间:2019-08-07

Text feature extraction based on hierarchical clustering with multiple semantic factors
wangjing. Text feature extraction based on hierarchical clustering with multiple semantic factors[J]. Application Research of Computers, 2020, 37(10): 2951-2955,2960
Authors:wangjing
Affiliation:Yunnan University
Abstract:Aiming at the diversity of keywords extracted from similar texts, and the fuzzy relationship between similarity and relevance, this paper proposed a text feature extraction method based on hierarchical clustering of words. The method took the similarity and relevance of words as the semantic distance under the premise that the same word between texts affects text similarity, and according to the difference of the semantic distance, introduced a hierarchical clustering method and gave different clustering weights. Finally, it obtained a vector space model with clustering weight, which took words and clusters as the feature unit. This paper introduced word2vec to train word vectors to obtain text similarity, and according to the algorithm characteristics of Skip-Gram+Huffman Softmax model, used the point mutual information formula to accurately obtain the correlation between words. The text categorization experimental results show that the proposed method can improve the accuracy of text feature extraction, and more effectively than the currently popular method of using only similarity monolayer clustering and statistics.
Keywords:semantic   text feature   hierarchical clustering   word vector
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号