首页 | 官方网站   微博 | 高级检索  
     

融合单词贡献度与Word2Vec词向量的文档表示
引用本文:彭俊利,谷雨,张震,耿小航.融合单词贡献度与Word2Vec词向量的文档表示[J].计算机工程,2021,47(4):62-67.
作者姓名:彭俊利  谷雨  张震  耿小航
作者单位:杭州电子科技大学 通信信息传输与融合技术国防重点学科实验室, 杭州 310000
摘    要:针对现有文档向量表示方法受噪声词语影响和重要词语语义不完整的问题,通过融合单词贡献度与Word2Vec词向量提出一种新的文档表示方法。应用数据集训练Word2Vec模型,计算数据集中词语的贡献度,同时设置贡献度阈值,提取贡献度大于该阈值的单词构建单词集合。在此基础上,寻找文档与集合中共同存在的单词,获取其词向量并融合单词贡献度生成文档向量。实验结果表明,该方法在搜狗中文文本语料库和复旦大学中文文本分类语料库上分类的平均准确率、召回率和F1值均优于TF-IDF、均值Word2Vec、PTF-IDF加权Word2Vec模型等传统方法,同时其对英文文本也能进行有效分类。

关 键 词:单词贡献度  Word2Vec词向量  词嵌入  文档表示  文本分类  
收稿时间:2019-10-22
修稿时间:2020-01-02

Document Representation Fused with Term Contribution and Word2Vec Word Vector
PENG Junli,GU Yu,ZHANG Zhen,GENG Xiaohang.Document Representation Fused with Term Contribution and Word2Vec Word Vector[J].Computer Engineering,2021,47(4):62-67.
Authors:PENG Junli  GU Yu  ZHANG Zhen  GENG Xiaohang
Affiliation:National Defense Key Discipline Laboratory of Communication Information Transmission and Fusion Technology, Hangzhou Dianzi University, Hangzhou 310000, China
Abstract:The existing document vector representation methods are affected by noise words and the semantics of important words is incomplete.To address the problems,this paper proposes a new document representation method by fusing Term Contribution(TC)and Word2Vec word vector.Trained with a dataset,the Word2Vec model calculates the TC of words in the data set.Then the contribution threshold is set and the words whose TC is greater than the threshold are extracted to construct a word set.On this basic,the word that exists both in the document and the set is extracted,and its word vector is fused with the TC to generate the document vector.Experimental results show that the average accuracy,recall rate and F1 value of the proposed method on Sogou Chinese text corpus and Fudan University Chinese text classification corpus are better than those of traditional methods such as TF-IDF,mean Word2Vec and PIF-IDF weighted Word2Vec models.Meanwhile,it can also effectively classify English texts.
Keywords:Term Contribution(TC)  Word2Vec word vector  word embedding  document representation  text classification
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号