首页 | 官方网站   微博 | 高级检索  
     

基于Word2Vec的中文短文本分类问题研究
引用本文:汪静,罗浪,王德强.基于Word2Vec的中文短文本分类问题研究[J].计算机系统应用,2018,27(5):209-215.
作者姓名:汪静  罗浪  王德强
作者单位:中南民族大学 计算机科学学院, 武汉 430074,中南民族大学 计算机科学学院, 武汉 430074,中南民族大学 计算机科学学院, 武汉 430074
基金项目:赛尔网络下一代互联网技术创新项目(NGII20150106)
摘    要:针对短文本中固有的特征稀疏以及传统分类模型存在的“词汇鸿沟”等问题, 我们利用Word2Vec模型可以有效缓解短文本中数据特征稀疏的问题, 并且引入传统文本分类模型中不具有的语义关系. 但进一步发现单纯利用 Word2Vec模型忽略了不同词性的词语对短文本的影响力, 因此引入词性改进特征权重计算方法, 将词性对文本分类的贡献度嵌入到传统的TF-IDF算法中计算短文本中词的权重, 并结合 Word2Vec 词向量生成短文本向量, 最后利用SVM实现短文本分类. 在复旦大学中文文本分类语料库上的实验结果验证了该方法的有效性.

关 键 词:Word2Vec  TF-IDF  文本表示  短文本分类
收稿时间:2017/8/18 0:00:00
修稿时间:2017/9/5 0:00:00

Research on Chinese Short Text Classification Based on Word2Vec
WANG Jing,LUO Lang and WANG De-Qiang.Research on Chinese Short Text Classification Based on Word2Vec[J].Computer Systems& Applications,2018,27(5):209-215.
Authors:WANG Jing  LUO Lang and WANG De-Qiang
Affiliation:School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China,School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China and School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China
Abstract:To address the problems such as the inherent sparsity in the short text and the "lexical gap" of traditional classification model, using Word2Vec model to map words to a spatial vector of low-dimensional real number according to context semantic relations can effectively ease the sparse feature issue of short text. However, further study found that only using Word2Vec will ignore the influence of different parts of speech on the short text. Therefore, we introduce part of speech to improve the feature weighting approach, in which the contribution of speech is embedded into the traditional TF-IDF algorithm to calculate the weight of the words in the short text, and the vector of short text is generated by combining the word vector of Word2Vec. Finally, we use the SVM to achieve short text classification. Experimental results on Fudan University Chinese text classification corpus validate the effectiveness of the proposed method.
Keywords:Word2Vec  TF-IDF  text representation  short text classification
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号