首页 | 官方网站   微博 | 高级检索  
     

基于TF-IDF与word2vec的台词文本分类研究
引用本文:但宇豪,黄继风,杨琳,高海.基于TF-IDF与word2vec的台词文本分类研究[J].上海师范大学学报(自然科学版),2020,49(1):89-95.
作者姓名:但宇豪  黄继风  杨琳  高海
作者单位:上海师范大学信息与机电工程学院,上海 201418;上海计算机软件技术开发中心,上海 201112;上海高创电脑技术工程有限公司,上海 200030
基金项目:上海市科研计划项目(17DZ2292100)
摘    要:为提高文本分类的准确性,针对健康节目台词文本各类别之间样本数量及各样本之间词数不平衡的特点,提出了一种基于word2vec均值算法及改进的词频-逆文本频率(TFIDF)算法的分类方法 .该方法通过引入信息熵及修正因子,缓解了数据不平衡对分类准确率及召回率造成的不良影响.实验结果表明:所提出的分类方法在准确率及召回率上与word2vec均值模型相比,分别提高7.3%及10.5%.

关 键 词:词频-逆文本频率(TF-IDF)  word2vec  信息熵  文本分类  机器学习  加权
收稿时间:2019/11/13 0:00:00

Research on line text classification based on TF-IDF and word2vec
DAN Yuhao,HUANG Jifeng,YANG Lin and GAO Hai.Research on line text classification based on TF-IDF and word2vec[J].Journal of Shanghai Normal University(Natural Sciences),2020,49(1):89-95.
Authors:DAN Yuhao  HUANG Jifeng  YANG Lin and GAO Hai
Affiliation:College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China,College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China,Shanghai Development Center of Computer Software Technology, Shanghai 201112, China and Shanghai Gaochuang Computer Technology Co., Ltd., Shanghai 200030, China
Abstract:In order to improve the classification accuracy of line text,a classification method based on word2vec average algorithm and improved term frequency-inverse document frequency(TF-IDF) algorithm was proposed,which took into account the characteristic of unbalanced sample quantity and word number among different categories of line text for health TV programs.By introducing information entropy and correction factors,the adverse impact of data imbalance on classification accuracy and recall rate was alleviated.The experimental results showed that the classification accuracy and recall rate of the proposed method were improved by 7.3% and 10.5% respectively compared with the word2vec average model.
Keywords:term frequency-inverse document frequency (TF-IDF)  word2vec  information entropy  text classification  machine learning  weight
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《上海师范大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《上海师范大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号