基于分布式框架下的中文文本特征分类 Chinese Text Feature Classification Based on Distributed Framework期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于分布式框架下的中文文本特征分类

引用本文：	张慧芳,宗彩乐,张晓琳.基于分布式框架下的中文文本特征分类[J].广东电脑与电讯,2019,1(5):1-7.

作者姓名：	张慧芳宗彩乐张晓琳

作者单位：	内蒙古科技大学信息工程学院,内蒙古包头,014010;青岛地铁集团有限公司运营分公司,山东青岛,266000

基金项目：	国家自然科学基金资助项目，项目编号：61562065。

摘要：	研究运用复旦中文文本及搜狗中文文档作为研究对象,提高了中文文本分类精确度及召回率,分析得出特征词的最佳贡献值。应用朴素贝叶斯分类方法和改进的TFIDF关键字提取及权重计算,提出TNBIF模型分类方法,在Spark平台上并行分类实现。实验结果表明:应用TNBIF模型实行中文文本分类,精确度高达95.49%,比传统文本分类方法精确度提高5.41%,召回率提高了6.64%。本研究得出最佳贡献值为0.95。
关键词：	TNBIF 模型海量数据集 SPARK 特征分类并行分类
Chinese Text Feature Classification Based on Distributed Framework

ZHANG Hui-fang ZONG Cai-le ZHANG Xiao-lin.Chinese Text Feature Classification Based on Distributed Framework[J].Computer & Telecommunication,2019,1(5):1-7.

Authors:	ZHANG Hui-fang ZONG Cai-le ZHANG Xiao-lin

Affiliation:	(Inner Mongolia University of Science and Technology, Baotou 014010, Inner Mongolia;Qingdao Metro Group Co., Ltd. Operating Branch, Qingdao 266000, Shandong)

Abstract:	The study uses Fudan Chinese text and Sogou Chinese document as the research object. It improves the Chinese text classification accuracy and recall rate. And it analyzes and obtains the best contribution value of the feature words. Based on naive Bayes classification method, improved TFIDF keyword extraction and weight calculation, the TNBIF model classification method is proposed and implemented on the Spark platform. The experimental results show that the Chinese text classification is applied by the TNBIF model. The accuracy is as high as 95.49%, which is 5.41% higher than the traditional text classification method and the recall rate is increased by 6.64%. This study obtains an optimal contribution of 0.95.

Keywords:	TNBIF model massive data set Spark feature classification parallel classification
本文献已被维普万方数据等数据库收录！
	点击此处可从《广东电脑与电讯》浏览原始摘要信息
	点击此处可从《广东电脑与电讯》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏