首页 | 官方网站   微博 | 高级检索  
     

基于深度学习的文本情感分析并行化算法
引用本文:翟东海,侯佳林,刘月.基于深度学习的文本情感分析并行化算法[J].西南交通大学学报,2019,54(3):647-654.
作者姓名:翟东海  侯佳林  刘月
作者单位:西南交通大学信息科学与技术学院
基金项目:国家自然科学基金资助项目(61540060);科技部国家软科学研究计划资助项目(2013GXS4D150);教育部科学技术研究重点项目(212167)
摘    要:在训练集和测试集数据量大的情况下,半监督递归自编码(semi-supervised recursive auto encoder,Semi-Supervised RAE)文本情感分析模型会出现网络训练速度缓慢和模型的测试结果输出速率缓慢等问题. 因此,提出采用并行化处理框架,在大训练集情况下,基于“分而治之”的方法,先将数据集进行分块划分并将各个数据块输入Map节点计算每个数据块的误差,利用缓冲区汇总所有的块误差,Reduce节点从缓冲区读取这些块误差以计算优化目标函数;然后,调用L-BFGS (limited-memory Broyden-Fletcher-Goldfarb-Shanno)算法调整参数,更新后的参数集再次加载到模型中,重复以上训练步骤逐步优化目标函数直至收敛,从而得到最优参数集;在测试集大的情况下,模型的初始化参数为上述步骤得到的参数集,Map节点对各句子进行编码得到其向量表示,然后暂存在缓冲区中;最后,在Reduce节点中分类器利用各语句的向量表示计算各自语句的情感标签. 实例验证表明:在标准语料库MR (movie review)下本文算法精确度为77.0%,与原始算法的精确度(77.3%)几乎相同;在大数据量训练集下,训练时间在一定程度上随着计算节点的增加而大量减少. 

关 键 词:半监督递归自编码    文本情感分析    并行计算
收稿时间:2016-11-26

Parallel Algorithms for Text Sentiment Analysis Based on Deep Learning
ZHAI Donghai,HOU Jialin,LIU Yue.Parallel Algorithms for Text Sentiment Analysis Based on Deep Learning[J].Journal of Southwest Jiaotong University,2019,54(3):647-654.
Authors:ZHAI Donghai  HOU Jialin  LIU Yue
Abstract:In the case of big training set and test set, based on semi-supervised auto encoder (Semi-Supervised RAE), the text sentiment analysis algorithm is accompanied by slow training rate and output rate of test results. To solve these problems, the corresponding parallel algorithms are proposed in this paper. For the big training data set, the method of " separate operation” is adopted to divide the data set into blocks. Each data block is inputted into Map nodes to calculate its error, and the errors of all data blocks are stored in the buffer. The block errors are read by Reduce nodes from the buffer to calculate the optimization objective function. Then, the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm is called to update the parameter set, and the updated parameter set is reloaded into the cluster. The above process is iterated until the optimization objective function converges; therefore, an optimal parameter set is obtained. For the big test data set, the parameter set obtained by the above steps is used to initialize the cluster. The vector representation of each sentence is calculated in Map nodes and temporarily stored in the buffer. Then, the sentiment label of each sentence is calculated by the classifier in the Reduce node using the vector representation. The experimental results demonstrate that in the standard MR (movie review) corpus, the accuracy of the algorithm is 77.0%, which is almost the same as the accuracy of the original algorithm (77.3%), at the same time the training time is decreased greatly along with the increase of compute nodes in the massive training data sets. 
Keywords:
本文献已被 CNKI 等数据库收录!
点击此处可从《西南交通大学学报》浏览原始摘要信息
点击此处可从《西南交通大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号