首页 | 官方网站   微博 | 高级检索  
     

基于表示学习的中文分词
引用本文:刘春丽,李晓戈,刘睿,范贤,杜丽萍.基于表示学习的中文分词[J].计算机应用,2016,36(10):2794-2798.
作者姓名:刘春丽  李晓戈  刘睿  范贤  杜丽萍
作者单位:西安邮电大学 计算机学院, 西安 710121
基金项目:国家自然科学基金资助项目(61373116);陕西省普通高等学校重点学科专项资金资助项目(112-1602);西安邮电大学研究生创新基金资助项目(ZL2013-30)。
摘    要:为提高中文分词的准确率和未登录词(OOV)识别率,提出了一种基于字表示学习方法的中文分词系统。首先使用Skip-gram模型将文本中的词映射为高维向量空间中的向量;其次用K-means聚类算法将词向量聚类,并将聚类结果作为条件随机场(CRF)模型的特征进行训练;最后基于该语言模型进行分词和未登录词识别。对词向量的维数、聚类数及不同聚类算法对分词的影响进行了分析。基于第四届自然语言处理与中文计算会议(NLPCC2015)提供的微博评测语料进行测试,实验结果表明,在未利用外部知识的条件下,分词的F值和OOV识别率分别达到95.67%和94.78%,证明了将字的聚类特征加入到条件随机场模型中能有效提高中文短文本的分词性能。

关 键 词:表示学习  词向量  聚类  条件随机场  中文分词  
收稿时间:2016-03-24
修稿时间:2016-06-21

Chinese word segment based on character representation learning
LIU Chunli,LI Xiaoge,LIU Rui,FAN Xian,DU Liping.Chinese word segment based on character representation learning[J].journal of Computer Applications,2016,36(10):2794-2798.
Authors:LIU Chunli  LI Xiaoge  LIU Rui  FAN Xian  DU Liping
Affiliation:College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an Shaanxi 710121, China
Abstract:In order to improve the accuracy and the Out Of Vocabulary (OOV) recognition rate of the Chinese word segmentation, a Chinese word segmentation system based on character representation learning method was proposed. Firstly, the word in the text was mapped to a vector in a high-dimentioanl vecter space using Skip-gram model; then the K-means clustering algorithm was used to acquire clusters of the word vector, and the clustering results were regarded as features of Conditional Random Fields (CRF) model for training. Finally the CRF model was used for word segmentation and OOV recognition. The influences of the word vector dimensions, the number of clusters and different cluster algorithm on word segmentation were analyzed. Experiments were conducted on the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC2015) corpus. Experimental results show that the proposed system can effectively improve Chinese short text segmentation performance without using external knowledge, the F-value and the OOV recognition rate achieve to 95.67% and 94.78% respectively.
Keywords:representation learning                                                                                                                        word vector                                                                                                                        clustering                                                                                                                        Conditional Random Field (CRF)                                                                                                                        Chinese word segmentation
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号