首页 | 官方网站   微博 | 高级检索  
     

基于LSTM网络的序列标注中文分词法
引用本文:任智慧,徐浩煜,封松林,周晗,施俊.基于LSTM网络的序列标注中文分词法[J].计算机应用研究,2017,34(5).
作者姓名:任智慧  徐浩煜  封松林  周晗  施俊
作者单位:中国科学院上海高等研究院;上海大学,中国科学院上海高等研究院;中国科学院大学,中国科学院上海高等研究院;中国科学院大学,中国科学院上海高等研究院,上海大学通信与信息工程学院
基金项目:国家自然科学基金项目(61471231); 中科院先导(No.XDA06010301)
摘    要:当前主流的中文分词方法是基于字标注的传统机器学习的方法。但传统机器学习方法需要人为地从中文文本中配置并提取特征,存在词库维度高且仅利用CPU训练模型时间长的缺点。针对以上问题,进行了研究提出基于LSTM(Long Short-Term Memory)网络模型的改进方法,采用不同词位标注集并加入预先训练的字嵌入向量(character embedding)进行中文分词。在中文分词评测常用的语料上进行实验对比,结果表明:基于LSTM网络模型的方法能得到比当前传统机器学习方法更好的性能;采用六词位标注并加入预先训练的字嵌入向量能够取得相对最好的分词性能;而且利用GPU可以大大缩短深度神经网络模型的训练时间;LSTM网络模型的方法也更容易推广并应用到其他自然语言处理(NLP)中序列标注的任务。

关 键 词:中文分词    LSTM    字嵌入
收稿时间:2016/3/25 0:00:00
修稿时间:2017/3/12 0:00:00

A sequence labeling Chinese word segmentation method based on LSTM networks
Ren Zhihui,Xu Haoyu,Feng Songlin,Zhou Han and Shi Jun.A sequence labeling Chinese word segmentation method based on LSTM networks[J].Application Research of Computers,2017,34(5).
Authors:Ren Zhihui  Xu Haoyu  Feng Songlin  Zhou Han and Shi Jun
Affiliation:Shanghai Advanced Research Institute,Chinese Academy of Sciences; Shanghai University,,Shanghai Advanced Research Institute,Chinese Academy of Sciences; University of Chinese Academy of Sciences,Shanghai Advanced Research Institute,Chinese Academy of Sciences,School of Communication and Information Engineering,Shanghai University
Abstract:Currently, the dominant state-of-the-art methods for Chinese word segmentation are based on character tagging methods by using traditional machine learning technology. However, there are some disadvantages in the traditional machine learning methods: 1) Artificially configuring and extracting features from Chinese texts; 2) High dimension of the dictionary; 3) long training time by just exploiting CPUs. This paper proposed an improved method based on Long Short-Term Memory (LSTM) network model, by using different tag set and adding pre-trained character embeddings, to perform Chinese word segmentation. Compared with the best result in Bakeoff and state-of-the-art methods, this paper conducted the experiments on common used corpuses. The results demonstrate that traditional machine learning methods are exceeded by the methods based on LSTM network. By using six-tag-set and adding pre-trained character embedding, the LSTM-method can reach the relatively highest performance on Chinese word segmentation. Then, it can greatly reduce the training time of deep neural network model by using GPUs. Moreover, The methods based on LSTM network can easily applied to other sequence labeling tasks in natural language processing.Currently, the dominant state-of-the-art methods for Chinese word segmentation are based on character tagging methods by using traditional machine learning technology. However, there are some disadvantages in the traditional machine learning methods: 1) Artificially configuring and extracting features from Chinese texts; 2) High dimension of the dictionary; 3) long training time by just exploiting CPUs. This paper proposed an improved method based on Long Short-Term Memory (LSTM) network model, by using different tag set and adding pre-trained character embeddings, to perform Chinese word segmentation. Compared with the best result in Bakeoff and state-of-the-art methods, this paper conducted the experiments on common used corpuses. The results demonstrate that traditional machine learning methods are exceeded by the methods based on LSTM network. By using six-tag-set and adding pre-trained character embedding, the LSTM-method can reach the relatively highest performance on Chinese word segmentation. Then, it can greatly reduce the training time of deep neural network model by using GPUs. Moreover, The methods based on LSTM network can easily applied to other sequence labeling tasks in natural language processing.
Keywords:Chinese word segmentation  LSTM  character embedding
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号