首页 | 官方网站   微博 | 高级检索  
     

基于表示学习的中文分词算法探索
引用本文:来斯惟,徐立恒,陈玉博,刘康,赵军.基于表示学习的中文分词算法探索[J].中文信息学报,2013,27(5):8-15.
作者姓名:来斯惟  徐立恒  陈玉博  刘康  赵军
作者单位:中国科学院自动化研究所 模式识别国家重点实验室,北京 100190
基金项目:国家自然科学基金资助项目,国家高技术研究发展计划资助项目(863计划),国家重点基础研究发展计划资助项目(973计划),网络文化与数字传播北京市重点实验室开放课题资助项目
摘    要:分词是中文自然语言处理中的一个关键基础技术。通过基于字的统计机器学习方法学习判断词边界是当前中文分词的主流做法。然而,传统机器学习方法严重依赖人工设计的特征,而验证特征的有效性需要不断的尝试和修改,是一项费时费力的工作。随着基于神经网络的表示学习方法的兴起,使得自动学习特征成为可能。该文探索了一种基于表示学习的中文分词方法。首先从大规模语料中无监督地学习中文字的语义向量,然后将字的语义向量应用于基于神经网络的有监督中文分词。实验表明,表示学习算法是一种有效的中文分词方法,但是我们仍然发现,由于语料规模等的限制,表示学习方法尚不能完全取代传统基于人工设计特征的有监督机器学习方法。

关 键 词:表示学习  中文分词  

Chinese Word Segment Based on Character Representation Learning
LAI Siwei , XU Liheng , CHEN Yubo , LIU Kang , ZHAO Jun.Chinese Word Segment Based on Character Representation Learning[J].Journal of Chinese Information Processing,2013,27(5):8-15.
Authors:LAI Siwei  XU Liheng  CHEN Yubo  LIU Kang  ZHAO Jun
Affiliation:National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Abstract:Word segmentation is a fundamental technology of Chinese natural language processing. Using character-based statistical machine learning methods to perform Chinese word segmentation is the main trendcurrently. However, conventional machine learning methods heavily rely on manually designed features, which require intensive labor to modify the features and verify their effectiveness. With the rapid develop of neural-network-based representation learning, it becomes realistic to learn featuresautomatically. This paper investigates a Chinese word segment method based on representation learning. We first learn embedding vectors for Chinese characters from a large corpus unsupervisedly, and then apply them to neural-network-based Chinese word segmentation supervisedly. Experimental results show that representation learning is an effective method for Chinese word segmentation. However, due to the limitation of corpus size, it still cannot replace conventional machine learning methods whichbased on manually designed features. Key wordsrepresentation learning; Chinese word segmentation
Keywords:representation learning  Chinese word segmentation
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号