首页 | 官方网站   微博 | 高级检索  
     

基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究
引用本文:程宁,李斌,葛四嘉,郝星月,冯敏萱.基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J].中文信息学报,2020,34(4):1-9.
作者姓名:程宁  李斌  葛四嘉  郝星月  冯敏萱
作者单位:1.南京师范大学 文学院,江苏 南京 210097;
2.哈佛大学 计量社会科学研究所,美国 剑桥 02138
基金项目:国家自然科学基金(71673143);国家语委科研项目(WT135-24,YB135-61);江苏省高校哲学社会科学优秀创新团队建设项目(2017STD006)
摘    要:古汉语信息处理的基础任务包括自动断句、自动分词、词性标注、专名识别等。大量的古汉语文本未经标点断句,所以词法分析等任务首先需要建立在断句基础之上。然而,分步处理容易造成错误的多级扩散,该文设计实现了古汉语断句与词法分析一体化的标注方法,基于BiLSTM-CRF神经网络模型在四种跨时代的测试集上验证了不同标注层次下模型对断句、词法分析的效果以及对不同时代文本标注的泛化能力。研究表明,一体化的标注方法对古汉语的断句、分词及词性标注任务的F1值均有提升。综合各测试集的实验结果,断句任务F1值达到78.95%,平均提升了3.5%;分词任务F1值达到85.73%,平均提升了0.18%;词性标注任务F1值达到72.65%,平均提升了0.35%。

关 键 词:古文断句  分词  词性标注  BiLSTM-CRF  古汉语信息处理  

A Joint Model of Automatic Sentence Segmentation and Lexical Analysis for Ancient Chinese Based on BiLSTM-CRF Model
CHENG Ning,LI Bin,GE Sijia,HAO Xingyue,FENG Minxuan.A Joint Model of Automatic Sentence Segmentation and Lexical Analysis for Ancient Chinese Based on BiLSTM-CRF Model[J].Journal of Chinese Information Processing,2020,34(4):1-9.
Authors:CHENG Ning  LI Bin  GE Sijia  HAO Xingyue  FENG Minxuan
Affiliation:1.School of Chinese Language and Literature, Nanjing Normal University, Nanjing, Jiangsu 210097, China;
2.Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA
Abstract:The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. To avoid the error accumulation in the pipeline processing, this paper proposes a joint approach to sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Experiments show that the joint model achieves improvements on the F1-score of sentence segmentation, word segmentation and part-of-speech tagging: yielding 78.95% for sentence segmentation (with an average increase of 3.5%), 85.73% for word segmentation (with an average increase of 0.18%), and 72.65% for part-of-speech tagging (with an average increase of 0.35%).
Keywords:sentence segmentation of ancient Chinese  word segmentation  part-of-speech tagging  BiLSTM-CRF  ancient Chinese information processing  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号