基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究 A Joint Model of Automatic Sentence Segmentation and Lexical Analysis for Ancient Chinese Based on BiLSTM-CRF Model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究

引用本文：	程宁,李斌,葛四嘉,郝星月,冯敏萱.基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J].中文信息学报,2020,34(4):1-9.

作者姓名：	程宁李斌葛四嘉郝星月冯敏萱

作者单位：	1.南京师范大学文学院,江苏南京 210097; 2.哈佛大学计量社会科学研究所,美国剑桥 02138

基金项目：	国家自然科学基金(71673143);国家语委科研项目(WT135-24,YB135-61);江苏省高校哲学社会科学优秀创新团队建设项目(2017STD006)

摘要：	古汉语信息处理的基础任务包括自动断句、自动分词、词性标注、专名识别等。大量的古汉语文本未经标点断句,所以词法分析等任务首先需要建立在断句基础之上。然而,分步处理容易造成错误的多级扩散,该文设计实现了古汉语断句与词法分析一体化的标注方法,基于BiLSTM-CRF神经网络模型在四种跨时代的测试集上验证了不同标注层次下模型对断句、词法分析的效果以及对不同时代文本标注的泛化能力。研究表明,一体化的标注方法对古汉语的断句、分词及词性标注任务的F₁值均有提升。综合各测试集的实验结果,断句任务F₁值达到78.95%,平均提升了3.5%;分词任务F₁值达到85.73%,平均提升了0.18%;词性标注任务F₁值达到72.65%,平均提升了0.35%。
关键词：	古文断句分词词性标注 BiLSTM-CRF 古汉语信息处理
A Joint Model of Automatic Sentence Segmentation and Lexical Analysis for Ancient Chinese Based on BiLSTM-CRF Model

CHENG Ning,LI Bin,GE Sijia,HAO Xingyue,FENG Minxuan.A Joint Model of Automatic Sentence Segmentation and Lexical Analysis for Ancient Chinese Based on BiLSTM-CRF Model[J].Journal of Chinese Information Processing,2020,34(4):1-9.

Authors:	CHENG Ning LI Bin GE Sijia HAO Xingyue FENG Minxuan

Affiliation:	1.School of Chinese Language and Literature, Nanjing Normal University, Nanjing, Jiangsu 210097, China; 2.Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA

Abstract:	The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. To avoid the error accumulation in the pipeline processing, this paper proposes a joint approach to sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Experiments show that the joint model achieves improvements on the F₁-score of sentence segmentation, word segmentation and part-of-speech tagging: yielding 78.95% for sentence segmentation (with an average increase of 3.5%), 85.73% for word segmentation (with an average increase of 0.18%), and 72.65% for part-of-speech tagging (with an average increase of 0.35%).

Keywords:	sentence segmentation of ancient Chinese word segmentation part-of-speech tagging BiLSTM-CRF ancient Chinese information processing

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏