首页 | 官方网站   微博 | 高级检索  
     

基于深层语言模型的古汉语知识表示及自动断句研究
引用本文:胡韧奋,李绅,诸雨辰.基于深层语言模型的古汉语知识表示及自动断句研究[J].中文信息学报,2021,35(4):8-15.
作者姓名:胡韧奋  李绅  诸雨辰
作者单位:1.北京师范大学 中文信息处理研究所,北京 100875;
2.北京师范大学 汉语文化学院,北京 100875;
3.北京师范大学 文学院,北京 100875
基金项目:国家自然科学基金(62006021);教育部人文社会科学研究青年基金(18YJC751073);国家社会科学基金(18ZDA238)
摘    要:古文句读不仅需要考虑当前文本的语义和语境信息,还需要综合历史文化常识,对专家知识有较高要求。该文提出了一种基于深层语言模型(BERT)的古汉语知识表示方法,并在此基础上通过条件随机场和卷积神经网络实现了高精度的自动断句模型。在诗、词和古文三种文体上,模型断句F1值分别达到99%、95%和92%以上。在表达较为灵活的词和古文文体上,模型较之传统双向循环神经网络方法的F1值提升幅度达到10%以上。实验数据显示,模型能较好地捕捉诗词表达的节奏感和韵律感,也能充分利用上下文信息,实现语序、语法、语义、语境等信息的编码。在进一步的案例应用中,该方法在已出版古籍的断句疑难误例上也取得了较好的效果。

关 键 词:古汉语  自动断句  深层语言模型  
收稿时间:2019-09-09

Knowledge Representation and Sentence Segmentation of Ancient Chinese Based on Deep Language Models
HU Renfen,LI Shen,ZHU Yuchen.Knowledge Representation and Sentence Segmentation of Ancient Chinese Based on Deep Language Models[J].Journal of Chinese Information Processing,2021,35(4):8-15.
Authors:HU Renfen  LI Shen  ZHU Yuchen
Affiliation:1.Institution of Chinese Information Processing, Beijing Normal University, Beijing 100875, China;2.College of Chinese Language and Culture, Beijing Normal Univeristy, Beijing 100875, China;3.School of Chinese Language and Literature, Beijing Normal Univeristy, Beijing 100875, China
Abstract:Sentence segmentation of ancient Chinese texts is a very difficult task even for experts in this area, since it not only relies on the sentence meaning and the contextual information, but also requires historical and cultural knowledge. This paper proposes to build knowledge representation of ancient Chinese with BERT, a deep language model, and then construct the sentence segmentation model with Conditional Random Field and Convolutional Neural Networks. Our model achieves significant improvements in all of the three ancient text styles. It achieves 99%, 95% and 92% F1 scores for poems, lyrics and prose texts, respectively, out-performing Bi-GRU by 10% in lyrics and proses which are more difficult to segment. In further case studies, the method achieves good results in the difficult cases in published ancient books.
Keywords:ancient Chinese  automatic sentence segmentation  deep language model  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号