首页 | 官方网站   微博 | 高级检索  
     

基于小句复合体的句子边界自动识别研究
引用本文:何晓文,罗智勇,胡紫娟,王瑞琦.基于小句复合体的句子边界自动识别研究[J].中文信息学报,2021,35(5):1-8.
作者姓名:何晓文  罗智勇  胡紫娟  王瑞琦
作者单位:北京语言大学 信息科学学院,北京100083
基金项目:北京语言大学研究生创新基金(中央高校基本科研业务费专项资金)(19YCX124);国家自然科学基金(62076037)
摘    要:自然语言文本的语法结构层次包括语素、词语、短语、小句、小句复合体、语篇等。其中,语素、词、短语等相关处理技术已经相对成熟,而句子的概念至今未有公认的、适用于语言信息处理的界定。该文重新审视了语言学中句子的定义和自然语言处理中句子的切分问题,提出了中文句子切分的任务;基于小句复合体理论将句子定义为最小的话头自足的标点句序列,也就是自足的话题结构,并设计和实现了基于BERT的边界识别模型。实验结果表明,该模型对句子边界自动识别正确率、F1值分别达到88.37%、83.73%,识别效果优于按照不同的标点符号机械分割的效果。

关 键 词:句子  小句复合体  句子边界识别  
收稿时间:2019-09-18

Automatic Recognition of Sentence Boundary Based on Clause Complex
HE Xiaowen,LUO Zhiyong,HU Zijuan,WANG Ruiqi.Automatic Recognition of Sentence Boundary Based on Clause Complex[J].Journal of Chinese Information Processing,2021,35(5):1-8.
Authors:HE Xiaowen  LUO Zhiyong  HU Zijuan  WANG Ruiqi
Affiliation:School of Computer Science, Beijing Language and Culture University, Beijing 100083, China
Abstract:The grammatical structure of natural language text consists of words, phrases, sentences, clause complexes and texts. This paper re-examines the definition of sentences in linguistics and the segmentation of sentences in natural language processing, and puts forward the task of Chinese sentence segmentation. Based on the theory of clause complex, the sentence is defined as the smallest topic self-sufficient punctuation sequence, and a sentence boundary recognition model based on BERT is designed and implemented. The experimental results show that the accuracy and F1 value of the model are 88.37% and 83.73%, respectively, much better than that of mechanical segmentation according to punctuation marks.
Keywords:sentence  clause complex  sentence boundary recognition  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号