首页 | 官方网站   微博 | 高级检索  
     

面向中文电子病历的句法分析融合模型
引用本文:蒋志鹏,关毅.面向中文电子病历的句法分析融合模型[J].自动化学报,2019,45(2):276-288.
作者姓名:蒋志鹏  关毅
作者单位:1.哈尔滨工业大学语言技术中心网络智能研究室 哈尔滨 150001
基金项目:国家自然科学基金71531007
摘    要:完全句法分析是自然语言处理(Natural language processing,NLP)中重要的结构化过程,由于中文电子病历(Chinese electronic medical record,CEMR)句法标注语料匮乏,目前还没有面向中文电子病历的完全句法分析研究.本文针对中文电子病历模式化强的子语言特征,首次以树片段形式化中文电子病历复用的模式,提出了面向数据句法分析(Data-oriented parsing,DOP)和层次句法分析融合模型.在树片段抽取阶段,提出效率更高的标准树片段和局部树片段抽取算法,分别解决了标准树片段的重复比对问题,以及二次树核(Quadratic tree kernel,QTK)的效率低下问题,获得了标准树片段集和局部树片段集.基于上述两个树片段集,提出词汇和词性混合匹配策略和最大化树片段组合算法改进面向数据句法分析模型,缓解了无效树片段带来的噪声.实验结果表明,该融合模型能够有效改善中文电子病历句法分析效果,基于少量标注语料F1值能够达到目前最高的80.87%,并且在跨科室句法分析上超过Stanford parser和Berkeley parser 2%以上.

关 键 词:中文电子病历    完全句法分析    面向数据句法分析    层次句法分析
收稿时间:2017-04-25

A Fusion Model for Chinese Electronic Medical Record Parsing
Affiliation:1.Web Intelligence Laboratory, Language Technology Center, Harbin Institute of Technology, Harbin 1500012.Changan Communication Technology Co., LTD, Beijing 102209
Abstract:Full parsing is an important structuring process of the natural language processing (NLP). However, its research on Chinese electronic medical record (CEMR) is currently a blank because of the lack of syntactical annotated corpus on CEMR. To make the best of the sub-language characteristic of strong pattern in CEMR, patterns reused is first formalized as tree fragment in CEMR, and a model integrating data-oriented parsing (DOP) and hierarchical parsing is proposed. In the extraction stage of tree fragments, we propose a more efficient standard tree fragment algorithm by solving repeated comparison of standard tree fragments, and a partial tree fragment extraction algorithm to substitute for the low-efficient quadratic tree kernel (QTK) algorithm to obtain a standard tree fragment set and a partial tree fragment set. Based on the two extracted tree fragment sets, a strategy matching word and part-of-speech (POS) synchronously and a maximal combination algorithm of tree fragments are proposed to improve DOP, and alleviate the noise caused by invalid tree fragments. Experimental results show that the fusion model based on DOP and hierarchical parsing can effectively improve the parsing effect for CEMR, and the F1 score reaches the highest 80.87% based on a small number of annotated corpora, which is even 2% higher than those of the two state-of-the-art parsers of Stanford and Berkeley in cross-department parsing.
Keywords:
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号