基于远程监督的病历文本漏标问题研究 Conquering Unlabeled Entity in Medical Record Text under Distant Supervision Framework期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于远程监督的病历文本漏标问题研究

引用本文：	杨一帆,施淼元,缪庆亮,李茂龙.基于远程监督的病历文本漏标问题研究[J].中文信息学报,2022,36(8):73-80.

作者姓名：	杨一帆施淼元缪庆亮李茂龙

作者单位：	思必驰科技股份有限公司,江苏苏州 215000

摘要：	医疗健康一直是人们热议的话题,针对病历文本的自动抽取技术也日趋重要。目前医疗领域数据人工标注成本高,获取大规模标注语料较困难。一种解决标注语料缺失的方法是基于词表的远程监督方法。但由于远程监督的标准数据质量不高,导致模型性能缩水严重。该文主要研究如何缓解远程监督带来的数据漏标问题。通过对数据进行增强、结合基于片段排列的命名实体识别模型与负采样方法提高模型泛化能力,并选取全局最优节点集合解决实体识别冲突问题。实验表明,数据增强与选取全局最优节点集合两者分别对结果有0.5%左右稳定提高,负采样方法提高5%至10%不等。
关键词：	命名实体识别远程监督数据漏标数据增强负采样
Conquering Unlabeled Entity in Medical Record Text under Distant Supervision Framework

YANG Yifan,SHI Miaoyuan,MIAO Qingliang,LI Maolong.Conquering Unlabeled Entity in Medical Record Text under Distant Supervision Framework[J].Journal of Chinese Information Processing,2022,36(8):73-80.

Authors:	YANG Yifan SHI Miaoyuan MIAO Qingliang LI Maolong

Affiliation:	AI Speech Co., Ltd., Suzhou, Jiangsu 215000, China

Abstract:	Automatic extraction technology for medical record text is becoming increasingly important. At present, the distant supervision method is a popular solution to the lack of labeled corpus. Focusing on alleviating the unlabeled entity issue caused by distant supervision, this paper proposes a combined strategy of data augmentation, negative sampling and global optimal node set selection for the span-level based named entity recognition model. Experiments show that both data enhancement and the global optimal node set selection have a stable improvement of about 0.5%, and the negative sampling method has 5% to 10% improvement.

Keywords:	named entity recognition distant supervision data omission data augmentation negative sampling

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏