首页 | 官方网站   微博 | 高级检索  
     

基于无监督学习的中文电子病历分词
引用本文:张立邦,关 毅,杨锦峰.基于无监督学习的中文电子病历分词[J].智能计算机与应用,2014(2):68-71.
作者姓名:张立邦  关 毅  杨锦峰
作者单位:哈尔滨工业大学计算机科学与技术学院,哈尔滨150001
摘    要:电子病历中包含大量有用的医疗知识,抽取这些知识对于构建临床决策支持系统和个性化医疗健康信息服务具有重要意义。自动分词是分析和挖掘中文电子病历的关键基础。为了克服获取标注语料的困难,提出了一种基于无监督学习的中文电子病历分词方法。首先,使用通用领域的词典对电子病历进行初步的切分,为了更好地解决歧义问题,引入概率模型,并通过EM算法从生语料中估计词的出现概率。然后,利用字串的左右分支信息熵构建良度,将未登录词识别转化为最优化问题,并使用动态规划算法进行求解。最后,在3 000来自神经内科的中文电子病历上进行实验,证明了该方法的有效性。

关 键 词:中文电子病历  无监督分词  EM算法  分支信息熵  动态规划

An Unsupervised Approach to Word Segmentation in Chinese EMRs
ZHANG Libang,GUAN Yi,YANG Jinfeng.An Unsupervised Approach to Word Segmentation in Chinese EMRs[J].INTELLIGENT COMPUTER AND APPLICATIONS,2014(2):68-71.
Authors:ZHANG Libang  GUAN Yi  YANG Jinfeng
Affiliation:( School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China)
Abstract:Electronic medical records( EMR) contain a lot of useful medical knowledge. Extracting these knowledge are important for building clinical decision support system and personalized healthcare information service. Automatic word segmentation is a key precursor for analysis and mining of Chinese EMRs. In order to overcome the difficulties of obtaining labeled corpus,the paper proposes an unsupervised approach to word segmentation in Chinese EMRs. First,the paper uses a lexicon of general domain to generate an initial segmentation. To deal with the ambiguity problem,the paper also builds a probabilistic model. The probabilities of words are estimated by an EM procedure. Then the paper uses the left and right branching entropy to build goodness measure and regards the recognition of unknown words as an optimization problem which can be solved by dynamic programming. Finally,to prove the effectiveness of our approach,experiments are conducted on 3,000 copies of Chinese EMRs from the Department of Neurology.
Keywords:Chinese EMRs  Unsupervised Segmentation  EM Algorithm  Branching Entropy  Dynamic Programming
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号