首页 | 官方网站   微博 | 高级检索  
     

藏文自动分词中未登录词处理方法研究
引用本文:羊毛卓玛,高定国.藏文自动分词中未登录词处理方法研究[J].计算机工程,2012,38(17):46-48.
作者姓名:羊毛卓玛  高定国
作者单位:西藏大学工学院;青海师范大学民族师范学院
基金项目:国家自然科学基金资助项目“基于虚词的藏文基本句型的格式化研究”(6106315)
摘    要:藏文中后接成份出现频率较高,分词中未登录词的后缀单切现象会影响分词的正确率,为此,采用词(语素)+缀归并的方法,将藏文后接成份与前一词(语素)归并为一个切分单位输出。针对藏文中大量人名、地名、单位名等未登录词在分词时出现的碎片切分现象,使用分词碎片整合方法,将多次出现的词条碎片整合为一个切分单位输出。实验结果表明,2种方法能提高藏文自动分词的识别正确率。

关 键 词:藏文信息处理  词缀归并  未登录词  分词碎片整合
收稿时间:2011-10-28
修稿时间:2011-12-20

Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation
Yangmo Droma,GAO Ding-guo.Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation[J].Computer Engineering,2012,38(17):46-48.
Authors:Yangmo Droma  GAO Ding-guo
Affiliation:1(1.School of Engineering,Tibet University,Lhasa 850000,China;2.College of National Hualion Teachers,Qinghai Normal University,Hainan 813000,China)
Abstract:In Tibetan,followed ingredients appear with high frequency.Suffix-cut appears in the participle word.It affects the accuracy of the word.By applying word(morpheme) + suffix method,Tibetan suffix and prefix word(morpheme) are grouped into a slitting unit output.In response to a large number of names,place names,unit names,and so on appear in Tibetan,which are not included in dictionaries,debris splitting phenomena appears in the word.Aiming at the problem,it uses word fragments consolidation method.Multiple occurrences of the term debris are to be grouped into a slit unit output.Experimental results show that two methods can improve the accuracy of Tibetan word segmentation.
Keywords:Tibetan information processing  affix merging  unknown word  word segmentation fragment integration
本文献已被 CNKI 维普 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号