首页 | 官方网站   微博 | 高级检索  
     

基于统计的汉语词性标注方法的分析与改进
引用本文:魏欧,吴健,孙玉芳.基于统计的汉语词性标注方法的分析与改进[J].软件学报,2000,11(4):473-480.
作者姓名:魏欧  吴健  孙玉芳
作者单位:中国科学院软件研究所,北京,100080
基金项目:本文研究得到国家“九五”重点科技攻关项目基金(Nos.96-B08-1-3,98-779-01-02)资助.
摘    要:从词性概率矩阵与词汇概率矩阵的结构和数值变化等方面,对目前常用的基于统计的汉语词性标注方法中训练语料规模与标注正确率之间所存在的非线性关系作了分析.为了充分利用训练语料库,提高标注正确率,从利用词语相关的语法属性和加强对未知词的处理两个方面加以改进,提高了标注性能.封闭测试和开放测试的正确率分别达到96.5%和96%.

关 键 词:词性标注  n元语法  语料  语法属性.
收稿时间:1998/11/23 0:00:00
修稿时间:1999/4/21 0:00:00

Analysis and Improvement of Statistics-Based Chinese Part-of-Speech Tagging
WEI Ou,WU Jian and SUN Yu-fang.Analysis and Improvement of Statistics-Based Chinese Part-of-Speech Tagging[J].Journal of Software,2000,11(4):473-480.
Authors:WEI Ou  WU Jian and SUN Yu-fang
Affiliation:Institute of Software The Chinese Academy of Sciences Beijing 100080
Abstract:In this paper, a popular statistics\|based training and tagging method for Chinese texts is studied, and the nonlinear relation between training set and tagging accuracy is analyzed from the aspects of the structure and numerical value of the matrix of transition probabilities and the matrix of symbol probabilities. In order to make use of training corpus sufficiently and get the higher tagging accuracy, the training and tagging method is improved from two aspects: using other grammatical attributes of words, and strengthening the processing of unknown words. With the improved method, open test and close test showed that the overall accuracies are about 96.5% and 96% respectively.
Keywords:Part-of-Speech tagging  n-gram  corpus  grammatical attribute  
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号