基于统计的汉语词性标注方法的分析与改进 Analysis and Improvement of Statistics-Based Chinese Part-of-Speech Tagging期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于统计的汉语词性标注方法的分析与改进

引用本文：	魏欧,吴健,孙玉芳.基于统计的汉语词性标注方法的分析与改进[J].软件学报,2000,11(4):473-480.

作者姓名：	魏欧吴健孙玉芳

作者单位：	中国科学院软件研究所,北京,100080

基金项目：	本文研究得到国家“九五”重点科技攻关项目基金（Nos.96-B08-1-3,98-779-01-02)资助.

摘要：	从词性概率矩阵与词汇概率矩阵的结构和数值变化等方面,对目前常用的基于统计的汉语词性标注方法中训练语料规模与标注正确率之间所存在的非线性关系作了分析.为了充分利用训练语料库,提高标注正确率,从利用词语相关的语法属性和加强对未知词的处理两个方面加以改进,提高了标注性能.封闭测试和开放测试的正确率分别达到96.5%和96%.
关键词：	词性标注 n元语法语料语法属性.
收稿时间：	1998/11/23 0:00:00
修稿时间：	1999/4/21 0:00:00
Analysis and Improvement of Statistics-Based Chinese Part-of-Speech Tagging

WEI Ou,WU Jian and SUN Yu-fang.Analysis and Improvement of Statistics-Based Chinese Part-of-Speech Tagging[J].Journal of Software,2000,11(4):473-480.

Authors:	WEI Ou WU Jian and SUN Yu-fang

Affiliation:	Institute of Software The Chinese Academy of Sciences Beijing 100080

Abstract:	In this paper, a popular statistics\\|based training and tagging method for Chinese texts is studied, and the nonlinear relation between training set and tagging accuracy is analyzed from the aspects of the structure and numerical value of the matrix of transition probabilities and the matrix of symbol probabilities. In order to make use of training corpus sufficiently and get the higher tagging accuracy, the training and tagging method is improved from two aspects: using other grammatical attributes of words, and strengthening the processing of unknown words. With the improved method, open test and close test showed that the overall accuracies are about 96.5% and 96% respectively.

Keywords:	Part-of-Speech tagging n-gram corpus grammatical attribute
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏