首页 | 官方网站   微博 | 高级检索  
     

提高汉语自动分词精度的多步处理策略
引用本文:赵铁军,吕雅娟,于浩,杨沐昀,刘芳.提高汉语自动分词精度的多步处理策略[J].中文信息学报,2001,15(1):13-18.
作者姓名:赵铁军  吕雅娟  于浩  杨沐昀  刘芳
作者单位:哈尔滨工业大学计算机科学与技术学院
基金项目:国家自然科学基金! ( 697750 17)
摘    要:汉语自动分词在面向大规模真实文本进行分词时仍然存在很多困难。其中两个关键问题是未登录词的识别和切分歧义的消除。本文描述了一种旨在降低分词难度和提高分词精度的多步处理策略,整个处理步骤包括7个部分,即消除伪歧义、句子的全切分、部分确定性切分、数词串处理、重叠词处理、基于统计的未登录词识别以及使用词性信息消除切分歧义的一体化处理。开放测试结果表明分词精确率可达98%以上。

关 键 词:汉语自动分词  歧义  多步处理  
修稿时间:2000年5月23日

Increasing Accuracy of Chinese Segmentation with Strategy of Multi step Processing
ZHAO Tie-jun,LV Ya-juan,YU Hao,YANG Mu-yun,LIU Fang.Increasing Accuracy of Chinese Segmentation with Strategy of Multi step Processing[J].Journal of Chinese Information Processing,2001,15(1):13-18.
Authors:ZHAO Tie-jun  LV Ya-juan  YU Hao  YANG Mu-yun  LIU Fang
Affiliation:School of Computer Science and Technology ,Harbin Institute of Technology
Abstract:The automatic word segmentation of Chinese sentences is difficult when the processing mechanism faces large scale real texts.The crucial two issues in Chinese segmentation are the identification of unknown words and the disambiguation of segmentation strings.This paper describes a strategy based on multi steps processing for decreasing the difficulties and improving the accuracy of the segmentation.The processing steps include seven parts,i.e.,disambiguation of pseudo ambiguities,full segmentation of a sentence,determinate segmentation for some words,processing of numeral string,processing for reduplication of words,statistical identification for unknown words and final correction for segmentation ambiguities with part of speech which is integrated in the tagger.The output of this procedure is promising with above 98% accuracy in open test.
Keywords:Chinese segmentation  ambiguity  multi  step strategy
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号