首页 | 官方网站   微博 | 高级检索  
     

基于聚类的语料库分词评价方法研究
引用本文:宋礼鹏,郑家恒.基于聚类的语料库分词评价方法研究[J].计算机学报,2004,27(2):192-196.
作者姓名:宋礼鹏  郑家恒
作者单位:山西大学计算机科学系,太原,030006
基金项目:国家“八六三”高技术研究发展计划 (2 0 0 1AA114 0 3 1)资助
摘    要:对大规模汉语文本语料库分词正确率的评价提出了新的见解,即在分层抽样基础上对文本样本进行聚类.通过聚类可提高检验精度或减少样本量.该方法采用了一种新的样本相似性度量公式,该公式综合考虑了样本向量间的距离和样本向量各分量之间的线性相关性.通过对聚类结果的动态评价,调整聚类的类别数和相似性因子,提高了聚类的效率和质量.实验表明该方法在评价大规模语料库分词正确率时取得了很好的效果。

关 键 词:汉语  语料库  分词评价  相似性因子  样本聚类  语言学  分层抽样

Evaluation Method of the Corpus Segmentation Based on Clustering
SONG Li,Peng,ZHENG Jia,Heng.Evaluation Method of the Corpus Segmentation Based on Clustering[J].Chinese Journal of Computers,2004,27(2):192-196.
Authors:SONG Li  Peng  ZHENG Jia  Heng
Abstract:A testing model of the large scale corpus segmentation is proposed in this paper. The sample clustering method based on hierarchical sampling is adopted in the model. We conduct the operation of the sample clustering method according to a new measurement formula for the similarity of the samples, in which the distance of the sample vector and the linear correlation between the components of the sample vector are taken into consideration comprehensively. Through the dynamic evaluation of the clustering results, the clustering parameters are adjusted, and meanwhile, the clustering efficiency and quality are improved. Compared with the random sampling method, the sample clustering method can reduce the sample number by 63.3% under the large scale circumstances. The experiment still shows that this method improves the testing precision by 60%.
Keywords:hierarchical sampling  similarity factor  sample clustering  evaluation function
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号