首页 | 官方网站   微博 | 高级检索  
     

WWW中文信息自动分类方法研究
引用本文:郑家恒,宋文中.WWW中文信息自动分类方法研究[J].情报学报,2002,21(5):532-536.
作者姓名:郑家恒  宋文中
作者单位:山西大学计算机科学系,太原,030006
摘    要:本文采用一种基于词的归类技术。在类别词专指度的计算中 ,考虑了类别词在语料中的频度、集中度和分布性等因素。根据HTML语言的标记特性 ,应用三维加权分类算法计算类别权值。采用Bayes公式变型 ,计算WWW中文信息文件归类可信度 ,并按可信度最大归类。对 10 8篇试语料进行测试 ,封闭测试的归类正确率为98 1% ,开放测试的正确率为 83 3%。

关 键 词:WWW中文信息自动分类  文本自动分类  类别词
修稿时间:2001年9月3日

Study on Automatic Categorizing Method of Chinese Information for World Wide Web
Zheng Jiaheng and Song Wenzhong.Study on Automatic Categorizing Method of Chinese Information for World Wide Web[J].Journal of the China Society for Scientific andTechnical Information,2002,21(5):532-536.
Authors:Zheng Jiaheng and Song Wenzhong
Abstract:The word-based categorization is adopted in the paper.It not only uses the frequency,concentrated degree and distribution,but also uses amount of the every corpus to determine the specialty of the category-word.This paper analyses the tag of HTML,discusses the research on the three-dimensional weighted algorithm to calculate the classification weight.The algorithm uses the frequency,location and specialty.The reliability is calculated by Bayes algorithm and the document is categorized to the kind which reliability is maximum.Close testing and open testing are done in the experiment system.The recall ratio of close testing is 98.1%,the accuracy of open testing is 83.3%.
Keywords:WWW  Chinese information automatic categorization  text automatic categorization  category-word  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号