首页 | 官方网站   微博 | 高级检索  
     

一种基于字同现频率的汉语文本主题抽取方法
引用本文:马颖华,王永成,苏贵洋,张宇萌.一种基于字同现频率的汉语文本主题抽取方法[J].计算机研究与发展,2003,40(6):874-878.
作者姓名:马颖华  王永成  苏贵洋  张宇萌
作者单位:上海交通大学计算机科学与工程系,上海,200030
基金项目:国家自然科学基金 ( 60 0 82 0 0 3)
摘    要:主题抽取是文本自动处理的基础工作之一,而主题的抽取一直以分词或者抽词作为第1步.由于汉语词间缺少明显的间隔,因此分词和抽词的效果往往不够理想,从而在一定程度上影响了主题抽取的质量.提出以字为处理单位,基于字同现领率的汉语文本主题自动抽取的新方法.该方法速度快,适应多种文体类型,并完全避开了分词和抽词过程,可以广泛应用在主题句、主题段落等主题抽取的多个层面,而且同样适用于其他语言的文本主题抽取.主题句自动抽取实验表明,该方法抽取新闻文本主题句的正确率达到77.19%.汉语文本的主题抽取比较实验还表明,省略分词步骤并没有降低抽取算法的正确率.

关 键 词:自然语言处理  主题抽取  同现频率

A Novel Chinese Text Subject Extraction Method Based on Character Co-occurrence
MA Ying-Hua,WANG Yong-Cheng,SU Gui-Yang,and ZHANG Yu-Meng.A Novel Chinese Text Subject Extraction Method Based on Character Co-occurrence[J].Journal of Computer Research and Development,2003,40(6):874-878.
Authors:MA Ying-Hua  WANG Yong-Cheng  SU Gui-Yang  and ZHANG Yu-Meng
Abstract:Subject extraction is one of the fundamental works of natural language processing. Word segmentation or word extraction is always the first step of subject extraction. As there is no intervals among words in Chinese text, both word segmentation and word extraction are difficult. In this paper, a novel Chinese text subject extraction method based on character co-occurrence is put forward. Neither word segmentation nor word extraction is required in this method. The method has high processing speed and can be used in both subject sentence extraction and subject paragraph extraction. Another advantage of this approach is that it can be used to process not only Chinese text but also text in other languages and even multi-language text. Results of experiments show that the approach gains high accuracy of 77.19% in multi-style text of news. And without word segmentation, the accuracy does not decline.
Keywords:natural language processing  subject extraction  co-occurrence
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号