首页 | 官方网站   微博 | 高级检索  
     

文言信息的自动抽取: 基于统计和规则的尝试
引用本文:虞宁翌,高琦,恩东. 文言信息的自动抽取: 基于统计和规则的尝试[J]. 中文信息学报, 2015, 29(6): 127-134
作者姓名:虞宁翌  高琦  恩东
作者单位:1. 北京语言大学 语言科学院,北京 100083;
2. 北京语言大学 信息科学学院,北京 100083
基金项目:国家自然科学基金(61300081,61170162);国家高技术研究发展计划(2015AA015409);国家社会科学重大基金(12&ZD173)
摘    要:文言信息的自动抽取有利于语言监测和语料库构建。同时该文的计算研究也验证了语言学界关于汉语文白系统连续性的自省结论。该文将从混合语料中标注文言文的问题视为短文本分类的问题进行处理。使用基于规则和基于统计的方法对文言文、白话文本进行分类。在基于规则的方法中,考虑文言常用虚词和句式的影响,对N-gram、朴素贝叶斯、最大熵、决策树模型的性能进行了研究。结果表明监测虚词系统的一元语言模型的F值达到了0.98。

关 键 词:文言标注  文本分类  规则模型  统计模型
  

A Tentative Study on Statistical and Rule Based Information Extraction #br#from Ancient Chinese
YU Ningyi,AO Gaoqi,UN Endong. A Tentative Study on Statistical and Rule Based Information Extraction #br#from Ancient Chinese[J]. Journal of Chinese Information Processing, 2015, 29(6): 127-134
Authors:YU Ningyi  AO Gaoqi  UN Endong
Affiliation:1. Faculty of Language Sciences, Beijing Language and Culture University, Beijing 100083, China;
2. College of Information Sciences, Beijing Language and Culture University, Beijing 100083, China)
Abstract:The information extraction from ancient Chinese benefits language monitoring and corpus construction. This paper regards the ancient Chinese tagging in mixed corpus as a task of short text classification, and applies both rule methods and statistical methods. For rule based methods, the paper considers the effect from function words and constructions in ancient Chinese. For statistical methods, we conduct experiments on N-gram, Naive Bayes, Maximum Entropy, and Decision Tree. Experiments indicate that the unigram model over performs others in F value of 0.98. The research in this paper also provides evidence for the conclusion on Chinese evolution as a continuum.Key words ancient Chinese tagging; text classification; rule based model; statistic based model
Keywords:ancient Chinese tagging   text classification   rule based model   statistic based model  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号