首页 | 官方网站   微博 | 高级检索  
     

W-POS语言模型及其选择与匹配算法
引用本文:邱云飞,刘世兴,魏海超,邵良杉.W-POS语言模型及其选择与匹配算法[J].计算机应用,2015,35(8):2210-2214.
作者姓名:邱云飞  刘世兴  魏海超  邵良杉
作者单位:1. 辽宁工程技术大学 软件学院, 辽宁 葫芦岛 125105; 2. 辽宁工程技术大学 系统工程研究所, 辽宁 葫芦岛 125105
基金项目:国家自然科学基金资助项目(70971059);辽宁省创新团队项目(2009T045);辽宁省高等学校杰出青年学者成长计划项目(LJQ2012027)。
摘    要:n-grams语言模型旨在利用多个词的组合形式生成文本特征,以此训练分类器对文本进行分类。然而n-grams自身存在冗余词,并且在与训练集匹配量化的过程中会产生大量稀疏数据,严重影响分类准确率,限制了其使用范围。对此,基于n-grams语言模型,提出一种改进的n-grams语言模型--W-POS。将分词后文本中出现概率较小的词和冗余词用词性代替,得到由词和词性的不规则排列组成的W-POS语言模型,并提出该语言模型的选择规则、选择算法以及与测试集的匹配算法。在复旦大学中文语料库和英文语料库20Newsgroups中的实验结果表明,W-POS语言模型既继承了n-grams语言模型减少特征数量、携带部分语义和提高精度的优点,又克服了n-grams语言模型产生大量稀疏数据、含有冗余词的缺陷,并验证了选择和匹配算法的有效性。

关 键 词:n-grams语言模型  词性  冗余度  稀疏数据  特征选择  
收稿时间:2015-03-16
修稿时间:2015-04-29

W-POS language model and its selecting and matching algorithms
QIU Yunfei,LIU Shixing,WEI Haichao,SHAO Liangshan.W-POS language model and its selecting and matching algorithms[J].journal of Computer Applications,2015,35(8):2210-2214.
Authors:QIU Yunfei  LIU Shixing  WEI Haichao  SHAO Liangshan
Affiliation:1. School of Software, Liaoning Technical University, Huludao Liaoning 125105, China;
2. System Engineering Institute, Liaoning Technical University, Huludao Liaoning 125105, China
Abstract:n-grams language model aims to use text feature combined of some words to train classifier. But it contains many redundancy words, and a lot of sparse data will be generated when n-grams matches or quantifies the test data, which badly influences the classification precision and limites its application. Therefore, an improved language model named W-POS (Word-Parts of Speech) was proposed based on n-grams language model. After words segmentation, parts of speeches were used to replace the words that rarely appeared and were redundant, then the W-POS language model was composed of words and parts of speeches. The selection rules, selecting algorithm and matching algorithm of W-POS language model were also put forward. The experimental results in Fudan University Chinese Corpus and 20Newsgroups show that the W-POS language model can not only inherit the advantages of n-grams including reducing amount of features and carrying parts of semantics, but also overcome the shortages of producing large sparse data and containing redundancy words. The experiments also verify the effectiveness and feasibility of the selecting and matching algorithms.
Keywords:n-grams language model" target="_blank">n-grams language model')">n-grams language model                                                                                                                        parts of speech                                                                                                                        redundancy                                                                                                                        sparse data                                                                                                                        feature selection
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号