首页 | 官方网站   微博 | 高级检索  
     


Discriminative features for text document classification
Authors:Email author" target="_blank">K?TorkkolaEmail author
Affiliation:(1) Motorola Labs, 2900 South Diablo Way, MD DW286, Tempe, AZ 85282, USA
Abstract:Abstract The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.An erratum to this article can be found at
Keywords:Dimension reduction  Linear discriminant analysis  Random transforms  Text classification
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号