首页 | 官方网站   微博 | 高级检索  
     


Clustering web documents using hierarchical representation with multi-granularity
Authors:Faliang Huang  Shichao Zhang  Minghua He  Xindong Wu
Affiliation:1. Faculty of Software, Fujian Normal University, Cangshan District, 8 Shangsan Road, Fuzhou, 350007, China
2. College of Computer Science and IT, Guangxi Normal University, Guilin, 541004, PR, China
5. Faculty of Engineering and Information Technology, UTS, PO Box 123, Broadway, NSW, 2007, Australia
3. Computer Science, Aston University, Birmingham, Aston Triangle, B4 7ET, United Kingdom
4. Department of Computer Science, University of Vermont, 33 Colchester Avenue, Burlington, VT, 05405, USA
Abstract:Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号