首页 | 官方网站   微博 | 高级检索  
     

基于后缀树词序列核挖掘Web文档
引用本文:傅鹏,张德运,陈海诠,董皓.基于后缀树词序列核挖掘Web文档[J].微电子学与计算机,2005,22(12):4-7.
作者姓名:傅鹏  张德运  陈海诠  董皓
作者单位:西安交通大学电子与信息工程学院,陕西,西安,710049
摘    要:通过将文档表示为一棵后缀树,文章提出一种基于后缀树索引计算文档相似度的词序列核.首先根据文档的词序列构造出后缀树,然后根据后缀树词序列核计算文档间的相似度,最后利用支持向量机对文档进行分类.理论分析表明后缀树词序列核的计算只与比较文档的长度成线性关系,大大减少了序列核的计算时间.在reuters21578文档集上将后缀树词序列核与词序列核、多项式核进行比较,实验结果表明在改善速度的同时,后缀树词序列核可达到与词序列核相当的性能,优于多项式核,更适于Web文档挖掘等应用.

关 键 词:核学习方法  词序列核  字符串核  后缀树  Web挖掘
文章编号:1000-7180(2005)12-004-04
收稿时间:2005-03-18
修稿时间:2005年3月18日

Suffix-Tree Word Sequence Kernel for Web Document Mining
FU Peng,ZHANG De-yun,Chen Hai-quan,Dong Hao.Suffix-Tree Word Sequence Kernel for Web Document Mining[J].Microelectronics & Computer,2005,22(12):4-7.
Authors:FU Peng  ZHANG De-yun  Chen Hai-quan  Dong Hao
Abstract:The use of string kernel (SK) and word sequence kernel (WSK) are novel ways of computing document similarity based on matching non-consecutive subsequences of characters, but the computing time of those kernels is expensive. This paper presents suffix tree word sequence kernel (STWSK), a modified word sequence kernel to compute the similarity of documents. To compute the new kernel, at first, suffix trees of documents are constructed with suffix tree constructing algorithm, and then the word sequence kernel is computed based on the suffix trees. With STWSK, the documents can be categorized using Support Vector Machine fast and efficiently. The theory analysis shows that the computing time of STWSK is linear to the length of the compared documents, which is less than that of SK and WSK obviously. We compare the classification performance of STWSK with WSK and polynomial kernel (PK) on Reuters-21578 text dataset. The experiment results show that STWSK is better than PK, and is not worse than WSK. So STWSK is more appropriate to the real Web documents mining tasks.
Keywords:Kernel methods  Word sequence kernel  String kernel  Suffix tree  Web mining
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号