首页 | 官方网站   微博 | 高级检索  
     

基于DRPKP算法的文本去重研究与应用
引用本文:俞桝,王引娜.基于DRPKP算法的文本去重研究与应用[J].微型电脑应用,2014(1):58-60.
作者姓名:俞桝  王引娜
作者单位:[1]国泰君安证券股份有限公司上海,200120 [2]华存数据信息技术有限公司上海,200120
基金项目:国家科技支撑计划课题“证券与金融产品交易综合服务示范”资助(编号:2012BAH13F03)
摘    要:SimHash算法是目前主流的文本去重算法,但它对于特定行业的文本数据在主题方面的天然相似性特点并没有特殊的考虑.基于多年在金融证券行业信息管理和数据整合的经验,本文分析目前文本去重方法存在的问题,特别针对SimHash算法在特定行业文本去重中的不足,创新地提出一种基于段落主题的文本去重方法(简称DRPKP算法),通过对去重准确率、覆盖率和去重时间3个指标进行对比测试,DRPKP算法比SimHash算法准确率可提高24.5%、覆盖率可提高16.34%,且去重时间更短.

关 键 词:文本去重  段落主题  SimHash  相似度  MapReduce

Research and Application on Text Duplication Removal Based on DRPKP Algorithm
Yu Feng,Wang Yinna.Research and Application on Text Duplication Removal Based on DRPKP Algorithm[J].Microcomputer Applications,2014(1):58-60.
Authors:Yu Feng  Wang Yinna
Affiliation:1.Guotai Junan Securities Co., Ltd., Shanghai200120,China; 2.China Information Technology Co., Ltd.Stored Data, Shanghai200120,China;)
Abstract:SimHash algorithm is one of the best algorithm for text duplication detection and removal.However,it has less consideration on the naturalsimilarity of text in specific fields.Based on our experience in information management and data integration in financing and securities industry,we analyzemost text duplication removal algorithms today,especially focus onSimHash algorithm,and propose an newalgorithm for text duplication detection and removal which is based on paragraph key phrase(DRPKP).We appliedour algorithm to detect and remove text duplication in real data set onGuo Tai Jun An's Financial Information and Unified Information Retrieval Platform.In comparison withSimHash algorithm,our DRPKPalgorithm performs better with the precision ofduplication removal increased by 24.5%,andthe recallincreased by 16.34%; meanwhile,our DRPKPalgorithm also shows an advantage in operating time.
Keywords:Image Retrieval  Gaussian Pyramid  Color Histogram
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号