首页 | 官方网站   微博 | 高级检索  
     

一种使用双阈值的数据仓库环境下重复记录消除算法
引用本文:洪圆,孙未未,施伯乐.一种使用双阈值的数据仓库环境下重复记录消除算法[J].计算机工程与应用,2005,41(1):168-170,216.
作者姓名:洪圆  孙未未  施伯乐
作者单位:复旦大学计算机与信息技术系,上海,200433
基金项目:国家863高技术研究发展计划基金项目(编号:2002AA4Z3430)
摘    要:重复记录消除是数据清理研究中一个很重要的方面,它的目的是检测并消除那些冗余的、可能对后来的OLAP和数据挖掘造成影响的数据。已有研究都是通过设定一个相似度阈值来判断两条记录是否为重复记录。过大的阈值将导致返回率下降,过小的阈值将导致误检率上升。文章提出了一种双阈值的重复记录消除方法,利用数据仓库环境下数据库表之间的外键联系做进一步判断,可以有效地提高判断质量,减小误检率。

关 键 词:重复记录消除  数据仓库  外键参照  双阈值
文章编号:1002-8331-(2005)01-0168-03

Duplicate Records Elimination in Data Warehouse with Two Thresholds
Hong Yuan,Sun Weiwei,Shi Baile.Duplicate Records Elimination in Data Warehouse with Two Thresholds[J].Computer Engineering and Applications,2005,41(1):168-170,216.
Authors:Hong Yuan  Sun Weiwei  Shi Baile
Abstract:Data Warehouse integrates the data from several data sources.But the data integrated from outer data sources may contain error information such as typical errors or duplicate records.So these data should be cleaned before OLAP or data mining.Duplicate records elimination is an important issue in data cleaning.It is used to detect those duplicate records,which are not identical but may represent the same object.A new algorithm using two thresholds is introduced to detect duplicate records in data warehouse.In this algorithm the relationships such as foreign key reference between two tables are used to help to decide whether two records are duplicates when some traditional algorithms could not make an explicit decision.
Keywords:duplicate elimination  data warehouse  foreign key reference  two thresholds
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号