首页 | 官方网站   微博 | 高级检索  
     


A Taxonomy of Dirty Data
Authors:Won Kim  Byoung-Ju Choi  Eui-Kyeong Hong  Soo-Kyung Kim  Doheon Lee
Affiliation:(1) Cyber Database Solutions, Inc., Austin, Texas, USA;(2) Department of Computer Science, Ewha Institute of Science and Technology, Seoul, Korea;(3) University of Seoul, AITrc, Seoul, Korea;(4) Lucent Technologies, Seoul, Korea;(5) Department of Biosystems, Korea Advanced Institute of Science and Technology, Daejon, Korea
Abstract:Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often ldquodirtyrdquo. Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.
Keywords:dirty data  data quality  data mining  data cleansing  data warehousing
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号