首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We describe a grid-based approach for enterprise-scale data mining, which is based on leveraging parallel database technology for data storage, and on-demand compute servers for parallelism in the statistical computations. This approach is targeted towards the use of data mining in highly-automated vertical business applications, where the data is stored on one or more relational database systems, and an independent set of high-performance compute servers or a network of low-cost, commodity processors is used to improve the application performance and overall workload management. The goal of this paper is to describe an algorithmic decomposition of data mining kernels between the data storage and compute grids, which makes it possible to exploit the parallelism on the respective grids in a simple way, while minimizing the data transfer between these grids. This approach is compatible with existing standards for data mining task specification and results reporting, so that larger applications using these data mining algorithms do not have to be modified to benefit from this grid-based approach.  相似文献   

2.
Data mining has attracted a lot of research efforts during the past decade. However, little work has been reported on the efficiency of supporting a large number of users who issue different data mining queries periodically when there are new needs and when data is updated. Our work is motivated by the fact that the pattern-growth method is one of the most efficient methods for frequent pattern mining which constructs an initial tree and mines frequent patterns on top of the tree. In this paper, we present a data mining proxy approach that can reduce the I/O costs to construct an initial tree by utilizing the trees that have already been resident in memory. The tree we construct is the smallest for a given data mining query. In addition, our proxy approach can also reduce CPU cost in mining patterns, because the cost of mining relies on the sizes of trees. The focus of the work is to construct an initial tree efficiently. We propose three tree operations to construct a tree. With a unique coding scheme, we can efficiently project subtrees from on-disk trees or in-memory trees. Our performance study indicated that the data mining proxy significantly reduces the I/O cost to construct trees and CPU cost to mine patterns over the trees constructed.  相似文献   

3.
数据库入侵检测的一种数据挖掘方法   总被引:3,自引:0,他引:3  
针对在数据库系统中检测恶意事务提出了一种数据挖掘方法。该方法挖掘数据库中各数据项事务之间的数据关联规则,所设计的数据关联规则挖掘器主要用来挖掘与数据库日志记录相关的数据。不符合关联规则的事务作为恶意事务。试验证明该方法可以有效的检测到恶意事务。  相似文献   

4.
A hybrid multi-group approach for privacy-preserving data mining   总被引:6,自引:6,他引:0  
In this paper, we propose a hybrid multi-group approach for privacy preserving data mining. We make two contributions in this paper. First, we propose a hybrid approach. Previous work has used either the randomization approach or the secure multi-party computation (SMC) approach. However, these two approaches have complementary features: the randomization approach is much more efficient but less accurate, while the SMC approach is less efficient but more accurate. We propose a novel hybrid approach, which takes advantage of the strength of both approaches to balance the accuracy and efficiency constraints. Compared to the two existing approaches, our proposed approach can achieve much better accuracy than randomization approach and much reduced computation cost than SMC approach. We also propose a multi-group scheme that makes it flexible for the data miner to control the balance between data mining accuracy and privacy. This scheme is motivated by the fact that existing randomization schemes that randomize data at individual attribute level can produce insufficient accuracy when the number of dimensions is high. We partition attributes into groups, and develop a scheme to conduct group-based randomization to achieve better data mining accuracy. To demonstrate the effectiveness of the proposed general schemes, we have implemented them for the ID3 decision tree algorithm and association rule mining problem and we also present experimental results.
Wenliang DuEmail:
  相似文献   

5.
Although the integration of engineering data within the framework of product data management systems has been successful in the recent years, the holistic analysis (from a systems engineering perspective) of multi-disciplinary data or data based on different representations and tools is still not realized in practice. At the same time, the application of advanced data mining techniques to complete designs is very promising and bears a high potential for synergy between different teams in the development process. In this paper, we propose shape mining as a framework to combine and analyze data from engineering design across different tools and disciplines. In the first part of the paper, we introduce unstructured surface meshes as meta-design representations that enable us to apply sensitivity analysis, design concept retrieval and learning as well as methods for interaction analysis to heterogeneous engineering design data. We propose a new measure of relevance to evaluate the utility of a design concept. In the second part of the paper, we apply the formal methods to passenger car design. We combine data from different representations, design tools and methods for a holistic analysis of the resulting shapes. We visualize sensitivities and sensitive cluster centers (after feature reduction) on the car shape. Furthermore, we are able to identify conceptual design rules using tree induction and to create interaction graphs that illustrate the interrelation between spatially decoupled surface areas. Shape data mining in this paper is studied for a multi-criteria aerodynamic problem, i.e. drag force and rear lift, however, the extension to quality criteria from different disciplines is straightforward as long as the meta-design representation is still applicable.  相似文献   

6.
A decision-theoretic approach to data mining   总被引:1,自引:0,他引:1  
In this paper, we develop a decision-theoretic framework for evaluating data mining systems, which employ classification methods, in terms of their utility in decision-making. The decision-theoretic model provides an economic perspective on the value of "extracted knowledge", in terms of its payoff to the organization, and suggests a wide range of decision problems that arise from this point of view. The relation between the quality of a data mining system and the amount of investment that the decision maker is willing to make is formalized. We propose two ways by which independent data mining systems can be combined and show that the combined data mining system can be used in the decision-making process of the organization to increase payoff. Examples are provided to illustrate the various concepts, and several ways by which the proposed framework can be extended are discussed.  相似文献   

7.
Data mining can dig out valuable information from databases to assist a business in approaching knowledge discovery and improving business intelligence. Database stores large structured data. The amount of data increases due to the advanced database technology and extensive use of information systems. Despite the price drop of storage devices, it is still important to develop efficient techniques for database compression. This paper develops a database compression method by eliminating redundant data, which often exist in transaction database. The proposed approach uses a data mining structure to extract association rules from a database. Redundant data will then be replaced by means of compression rules. A heuristic method is designed to resolve the conflicts of the compression rules. To prove its efficiency and effectiveness, the proposed approach is compared with two other database compression methods. Chin-Feng Lee is an associate professor with the Department of Information Management at Chaoyang University of Technology, Taiwan, R.O.C. She received her M.S. and Ph.D. degrees in 1994 and 1998, respectively, from the Department of Computer Science and Information Engineering at National Chung Cheng University. Her current research interests include database design, image processing and data mining techniques. S. Wesley Changchien is a professor with the Institute of Electronic Commerce at National Chung-Hsing University, Taiwan, R.O.C. He received a BS degree in Mechanical Engineering (1989) and completed his MS (1993) and Ph.D. (1996) degrees in Industrial Engineering at State University of New York at Buffalo, USA. His current research interests include electronic commerce, internet/database marketing, knowledge management, data mining, and decision support systems. Jau-Ji Shen received his Ph.D. degree in Information Engineering and Computer Science from National Taiwan University at Taipei, Taiwan in 1988. From 1988 to 1994, he was the leader of the software group in Institute of Aeronautic, Chung-Sung Institute of Science and Technology. He is currently an associate professor of information management department in the National Chung Hsing University at Taichung. His research areas focus on the digital multimedia, database and information security. His current research areas focus on data engineering, database techniques and information security. Wei-Tse Wang received the B.A. (2001) and M.B.A (2003) degrees in Information Management at Chaoyang University of Technology, Taiwan, R.O.C. His research interests include data mining, XML, and database compression.  相似文献   

8.
9.
In this paper, we propose a novel face detection method based on the MAFIA algorithm. Our proposed method consists of two phases, namely, training and detection. In the training phase, we first apply Sobel's edge detection operator, morphological operator, and thresholding to each training image, and transform it into an edge image. Next, we use the MAFIA algorithm to mine the maximal frequent patterns from those edge images and obtain the positive feature pattern. Similarly, we can obtain the negative feature pattern from the complements of edge images. Based on the feature patterns mined, we construct a face detector to prune non-face candidates. In the detection phase, we apply a sliding window to the testing image in different scales. For each sliding window, if the slide window passes the face detector, it is considered as a human face. The proposed method can automatically find the feature patterns that capture most of facial features. By using the feature patterns to construct a face detector, the proposed method is robust to races, illumination, and facial expressions. The experimental results show that the proposed method has outstanding performance in the MIT-CMU dataset and comparable performance in the BioID dataset in terms of false positive and detection rate.  相似文献   

10.
From visual data exploration to visual data mining: a survey   总被引:8,自引:0,他引:8  
We survey work on the different uses of graphical mapping and interaction techniques for visual data mining of large data sets represented as table data. Basic terminology related to data mining, data sets, and visualization is introduced. Previous work on information visualization is reviewed in light of different categorizations of techniques and systems. The role of interaction techniques is discussed, in addition to work addressing the question of selecting and evaluating visualization techniques. We review some representative work on the use of information visualization techniques in the context of mining data. This includes both visual data exploration and visually expressing the outcome of specific mining algorithms. We also review recent innovative approaches that attempt to integrate visualization into the DM/KDD process, using it to enhance user interaction and comprehension.  相似文献   

11.
Information visualization and visual data mining   总被引:12,自引:0,他引:12  
Never before in history has data been generated at such high volumes as it is today. Exploring and analyzing the vast volumes of data is becoming increasingly difficult. Information visualization and visual data mining can help to deal with the flood of information. The advantage of visual data exploration is that the user is directly involved in the data mining process. There are a large number of information visualization techniques which have been developed over the last decade to support the exploration of large data sets. In this paper, we propose a classification of information visualization and visual data mining techniques which is based on the data type to be visualized, the visualization technique, and the interaction and distortion technique. We exemplify the classification using a few examples, most of them referring to techniques and systems presented in this special section  相似文献   

12.
Digital Library support for textual and certain types of non-textual documents has significantly advanced over the last years. While Digital Library support implies many aspects along the whole library workflow model, interactive and visual retrieval allowing effective query formulation and result presentation are important functions. Recently, new kinds of non-textual documents which merit Digital Library support, but yet cannot be fully accommodated by existing Digital Library technology, have come into focus. Scientific data, as produced for example, by scientific experimentation, simulation or observation, is such a document type. In this article we report on a concept and first implementation of Digital Library functionality for supporting visual retrieval and exploration in a specific important class of scientific primary data, namely, time-oriented research data. The approach is developed in an interdisciplinary effort by experts from the library, natural sciences, and visual analytics communities. In addition to presenting the concept and to discussing relevant challenges, we present results from a first implementation of our approach as applied on a real-world scientific primary data set. We also report from initial user feedback obtained during discussions with domain experts from the earth observation sciences, indicating the usefulness of our approach.  相似文献   

13.
This paper proposes a flexible sequence alignment approach for pattern mining and matching in the recognition of human activities. During pattern mining, the proposed sequence alignment algorithm is invoked to extract out the representative patterns which denote specific activities of a person from the training patterns. It features high performance and robustness on pattern diversity. Besides, the algorithm evaluates the appearance probability of each pattern as weight and allows adapting pattern length to various human activities. Both of them are able to improve the accuracy of activity recognition. In pattern matching, the proposed algorithm adopts a dynamic programming based strategy to evaluate the correlation degree between each representative activity pattern and the observed activity sequence. It can avoid the trouble on segmenting the observed sequence. Moreover, we are able to obtain recognition results continuously. Besides, the proposed matching algorithm favors recognition of concurrent human activities with parallel matching. The experimental result confirms the high accuracy of human activity recognition by the proposed approach.  相似文献   

14.
HD-Eye: visual mining of high-dimensional data   总被引:3,自引:0,他引:3  
Clustering in high-dimensional databases poses an important problem. However, we can apply a number of different clustering algorithms to high-dimensional data. The authors consider how an advanced clustering algorithm combined with new visualization methods interactively clusters data more effectively. Experiments show these techniques improve the data mining process  相似文献   

15.
Since sport marketing is a commercial activity, precise customer and marketing segmentation must be investigated frequently and it would help to know the sport market after a specific customer profile, segmentation, or pattern come with marketing activities has found. Such knowledge would not only help sport firms, but would also contribute to the broader field of sport customer behavior and marketing. This paper proposes using the Apriori algorithm of association rules, and clustering analysis based on an ontology-based data mining approach, for mining customer knowledge from the database. Knowledge extracted from data mining results is illustrated as knowledge patterns, rules, and maps in order to propose suggestions and solutions to the case firm, Taiwan Adidas, for possible product promotion and sport marketing.  相似文献   

16.
Data mining is most commonly used in attempts to induce association rules from transaction data. In the past, we used the fuzzy and GA concepts to discover both useful fuzzy association rules and suitable membership functions from quantitative values. The evaluation for fitness values was, however, quite time-consuming. Due to dramatic increases in available computing power and concomitant decreases in computing costs over the last decade, learning or mining by applying parallel processing techniques has become a feasible way to overcome the slow-learning problem. In this paper, we thus propose a parallel genetic-fuzzy mining algorithm based on the master–slave architecture to extract both association rules and membership functions from quantitative transactions. The master processor uses a single population as a simple genetic algorithm does, and distributes the tasks of fitness evaluation to slave processors. The evolutionary processes, such as crossover, mutation and production are performed by the master processor. It is very natural and efficient to run the proposed algorithm on the master–slave architecture. The time complexities for both sequential and parallel genetic-fuzzy mining algorithms have also been analyzed, with results showing the good effect of the proposed one. When the number of generations is large, the speed-up can be nearly linear. The experimental results also show this point. Applying the master–slave parallel architecture to speed up the genetic-fuzzy data mining algorithm is thus a feasible way to overcome the low-speed fitness evaluation problem of the original algorithm.  相似文献   

17.
The number, variety and complexity of projects involving data mining or knowledge discovery in databases activities have increased just lately at such a pace that aspects related to their development process need to be standardized for results to be integrated, reused and interchanged in the future. Data mining projects are quickly becoming engineering projects, and current standard processes, like CRISP-DM, need to be revisited to incorporate this engineering viewpoint. This is the central motivation of this paper that makes the point that experience gained about the software development process over almost 40 years could be reused and integrated to improve data mining processes. Consequently, this paper proposes to reuse ideas and concepts underlying the IEEE Std 1074 and ISO 12207 software engineering model processes to redefine and add to the CRISP-DM process and make it a data mining engineering standard.  相似文献   

18.
We study in this work how a user can be guided to find a relevant visualization in the context of visual data mining. We present a state of the art on the user assistance in visual and interactive methods. We propose a user assistant called VizAssist, which aims at improving the existing approaches along three directions: it uses simpler computational models of the visualizations and the visual perception guidelines, in order to facilitate the integration of new visualizations and the definition of a mapping heuristic. VizAssist allows the user to provide feedback in a visual and interactive way, with the aim of improving the data to visualization mapping. This step is performed with an interactive genetic algorithm. Finally, VizAssist aims at proposing a free on-line tool (www.vizassist.fr) that respects the privacy of the user data. This assistant can be viewed as a global interface between the user and some of the many visualizations that are implemented with D3js.  相似文献   

19.
The recent trends in collecting huge and diverse datasets have created a great challenge in data analysis. One of the characteristics of these gigantic datasets is that they often have significant amounts of redundancies. The use of very large multi-dimensional data will result in more noise, redundant data, and the possibility of unconnected data entities. To efficiently manipulate data represented in a high-dimensional space and to address the impact of redundant dimensions on the final results, we propose a new technique for the dimensionality reduction using Copulas and the LU-decomposition (Forward Substitution) method. The proposed method is compared favorably with existing approaches on real-world datasets: Diabetes, Waveform, two versions of Human Activity Recognition based on Smartphone, and Thyroid Datasets taken from machine learning repository in terms of dimensionality reduction and efficiency of the method, which are performed on statistical and classification measures.  相似文献   

20.
In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers.Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with DMPD in order to compare its classification performance with other k-anonymity-based methods. The results suggest that DMPD performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号