首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Constraint-Based Rule Mining in Large, Dense Databases   总被引:7,自引:1,他引:6  
Constraint-based rule miners find all rules in a given data-set meeting user-specified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all user-specified constraints including minimum support, minimum confidence, and a new constraint that ensures every mined rule offers a predictive advantage over any of its simplifications. Our algorithm maintains efficiency even at low supports on data that is dense (e.g. relational tables). Previous approaches such as Apriori and its variants exploit only the minimum support constraint, and as a result are ineffective on dense data due to a combinatorial explosion of “frequent itemsets”.  相似文献   

2.
We consider an ε-optimal model reduction problem for a linear discrete time-invariant system, where the anisotropic norm of reduction error transfer function is used as a performance criterion. For solving the main problem, we state and solve an auxiliary problem of H 2 ε-optimal reduction of a weighted linear discrete time system. A sufficient optimality condition defining a solution to the anisotropic ε-optimal model reduction problem has the form of a system of cross-coupled nonlinear matrix algebraic equations including a Riccati equation, four Lyapunov equations, and five special-type nonlinear equations. The proposed approach to solving the problem ensures stability of the reduced model without any additional technical assumptions. The reduced-order model approximates the steady-state behavior of the full-order system.  相似文献   

3.
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power. Editor: Johannes Fürnkranz. An erratum to this article can be found at  相似文献   

4.
Most of the methods that generate decision trees for a specific problem use the examples of data instances in the decision tree–generation process. This article proposes a method called RBDT‐1—rule‐based decision tree—for learning a decision tree from a set of decision rules that cover the data instances rather than from the data instances themselves. The goal is to create on demand a short and accurate decision tree from a stable or dynamically changing set of rules. The rules could be generated by an expert, by an inductive rule learning program that induces decision rules from the examples of decision instances such as AQ‐type rule induction programs, or extracted from a tree generated by another method, such as the ID3 or C4.5. In terms of tree complexity (number of nodes and leaves in the decision tree), RBDT‐1 compares favorably with AQDT‐1 and AQDT‐2, which are methods that create decision trees from rules. RBDT‐1 also compares favorably with ID3 while it is as effective as C4.5 where both (ID3 and C4.5) are well‐known methods that generate decision trees from data examples. Experiments show that the classification accuracies of the decision trees produced by all methods under comparison are indistinguishable.  相似文献   

5.
We consider the problem of allocating n tasks of a distributed program to m processors of a distributed system in order to minimize total communication and processing costs. If the intertask communication can be represented by a tree and if the communication costs are uniform, it is known that an optimal allocation can be determined in O(nm) time. A K-optimal solution set Ω = { 1,..., K} of a given task allocation problem is a set of allocations such that no allocation which is not contained in Ω is better than any i, i = 1,..., K. In this paper, an algorithm is presented which computes a K-optimal set for the considered task allocation problem in O(Knm).  相似文献   

6.
Multivariate Discretization for Set Mining   总被引:2,自引:0,他引:2  
Many algorithms in data mining can be formulated as a set-mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user-specified constraints. Set-mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed number of values. However, many datasets also contain continuous variables and a common method of dealing with these is to discretize them by breaking them into ranges. Most discretization methods are univariate and consider only a single feature at a time (sometimes in conjunction with a class variable). We argue that this is a suboptimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data. Discretization should consider the effects on all variables in the analysis and that two regions X and Y should only be in the same interval after discretization if the instances in those regions have similar multivariate distributions (F x F y ) across all variables and combinations of variables. We present a bottom-up merging algorithm to discretize continuous variables based on this rule. Our experiments indicate that the approach is feasible, that it will not destroy hidden patterns and that it will generate meaningful intervals. Received 14 November 2000 / Revised 1 February 2001 / Accepted in revised form 1 May 2001  相似文献   

7.
The information content of rules and rule sets and its application   总被引:1,自引:1,他引:0  
The information content of rules is categorized into inner mutual information content and outer impartation information content. Actually, the conventional objective interestingness measures based on information theory are all inner mutual information, which represent the confidence of rules and the mutual information between the antecedent and consequent. Moreover, almost all of these measures lose sight of the outer impartation information, which is conveyed to the user and help the user to make decisions. We put forward the viewpoint that the outer impartation information content of rules and rule sets can be represented by the relations from input universe to output universe. By binary relations, the interaction of rules in a rule set can be easily represented by operators: union and intersection. Based on the entropy of relations, the outer impartation information content of rules and rule sets are well measured. Then, the conditional information content of rules and rule sets, the independence of rules and rule sets and the inconsistent knowledge of rule sets are defined and measured. The properties of these new measures are discussed and some interesting results are proven, such as the information content of a rule set may be bigger than the sum of the information content of rules in the rule set, and the conditional information content of rules may be negative. At last, the applications of these new measures are discussed. The new method for the appraisement of rule mining algorithm, and two rule pruning algorithms, λ-choice and RPClC, are put forward. These new methods and algorithms have predominance in satisfying the need of more efficient decision information.  相似文献   

8.
基于GDT的不完整信息系统规则发现   总被引:4,自引:0,他引:4  
提出了一个基于GDT的从不完整信息系统进行规则发现的方法。该方法利用GDT的思想,通过概括强度、规则置信度和规则强度等概念,充分考虑到数据不完整性和噪音引起的不确定性,在不改变原信息系统大小的前提下,直接从不完整信息系统中得到简洁实用的规则。最后,通过一个例子阐述了该方法的实施过程,并将该方法与提及的其它几种不从不完整信息系统中发现规则的方法进行了分析比较。分析表明该方法是一种新的有效的从不完整信息系统抽取规则的途径。  相似文献   

9.
We present a goal-directed E-unification procedure with eager Variable Elimination and a new rule, Cycle, for the case of collapsing equations – that is, equations of the type x ≈ v where xVar(v). Cycle replaces Variable Decomposition (or the so-called Root Imitation) and thus removes possibility of some obviously unnecessary infinite paths of inferences in the E-unification procedure. We prove that, as in other approaches, such inferences into variable positions in our goal-directed procedure are not needed. Our system is independent of selection rule and complete for any E-unification problem.  相似文献   

10.
The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHSRHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.  相似文献   

11.
Parity space approach and H2 approach are two important fault detection approaches. This paper studies the relationship between these two approaches, which reveals frequency domain characteristics of the optimal solution of the parity space approach on the one side and provides a numerical solution of the H2-optimal design of residual generators on the other side.  相似文献   

12.
Top‐Rank‐K Frequent Itemset (or Pattern) Mining (FPM) is an important data mining task, where user decides on the number of top frequency ranks of patterns (itemsets) they want to mine from a transactional dataset. This problem does not require the minimum support threshold parameter that is typically used in FPM problems. Rather, the algorithms solving the Top‐Rank‐K FPM problem are fed with K , the number of frequency ranks of itemsets required, to compute the threshold internally. This paper presents two declarative approaches to tackle the Top‐Rank‐K Closed FPM problem. The first approach is Boolean Satisfiability‐based (SAT‐based) where we propose an effective encoding for the problem along with an efficient algorithm employing this encoding. The second approach is CP‐based, that is, utilizes Constraint Programming technique, where a simple CP model is exploited in an innovative manner to mine the Top‐Rank‐K Closed FPM itemsets from transactional datasets. Both approaches are evaluated experimentally against other declarative and imperative algorithms. The proposed SAT‐based approach significantly outperforms IM, another SAT‐based approach, and outperforms the proposed CP‐approach for sparse and moderate datasets, whereas the latter excels on dense datasets. An extensive study has been conducted to assess the proposed approaches in terms of their feasibility, performance factors, and practicality of use.  相似文献   

13.
Abstract

As today’s manufacturing domain is becoming more and more knowledge-intensive, knowledge-based systems (KBS) are widely applied in the predictive maintenance domain to detect and predict anomalies in machines and machine components. Within a KBS, decision rules are a comprehensive and interpretable tool for classification and knowledge discovery from data. However, when the decision rules incorporated in a KBS are extracted from heterogeneous sources, they may suffer from several rule quality issues, which weakens the performance of a KBS. To address this issue, in this paper, we propose a rule base refinement approach with considering rule quality measures. The proposed approach is based on a rule integration method for integrating the expert rules and the rules obtained from data mining. Within the integration process, rule accuracy, coverage, redundancy, conflict, and subsumption are the quality measures that we use to refine the rule base. A case study on a real-world data set shows the approach in detail.  相似文献   

14.
For the solution of the problem of L-optimal design of an experiment use is made of a skeleton algorithm [1]. This algorithm is modified for the solution of a multiparameter problem of linear programming, to which the initial problem reduces. The algorithm ensures the lack of almost degenerate iterations and does not require operations of the matrix inversion.  相似文献   

15.
The synthesis procedure of H 2-optimal system is based on the idea of coercive splitting of the object motion and controller construction, allowing us to provide partial mutual compensation for these motions in order to increase the robustness to object model uncertainty.  相似文献   

16.
Effective information systems require the existence of explicit process models. A completely specified process design needs to be developed in order to enact a given business process. This development is time consuming and often subjective and incomplete. We propose a method that constructs the process model from process log data, by determining the relations between process tasks. To predict these relations, we employ machine learning technique to induce rule sets. These rule sets are induced from simulated process log data generated by varying process characteristics such as noise and log size. Tests reveal that the induced rule sets have a high predictive accuracy on new data. The effects of noise and imbalance of execution priorities during the discovery of the relations between process tasks are also discussed. Knowing the causal, exclusive, and parallel relations, a process model expressed in the Petri net formalism can be built. We illustrate our approach with real world data in a case study.
Antal Van Den BoschEmail:
  相似文献   

17.
Rules have been used in a database context for several purposes:deductive database queries, active database triggers, and productionsystem programs. Exploring the search space for non-deterministicrule programs, however, has generally been available only in largemonolithic systems intended for artificial intelligence applications.The goal of this research is to provide multi-path reasoning fornon-deterministic rule programs in a database environment. Themotivation for this work includes the increasing use ofdatabase-style triggers to assist data-intensive applications, e.g.,the human genome project and the Intelligent Integration ofInformation (I 3) Program.A non-deterministic rule program is one where there is more than oneterminal state. Such programs are generally not consideredappropriate for database queries where the focus is on rules programsor techniques that guarantee a unique fixed point. Butnon-deterministic programs are commonly used for various heuristicsearch problems such as the traveling salesperson problem. With anon-deterministic program, it is particularly important to beflexible about the order in which the search space is exploredbecause exhaustive search is generally not feasible. Thecontributions of this research are: the representation of themulti-path search tree within a database and the ability for theproblem solver to control the search process for a non-deterministicrule program.  相似文献   

18.
On account of the enormous amounts of rules that can be produced by data mining algorithms, knowledge post-processing is a difficult stage in an association rule discovery process. In order to find relevant knowledge for decision making, the user (a decision maker specialized in the data studied) needs to rummage through the rules. To assist him/her in this task, we here propose the rule-focusing methodology, an interactive methodology for the visual post-processing of association rules. It allows the user to explore large sets of rules freely by focusing his/her attention on limited subsets. This new approach relies on rule interestingness measures, on a visual representation, and on interactive navigation among the rules. We have implemented the rule-focusing methodology in a prototype system called ARVis. It exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm. Julien Blanchard earned the Ph.D. in 2005 from Nantes University (France) and is currently an assistant professor at the Polytechnic School of Nantes University. He is the author of a book chapter and seven journal and international conference papers in the field of visualization and interestingness measures for data mining. Fabrice Guillet is currently a member of the LINA laboratory (CNRS 2729) at the Polytechnic Graduate School of Nantes University (France). He receive the Ph.D. degree in computer science in 1995 from the Ecole Nationale Supěrieure des Télécommunications de Bretagne. He is author of 35 international publications in data mining and knowledge management. He is a founder and a permanent member of the Steering Committee of the annual EGC French-speaking conference. Henri Briand received the Ph.D. degree in 1983 from Paul Sabatier University located in Toulouse (France) and has published works in over 100 publications in database systems and database mining. He was the head of the Computer Engineering Department at the Polytechnic School of Nantes University. He was in charge of a research team in the data mining domain. He is responsible for the organization of the Data Mining Master in Nantes University.  相似文献   

19.
Partial match queries arise frequently in the context of large databases, where each record contains a distinct multidimensional key, that is, the key of each record is aK-tuple of values. The components of a key are called thecoordinates orattributes of the key. In a partial match query we specify the value ofs attributes, 0<s<K, and leave the remainingKs attributes unspecified. The goal is to retrieve all the records in the database that match the specified attributes. In this paper we present several results about the average performance and variance of partial matches in relaxedK-dimensional trees (search trees and digital tries). These data structures are variants of the well knownK d-trees andK d-tries. In relaxed trees the sequence of attributes used to guide a query is explicitly stored at the nodes of the tree and randomly generated and, in general, will be different for different search paths. In the standard variants, the sequence of attributes that guides a query examines the attributes in a cyclic fashion, fixed and identical for all search paths. We show that the probabilistic analysis of the relaxed multidimensional trees is very similar to that of standardK d-trees andK d-tries, and also to the analysis of quadtrees. In fact, besides the average cost and variance of partial match in relaxedK d-trees andK d-tries, we also obtain the variance of partial matches in two-dimensional quadtrees. We also compute the average cost of partial matches in other relaxed multidimensional digital tries, namely, relaxedK d-Patricia and relaxedK d-digital search trees. This research was supported by Acciones Integradas Hispano-Austríacas HU1997-0016 (Austrian-Spanish Scientific Exchange Program). The first author was also supported by ESPRIT LTR 20244 (ALCOM IT), CICYT TIC97-1475-CE, DGES PB95-0787 (KOALA), and CIRIT 1997SGR-00366 (SGR). The second author was also supported by the Austrian Research Society (FWF) under Project P12599-MAT. Online publication October 13, 2000.  相似文献   

20.
In this paper, we propose two sampling theories of rule discovery based on generality and accuracy. The first theory concerns the worst case: it extends a preliminary version of PAC learning, which represents a worst-case analysis for classification. In our analysis, a rule is defined as a probabilistic constraint of true assignment to the class attribute for corresponding examples, and we mainly analyze the case in which we try to avoid finding a bad rule. Effectiveness of our approach is demonstrated through examples for conjunction-rule discovery. The second theory concerns a distribution-based case: it represents the conditions that a rule exceeds pre-specified thresholds for generality and accuracy with high reliability. The idea is to assume a 2-dimensional normal distribution for two probabilistic variables, and obtain the conditions based on their confidence region. This approach has been validated experimentally using 21 benchmark data sets in the machine learning community against conventional methods each of which evaluates the reliability of generality. Discussions on related work are provided for PAC learning, multiple comparison, and analysis of association-rule discovery.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号