首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The growth of machine-generated relational databases, both in the sciences and in industry, is rapidly outpacing our ability to extract useful information from them by manual means. This has brought into focus machine learning techniques like Inductive Logic Programming (ILP) that are able to extract human-comprehensible models for complex relational data. The price to pay is that ILP techniques are not efficient: they can be seen as performing a form of discrete optimisation, which is known to be computationally hard; and the complexity is usually some super-linear function of the number of examples. While little can be done to alter the theoretical bounds on the worst-case complexity of ILP systems, some practical gains may follow from the use of multiple processors. In this paper we survey the state-of-the-art on parallel ILP. We implement several parallel algorithms and study their performance using some standard benchmarks. The principal findings of interest are these: (1) of the techniques investigated, one that simply constructs models in parallel on each processor using a subset of data and then combines the models into a single one, yields the best results; and (2) sequential (approximate) ILP algorithms based on randomized searches have lower execution times than (exact) parallel algorithms, without sacrificing the quality of the solutions found. This is an extended version of the paper entitled Strategies to Parallelize ILP Systems, published in the Proceedings of the 15th International Conference on Inductive Logic Programming (ILP 2005), vol. 3625 of LNAI, pp. 136–153, Springer-Verlag.  相似文献   

2.
Inductive logic programming (ILP) is a sub‐field of machine learning that provides an excellent framework for multi‐relational data mining applications. The advantages of ILP have been successfully demonstrated in complex and relevant industrial and scientific problems. However, to produce valuable models, ILP systems often require long running times and large amounts of memory. In this paper we address fundamental issues that have direct impact on the efficiency of ILP systems. Namely, we discuss how improvements in the indexing mechanisms of an underlying logic programming system benefit ILP performance. Furthermore, we propose novel data structures to reduce memory requirements and we suggest a new lazy evaluation technique to search the hypothesis space more efficiently. These proposals have been implemented in the April ILP system and evaluated using several well‐known data sets. The results observed show significant improvements in running time without compromising the accuracy of the models generated. Indeed, the combined techniques achieve several order of magnitudes speedup in some data sets. Moreover, memory requirements are reduced in nearly half of the data sets. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

3.
It is well known that many hard tasks considered in machine learning and data mining can be solved in a rather simple and robust way with an instance- and distance-based approach. In this work we present another difficult task: learning, from large numbers of complex performances by concert pianists, to play music expressively. We model the problem as a multi-level decomposition and prediction task. We show that this is a fundamentally relational learning problem and propose a new similarity measure for structured objects, which is built into a relational instance-based learning algorithm named DISTALL. Experiments with data derived from a substantial number of Mozart piano sonata recordings by a skilled concert pianist demonstrate that the approach is viable. We show that the instance-based learner operating on structured, relational data outperforms a propositional k-NN algorithm. In qualitative terms, some of the piano performances produced by DISTALL after learning from the human artist are of substantial musical quality; one even won a prize in an international ‘computer music performance’ contest. The experiments thus provide evidence of the capabilities of ILP in a highly complex domain such as music. Editors: Tamás Horváth and Akihiro Yamamoto  相似文献   

4.
Recently, computer programs developed within the field of Inductive Logic Programming (ILP) have received some attention for their ability to construct restricted first-order logic solutions using problem-specific background knowledge. Prominent applications of such programs have been concerned with determining structure-activity relationships in the areas of molecular biology and chemistry. Typically the task here is to predict the activity of a compound (for example, toxicity), from its chemical structure. A summary of the research in the area is: (a) ILP programs have largely been restricted to qualitative predictions of activity (high, low etc.); (b) When appropriate attributes are available, ILP programs have equivalent predictivity to standard quantitative analysis techniques like linear regression. However ILP programs usually perform better when such attributes are unavailable; and (c) By using structural information as background knowledge, an ILP program can provide comprehensible explanations for biological activity. This paper examines the use of ILP programs as a method of discovering new attributes. These attributes could then be used by methods like linear regression, thus allowing for quantitative predictions while retaining the ability to use structural information as background knowledge. Using structure-activity tasks as a test-bed, the utility of ILP programs in constructing new features was evaluated by examining the prediction of biological activity using linear regression, with and without the aid of ILP learnt logical attributes. In three out of the five data sets examined the addition of ILP attributes produced statistically better results. In addition six important structural features that have escaped the attention of the expert chemists were discovered. The method used here to construct new attributes is not specific to the problem of predicting biological activity, and the results obtained suggest a wider role for ILP programs in aiding the process of scientific discovery.  相似文献   

5.
Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era of very large machine-generated scientific and industrial datasets, accompanied by programs that provide ready access to complex relational information in machine-readable forms (ontologies, parsers, and so on). Besides the usual issues about the ease of use, ILP is now confronted with questions of implementation. We are concerned here with two of these, namely: can an ILP system construct models efficiently when (a) Dataset sizes are too large to fit in the memory of a single machine; and (b) Search space sizes becomes prohibitively large to explore using a single machine. In this paper, we examine the applicability to ILP of a popular distributed computing approach that provides a uniform way for performing data and task parallel computations in ILP. The MapReduce programming model allows, in principle, very large numbers of processors to be used without any special understanding of the underlying hardware or software involved. Specifically, we show how the MapReduce approach can be used to perform the coverage-test that is at the heart of many ILP systems, and to perform multiple searches required by a greedy set-covering algorithm used by some popular ILP systems. Our principal findings with synthetic and real-world datasets for both data and task parallelism are these: (a) Ignoring overheads, the time to perform the computations concurrently increases with the size of the dataset for data parallelism and with the size of the search space for task parallelism. For data parallelism this increase is roughly in proportion to increases in dataset size; (b) If a MapReduce implementation is used as part of an ILP system, then benefits for data parallelism can only be expected above some minimal dataset size, and for task parallelism can only be expected above some minimal search-space size; and (c) The MapReduce approach appears better suited to exploit data-parallelism in ILP.  相似文献   

6.
To date, Inductive Logic Programming (ILP) systems have largely assumed that all data needed for learning have been provided at the onset of model construction. Increasingly, for application areas like telecommunications, astronomy, text processing, financial markets and biology, machine-generated data are being generated continuously and on a vast scale. We see at least four kinds of problems that this presents for ILP: (1) it may not be possible to store all of the data, even in secondary memory; (2) even if it were possible to store the data, it may be impractical to construct an acceptable model using partitioning techniques that repeatedly perform expensive coverage or subsumption-tests on the data; (3) models constructed at some point may become less effective, or even invalid, as more data become available (exemplified by the “drift” problem when identifying concepts); and (4) the representation of the data instances may need to change as more data become available (a kind of “language drift” problem). In this paper, we investigate the adoption of a stream-based on-line learning approach to relational data. Specifically, we examine the representation of relational data in both an infinite-attribute setting, and in the usual fixed-attribute setting, and develop implementations that use ILP engines in combination with on-line model-constructors. The behaviour of each program is investigated using a set of controlled experiments, and performance in practical settings is demonstrated by constructing complete theories for some of the largest biochemical datasets examined by ILP systems to date, including one with a million examples; to the best of our knowledge, the first time this has been empirically demonstrated with ILP on a real-world data set.  相似文献   

7.
8.
Problems concerned with learning the relationships between molecular structure and activity have been important test-beds for Inductive Logic programming (ILP) systems. In this paper we examine these applications and empirically evaluate the extent to which a first-order representation was required. We compared ILP theories with those constructed using standard linear regression and a decision-tree learner on a series of progressively more difficult problems. When a propositional encoding is feasible for the feature-based algorithms, we show that such algorithms are capable of matching the predictive accuracies of an ILP theory. However, as the complexity of the compounds considered increased, propositional encodings becomes intractable. In such cases, our results show that ILP programs can still continue to construct accurate, understandable theories. Based on this evidence, we propose future work to realise fully the potential of ILP in structure-activity problem.  相似文献   

9.
Relational learning can be described as the task of learning first-order logic rules from examples. It has enabled a number of new machine learning applications, e.g. graph mining and link analysis. Inductive Logic Programming (ILP) performs relational learning either directly by manipulating first-order rules or through propositionalization, which translates the relational task into an attribute-value learning task by representing subsets of relations as features. In this paper, we introduce a fast method and system for relational learning based on a novel propositionalization called Bottom Clause Propositionalization (BCP). Bottom clauses are boundaries in the hypothesis search space used by ILP systems Progol and Aleph. Bottom clauses carry semantic meaning and can be mapped directly onto numerical vectors, simplifying the feature extraction process. We have integrated BCP with a well-known neural-symbolic system, C-IL2P, to perform learning from numerical vectors. C-IL2P uses background knowledge in the form of propositional logic programs to build a neural network. The integrated system, which we call CILP++, handles first-order logic knowledge and is available for download from Sourceforge. We have evaluated CILP++ on seven ILP datasets, comparing results with Aleph and a well-known propositionalization method, RSD. The results show that CILP++ can achieve accuracy comparable to Aleph, while being generally faster, BCP achieved statistically significant improvement in accuracy in comparison with RSD when running with a neural network, but BCP and RSD perform similarly when running with C4.5. We have also extended CILP++ to include a statistical feature selection method, mRMR, with preliminary results indicating that a reduction of more than 90 % of features can be achieved with a small loss of accuracy.  相似文献   

10.
A common step in pharmaceutical development is the formation of a quantitative structure-activity relationship *(QSAR) to model an exploratory series of compounds. A QSAR generalizes how the structure (shape) of a compound relates to its biological activity. A comparative study was carried out of six artificial intelligence and traditional algorithms for modeling QSAR's: GOLEM, CART, and MS from symbolic machine learning; back-propagation from neural networks; and linear regression and nearest-neighbor from traditional statistics. Two test case problems were studied: the inhibition of Escherichia coli dihydrofolate reductase (DHFR) by pyrimidines, and the inhibition of ratlmouse tumor DHFR by triazines. It was found that there was no significant statistical difference between the methods in terms of their ability to rank unseen compounds by activity. However, symbolic machine learning methods, in particular relational ones, were found to generate rules that provided insight into the stereochemistry of compound receptor interactions.  相似文献   

11.
A particularly successful role for Inductive Logic Programming (ILP) is as a tool for discovering useful relational features for subsequent use in a predictive model. Conceptually, the case for using ILP to construct relational features rests on treating these features as functions, the automated discovery of which necessarily requires some form of first-order learning. Practically, there are now several reports in the literature that suggest that augmenting any existing feature with ILP-discovered relational features can substantially improve the predictive power of a model. While the approach is straightforward enough, much still needs to be done to scale it up to explore more fully the space of possible features that can be constructed by an ILP system. This is in principle, infinite and in practice, extremely large. Applications have been confined to heuristic or random selections from this space. In this paper, we address this computational difficulty by allowing features and models to be constructed in a distributed manner. That is, there is a network of computational units, each of which employs an ILP engine to construct some small number of features and then builds a (local) model. We then employ an asynchronous consensus-based algorithm, in which neighboring nodes share information and update local models. This gossip-based information exchange results in the formation of non-stationary Markov chains. For a category of models (those with convex loss functions), it can be shown (using the Supermartingale Convergence Theorem) that the algorithm will result in all nodes converging to a consensus model. In practice, it may be slow to achieve this convergence. Nevertheless, our results on synthetic and real datasets suggest that in relatively short time the “best” node in the network reaches a model whose predictive accuracy is comparable to that obtained using more computational effort in a non-distributed setting (the best node is identified as the one whose weights converge first).  相似文献   

12.
Inductive logic programming (ILP) is concerned with the induction of logic programs from examples and background knowledge. In ILP, the shift of attention from program synthesis to knowledge discovery resulted in advanced techniques that are practically applicable for discovering knowledge in relational databases. This paper gives a brief introduction to ILP, presents selected ILP techniques for relational knowledge discovery and reviews selected ILP applications. Nada Lavrač, Ph.D.: She is a senior research associate at the Department of Intelligent Systems, J. Stefan Institute, Ljubljana, Slovenia (since 1978) and a visiting professor at the Klagenfurt University, Austria (since 1987). Her main research interest is in machine learning, in particular inductive logic programming and intelligent data analysis in medicine. She received a BSc in Technical Mathematics and MSc in Computer Science from Ljubljana University, and a PhD in Technical Sciences from Maribor University, Slovenia. She is coauthor of KARDIO: A Study in Deep and Qualitative Knowledge for Expert Systems, The MIT Press 1989, and Inductive Logic Programming: Techniques and Applications, Ellis Horwood 1994, and coeditor of Intelligent Data Analysis in Medicine and Pharmacology, Kluwer 1997. She was the coordinator of the European Scientific Network in Inductive Logic Programming ILPNET (1993–1996) and program cochair of the 8th European Machine Learning Conference ECML’95, and 7th International Workshop on Inductive Logic Programming ILP’97. Sašo Džeroski, Ph.D.: He is a research associate at the Department of Intelligent Systems, J. Stefan Institute, Ljubljana, Slovenia (since 1989). He has held visiting researcher positions at the Turing Institute, Glasgow (UK), Katholieke Universiteit Leuven (Belgium), German National Research Center for Computer Science (GMD), Sankt Augustin (Germany) and the Foundation for Research and Technology-Hellas (FORTH), Heraklion (Greece). His research interest is in machine learning and knowledge discovery in databases, in particular inductive logic programming and its applications and knowledge discovery in environmental databases. He is co-author of Inductive Logic Programming: Techniques and Applications, Ellis Horwood 1994. He is the scientific coordinator of ILPnet2, The Network of Excellence in Inductive Logic Programming. He was program co-chair of the 7th International Workshop on Inductive Logic Programming ILP’97 and will be program co-chair of the 16th International Conference on Machine Learning ICML’99. Masayuki Numao, Ph.D.: He is an associate professor at the Department of Computer Science, Tokyo Institute of Technology. He received a bachelor of engineering in electrical and electronics engineering in 1982 and his Ph.D. in computer science in 1987 from Tokyo Institute of Technology. He was a visiting scholar at CSLI, Stanford University from 1989 to 1990. His research interests include Artificial Intelligence, Global Intelligence and Machine Learning. Numao is a member of Information Processing Society of Japan, Japanese Society for Artificial Intelligence, Japanese Cognitive Science Society, Japan Society for Software Science and Technology and AAAI.  相似文献   

13.
Multivariate classification models play an increasingly important role in human factors research. In the past, these models have been based primarily on discriminant analysis and logistic regression. Models developed from machine learning research offer the human factors professional a viable alternative to these traditional statistical classification methods. To illustrate this point, two machine learning approaches--genetic programming and decision tree induction--were used to construct classification models designed to predict whether or not a student truck driver would pass his or her commercial driver license (CDL) examination. The models were developed and validated using the curriculum scores and CDL exam performances of 37 student truck drivers who had completed a 320-hr driver training course. Results indicated that the machine learning classification models were superior to discriminant analysis and logistic regression in terms of predictive accuracy. Actual or potential applications of this research include the creation of models that more accurately predict human performance outcomes.  相似文献   

14.
15.
We revisit an application developed originally using abductive Inductive Logic Programming (ILP) for modeling inhibition in metabolic networks. The example data was derived from studies of the effects of toxins on rats using Nuclear Magnetic Resonance (NMR) time-trace analysis of their biofluids together with background knowledge representing a subset of the Kyoto Encyclopedia of Genes and Genomes (KEGG). We now apply two Probabilistic ILP (PILP) approaches—abductive Stochastic Logic Programs (SLPs) and PRogramming In Statistical modeling (PRISM) to the application. Both approaches support abductive learning and probability predictions. Abductive SLPs are a PILP framework that provides possible worlds semantics to SLPs through abduction. Instead of learning logic models from non-probabilistic examples as done in ILP, the PILP approach applied in this paper is based on a general technique for introducing probability labels within a standard scientific experimental setting involving control and treated data. Our results demonstrate that the PILP approach provides a way of learning probabilistic logic models from probabilistic examples, and the PILP models learned from probabilistic examples lead to a significant decrease in error accompanied by improved insight from the learned results compared with the PILP models learned from non-probabilistic examples.  相似文献   

16.
Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. The goal of our research is to find new approaches within ILP particularly suited for large, highly-skewed domains. We propose Gleaner, a randomized search method that collects good clauses from a broad spectrum of points along the recall dimension in recall-precision curves and employs an “at least L of these K clauses” thresholding method to combine sets of selected clauses. Our research focuses on Multi-Slot Information Extraction (IE), a task that typically involves many more negative examples than positive examples. We formulate this problem into a relational domain, using two large testbeds involving the extraction of important relations from the abstracts of biomedical journal articles. We compare Gleaner to ensembles of standard theories learned by Aleph, finding that Gleaner produces comparable testset results in a fraction of the training time. Editor: Rui Camacho  相似文献   

17.
Randomised restarted search in ILP   总被引:1,自引:0,他引:1  
Recent statistical performance studies of search algorithms in difficult combinatorial problems have demonstrated the benefits of randomising and restarting the search procedure. Specifically, it has been found that if the search cost distribution of the non-restarted randomised search exhibits a slower-than-exponential decay (that is, a “heavy tail”), restarts can reduce the search cost expectation. We report on an empirical study of randomised restarted search in ILP. Our experiments conducted on a high-performance distributed computing platform provide an extensive statistical performance sample of five search algorithms operating on two principally different classes of ILP problems, one represented by an artificially generated graph problem and the other by three traditional classification benchmarks (mutagenicity, carcinogenicity, finite element mesh design). The sample allows us to (1) estimate the conditional expected value of the search cost (measured by the total number of clauses explored) given the minimum clause score required and a “cutoff” value (the number of clauses examined before the search is restarted), (2) estimate the conditional expected clause score given the cutoff value and the invested search cost, and (3) compare the performance of randomised restarted search strategies to a deterministic non-restarted search. Our findings indicate striking similarities across the five search algorithms and the four domains, in terms of the basic trends of both the statistics (1) and (2). Also, we observe that the cutoff value is critical for the performance of the search algorithm, and using its optimal value in a randomised restarted search may decrease the mean search cost (by several orders of magnitude) or increase the mean achieved score significantly with respect to that obtained with a deterministic non-restarted search. Editors: Rui Camacho  相似文献   

18.
19.
Statistical relational learning (SRL) is a subarea in machine learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)—as is generally assumed. For the traditional i.i.d. setting, distribution-free bounds exist, such as the Hoeffding bound, which are used to provide confidence bounds on the generalization error of a classification algorithm given its hold-out error on a sample size of N. Bounds of this form are currently not present for the type of interactions that are considered in the data by relational classification algorithms. In this paper, we extend the Hoeffding bounds to the relational setting. In particular, we derive distribution-free bounds for certain classes of data generation models that do not produce i.i.d. data and are based on the type of interactions that are considered by relational classification algorithms that have been developed in SRL. We conduct empirical studies on synthetic and real data which show that these data generation models are indeed realistic and the derived bounds are tight enough for practical use.  相似文献   

20.
In this paper, we explore the issue of static routing and spectrum/IT resource assignment (RSIA) of elastic all-optical switched intra-datacenter networks (intra-DCNs) by proposing anycast- and manycast-based integer linear programming (ILP) models. The objective is to jointly optimize the DCN resources, i.e., network transmission bandwidth and IT resources, under different situations. First, for given service-request matrices with unknown network transmission bandwidth and IT resources, we propose anycast and manycast ILP models to minimize the maximum numbers of required network and IT resources to accommodate all the service requests. For anycast RSIA issue, we proposed two different ILP models that are based on node-arc and link-path methods, respectively. Node-arc based manycast ILP model is also proposed for the first time to our knowledge. Second, for given network transmission bandwidth and IT resources and known service-request matrices, we propose node-arc based anycast ILP models to maximize the total number of successfully served service requests. To evaluate the efficiency of anycast and manycast models, all proposed ILP models are evaluated and compared with unicast ILP models. Simulation results show that anycast and manycast ILP models perform much better in efficiently using DCN resources and successfully accommodating more service requests when compared to unicast ILP models under the same network conditions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号