Current approaches to data mining usually address specific userrequests, while no general design criteria for the extraction of associationrules are available for the end-user. In this paper, we propose aclassification of association rule types, which provides a general frameworkfor the design of association rule mining applications. Based on theidentified association rule types, we introduce predefined templates as ameans to capture the user specification of mining applications. Furthermore,we propose a general language to design templates for the extraction ofarbitrary association rule types. 相似文献
Internet measured data collected via passive measurement are analyzed to obtain localization information on nodes by clustering (i.e., grouping together) nodes that exhibit similar network path properties. Since traditional clustering algorithms fail to correctly identify clusters of homogeneous nodes, we propose the NetCluster novel framework, suited to analyze Internet measurement datasets. We show that the proposed framework correctly analyzes synthetically generated traces. Finally, we apply it to real traces collected at the access link of Politecnico di Torino campus LAN and discuss the network characteristics as seen at the vantage point. 相似文献
The lack of tools for rule generation, analysis, and run-time monitoring appears one of the main obstacles to the widespreading of active database applications. This paper describes a complete tool environment for assisting the design of active rules applications; the tools were developed at Politecnico di Milano in the context of the IDEA Project, a 4-years Esprit project sponsored by the European Commission which was launched in June 1992. We describe tools for active rule generation, analysis, debugging, and browsing; rules are defined in Chimera, a conceptual design model and language for the specification of active rules applications. We also introduce a tool for mapping from Chimera into Oracle, a relational product supporting triggers.Most of the tools described in this paper are fully implemented and currently in operation (beta-testing) within the companies participating to the IDEA Project, with the exception of two of them (called Argonaut-V and Pandora), which will be completed by the end of 1996.Research presented in this paper is supported by Esprit project P6333 IDEA, and by ENEL contract VDS 1/94: Integrity Constraint Management 相似文献
Identifying the most relevant scientific publications on a given topic is a well-known research problem. The Author-Topic Model (ATM) is a generative model that represents the relationships between research topics and publication authors. It allows us to identify the most influential authors on a particular topic. However, since most research works are co-authored by many researchers the information provided by ATM can be complemented by the study of the most fruitful collaborations among multiple authors. This paper addresses the discovery of research collaborations among multiple authors on single or multiple topics. Specifically, it exploits an exploratory data mining technique, i.e., weighted association rule mining, to analyze publication data and to discover correlations between ATM topics and combinations of authors. The mined rules characterize groups of researchers with fairly high scientific productivity by indicating (1) the research topics covered by their most cited publications and the relevance of their scientific production separately for each topic, (2) the nature of the collaboration (topic-specific or cross-topic), (3) the name of the external authors who have (occasionally) collaborated with the group either on a specific topic or on multiple topics, and (4) the underlying correlations between the addressed topics. The applicability of the proposed approach was validated on real data acquired from the Online Mendelian Inheritance in Man catalog of genetic disorders and from the PubMed digital library. The results confirm the effectiveness of the proposed strategy.
Nowadays, wireless sensor networks are being used for a fast-growing number of different application fields (e.g., habitat
monitoring, highway traffic monitoring, remote surveillance). Monitoring (i.e., querying) the sensor network entails the frequent
acquisition of measurements from all sensors. Since sensor data acquisition and communication are the main sources of power
consumption and sensors are battery-powered, an important issue in this context is energy saving during data collection. Hence,
the challenge is to extend sensor lifetime by reducing communication cost and computation energy. This paper thoroughly describes
the complete design, implementation and validation of the SeReNe framework. Given historical sensor readings, SeReNe discovers energy-saving models to efficiently acquire sensor network data. SeReNe exploits different clustering algorithms to discover spatial and temporal correlations which allow the identification of
sets of correlated sensors and sensor data streams. Given clusters of correlated sensors, a subset of representative sensors
is selected. Rather than directly querying all network nodes, only the representative sensors are queried by reducing the
communication, computation and power costs. Experiments performed on both a real sensor network deployed at the Politecnico
di Torino labs and a publicly available dataset from Intel Berkeley Research lab demonstrate the adaptability and the effectiveness
of the SeReNe framework in providing energy-saving sensor network models. 相似文献
This paper presents the IMine index, a general and compact structure which provides tight integration of itemset extraction in a relational DBMS. Since no constraint is enforced during the index creation phase, IMine provides a complete representation of the original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk block. The IMine index structure can be efficiently exploited by different itemset extraction algorithms. In particular, IMine data access methods currently support the FP-growth and LCM v.2 algorithms, but they can straightforwardly support the enforcement of various constraint categories. The IMine index has been integrated into the PostgreSQL DBMS and exploits its physical level access methods. Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for large datasets. Itemset mining supported by the IMine index shows performance always comparable with, and sometimes better than, state of the art algorithms accessing data on flat file. 相似文献
Sentence-based multi-document summarization is the task of generating a succinct summary of a document collection, which consists of the most salient document sentences. In recent years, the increasing availability of semantics-based models (e.g., ontologies and taxonomies) has prompted researchers to investigate their usefulness for improving summarizer performance. However, semantics-based document analysis is often applied as a preprocessing step, rather than integrating the discovered knowledge into the summarization process.This paper proposes a novel summarizer, namely Yago-based Summarizer, that relies on an ontology-based evaluation and selection of the document sentences. To capture the actual meaning and context of the document sentences and generate sound document summaries, an established entity recognition and disambiguation step based on the Yago ontology is integrated into the summarization process.The experimental results, which were achieved on the DUC’04 benchmark collections, demonstrate the effectiveness of the proposed approach compared to a large number of competitors as well as the qualitative soundness of the generated summaries. 相似文献
The analysis of medical data is a challenging task for health care systems since a huge amount of interesting knowledge can be automatically mined to effectively support both physicians and health care organizations. This paper proposes a data analysis framework based on a multiple-level clustering technique to identify the examination pathways commonly followed by patients with a given disease. This knowledge can support health care organizations in evaluating the medical treatments usually adopted, and thus the incurred costs. The proposed multiple-level strategy allows clustering patient examination datasets with a variable distribution. To measure the relevance of specific examinations for a given disease complication, patient examination data has been represented in the Vector Space Model using the TF-IDF method. As a case study, the proposed approach has been applied to the diabetic care scenario. The experimental validation, performed on a real collection of diabetic patients, demonstrates the effectiveness of the approach in identifying groups of patients with a similar examination history and increasing severity in diabetes complications. 相似文献