首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Businesses are naturally interested in detecting anomalies in their internal processes, because these can be indicators for fraud and inefficiencies. Within the domain of business intelligence, classic anomaly detection is not very frequently researched. In this paper, we propose a method, using autoencoders, for detecting and analyzing anomalies occurring in the execution of a business process. Our method does not rely on any prior knowledge about the process and can be trained on a noisy dataset already containing the anomalies. We demonstrate its effectiveness by evaluating it on 700 different datasets and testing its performance against three state-of-the-art anomaly detection methods. This paper is an extension of our previous work from 2016 (Nolle et al. in Unsupervised anomaly detection in noisy business process event logs using denoising autoencoders. In: International conference on discovery science, Springer, pp 442–456, 2016). Compared to the original publication we have further refined the approach in terms of performance and conducted an elaborate evaluation on more sophisticated datasets including real-life event logs from the Business Process Intelligence Challenges of 2012 and 2017. In our experiments our approach reached an \(F_1\) score of 0.87, whereas the best unaltered state-of-the-art approach reached an \(F_1\) score of 0.72. Furthermore, our approach can be used to analyze the detected anomalies in terms of which event within one execution of the process causes the anomaly.  相似文献   

2.
In the era of bigdata, with a massive set of digital information of unprecedented volumes being collected and/or produced in several application domains, it becomes more and more difficult to manage and query large data repositories. In the framework of the PetaSky project (http://com.isima.fr/Petasky), we focus on the problem of managing scientific data in the field of cosmology. The data we consider are those of the LSST project (http://www.lsst.org/). The overall size of the database that will be produced is expected to exceed 60 PB (Lsst data challenge handbook, 2012). In order to evaluate the performances of existing SQL On MapReduce data management systems, we conducted extensive experiments by using data and queries from the area of cosmology. The goal of this work is to report on the ability of such systems to support large scale declarative queries. We mainly investigated the impact of data partitioning, indexing and compression on query execution performances.  相似文献   

3.
Nowadays, business processes are increasingly supported by IT services that produce massive amounts of event data during the execution of a process. These event data can be used to analyze the process using process mining techniques to discover the real process, measure conformance to a given process model, or to enhance existing models with performance information. Mapping the produced events to activities of a given process model is essential for conformance checking, annotation and understanding of process mining results. In order to accomplish this mapping with low manual effort, we developed a semi-automatic approach that maps events to activities using insights from behavioral analysis and label analysis. The approach extracts Declare constraints from both the log and the model to build matching constraints to efficiently reduce the number of possible mappings. These mappings are further reduced using techniques from natural language processing, which allow for a matching based on labels and external knowledge sources. The evaluation with synthetic and real-life data demonstrates the effectiveness of the approach and its robustness toward non-conforming execution logs.  相似文献   

4.
In this paper, we present Para Miner which is a generic and parallel algorithm for closed pattern mining. Para Miner is built on the principles of pattern enumeration in strongly accessible set systems. Its efficiency is due to a novel dataset reduction technique (that we call EL-reduction), combined with novel technique for performing dataset reduction in a parallel execution on a multi-core architecture. We illustrate Para Miner’s genericity by using this algorithm to solve three different pattern mining problems: the frequent itemset mining problem, the mining frequent connected relational graphs problem and the mining gradual itemsets problem. In this paper, we prove the soundness and the completeness of Para Miner. Furthermore, our experiments show that despite being a generic algorithm, Para Miner can compete with specialized state of the art algorithms designed for the pattern mining problems mentioned above. Besides, for the particular problem of gradual itemset mining, Para Miner outperforms the state of the art algorithm by two orders of magnitude.  相似文献   

5.
The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when (1) the data set is massive, calling for large-scale distribution, and/or (2) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative \(\underline{K}\)-ItemSet), a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large-scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.  相似文献   

6.
As the volume of data generated each day continues to increase, more and more interest is put into Big Data algorithms and the insight they provide.? Since these analyses require a substantial amount of resources, including physical machines, power, and time, reliable execution of the algorithms becomes critical. This paper analyzes the error resilience of a select group of popular Big Data algorithms and shows how they can effectively be made more fault-tolerant. Using KULFI (http://github.com/quadpixels/kulfi, 2013) and the LLVM (Proceedings of the 2004 international symposium on code generation and optimization (CGO 2004), San Jose, CA, USA, 2004) compiler for compilation allows injection of artificial soft faults throughout these algorithms, giving a thorough analysis of how faults in different locations can affect the outcome of the program. This information is then used to help guide incorporating fault tolerance mechanisms into the program, making them as impervious as possible to soft faults.  相似文献   

7.
In this paper we extend the study of algorithms for monitoring distributed data streams from whole data streams to a time-based sliding window. The concern is how to minimize the communication between individual streams and the root, while allowing the root, at any time, to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statistics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is $O(\frac{k}{\varepsilon} \log\frac{\varepsilon N}{k})$ bits for basic counting, $O(\frac{k}{\varepsilon} \log\frac{N}{k})$ words for frequent items and $O(\frac{k}{\varepsilon^{2}} \log\frac{N}{k})$ words for quantiles, where k is the number of distributed data streams, N is the total number of items in the streams that arrive or expire in the window, and ε<1 is the given error bound. The performance of our algorithms matches and nearly matches the corresponding lower bounds. We also show how to generalize these results to streams with out-of-order data.  相似文献   

8.
We give matching upper and lower bounds of \(\varTheta(\min(\frac{\log m}{\log \log m},\, n))\) for the individual step complexity of a wait-free m-valued adopt-commit object implemented using multi-writer registers for n anonymous processes. While the upper bound is deterministic, the lower bound holds for randomized adopt-commit objects as well. Our results are based on showing that adopt-commit objects are equivalent, up to small additive constants, to a simpler class of objects that we call conflict detectors. Our anonymous lower bound also applies to the individual step complexity of m-valued wait-free anonymous consensus, even for randomized algorithms with global coins against an oblivious adversary. The upper bound can be used to slightly improve the cost of randomized consensus against an oblivious adversary. For deterministic non-anonymous implementations of adopt-commit objects, we show a lower bound of \(\varOmega(\min(\frac{\log m}{\log \log m}, \frac{\sqrt{\log n}}{\log \log n}))\) and an upper bound of \(O(\min(\frac{\log m}{\log \log m},\, \log n))\) on the worst-case individual step complexity. For randomized non-anonymous implementations, we show that any execution contains at least one process whose steps exceed the deterministic lower bound.  相似文献   

9.
Abstract

Business process redesign is one of the most powerful ways to boost business performance and to improve customer satisfaction (Limam Mansar & Reijers, 2005 Limam Mansar, S. and Reijers, H. 2005. Best Practices in Business Process Redesign: Validation of a Redesign Framework. Computers in Industry, 56: 457471.  [Google Scholar]). A possible approach to business process redesign is using redesign best practices. A previous study identified a set of 29 different redesign best practices (Reijers, 2003 Reijers, H. 2003. Design and Control of Workflow Processes: Business Process Management for the Service Industry, Berlin: Springer Verlag. [Crossref] [Google Scholar]). However, little is known about the exact impact of these redesign best practices on workflow performance. This study proposes an approach that can be used to quantify the impact of a business process redesign project on all dimensions of workflow performance. The approach consists of a large set of performance measures and a simulation toolkit. It supports the quantification of the impact of the implementation of redesign best practices, in order to determine what best practice or combination of best practices leads to the most favorable effect in a specific business process. The approach is developed based on a quantification project for the parallel best practice and is validated with two other quantification projects, namely for the knockout and triage best practices.  相似文献   

10.
Today, a large volume of hotel reviews is available on many websites, such as TripAdvisor (http://www.tripadvisor.com) and Orbitz (http://www.orbitz.com). A typical review contains an overall rating, several aspect ratings, and review text. The rating is an abstract of review in terms of numerical points. The task of aspect-based opinion summarization is to extract aspect-specific opinions hidden in the reviews which do not have aspect ratings, so that users can quickly digest them without actually reading through them. The task consists of aspect identification and aspect rating inference. Most existing studies cannot utilize aspect ratings which become increasingly abundant on review hosts. In this paper, we propose two topic models which explicitly model aspect ratings as observed variables to improve the performance of aspect rating inference on unrated reviews. The experiment results show that our approaches outperform the existing methods on the data set crawled from TripAdvisor website.  相似文献   

11.
ABSTRACT

Computer system security relies on different aspects of a computer system such as security policies, security mechanisms, threat analysis, and countermeasures. This paper provides an ontological approach to capturing and utilizing the fundamental attributes of those key components to determine the effects of vulnerabilities on a system's security. Our ontology for vulnerability management (OVM) has been populated with all vulnerabilities in NVD (see http://nvd.nist.gov/scap.cfm) with additional inference rules and knowledge discovery mechanisms so that it may provide a promising pathway to make security automation program (NIST Version 1.0, 2007 NIST. 2007. Information Security Automation Program, Automating Vulnerability Management, Security Measurement, and Compliance, Version 1.0 Beta revised May 22 [Google Scholar]) more effective and reliable.  相似文献   

12.
How can we discover interesting patterns from time-evolving high-speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? How to guarantee the found patterns to be self-consistent? High-speed data stream has been receiving increasing attention due to its wide applications such as sensors, network traffic, social networks, etc. The most fundamental task on the data stream is frequent pattern mining; especially, focusing on recentness is important in real applications. In this paper, we develop two algorithms for discovering recently frequent patterns in data streams. First, we propose TwMinSwap to find top-k recently frequent items in data streams, which is a deterministic version of our motivating algorithm TwSample providing theoretical guarantees based on item sampling. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Second, we propose TwMinSwap-Is to find top-k recently frequent itemsets in data streams. We especially focus on keeping self-consistency of the discovered itemsets, which is the most important property for reliable results, while using O(k) memory space with the assumption of a constant itemset size. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. We also show that TwMinSwap-Is is more accurate than the competitor and discovers recently frequent itemsets with reasonably large sizes (at most 5–7) depending on datasets. Thanks to TwMinSwap and TwMinSwap-Is, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and temporal human contact patterns.  相似文献   

13.
In our studies of global software engineering (GSE) teams, we found that informal, non-work-related conversations are positively associated with trust. Seeking to use novel analytical techniques to more carefully investigate this phenomenon, we described these non-work-related conversations by adapting the economics literature concept of “cheap talk,” and studied it using Evolutionary Game Theory (EGT). More specifically, we modified the classic Stag-hunt game and analyzed the dynamics in a fixed population setting (an abstraction of a GSE team). Doing so, we were able to demonstrate how cheap talk over the Internet (e-cheap talk) was powerful enough to facilitate the emergence of trust and improve the probability of cooperation where the punishment for uncooperative behavior is comparable to the cost of the cheap talk. To validate the results of our theoretical approach, we conducted two empirical case studies that analyzed the logged IRC development discussions of Apache Lucene (http://lucene.apache. org/) and Chromium OS (http://www.chromium.org/chromium-os) using both quantitative and qualitative methods. The results provide general support to the theoretical propositions. We discuss our findings and the theoretical and practical implications to GSE collaborations and research.  相似文献   

14.
15.
16.
The notions of a grammar form and its interpretations were introduced to describe families of structurally related grammars. Basically, a grammar formG is a (context-free) grammar serving as a master grammar and the interpretation operator defines a family of grammars, each structurally related toG. In this paper, a new operator yielding a family of grammars, is introduced as a variant of . There are two major results. The first is that and commute. The second is that for each grammar formG, the collection of all families of grammars ,G′ in , is finite. Expressed otherwise, the second result is that for each grammar formG there is only a bounded number of grammar forms in (G) no two of which are strongly equivalent.  相似文献   

17.
This article is a slightly abridged edited version of a final report detailing the background and implementation of a project that introduced electronic book (e-book) collections to Essex Public Libraries during 2004. The research considered e-book collections available for borrowing on a PDA (HP iPAQ) and collections downloadable on to the borrower's PDA or PC (OverDrive, ebrary). The project, sponsored by The Laser Foundation,1 The Laser (London and South-Eastern Library Region) Foundation was founded in 2002 following the transfer of an operational business to a grant-making body. The Foundation has made a number of grants to public and academic libraries since 2003, and further detail of its activities can be found at: http://www.bl.uk/concord/laser-about.html (accessed 21 February 2005). consisted of a partnership consisting between Loughborough University,2 The Department of Information Science. http://www.lboro.ac.uk/dis/ (accessed 21 February 2005). Essex Public Libraries3 Essex County Libraries operate 73 public libraries and mobile library services in south-eastern England. Their website can be found at: http://194.129.26.30/vip8/ecc/ECCWebsite/display/channels/libraries_channel_134343_Enjoying/index.jsp (accessed 21 February 2005). and Co-East.4 The Co-East Partnership manages acquisition and provision of electronic resources for public libraries in eastern England, including management of the popular ‘Ask a Librarian’ service. Their website can be found at: http://www.co-east.net/ (accessed 22 October 2004). In addition to a discussion of the findings of the research, guidelines are offered to other public library authorities considering the adoption of e-book collections and mobile technology. Two articles based on this research have been published elsewhere considering the evaluation of the iPAQ trials (Dearnley et al., ) and the provision and uptake of OverDrive and ebrary (Dearnley et al., ) collections.  相似文献   

18.
The stable revivals model provides a new semantic framework for the process algebra Csp. The model has recently been added to the realm of established Csp models. Within the Csp context, it enhances the analysis of systems with regards to properties such as responsiveness and stuckness. These properties are essential in component based system design. In this paper we report on the implementation of different variants of the model within Csp-Prover. Based on Isabelle/HOL, Csp-Prover is an interactive proof tool for Csp refinement, which is generic in the underlying Csp model. On the practical side, our encoding of the model provides semi-automatic proof support for reasoning on responsiveness and stuckness. On the theoretical side, our implementation also yields a machine verification of the model 's soundness as well as of its expected properties.  相似文献   

19.
In this paper we study decentralized routing in small-world networks that combine a wide variation in node degrees with a notion of spatial embedding. Specifically, we consider a variant of J. Kleinberg’s grid-based small-world model in which (1) the number of long-range edges of each node is not fixed, but is drawn from a power-law probability distribution with exponent parameter \(\alpha \ge 0\) and constant mean, and (2) the long-range edges are considered to be bidirectional for the purposes of routing. This model is motivated by empirical observations indicating that several real networks have degrees that follow a power-law distribution. The measured power-law exponent \(\alpha \) for these networks is often in the range between 2 and 3. For the small-world model we consider, we show that when \(2 < \alpha < 3\) the standard greedy routing algorithm, in which a node forwards the message to its neighbor that is closest to the target in the grid, finishes in an expected number of \(O(\log ^{\alpha -1} n\cdot \log \log n)\) steps, for any source–target pair. This is asymptotically smaller than the \(O(\log ^2 n)\) steps needed in Kleinberg’s original model with the same average degree, and approaches \(O(\log n)\) as \(\alpha \) approaches 2. Further, we show that when \(0\le \alpha < 2\) or \(\alpha \ge 3\) the expected number of steps is \(O(\log ^2 n)\) , while for \(\alpha = 2\) it is \(O(\log ^{4/3} n)\) . We complement these results with lower bounds that match the upper bounds within at most a \(\log \log n\) factor.  相似文献   

20.
Twitter (http://twitter.com) is one of the most popular social networking platforms. Twitter users can easily broadcast disaster-specific information, which, if effectively mined, can assist in relief operations. However, the brevity and informal nature of tweets pose a challenge to Information Retrieval (IR) researchers. In this paper, we successfully use word embedding techniques to improve ranking for ad-hoc queries on microblog data. Our experiments with the ‘Social Media for Emergency Relief and Preparedness’ (SMERP) dataset provided at an ECIR 2017 workshop show that these techniques outperform conventional term-matching based IR models. In addition, we show that, for the SMERP task, our word embedding based method is more effective if the embeddings are generated from the disaster specific SMERP data, than when they are trained on the large social media collection provided for the TREC (http://trec.nist.gov/) 2011 Microblog track dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号