共查询到20条相似文献,搜索用时 0 毫秒
1.
Vicenç Gómez Hilbert J. Kappen Nelly Litvak Andreas Kaltenbrunner 《World Wide Web》2013,16(5-6):645-675
Online discussion threads are conversational cascades in the form of posted messages that can be generally found in social systems that comprise many-to-many interaction such as blogs, news aggregators or bulletin board systems. We propose a framework based on generative models of growing trees to analyse the structure and evolution of discussion threads. We consider the growth of a discussion to be determined by an interplay between popularity, novelty and a trend (or bias) to reply to the thread originator. The relevance of these features is estimated using a full likelihood approach and allows to characterise the habits and communication patterns of a given platform and/or community. We apply the proposed framework on four popular websites: Slashdot, Barrapunto (a Spanish version of Slashdot), Meneame (a Spanish Digg-clone) and the article discussion pages of the English Wikipedia. Our results provide significant insight into understanding how discussion cascades grow and have potential applications in broader contexts such as community management or design of communication platforms. 相似文献
2.
Teran J Sifakis E Blemker SS Ng-Thow-Hing V Lau C Fedkiw R 《IEEE transactions on visualization and computer graphics》2005,11(3):317-328
Simulation of the musculoskeletal system has important applications in biomechanics, biomedical engineering, surgery simulation, and computer graphics. The accuracy of the muscle, bone, and tendon geometry as well as the accuracy of muscle and tendon dynamic deformation are of paramount importance in all these applications. We present a framework for extracting and simulating high resolution musculoskeletal geometry from the segmented visible human data set. We simulate 30 contact/collision coupled muscles in the upper limb and describe a computationally tractable implementation using an embedded mesh framework. Muscle geometry is embedded in a nonmanifold, connectivity preserving simulation mesh molded out of a lower resolution BCC lattice containing identical, well-shaped elements, leading to a relaxed time step restriction for stability and, thus, reduced computational cost. The muscles are endowed with a transversely isotropic, quasiincompressible constitutive model that incorporates muscle fiber fields as well as passive and active components. The simulation takes advantage of a new robust finite element technique that handles both degenerate and inverted tetrahedra. 相似文献
3.
Carlos Ordonez Naveen Mohanam Carlos Garcia-Alvarado 《Distributed and Parallel Databases》2014,32(3):377-403
Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability. 相似文献
4.
The paper presents an automatic video summarization technique based on graph theory methodology and the dominant sets clustering
algorithm. The large size of the video data set is handled by exploiting the connectivity information of prototype frames
that are extracted from a down-sampled version of the original video sequence. The connectivity information for the prototypes
which is obtained from the whole set of data improves video representation and reveals its structure. Automatic selection
of the optimal number of clusters and hereafter keyframes is accomplished at a next step through the dominant set clustering
algorithm. The method is free of user-specified modeling parameters and is evaluated in terms of several metrics that quantify
its content representational ability. Comparison of the proposed summarization technique to the Open Video storyboard, the
Adaptive clustering algorithm and the Delaunay clustering approach, is provided.
相似文献
D. BesirisEmail: |
5.
One-class learning and concept summarization for data streams 总被引:2,自引:2,他引:0
Xingquan Zhu Wei Ding Philip S. Yu Chengqi Zhang 《Knowledge and Information Systems》2011,28(3):523-553
In this paper, we formulate a new research problem of concept learning and summarization for one-class data streams. The main
objectives are to (1) allow users to label instance groups, instead of single instances, as positive samples for learning,
and (2) summarize concepts labeled by users over the whole stream. The employment of the batch-labeling raises serious issues
for stream-oriented concept learning and summarization, because a labeled instance group may contain non-positive samples
and users may change their labeling interests at any time. As a result, so the positive samples labeled by users, over the
whole stream, may be inconsistent and contain multiple concepts. To resolve these issues, we propose a one-class learning
and summarization (OCLS) framework with two major components. In the first component, we propose a vague one-class learning
(VOCL) module for concept learning from data streams using an ensemble of classifiers with instance level and classifier level
weighting strategies. In the second component, we propose a one-class concept summarization (OCCS) module that uses clustering
techniques and a Markov model to summarize concepts labeled by users, with only one scanning of the stream data. Experimental
results on synthetic and real-world data streams demonstrate that the proposed VOCL module outperforms its peers for learning
concepts from vaguely labeled stream data. The OCCS module is also able to rebuild a high-level summary for concepts marked
by users over the stream. 相似文献
6.
As course management systems (CMS) gain popularity in facilitating teaching. A forum is a key component to facilitate the interactions among students and teachers. Content analysis is the most popular way to study a discussion forum. But content analysis is a human labor intensity process; for example, the coding process relies heavily on manual interpretation; and it is time and energy consuming. In an asynchronous virtual learning environment, an instructor needs to keep monitoring the discussion forum from time to time in order to maintain the quality of a discussion forum. However, it is time consuming and difficult for instructors to fulfill this need especially for K12 teachers. This research proposes a genre classification system, called GCS, to facilitate the automatic coding process. We treat the coding process as a document classification task via modern data mining techniques. The genre of a posting can be perceived as an announcement, a question, clarification, interpretation, conflict, assertion, etc. This research examines the coding coherence between GCS and experts’ judgment in terms of recall and precision, and discusses how we adjust the parameters of the GCS to improve the coherence. Based on the empirical results, GCS adopts the cascade classification model to achieve the automatic coding process. The empirical evaluation of the classified genres from a repository of postings in an online course on earth science in a senior high school shows that GCS can effectively facilitate the coding process, and the proposed cascade model can deal with the imbalanced distribution nature of discussion postings. These results imply that GCS based on the cascade model can perform as an automatic posting coding system. 相似文献
7.
Edith Cohen Nick Duffield Haim Kaplan Carstent Lund Mikkel Thorup 《Journal of Computer and System Sciences》2014
Statistical summaries of IP traffic are at the heart of network operation and are used to recover aggregate information on subpopulations of flows. It is therefore of great importance to collect the most accurate and informative summaries given the router's resource constraints. A summarization algorithm, such as Cisco's sampled NetFlow, is applied to IP packet streams that consist of multiple interleaving IP flows. We develop sampling algorithms and unbiased estimators which address sources of inefficiency in current methods. First, we design tunable algorithms whereas currently a single parameter (the sampling rate) controls utilization of both memory and processing/access speed (which means that it has to be set according to the bottleneck resource). Second, we make a better use of the memory hierarchy, which involves exporting partial summaries to slower storage during the measurement period. 相似文献
8.
Ronald R. Yager 《Information Sciences》1982,28(1):69-86
We introduce a new approach to the summarization of data based upon the theory of fuzzy subsets. This new summarization allows for a linguistic summary of the data and is useful for both numeric and nonnumeric data. It summarizes the data in terms of three values: a summarizer, a quantity in agreement, and a truth value. We also discuss a procedure for investigating the informativeness of a summary. 相似文献
9.
10.
Umit Y. Ogras Hakan Ferhatosmanoglu 《The VLDB Journal The International Journal on Very Large Data Bases》2006,15(1):84-98
Managing large-scale time series databases has attracted significant attention in the database community recently. Related
fundamental problems such as dimensionality reduction, transformation, pattern mining, and similarity search have been studied
extensively. Although the time series data are dynamic by nature, as in data streams, current solutions to these fundamental
problems have been mostly for the static time series databases. In this paper, we first propose a framework to online summary
generation for large-scale and dynamic time series data, such as data streams. Then, we propose online transform-based summarization
techniques over data streams that can be updated in constant time and space. We present both the exact and approximate versions
of the proposed techniques and provide error bounds for the approximate case. One of our main contributions in this paper
is the extensive performance analysis. Our experiments carefully evaluate the quality of the online summaries for point, range,
and k–nn queries using real-life dynamic data sets of substantial size.
Edited by W. Aref 相似文献
11.
Emilio CorchadoAuthor Vitae 《Neurocomputing》2012,75(1):171-184
This study presents a novel version of the Visualization Induced Self-Organizing Map based on the application of a new fusion algorithm for summarizing the results of an ensemble of topology-preserving mapping models. The algorithm is referred to as Weighted Voting Superposition (WeVoS). Its main feature is the preservation of the topology of the map, in order to obtain the most accurate possible visualization of the data sets under study. To do so, a weighted voting process between the units of the maps in the ensemble takes place, in order to determine the characteristics of the units of the resulting map. Several different quality measures are applied to this novel neural architecture known as WeVoS-ViSOM and the results are analyzed, so as to present a thorough study of its capabilities. To complete the study, it has also been compared with the well-know SOM and its fusion version, with the WeVoS-SOM and with two other previously devised fusion Fusion by Euclidean Distance and Fusion by Voronoi Polygon Similarity—based on the analysis of the same quality measures in order to present a complete analysis of its capabilities. All three summarization methods were applied to three widely used data sets from the UCI Repository. A rigorous performance analysis clearly demonstrates that the novel fusion algorithm outperforms the other single and summarization methods in terms of data sets visualization. 相似文献
12.
13.
The purpose of this study was to evaluate the effectiveness of voluntary discussion forums in a higher education setting. Specifically, we examined intrinsic forum participation and investigated its relation to course performance across two experiments. In Experiment 1 (N = 1284) an online discussion forum was implemented at the beginning of an undergraduate introductory psychology course, and measures of course performance (i.e., writing assignment grades, exam grades, and extra-credits obtained) were compared with measures of forum participation. In Experiment 2 (N = 1334) an online discussion forum was implemented halfway through a second undergraduate introductory psychology course, after an initial measure of course performance was obtained, to control for the potential confound of student engagement (e.g., students who perform better in the course use the forum more). Overall, the results showed that students who participated in the forum tended to have better performance in the course, and furthermore that participating in the discussion forum, particularly reading posts on the forum, slightly improved exam performance. This study provides empirical support for the theoretical proposition that there is a facilitation effect of discussion forum participation on course performance. The results also suggest that implementation of an online discussion forum is beneficial even if a teacher only invests minimal time on the forum. 相似文献
14.
Jin H Wong ML Leung KS 《IEEE transactions on pattern analysis and machine intelligence》2005,27(11):1710-1719
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm, EMADS, is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectation-maximization with little loss of accuracy. 相似文献
15.
《Expert systems with applications》2014,41(6):2619-2629
In semiconductor manufacturing processes, sensor data are segmented and summarized in order to reduce storage space. This is conventionally done by segmenting the data based on predefined chamber step information and calculating statistics within the segments. However, segmentation via chamber steps often do not coincide with actual change points in data, which results in suboptimal summarization. This paper proposes a novel framework using abnormal difference and free knot spline with knot removal, to detect actual data change points and summarize on them. Preliminary experiments demonstrate that the proposed algorithm handles arbitrarily shaped data in a robust fashion and shows better performance than chamber step based segmentation and summarization. An evaluation metric based on linearity and parsimony is also proposed. 相似文献
16.
Korhonen P. Karaivanova J. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》1999,29(5):429-435
We consider the problem of searching nondominated alternatives in a discrete multiple criteria problem. The search procedure is based on the use of a reference direction. A reference direction reflects the desire of the decision maker (DM) to specify a search direction. To find a set of given alternatives related somehow to the reference direction specified by the DRI, the reference direction has to be projected onto the set of nondominated alternatives. Our purpose is to develop an efficient algorithm for making this projection. The projection of each given reference direction determines a nondominated ordered subset. The set is provided to a decision maker for evaluation. The decision maker will choose the most preferred alternative from this subset and continues the search from this alternative with a new reference direction. The search will end when no direction of improvement is found. A critical point in the procedure is the efficiency of the projection operation. This efficiency of our algorithm is considered theoretically and numerically. The projection is made by parametrizing an achievement scalarizing function originally proposed by Wierzbicki (1980) to project any single point onto the nondominated set 相似文献
17.
A global NDVI and EVI reference data set for land-surface phenology using 13 years of daily SPOT-VEGETATION observations 总被引:1,自引:0,他引:1
Astrid Verhegghen Sophie Bontemps Pierre Defourny 《International journal of remote sensing》2014,35(7):2440-2471
Time series of vegetation indices (VIs) obtained by remote sensing are widely used to study phenology on regional and global scales. The aim of the study is to design a method and to produce a reference data set describing the seasonal and inter-annual variability of the land-surface phenology on a global scale. Specific constraints are inherent in the design of such a global reference data set: (1) the high diversity of vegetation types and the heterogeneous conditions of observation, (2) a near-daily resolution is needed to follow the rapid changes in phenology, (3) the time series used to depict the baseline vegetation cycle must be long enough to be representative of the current vegetation dynamic and encompass anomalies, and (4) a spatial resolution consistent with a land-cover-specific analysis should be privileged. This study focuses on the SPOT (Satellite Pour l’Observation de la Terre)-VEGETATION sensor and its 13-year time series of reflectance values. Five steps addressing the noise and the missing data in the reflectance time series were selected to process the daily multispectral reflectance observations. The final product provides, for every pixel, three profiles for 52 × 7-day periods: a mean, a median, and a standard deviation profile. The mean and median profiles represent the reference seasonal pattern for variation of the vegetation at a specific location whereas the standard deviation profile expresses the inter-annual variability of VIs. A quality flag at the pixel level demonstrated that the reference data set can be considered as a reliable representation of the vegetation phenology in most parts of the Earth. 相似文献
18.
Boris Otto Kai M. Hüner Hubert ?sterle 《Information Systems and E-Business Management》2012,10(3):395-425
The quality of master data has become an issue of increasing prominence in companies. One reason for that is the growing number of regulatory and legal provisions companies need to comply with. Another reason is the growing importance of information systems supporting decision-making, requiring master data that is up-to-date, accurate and complete. While improving and maintaining master data quality is an organizational task that cannot be encountered by simply implementing a suitable software system, system support is mandatory in order to be able to meet challenges efficiently and make for good results. This paper describes the design process toward a functional reference model for master data quality management (MDQM). The model design process spanned several iterations comprising multiple design and evaluation cycles, including the model’s application in a participative case study at consumer goods manufacturer Beiersdorf. Practitioners may use the reference model as an instrument for the analysis, design and implementation of a company’s MDQM system landscape. Moreover, the reference model facilitates evaluation of software systems and supports company-internal and external communication. From a scientific perspective, the reference model is a design artifact; hence it represents a theory for designing information systems in the area of MDQM. 相似文献
19.
Angélica Caro Coral Calero Ismael Caballero Mario Piattini 《Software Quality Journal》2008,16(4):513-542
Data Quality is a critical issue in today’s interconnected society. Advances in technology are making the use of the Internet
an ever-growing phenomenon and we are witnessing the creation of a great variety of applications such as Web Portals. These
applications are important data sources and/or means of accessing information which many people use to make decisions or to
carry out tasks. Quality is a very important factor in any software product and also in data. As quality is a wide concept,
quality models are usually used to assess the quality of a software product. From the software point of view there is a widely
accepted standard proposed by ISO/IEC (the ISO/IEC 9126) which proposes a quality model for software products. However, until
now a similar proposal for data quality has not existed. Although we have found some proposals of data quality models, some
of them working as “de facto” standards, none of them focus specifically on web portal data quality and the user’s perspective.
In this paper, we propose a set of 33 attributes which are relevant for portal data quality. These have been obtained from
a revision of literature and a validation process carried out by means of a survey. Although these attributes do not conform
to a usable model, we think that it might be considered as a good starting point for constructing one.
Angélica Caro has a PhD in Computer Science and is Assistant Professor at the Department of Computer Science and Information Technologies of the Bio Bio University in Chillán, Chile. Her research interests include: Data quality, Web portals, data quality in Web portals and data quality measures. She is author of papers in national and international conferences on this subject. Coral Calero has a PhD in Computer Science and is Associate Professor at the Escuela Superior de Informatica of the Castilla-La Mancha University in Ciudad Real. She is a member of the Alarcos Research Group, in the same University, specialized in Information Systems, Databases and Software Engineering. Her research interests include: advanced databases design, database quality, software metrics, database metrics. She is author of papers in national and international conferences on this subject. She has published in Information Systems Journal, Software Quality Journal, Information and Software Technology Journal and SIGMOD Record Journal. She has organized the web services quality workshop (WISE Conference, Rome 2003) and Database Maintenance and Reengineering workshop (ICSM Conference, Montreal 2002). Ismael Caballero has an MSc and PhD in Computer Science from the Escuela Superior de Informática of the Castilla-La Mancha University in Ciudad Real. He actually works as an assistant professor in the Department of Information Systems and Technologies at the University of Castilla-La Mancha, and he has also been working in the R&D Department of Indra Sistemas since 2006. His research interests are focused on information quality management, information quality in SOA, and Global Software Development. Mario Piattini has an MSc and a PhD in Computer Science (Politechnical University of Madrid) and a MSc in Psychology (UNED.). He is also a Certified Information System Auditor and a Certified information System Manager by ISACA (Information System Audit and Control Association) as well as a Full Professor in the Department of Computer Science at the University of Castilla-La Mancha, in Ciudad Real, Spain. Furthermore, he is the author of several books and papers on databases, software engineering and information systems. He is a coeditor of several international books: “Advanced Databases Technology and Design”, 2000, Artech House, UK; "Information and database quality”, 2002, Kluwer Academic Publishers, Norwell, USA; “Component-based software quality: methods and techniques”, 2004, Springer, Germany; “Conceptual Software Metrics”, Imperial College Press, UK, 2005. He leads the ALARCOS research group of the Department of Computer Science at the University of Castilla-La Mancha, in Ciudad Real, Spain. His research interests are: advanced databases, database quality, software metrics, security and audit, software maintenance. 相似文献
Mario PiattiniEmail: |
Angélica Caro has a PhD in Computer Science and is Assistant Professor at the Department of Computer Science and Information Technologies of the Bio Bio University in Chillán, Chile. Her research interests include: Data quality, Web portals, data quality in Web portals and data quality measures. She is author of papers in national and international conferences on this subject. Coral Calero has a PhD in Computer Science and is Associate Professor at the Escuela Superior de Informatica of the Castilla-La Mancha University in Ciudad Real. She is a member of the Alarcos Research Group, in the same University, specialized in Information Systems, Databases and Software Engineering. Her research interests include: advanced databases design, database quality, software metrics, database metrics. She is author of papers in national and international conferences on this subject. She has published in Information Systems Journal, Software Quality Journal, Information and Software Technology Journal and SIGMOD Record Journal. She has organized the web services quality workshop (WISE Conference, Rome 2003) and Database Maintenance and Reengineering workshop (ICSM Conference, Montreal 2002). Ismael Caballero has an MSc and PhD in Computer Science from the Escuela Superior de Informática of the Castilla-La Mancha University in Ciudad Real. He actually works as an assistant professor in the Department of Information Systems and Technologies at the University of Castilla-La Mancha, and he has also been working in the R&D Department of Indra Sistemas since 2006. His research interests are focused on information quality management, information quality in SOA, and Global Software Development. Mario Piattini has an MSc and a PhD in Computer Science (Politechnical University of Madrid) and a MSc in Psychology (UNED.). He is also a Certified Information System Auditor and a Certified information System Manager by ISACA (Information System Audit and Control Association) as well as a Full Professor in the Department of Computer Science at the University of Castilla-La Mancha, in Ciudad Real, Spain. Furthermore, he is the author of several books and papers on databases, software engineering and information systems. He is a coeditor of several international books: “Advanced Databases Technology and Design”, 2000, Artech House, UK; "Information and database quality”, 2002, Kluwer Academic Publishers, Norwell, USA; “Component-based software quality: methods and techniques”, 2004, Springer, Germany; “Conceptual Software Metrics”, Imperial College Press, UK, 2005. He leads the ALARCOS research group of the Department of Computer Science at the University of Castilla-La Mancha, in Ciudad Real, Spain. His research interests are: advanced databases, database quality, software metrics, security and audit, software maintenance. 相似文献
20.
Ramón A. Carrasco Pedro Villar 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2012,16(1):135-151
In this paper we present the problem of aggregating heterogeneous data from various websites with opinions about high end
hotels into a database. We present the fuzzy model based on the semantic translation as a tool to obtain a linguistic summarization. The characteristics of this model (necessary to solve the problem) are not
together on any of the existing linguistic models: the management of the input heterogeneous data (natural language included);
the procurement of linguistic results with high precision and good interpretability; and the use of unbalanced linguistic
term sets described by trapezoidal membership functions for defining the initial linguistic terms. We applied it to aggregate
data from certain high end hotels websites and we show a case study using the high end hotels located in Granada (Spain) from
such websites during a year. With this aggregated information, a data analyst can make several analyses with the benefit of
easy linguistic interpretability and a high precision. The solution proposed here can be used to similar aggregation problems. 相似文献