首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In this paper, we describe a novel unsupervised approach for detecting, classifying, and tracing non-functional software requirements (NFRs). The proposed approach exploits the textual semantics of software functional requirements (FRs) to infer potential quality constraints enforced in the system. In particular, we conduct a systematic analysis of a series of word similarity methods and clustering techniques to generate semantically cohesive clusters of FR words. These clusters are classified into various categories of NFRs based on their semantic similarity to basic NFR labels. Discovered NFRs are then traced to their implementation in the solution space based on their textual semantic similarity to source code artifacts. Three software systems are used to conduct the experimental analysis in this paper. The results show that methods that exploit massive sources of textual human knowledge are more accurate in capturing and modeling the notion of similarity between FR words in a software system. Results also show that hierarchical clustering algorithms are more capable of generating thematic word clusters than partitioning clustering techniques. In terms of performance, our analysis indicates that the proposed approach can discover, classify, and trace NFRs with accuracy levels that can be adequate for practical applications.  相似文献   

2.
This paper presents a study of using ellipsoidal decision regions for motif-based patterned fabric defect detection, the result of which is found to improve the original detection success using max-min decision region of the energy-variance values. In our previous research, max-min decision region was found to be effective in distinct cases but ill detect the ambiguous false-positive and false-negative cases. To alleviate this problem, we first assume that the energy-variance values can be described by a Gaussian mixture model. Second, we apply k-means clustering to roughly identify the various clusters that make up the entire data population. Third, convex hull of each cluster is employed as a basis for fitting an ellipsoidal decision region over it. Defect detection is then based on these ellipsoidal regions. To validate the method, three wallpaper groups are evaluated using the new ellipsoidal regions, and compared with those results obtained using the max-min decision region. For the p2 group, success rate improves from 93.43% to 100%. For the pmm group, success rate improves from 95.9% to 96.72%, while the p4 m group records the same success rate at 90.77%. This demonstrates the superiority of using ellipsoidal decision regions in motif-based defect detection.  相似文献   

3.
In this paper, we present CatViz—Temporally-Sliced Correspondence Analysis Visualization. This novel method visualizes relationships through time and is suitable for large-scale temporal multivariate data. We couple CatViz with clustering methods, whereupon we introduce the concept of final centroid transfer, which enables the correspondence of clusters in time. Although CatViz can be used on any type of temporal data, we show how it can be applied to the task of exploratory visual analysis of text collections. We present a successful concept of employing feature-type filtering to present different aspects of textual data. We performed case studies on large collections of French and English news articles. In addition, we conducted a user study that confirms the usefulness of our method. We present typical tasks of exploratory text analysis and discuss application procedures that an analyst might perform. We believe that CatViz is general and highly applicable to large data sets because of its intuitiveness, effectiveness, and robustness. We expect that it will enable a better understanding of texts in huge historical archives.  相似文献   

4.
Usually, when analyzing data that have not been processed or filtered yet, it can be observed that not all the data have equal importance. Thus, it is common to find relevant data surrounded by non relevant one. This occurs when analyzing textual information due to its intrinsic nature: texts contain words that provide a lot of information about the subject matter, whereas they contain other words with a little meaning or relevance. We believe that although in principle the non-relevant words are not as important as the relevant ones, the former constitute the substrate that supports the last. Since this substrate is the context that surrounds the relevant information, we call it the contextual information. In this paper, we analyze the relevance that the contextual information has in textual data, in a clustering by compression scenario. We generate the contextual information applying a distortion technique previously developed by the authors. One of the main characteristics of this technique is that it maintains the contextual information. In this paper we compare this technique with three new distortion techniques that destroy the contextual information in different ways. The experimental results support our hypothesis that the contextual information is relevant at least in the area of text clustering by compression.  相似文献   

5.
面向软件缺陷数据的聚类分析就是按照一定的准则将不同的软件缺陷数据对象划分为多个类,使得类内的缺陷数据相似,类间的缺陷数据相异,其意义在于发现软件缺陷的分布规律,有针对性地制定测试方案,优化测试过程.针对传统K-Means方法聚类结果依赖样本初始空间分布的问题,提出一种基于PSO算法的数据降维处理方法 DRPS.仿真实验表明,经过该方法降维处理后数据的聚类准确率及聚类质量都有了一定程度的提高.  相似文献   

6.
Online reviews are often accessed by users deciding to buy a product, see a movie, or go to a restaurant. However, most reviews are written in a free-text format, usually with very scant structured metadata information and are therefore difficult for computers to understand, analyze, and aggregate. Users then face the daunting task of accessing and reading a large quantity of reviews to discover potentially useful information. We identified topical and sentiment information from free-form text reviews, and use this knowledge to improve user experience in accessing reviews. Specifically, we focus on improving recommendation accuracy in a restaurant review scenario. We propose methods to derive a text-based rating from the body of the reviews. We then group similar users together using soft clustering techniques based on the topics and sentiments that appear in the reviews. Our results show that using textual information results in better review score predictions than those derived from the coarse numerical star ratings given by the users. In addition, we use our techniques to make fine-grained predictions of user sentiments towards the individual topics covered in reviews with good accuracy.  相似文献   

7.
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.  相似文献   

8.
A Replicated Experiment to Assess Requirements Inspection Techniques   总被引:4,自引:2,他引:2  
This paper presents the independent replication of a controlled experiment which compared three defect detection techniques (Ad Hoc, Checklist, and Defect-based Scenario) for software requirements inspections, and evaluated the benefits of collection meetings after individual reviews. The results of our replication were partially different from those of the original experiment. Unlike the original experiment, we did not find any empirical evidence of better performance when using scenarios. To explain these negative findings we provide a list of hypotheses. On the other hand, the replication confirmed one result of the original experiment: the defect detection rate is not improved by collection meetings.The independent replication was made possible by the existence of an experimental kit provided by the original investigators. We discuss what difficulties we encountered in applying the package to our environment, as a result of different cultures and skills. Using our results, experience and suggestions, other researchers will be able to improve the original experimental design before attempting further replications.  相似文献   

9.
The popularity of GPS-equipped gadgets and mapping mashup applications has motivated the growth of geotagged Web resources as well as georeferenced multimedia applications. More and more research attention have been put on mining collaborative knowledge from mass user-contributed geotagged contents. However, little attention has been paid to generating high-quality geographical clusters, which is an important preliminary data-cleaning process for most geographical mining works. Previous works mainly use geotags to derive geographical clusters. Simply using one channel information is not sufficient for generating distinguishable clusters, especially when the location ambiguity problem occurs. In this paper, we propose a two-level clustering framework to utilize both the spatial and the semantic features of photographs for clustering. For the first-level geoclustering phase, we cluster geotagged photographs according to their spatial ties to roughly partition the dataset in an efficient way. Then we leverage the textual semantics in photographs' annotation to further refine the grouping results in the second-level semantic clustering phase. To effectively measure the semantic correlation between photographs, a semantic enhancement method as well as a new term weighting function have been proposed. We also propose a method for automatic parameter determination for the second-level spectral clustering process. Evaluation of our implementation on real georeferenced photograph dataset shows that our algorithm performs well, producing distinguishable geographical cluster with high accuracy and mutual information.  相似文献   

10.
A possibilistic approach was initially proposed for c-means clustering. Although the possibilistic approach is sound, this algorithm tends to find identical clusters. To overcome this shortcoming, a possibilistic Fuzzy c-means algorithm (PFCM) was proposed which produced memberships and possibilities simultaneously, along with the cluster centers. PFCM addresses the noise sensitivity defect of Fuzzy c-means (FCM) and overcomes the coincident cluster problem of possibilistic c-means (PCM). Here we propose a new model called Kernel-based hybrid c-means clustering (KPFCM) where PFCM is extended by adopting a Kernel induced metric in the data space to replace the original Euclidean norm metric. Use of Kernel function makes it possible to cluster data that is linearly non-separable in the original space into homogeneous groups in the transformed high dimensional space. From our experiments, we found that different Kernels with different Kernel widths lead to different clustering results. Thus a key point is to choose an appropriate Kernel width. We have also proposed a simple approach to determine the appropriate values for the Kernel width. The performance of the proposed method has been extensively compared with a few state of the art clustering techniques over a test suit of several artificial and real life data sets. Based on computer simulations, we have shown that our model gives better results than the previous models.  相似文献   

11.
针对短文本聚类存在的三个主要挑战,特征关键词的稀疏性、高维空间处理的复杂性和簇的可理解性,提出了一种结合语义改进的K-means短文本聚类算法。该算法通过词语集合表示短文本,缓解了短文本特征关键词的稀疏性问题;通过挖掘短文本集的最大频繁词集获取初始聚类中心,有效克服了K-means聚类算法对初始聚类中心敏感的缺点,解决了簇的理解性问题;通过结合TF-IDF值的语义相似度计算文档之间的相似度,避免了高维空间的运算。实验结果表明,从语义角度出发实现的短文本聚类算法优于传统的短文本聚类算法。  相似文献   

12.
针对磨玻璃肺结节(Ground Glass Opacity,GGO)边界对比度低、大小各异和灰度不均匀等造成分割准确率低的问题,提出一种基于边缘敏感的SLIC和二次密度聚类相结合的分割算法。将图像边缘检测结果与SLIC超像素算法相结合,并将其中含有边缘的超像素块用区域质心代替其原始聚类中心,改善SLIC边界黏连性较差的问题;针对密度聚类不能完整分割GGO的问题,提出二次密度聚类的方法,对密度聚类定位到的簇及其邻域簇进行二次密度聚类。实验结果表明,该算法分割GGO的平均准确率达90.17%,灵敏度达84%。  相似文献   

13.
The event detection problem, which is closely related to clustering, has gained a lot of attentions within event detection for textual documents. However, although image clustering is a problem that has been treated extensively in both Content-Based Image Retrieval (CBIR) and Text-Based Image Retrieval (TBIR) systems, event detection within image management is a relatively new area. Having this in mind, we propose a novel approach for event extraction and clustering of images, taking into account textual annotations, time and geographical positions. Our goal is to develop a clustering method based on the fact that an image may belong to an event cluster. Here, we stress the necessity of having an event clustering and cluster extraction algorithm that are both scalable and allow online applications. To achieve this, we extend a well-known clustering algorithm called Suffix Tree Clustering (STC), originally developed to cluster text documents using document snippets. The idea is that we consider an image along with its annotation as a document. Further, we extend it to also include time and geographical position so that we can capture the contextual information from each image during the clustering process. This has appeared to be particularly useful on images gathered from online photo-sharing applications such as Flickr. Hence, our STC-based approach is aimed at dealing with the challenges induced by capturing contextual information from Flickr images and extracting related events. We evaluate our algorithm using different annotated datasets mainly gathered from Flickr. As part of this evaluation we investigate the effects of using different parameters, such as time and space granularities, and compare these effects. In addition, we evaluate the performance of our algorithm with respect to mining events from image collections. Our experimental results clearly demonstrate the effectiveness of our STC-based algorithm in extracting and clustering events.  相似文献   

14.
王靖 《计算机应用研究》2020,37(10):2951-2955,2960
针对同类文本中提取的关键词形式多样,且在相似性与相关性上具有模糊关系,提出一种对词语进行分层聚类的文本特征提取方法。该方法在考虑文本间相同词贡献文本相似度的前提下,结合词语相似性与相关性作为语义距离,并根据该语义距离的不同,引入分层聚类并赋予不同聚类权值的方法,最终得到以词和簇共同作为特征单元的带有聚类权值的向量空间模型。引入了word2vec训练词向量得到文本相似度,并根据Skip-Gram+Huffman Softmax模型的算法特点,运用点互信息公式准确获取词语间的相关度。通过文本的分类实验表明,所提出的方法较目前常用的仅使用相似度单层聚类后再统计的方法,能更有效地提高文本特征提取的准确性。  相似文献   

15.
As the use of Open Source Software (OSS) systems increases in the corporate environment, it is important to examine the maintenance process of these projects. OSS projects allow end users to directly submit reports in case of any operational issues. Timely resolution of these defect reports requires effective management of maintenance resources. This study analyzes the usefulness of the textual content of the defect reports as an early indicator of their resolution time. Text Mining techniques are used to categorize defect reports of five OSS projects. Significant variation in the defect resolution time amongst the resulting categories, for each of the sample projects, indicates that a text based classification of defect reports can be useful in early assessment of resolution time before source code level analysis. Such technique can assist in allocation of sufficient maintenance resources to targeted defects and also enable project teams to manage customer expectations regarding defect resolution times.  相似文献   

16.
A multiparadigm approach is developed and demonstrated for exploiting knowledge about structure for the purpose of extracting information from noisy textual data. A motivating example of a potential application would be an address encoding system for a delivery service such as UPS, Federal Express or the United States Post Office. This approach combines aspects of database organization and clustering of records, fuzzy parsing, fuzzy retrieval, an aggregation algebra, and measures of both performance and accuracy. Fuzzy retrieval, in the form of set and fuzzy operators, is accomplished by considering each symbol of the input text to be imperfect and retrieving non-exact matching records from the database that hold for a particular threshold value. The set of low-level database operators constrains the cardinality and accuracy of retrievals. A hierarchical method of clustering the database is defined, whereby the records are partitioned in a manner such that similar records are in the same cluster. This clustering strategy is guaranteed to be mutually exclusive and a complete cover of the data records. Associated with these clusters is an algebra that combines clusters of data into one window of ranked data. A set of fuzzy measures is defined that are used to aggregate and rank sets of records.  相似文献   

17.
Most current information retrieval systems rely solely on lexical item repetition, which is notorious for its vulnerability. In this research, we propose a novel method for the extraction of salient textual patterns. One of our major objectives is to move away from keywords and their associated limitations in textual information retrieval. How individual sentences in text fit together to be perceived as a salient pattern is identified. A text network that exhibits textual continuity, arising from a connectionist model, is described. The network facilitates a dynamic extraction of salient textual segments by capturing semantics from two different categories of natural language, namely lexical cohesion and contextual coherence. We also present the results of an empirical study designed to compare our model with the performance of human judges in the identification of salient textual patterns. The preliminary results show that our model has the potential for automatic salient patterns discovery in text.  相似文献   

18.
张群  王红军  王伦文 《计算机科学》2016,43(Z11):443-446, 450
短文本因具有特征信息不足且高维稀疏等特点,使得传统文本聚类算法应用于短文本聚类任务时性能有限。针对上述情况,提出一种结合上下文语义的短文本聚类算法。首先借鉴社会网络分析领域的中心性和权威性思想设计了一种结合上下文语义的特征词权重计算方法,在此基础上构建词条-文本矩阵;然后对该矩阵进行奇异值分解,进一步将原始特征词空间映射到低维的潜在语义空间;最后通过改进的K-means聚类算法在低维潜在语义空间完成短文本聚类。实验结果表明,与传统的基于词频及逆向文档频权重的文本聚类算法相比,该算法能有效改善短文本特征不足及高维稀疏性,提高了短文的本聚类效果。  相似文献   

19.
The state-of-the-art text clustering methods suffer from the huge size of documents with high-dimensional features. In this paper, we studied fast SOM clustering technology for Text Information. Our focus is on how to enhance the efficiency of text clustering system whereas high clustering qualities are also kept. To achieve this goal, we separate the system into two stages: offline and online. In order to make text clustering system more efficient, feature extraction and semantic quantization are done offline. Although neurons are represented as numerical vectors in high-dimension space, documents are represented as collections of some important keywords, which is different from many related works, thus the requirement for both time and space in the offline stage can be alleviated. Based on this scenario, fast clustering techniques for online stage are proposed including how to project documents onto output layers in SOM, fast similarity computation method and the scheme of Incremental clustering technology for real-time processing, We tested the system using different datasets, the practical performance demonstrate that our approach has been shown to be much superior in clustering efficiency whereas the clustering quality are comparable to traditional methods.  相似文献   

20.
文本聚类的目标是把数据集中内容相似的文档归为一类,而使内容不同的文档分开。目前针对不同领域的需求,多种解决聚类问题的算法应运而生。然而,由于文本数据本身固有的复杂特点,如海量、高维、稀疏等,使得对海量文本数据的聚类仍然是一个棘手的问题。提出了层次非负矩阵分解聚类方法,该方法不但保留了非负矩阵分解的优点,如同步识别文档类别和找出类别本质特征,而且能够展现类别间的层次结构。这种类别层次结构在网页预览等应用中是非常有用的。在真实数据集20Newsgroups和Reuters-RCV1上的实验结果表明,层次非负矩阵分解相比已有的方法更有效。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号