首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
郭一村  陈华辉 《计算机应用》2021,41(4):1106-1112
在当前大规模数据检索任务中,学习型哈希方法能够学习紧凑的二进制编码,在节省存储空间的同时能快速地计算海明空间内的相似度,因此近似最近邻检索常使用哈希的方式来完善快速最近邻检索机制。对于目前大多数哈希方法都采用离线学习模型进行批处理训练,在大规模流数据的环境下无法适应可能出现的数据变化而使得检索效率降低的问题,提出在线哈希方法并学习适应性的哈希函数,从而在输入数据的过程中连续学习,并且能实时地应用于相似性检索。首先,阐释了学习型哈希的基本原理和实现在线哈希的内在要求;接着,从在线条件下流数据的读取模式、学习模式以及模型更新模式等角度介绍在线哈希不同的学习方式;而后,将在线学习算法分为六类:基于主-被动算法、基于矩阵分解技术、基于无监督聚类、基于相似性监督、基于互信息度量和基于码本监督,并且分析这些算法的优缺点及特点;最后,总结和讨论了在线哈希的发展方向。  相似文献   

2.
Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical SOM learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel SOM learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical SOM algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task.  相似文献   

3.
This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically with minimal human input. The system starts with an initial taxonomy in which documents are automatically categorized, and then evolves so as to provide a good indexing service as the document collection grows or its usage changes. To this end, we propose a series of algorithms that utilize text-mining technologies such as document clustering, document categorization, and hierarchy reorganization. In particular, clustering and categorization algorithms have been intensively studied in order to provide evolving facilities for hierarchical structures and categorization criteria. Through experiments using the Reuters-21578 document collection, we evaluate the performance of the proposed clustering and categorization methods by comparing them to those of well-known conventional methods.  相似文献   

4.
String alignment for automated document versioning   总被引:2,自引:2,他引:0  
The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.  相似文献   

5.
The self-organizing Maps (SOM) introduced by Kohonen implement two important operations: vector quantization (VQ) and a topology-preserving mapping. In this paper, an online self-organizing topological tree (SOTT) with faster learning is proposed. A new learning rule delivers the efficiency and topology preservation, which is superior of other structures of SOMs. The computational complexity of the proposed SOTT is O(log N) rather than O(N) as for the basic SOM. The experimental results demonstrate that the reconstruction performance of SOTT is comparable to the full-search SOM and its computation time is much shorter than the full-search SOM and other vector quantizers. In addition, SOTT delivers the hierarchical mapping of codevectors and the progressive transmission and decoding property, which are rarely supported by other vector quantizers at the same time. To circumvent the shortcomings of clustering performance of classical partition clustering algorithms, a hybrid clustering algorithm that fully exploit the online learning and multiresolution characteristics of SOTT is devised. A new linkage metric is proposed which can be updated online to accelerate the time consuming agglomerative hierarchical clustering stage. Besides the enhanced clustering performance, due to the online learning capability, the memory requirement of the proposed SOTT hybrid clustering algorithm is independent of the size of the data set, making it attractive for large database.  相似文献   

6.
The Self Organizing Map (SOM) algorithm has been utilized, with much success, in a variety of applications for the automatic organization of full-text document collections. A great advantage of the SOM method is that document collections can be ordered in such a way so that documents with similar content are positioned at nearby locations of the 2-dimensional SOM lattice. The resulting ordered map thus presents a general view of the document collection which helps the exploration of information contained in the whole document space. The most notable example of such an application is the WEBSOM method where the document collection is ordered onto a map by utilizing word category histograms for representing the documents data vectors. In this paper, we introduce the LSISOM method which resembles WEBSOM in the sense that the document maps are generated from word category histograms rather than simple histograms of the words. However, a major difference between the two methods is that in WEBSOM the word category histograms are formed using statistical information of short word contexts whereas in LSISOM these histograms are obtained from the SOM clustering of the Latent Semantic Indexing representation of document terms.  相似文献   

7.
A map of text documents arranged using the Self-Organizing Map (SOM) algorithm (1) is organized in a meaningful manner so that items with similar content appear at nearby locations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximate model of the data distribution in the high-dimensional document space. This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval. Furthermore, experiments on the well-known CISI collection [3] show significantly improved performance compared to Salton's vector space model, measured by average precision (AP) when retrieving a small, fixed number of best documents. Regarding comparison with Latent Semantic Indexing the results are inconclusive. This revised version was published online in August 2006 with corrections to the Cover Date.  相似文献   

8.
聚类算法作为发现数据内在结构与分布特征的无监督学习方法,被广泛应用于各个领域。伴随着互联网的高速发展和在线文档数量的大幅增加,文本聚类已成为一项重要任务。讨论文本聚类算法的基本概念与应用场景,对文本聚类算法及评价方法进行综述。  相似文献   

9.
This paper proposes a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space. Preliminary experimental results demonstrate this yields a comparable representation to that provided by a novel probabilistic approach which reconsiders the entire indexing problem of text documents and works directly in the original high dimensional vector-space representation of text. The employed projection index is derived here from the a priori constraints on the problem. The principal advantage of this approach is computational efficiency and is obtained by the exploitation of the Latent Semantic Indexing as a preprocessing stage. Simulation results on subsets of the 20-Newsgroups text corpus in various settings are provided. This revised version was published online in August 2006 with corrections to the Cover Date.  相似文献   

10.
郑毅 《信息安全与技术》2012,3(10):56-58,62
当用户通过合法或非法途径获取了企业信息系统中数字文档的访问权限,即可以不受限地通过下载、拷贝、网络等方式传播他人,而导致带有企业机密的泄密,使得共享与保密之间存在突出的问题,集成基于DRM技术构建的文档安全管理系统,对于已联网的企业存储信息的机密性和完整性是一个快速有效的解决方法。在本文中,对企业非结构化数字文档信息安全现状与存在问题进行了分析,研究了使用DRM对数字文档在线与离线应用信息防泄密进行保护的技术原理,提出了一种在企业现有信息系统体系下,通过二次开发集成基于DRM技术专业机密文档保护产品的嵌入式架构设计。  相似文献   

11.
李钊  李晓  王春梅  李诚  杨春 《计算机科学》2016,43(1):246-250, 269
在文本聚类中,相似性度量是影响聚类效果的重要因素。常用的相似性度量测度,如欧氏距离、相关系数等,只能描述文本间的低阶相关性,而文本间的关系非常复杂,基于低阶相关测度的聚类效果不太理想。一些基于复杂测度的文本聚类方法已被提出,但随着数据规模的扩展,文本聚类的计算量不断增加,传统的聚类方法已不适用于大规模文本聚类。针对上述问题,提出一种基于MapReduce的分布式聚类方法,该方法对传统K-means算法进行了改进,采用了基于信息损失量的相似性度量。为进一步提高聚类的效率,将该方法与基于MapReduce的主成分分析方法相结合,以降低文本特征向量的维数。实例分析表明,提出的大规模文本聚类方法的 聚类性能 比已有的聚类方法更好。  相似文献   

12.
《Pattern recognition》2014,47(2):758-768
Sentiment analysis, which detects the subjectivity or polarity of documents, is one of the fundamental tasks in text data analytics. Recently, the number of documents available online and offline is increasing dramatically, and preprocessed text data have more features. This development makes analysis more complex to be analyzed effectively. This paper proposes a novel semi-supervised Laplacian eigenmap (SS-LE). The SS-LE removes redundant features effectively by decreasing detection errors of sentiments. Moreover, it enables visualization of documents in perceptible low dimensional embedded space to provide a useful tool for text analytics. The proposed method is evaluated using multi-domain review data set in sentiment visualization and classification by comparing other dimensionality reduction methods. SS-LE provides a better similarity measure in the visualization result by separating positive and negative documents properly. Sentiment classification models trained over reduced data by SS-LE show higher accuracy. Overall, experimental results suggest that SS-LE has the potential to be used to visualize documents for the ease of analysis and to train a predictive model in sentiment analysis. SS-LE can also be applied to any other partially annotated text data sets.  相似文献   

13.
基于分级神经网络的Web文档模糊聚类技术   总被引:2,自引:1,他引:1  
给出了一种多层向量空间模型,该模型将一篇文档的相关信息从逻辑上划分为多个相对独立的文本段,按照不同位置的文本段确定相应的索引项权重.然后提出了一种简明而有效的基于分级神经网络的模糊聚类算法.与现有方法不同,该模糊聚类方法采用自组织神经网络和模糊聚类网络两部分组成的3层神经网络来实现.首先采用自组织神经网络从原始数据产生一个初始聚类结果,然后运用FCM方法对初始聚类的数目进行优化.实验结果表明,提出的Web文档聚类算法具有较好的聚类特性,它能将与一个主题相关的web文档较完全和准确地聚成一类.  相似文献   

14.
一种概念空间自生成方法   总被引:5,自引:2,他引:5  
文章提出一种自动生成概念空间的方法。首先通过SOM神经网络,对文本进行聚类,之后从结果中提取反映各类文本内容的概念,用于标注文本的类别,再通过模糊聚类进行概念自动抽象与归纳形成概念空间,用于文本的管理。SOM本身是无监督的学习方式,在设定好参数后,经过训练自动生成文本空间与概念空间的映射图。相关试验和结果表明概念空间对文本有很好的分类管理功能,便于文本检索。  相似文献   

15.
基于文本聚类搜索引擎的查询扩展算法   总被引:2,自引:0,他引:2       下载免费PDF全文
目前多数基于文本聚类搜索引擎的研究对于聚类产生的小聚类簇查询未能给出深入查询解决方案,针对此类问题提出了一种基于聚类的查询扩展算法。此算法利用簇关系树结构改进相似度公式,对目标簇提取主题词并进行二次查询后,通过K中值聚类算法对查询结果进行聚类以对其进行扩展。此算法全部过程均为离线运算,旨在避免在线运算影响查询响应效率,并通过实验验证了该算法的有效性。  相似文献   

16.
许伟佳 《数字社区&智能家居》2009,5(9):7281-7283,7286
文档聚类在Web文本挖掘中占有重要地位.是聚类分析在文本处理领域的应用。文章介绍了基于向量空间模型的文本表示方法,分析并优化了向量空间模型中特征词条权重的评价函数,使基于距离的相似性度量更为准确。重点分析了Web文档聚类中普遍使用的基于划分的k-means算法.对于k-means算法随机选取初始聚类中心的缺陷.详细介绍了采用基于最大最小距离法的原则,结合抽样技术思想,来稳定初始聚类中心的选取,改善聚类结果。  相似文献   

17.
We present a method for the classification of multi-labeled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time.Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labeled instances colliding in the same region. This approach is referred to as clashing.We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labeled streams.  相似文献   

18.
Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.  相似文献   

19.

Text document clustering is used to separate a collection of documents into several clusters by allowing the documents in a cluster to be substantially similar. The documents in one cluster are distinct from documents in other clusters. The high-dimensional sparse document term matrix reduces the clustering process efficiency. This study proposes a new way of clustering documents using domain ontology and WordNet ontology. The main objective of this work is to increase cluster output quality. This work aims to investigate and examine the method of selecting feature dimensions to minimize the features of the document name matrix. The sports documents are clustered using conventional K-Means with the dimension reduction features selection process and density-based clustering. A novel approach named ontology-based document clustering is proposed for grouping the text documents. Three critical steps were used in order to develop this technique. The initial step for an ontology-based clustering approach starts with data pre-processing, and the characteristics of the DR method are reduced with the Info-Gain collection. The documents are clustered using two clustering methods: K-Means and Density-Based clustering with DR Feature Selection Process. These methods validate the findings of ontology-based clustering, and this study compared them using the measurement metrics. The second step of this study examines the sports field ontology development and describes the principles and relationship of the terms using sports-related documents. The semantic web rational process is used to test the ontology for validation purposes. An algorithm for the synonym retrieval of the sports domain ontology terms has been proposed and implemented. The retrieved terms from the documents and sport ontology concepts are mapped to the retrieved synonym set words from the WorldNet ontology. The suggested technique is based on synonyms of mapped concepts. The proposed ontology approach employs the reduced feature set in order to clustering the text documents. The results are compared with two traditional approaches on two datasets. The proposed ontology-based clustering approach is found to be effective in clustering the documents with high precision, recall, and accuracy. In addition, this study also compared the different RDF serialization formats for sports ontology.

  相似文献   

20.
Strategy-based interactive cluster visualization for information retrieval   总被引:1,自引:0,他引:1  
In this paper we investigate a general purpose interactive information organization system. The system organizes documents by placing them into 1-, 2-, or 3-dimensional space based on their similarity and a spring-embedding algorithm. We begin by developing a method for estimating the quality of the organization when it is applied to a set of documents returned in response to a query. We show how the relevant documents tend to clump together in space. We proceed by presenting a method for measuring the amount of structure in the organization and explain how this knowledge can be used to refine the system. We also show that increasing the dimensionality of the organization generally improves its quality, albeit only a small amount. We introduce two methods for modifying the organization based on information obtained from the user and show how such feedback improves the organization. All the analysis is done offline without direct user intervention. Received: 21 December 1998 / Revised: 30 May 1999  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号