首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order to reflect the subtopic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several passages, assigns categories to each passage, and merges the passage categories to the document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. Using four subsets of the Reuters text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to the main topic(s), depending on their location in the test document.  相似文献   

2.
In this paper, we present a query-driven indexing/retrieval strategy for efficient full text retrieval from large document collections distributed within a structured P2P network. Our indexing strategy is based on two important properties: (1) the generated distributed index stores posting lists for carefully chosen indexing term combinations that are frequently present in user queries, and (2) the posting lists containing too many document references are truncated to a bounded number of their top-ranked elements. These two properties guarantee acceptable latency and bandwidth requirements, essentially because the number of indexing term combinations remains scalable and the posting lists transmitted during retrieval never exceed a constant size. A novel index update mechanism efficiently handles adding of new documents to the document collection. Thus, the generated distributed index corresponds to a constantly evolving query-driven indexing structure that efficiently follows current information needs of the users and changes in the document collection.We show that the size of the index and the generated indexing/retrieval traffic remains manageable even for Web-size document collections at the price of a marginal loss in precision for rare queries. Our theoretical analysis and experimental results provide convincing evidence about the feasibility of the query-driven indexing strategy for large scale P2P text retrieval.  相似文献   

3.
Carr  L.A.  De Roure  D.  Hall  W.  Hill  G. 《World Wide Web》1998,1(2):61-71
Links are the key element for changing a text into a hypertext, and yet the WWW provides limited linking facilities. Modeled on Open Hypermedia research the Distributed Link Service provides an independent system of link services for the World Wide Web and allows authors to create configurable navigation pathways for collections of WWW resources. This is achieved by adding links to documents as they are delivered from a WWW server, and by allowing the users to choose the sets of links that they will see according to their interests. This paper describes the development of the link service, the facilities that it adds for users of the WWW and its specific use in an Electronic Libraries project.  相似文献   

4.
5.
基于链接分块的相关链接提取方法   总被引:1,自引:0,他引:1  
每个网页都包含了大量的超链接,其中既包含了相关链接,也包含了大量噪声链接。提出了一种基于链接分块的相关链接提取方法。首先,将网页按照HTML语言中标签将网页分成许多的块,从块中提取链接,形成若干链接块;其次,根据相关链接的成块出现,相关链接文字与其所在网页标题含相同词等特征,应用规则与统计相结合的方法从所有链接块中提取相关链接块。相关链接提取方法测试结果,精确率在85%以上,召回率在70%左右,表明该方法很有效。  相似文献   

6.
We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.  相似文献   

7.
分析了潜在语义模型,研究了潜在语义空间中文本的表示方法,提出了一种大容量文本集的检索策略。检索过程由粗粒度非相关剔除和相关文本的精确检索两个步骤组成。使用潜在语义空间模型对文本集进行初步的筛选,剔除非相关文本;使用大规模文本检索方法对相关文本在段落一级进行精确检索,其中为了提高检索的执行效率,在检索算法中引入了遗传算法;输出这些候选的段落序号。实验结果证明了这种方法的有效性和高效性。  相似文献   

8.
This paper is concerned with the control of a manipulator with n-degrees of freedom by one actuator and n ? 1 brakes using fuzzy inference. All the links of the manipulator are excited by a motor placed on the base link, and the excitation torque is successively transmitted through each link from the base link until the final link. The brakes on the joints act at identical frequency with different phases, and the phases of timing on the brakes are controlled by fuzzy inference such that all the joint angles of links are followed to the desired trajectories. The effectiveness of the proposed method is illustrated by a simple example.  相似文献   

9.
Developing a comprehensive explanation of complex social phenomena is a difficult task that analysts often have to perform using vast collections of text documents. On the one hand, solutions exist to assist analysts in creating causal maps from text documents, but these can only articulate the relationships at work in a problem. On the other hand, Fuzzy Cognitive Maps (FCMs) can articulate these relationships and perform simulations, but no environment exists to help analysts in iteratively developing FCMs from text. In this paper, we detail the design and implementation of the first tool that allows analysts to develop FCMs from text collections, using interactive visualizations. We make three contributions: (i) we combine text mining and FCMs, (ii) we implement the first visual analytics environment built on FCMs, and (iii) we promote a strong feedback loop between interactive data exploration and model building. We provide two case studies exemplifying how to create a model from the ground-up or improve an existing one. Limitations include the increase in display complexity when working with large collection of files, and the reliance on KL-divergence for ad-hoc retrieval. Several improvements are discussed to further support analysts in creating high-quality models through interactive visualizations.  相似文献   

10.
11.
Recent advances in the field of kernel-based machine learning methods allow fast processing of text using string kernels utilizing suffix arrays. kernlab provides both kernel methods’ infrastructure and a large collection of already implemented algorithms and includes an implementation of suffix-array-based string kernels. Along with the use of the text mining infrastructure provided by tm these packages provide R with functionality in processing, visualizing and grouping large collections of text data using kernel methods. The emphasis is on the performance of various types of string kernels at these tasks.  相似文献   

12.
Very large hypermedia collections, say many thousands of documents and links, are often unmanageable. Users can have considerable difficulty finding particular information and comprehending its extent and scope. Support for explicit hierarchical structure of the collection, such as the Yahoo classification scheme, helps users by suggesting specific paths through the information space. Such hierarchical structure can also be visualized, either independently from, or in association with, the associative hyperlinks between individual documents. Furthermore, the availability of rich metadata fields for collections and individual documents, including attributes such as author, title, size, keywords, and creation date, permits enhancing visualizations by mapping attribute values to aspects of the visual presentation. The author discusses visualization techniques that were developed for the Hyperwave information server  相似文献   

13.
Abstract

Current generation of hypertext systems suffer from the limitations that they are static in nature, and they do not support the automated process of link creation very well. Because of the efforts involved in manually creating links, the hyperbases created using these systems are seldom modified even when they were found not to fully support the requirements of the intended users. This paper studies the development of automated tools to aid in the process of link creation, browsing, and link refinement. Only relation links are considered in this study. The automated tools are developed to help in three of the major stages of developing and using hypertext applications: (a) during authoring to generate a set of relation links between pairs of nodes; (b) during browsing to recommend an optimal set of starting nodes for the users to begin browsing, and to guide the users at each stage of browsing by suggesting a set of “next” nodes to traverse; and, (c) during training to modify, remove and add links based on users' feedback data collected. The training will result in long-term changes in the hypertext structure.

In order to test the effectiveness of the training process objectively, a navigator is built to simulate the browsing activities of the users. The effects of training have been evaluated on two text collections using a variety of objective measures. The results indicate that the training process has improved the effectiveness of the hyperbase to support browsing.  相似文献   

14.
This article presents a new method for the kinematics of hyper-redundant spatial robots, called the virtual link method. The method is based on dividing the robot into sub-systems consisting of three or four links. Virtual links are established between the first and the last nodes of each sub-system. A hypothetical robot with reduced of degrees of freedom is developed using the virtual links. The method then solves for the kinematics of the virtual link system followed by the sub-systems. The virtual link method also provides a singularity-free solution by predicting their occurrence and taking appropriate action. An approaching singularity is reflected by a drop in the determinant of the Jacobian of the system below a threshold value. In this situation, the method either chooses a different virtual link configuration or switches to a displacement distribution scheme. © 1996 John Wiley & Sons, Inc.  相似文献   

15.
Here, an approach to describe the dynamics of 2D and 3D mechanisms that leads to a simpleand fast analysis is presented. Any position of an elastic mechanism may be defined usingonly rotations. Because in space the finite angles are not vectors, the Euler–Rodriguesparameters were adopted to describe the 3D rotations. The assembling process of the links intothe whole mechanism is natural and follows the standard finite element scheme. Lagrange multipliersare used only to impose the closed loop conditions. The approach developed in this work can be applied to 3D mechanisms composed by beams and having internal hinges as kinematics pairs, but itcan be generalized to other link shapes. The method is used either for mechanisms with rigid links,or with elastic ones, even for very deformable links that completely change the initial configuration.Both the floating and the absolute reference frame approach may be used, depending on the problem;passing from one formulation to the other is quite natural.If the links may be considered beams, the method starts from the exact equations written forthe deformed shape of each link and this provides a good accuracy. In this paper, a special finiteelement is presented: the unknowns of the problems being the nodal rotations or nodal Euler–Rodriguesparameters. Few nodes are requested for good accuracy. In general, as the number degrees of freedomper node is smaller than in the classical finite element approach, an important reducing of the totalnumber of nodal unknowns is obtained leading to an important reducing of the computer time. TheEuler–Bernoulli beam model was adopted, but the implementation of the Timoshenko beam model thattakes the shear efforts into account, is not difficult.  相似文献   

16.
In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.  相似文献   

17.
Collections of information resources are one of the major directions in the development of digital libraries, which have seen significant progress in recent years. Collections are the most common form in which information resources are organized in such systems. Due to the large potential of existing technologies and the diversity of information resources, the characteristics of collections vary greatly. Nevertheless, there are characteristics common for all collections, and they must be taken into account when developing a collection. In this paper, the methodology issues of collection development are considered. The paper explores the most important common properties of information resource collections; systematization techniques involved in creating collections; problems of the genesis of collections; the role of metadata; peculiarities of scientific collections, as well as promising information technologies and standards meant for the creation, support, and usage of collections.  相似文献   

18.
19.
20.
Clustering of related or similar objects has long been regarded as a potentially useful contribution of helping users to navigate an information space such as a document collection. Many clustering algorithms and techniques have been developed and implemented but as the sizes of document collections have grown these techniques have not been scaled to large collections because of their computational overhead. To solve this problem, the proposed system concentrates on an interactive text clustering methodology, probability based topic oriented and semi-supervised document clustering. Recently, as web and various documents contain both text and large number of images, the proposed system concentrates on content-based image retrieval (CBIR) for image clustering to give additional effect to the document clustering approach. It suggests two kinds of indexing keys, major colour sets (MCS) and distribution block signature (DBS) to prune away the irrelevant images to given query image. Major colour sets are related with colour information while distribution block signatures are related with spatial information. After successively applying these filters to a large database, only small amount of high potential candidates that are somewhat similar to that of query image are identified. Then, the system uses quad modelling method (QM) to set the initial weight of two-dimensional cells in query image according to each major colour and retrieve more similar images through similarity association function associated with the weights. The proposed system evaluates the system efficiency by implementing and testing the clustering results with Dbscan and K-means clustering algorithms. Experiment shows that the proposed document clustering algorithm performs with an average efficiency of 94.4% for various document categories.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号