首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 175 毫秒
1.
A Knowledge-Based Approach to Effective Document Retrieval   总被引:3,自引:0,他引:3  
This paper presents a knowledge-based approach to effective document retrieval. This approach is based on a dual document model that consists of a document type hierarchy and a folder organization. A predicate-based document query language is proposed to enable users to precisely and accurately specify the search criteria and their knowledge about the documents to be retrieved. A guided search tool is developed as an intelligent natural language oriented user interface to assist users formulating queries. Supported by an intelligent question generator, an inference engine, a question base, and a predicate-based query composer, the guided search collects the most important information known to the user to retrieve the documents that satisfy users' particular interests. A knowledge-based query processing and search engine is devised as the core component in this approach. Algorithms are developed for the search engine to effectively and efficiently retrieve the documents that match the query.  相似文献   

2.
TEXPROS (TEXt PROcessing System) is an automatic document processing system which supports text-based information representation and manipulation, conveying meanings from stored information within office document texts. A dual modeling approach is employed to describe office documents and support document search and retrieval. The frame templates for representing document classes are organized to form a document type hierarchy. Based on its document type, the synopsis of a document is extracted to form its corresponding frame instance. According to the user predefined criteria, these frame instances are stored in different folders, which are organized as a folder organization (i.e., repository of frame instances associated with their documents). The concept of linking folders establishes filing paths for automatically filing documents in the folder organization. By integrating document type hierarchy and folder organization, the dual modeling approach provides efficient frame instance access by limiting the searches to those frame instances of a document type within those folders which appear to be the most similar to the corresponding queries.This paper presents an agent-based document filing system using folder organization. A storage architecture is presented to incorporate the document type hierarchy, folder organization and original document storage into a three-level storage system. This folder organization supports effective filing strategy and allows rapid frame instance searches by confining the search to the actual predicate-driven retrieval method. A predicate specification is proposed for specifying criteria on filing paths in terms of user predefined predicates for governing the document filing. A method for evaluating whether a given frame instance satisfies the criteria of a filing path is presented. The basic operations for constructing and reorganizing a folder organization are proposed.  相似文献   

3.
This paper presents a predicate-driven document filing system for organizing and automatically filing documents. A document model consists of two basic elements: frame templates representing document classes, and folders which are repositories of frame instances. The frame templates can be organized to form a document type hierarchy, which helps classify and file documents. Frame instances are grouped into a folder on the basis of user-defined criteria, specified as predicates which determine whether a frame instance belongs to a folder. Folders can naturally organized into a folder organization which represents the user's real world document filing system. The predicate consistency problem is discussed to eliminate two abnormalities from a folder organization: inapplicable edges (filing paths) and redundant folders. An evaluating net (including an association dictionary, an instantiation component and a production system) is then proposed for evaluating whether a frame instance satisfies the predicate of a folder during document filing. And the concept of consistency a rule base is also discussed.This work was supported by the Separately Budgeted Research (SBR) grant (No. 421190) from New Jersey Institute of Technology and the Systems Integration Program grant from AT&T Foundation  相似文献   

4.
This paper formally specifies a document model for office information systems, including formal definitions of document types (frame templates), a document type hierarchy, folders, and folder organizations. Folder Organizations are defined using predicates and directed graphs. AReconstruction Problem for folder organizations is then formulated; viz., under what circumstances it is possible to reconstruct a folder organization from its folder level predicates. The Reconstruction Problem is solved in terms of such graph-theoretic concepts as Associated Digraphs, transitive closure, and redundant/nonredundant filing paths. A Transitive Closure Inversion algorithm is then presented which efficiently recovers a Folder Organization digraph from its Associated Digraph.This work was supported in part by the National Science Foundation under Grant No. IRI-9224602, by the New Jersey Institute of Technology undre Grant No. 421280 and by a grant from AT&T Foundation.  相似文献   

5.
Much information is nowadays stored electronically in document bases. Users retrieve information from these document bases by browsing and querying. While a large number of tools are available nowadays, not much work has been done on tools that support queries involving all the characteristics of documents as well as the use of domain knowledge during the search for information. In this paper we propose a query language that allows for querying documents using content information, information about the logical structure of the documents as well as information about properties of the documents. Domain knowledge is taken into account during the search as well. We also present an architecture for a system supporting such a language and we describe a prototype implementation together with test results.  相似文献   

6.
The creation and deployment of knowledge repositories for managing, sharing, and reusing tacit knowledge within an organization has emerged as a prevalent approach in current knowledge management practices. A knowledge repository typically contains vast amounts of formal knowledge elements, which generally are available as documents. To facilitate users' navigation of documents within a knowledge repository, knowledge maps, often created by document clustering techniques, represent an appealing and promising approach. Various document clustering techniques have been proposed in the literature, but most deal with monolingual documents (i.e., written in the same language). However, as a result of increased globalization and advances in Internet technology, an organization often maintains documents in different languages in its knowledge repositories, which necessitates multilingual document clustering (MLDC) to create organizational knowledge maps. Motivated by the significance of this demand, this study designs a Latent Semantic Indexing (LSI)-based MLDC technique capable of generating knowledge maps (i.e., document clusters) from multilingual documents. The empirical evaluation results show that the proposed LSI-based MLDC technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision, and is capable of maintaining a good balance between monolingual and cross-lingual clustering effectiveness when clustering a multilingual document corpus.  相似文献   

7.
XML搜索引擎研究   总被引:31,自引:3,他引:28  
WWW上大量信息的涌现,对信息的查询提出了严峻的挑战,XML作为一种扩展标记语言,具有多HTML所不具备的优点,使得开展WWW上的深层应用成为可能,对基于XML的搜索引擎中涉及的关键技术进行了研究,并提出了对XML这种半结构化文化档建立索引和查询时采用的数据结构和算法,它在不丢失文档中结构信息的情况下,充分利用XML的标签所带来的上下文信息,能够大幅度提高查询的准确率。  相似文献   

8.
A new architecture for information retrieval systems is presented. If it was implemented, this architecture would allow the system to process retrieval statements that are equivalent to fuzzily defined queries. The philosophy on which the centerpiece of this system is based—the document search module—is fully explained in this paper. The emphasis is placed on the quick elimination of irrelevant references. A new technique, that takes into account the user's knowledge to discriminate between documents before they are actually retrieved from the data base, was developed. The search technique uses simple computations to select or eliminate potential candidates for retrieval. This technique does not have, qualitatively, the shortcomings of, not only conventional retrieval techniques, but also retrieval systems that accept relevance feedback from the user, in order to refine the search process. No implementation details have been included in this article and system performance figures are not discussed.  相似文献   

9.
刘彤  倪维健 《计算机科学》2015,42(10):275-280, 286
各种专业领域中的文档往往具有显著的结构化特征,即一篇文档往往是由具有不同表达功能的相对固定的多个文本字段构成,同时这些字段蕴含了相关的领域知识。针对专业文档的结构化和领域化特征,设计了一种面向结构化领域文档的信息检索模型。在该模型中,首先对领域文档集进行挖掘以构建能够反映领域知识的结构化模型,之后以此为基础设计了结构化文档检索算法来为用户查询返回相关的领域文档。选择一类典型的领域文档——农技处方开展了应用研究,利用一份现实的农技处方文档数据集将提出的方法与传统的信息检索方法进行了实验对比分析,并开发了农技处方检索原型系统。  相似文献   

10.
基于文档实例的中文信息检索   总被引:2,自引:0,他引:2  
传统的信息检索系统基于关键词建立索引并进行信息检索.这些系统存在查询返回文档集大、准确率低和普通用户不便于构造查询等不足.为此,该文提出基于文档实例的信息检索,即以已有文档作为样本,在文档库中检索与样本文档相似的所有文档.文中给出了基于文档实例的中文信息检索的解决方法和实现技术.初步实验结果表明该方法是行之有效的.  相似文献   

11.
With the advance of technology, business offices and organizations together with their clients create a massive amount of administrative documents every day. Administrative documents commonly contain some salient entities such as logos, stamps or seals as the means of their authentication and proprietorship. These salient entities provide quite discriminative information, which can effectively be used for different tasks of document image retrieval, classification and recognition in document-based applications. Thus, proper detection/recognition of these entities in document images increases the performance of such applications in terms of document retrieval, classification, and recognition. To present the state-of-the-art research on the retrieval of administrative document images, this paper deals with a survey of administrative document image retrieval in relation to seals and logos. All the available datasets, feature extraction and classification techniques for logo and seal detection/recognition are discussed systematically. The shortcomings of the present technologies on logo and seal based document processing are also highlighted. Avenues of the future works are further given for the benefit of readers. To the best of authors’ knowledge, there is no survey on administrative document image retrieval and hence the authors hope that this work will be helpful to the researchers of the document analysis community.  相似文献   

12.
语义检索是解决信息检索中准确度、人性化要求的一个非常有潜力的方法。通过对知识文档进行主题词标注,然后建立从词元→主题词→知识文档的二级索引结构;对用户的检索,进行查询词到主题词的转化,计算语义相似度,按照语义相似度算法进行排序文档。目前基于知识文档的语义检索系统已经在某集团公司进行部署和应用,取得了前5项结果命中用户总查询90%的效果,说明这种方法是语义检索的一种有效途径。  相似文献   

13.
一种通过内容和结构查询文档数据库的方法   总被引:4,自引:0,他引:4       下载免费PDF全文
文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与传统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.  相似文献   

14.
针对因应急文档知识查找和利用效率不高造成应急决策者不能快速有效制定应急决策的问题,从知识系统工程的角度出发,结合知识元理论对应急文档知识进行结构化建模,为决策者快速有效地使用应急文档知识提供了一种新的途径。通过对物理结构分析提取元数据和进行文档结构化处理,对逻辑结构分析提出知识元提取的方法,知识元导航链接建立知识与结构化文档间的关联,进行知识推理与检索,并对应急文档的细粒度知识挖掘模式进行了深入的探讨。最后开发了应急决策知识支持系统原型并进行了验证,结果表明该建模方法能有效解决应急文档知识查找和利用效率不高的问题。  相似文献   

15.
In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications (e.g., passage-based retrieval and high-quality information extraction). However, long documents, containing hundreds or even thousands of pages and a variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper position on the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. As for evaluations, we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong. Since such mistakes may worsen the downstream applications seriously, a new measure is developed for a more careful evaluation. The empirical experiments based on thousands of long documents from Chinese financial market, English financial market and English scientific publication show that the HELD model with the "root-to-leaf" traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6, 0.729,1 and 0.957,8 in the Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.  相似文献   

16.
Machine Learning for Intelligent Processing of Printed Documents   总被引:1,自引:0,他引:1  
A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents.  相似文献   

17.
We describe a prototype system, IKARUS, with which we investigated the potential of integrating web-based documents, shared knowledge bases, and information retrieval for improving knowledge storage and retrieval. As an example, we discuss how to implement both a user manual and an online help system as one system. The following technologies are combined: a web-based design, a frame-based knowledge engine, use of an advanced full-text search engine, and simple techniques to control terminology. We have combined graphical browsing with several unusual forms of text retrieval—for example, to the sentence and paragraph level.  相似文献   

18.
从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。  相似文献   

19.
ACIRD: intelligent Internet document organization and retrieval   总被引:6,自引:0,他引:6  
This paper presents an intelligent Internet information system, Automatic Classifier for the Internet Resource Discovery (ACIRD), which uses machine learning techniques to organize and retrieve Internet documents. ACIRD consists of a knowledge acquisition process, document classifier, and two-phase search engine. The knowledge acquisition process of ACIRD automatically learns classification knowledge from classified Internet documents. The document classifier applies learned classification knowledge to classify newly collected Internet documents into one or more classes. Experimental results indicate that ACIRD performs as well or better than human experts in both knowledge acquisition and document classification. By using the learned classification knowledge and the given class lattice, the ACIRD two-phase search engine responds to user queries with hierarchically structured navigable results (instead of a conventional flat ranked document list), which greatly aids users in locating information from numerous, diversified Internet documents  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号