首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper presents a predicate-driven document filing system for organizing and automatically filing documents. A document model consists of two basic elements: frame templates representing document classes, and folders which are repositories of frame instances. The frame templates can be organized to form a document type hierarchy, which helps classify and file documents. Frame instances are grouped into a folder on the basis of user-defined criteria, specified as predicates which determine whether a frame instance belongs to a folder. Folders can naturally organized into a folder organization which represents the user's real world document filing system. The predicate consistency problem is discussed to eliminate two abnormalities from a folder organization: inapplicable edges (filing paths) and redundant folders. An evaluating net (including an association dictionary, an instantiation component and a production system) is then proposed for evaluating whether a frame instance satisfies the predicate of a folder during document filing. And the concept of consistency a rule base is also discussed.This work was supported by the Separately Budgeted Research (SBR) grant (No. 421190) from New Jersey Institute of Technology and the Systems Integration Program grant from AT&T Foundation  相似文献   

2.
This paper presents a knowledge-based approach to managing and retrieving personal documents. The dual document models consist of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization mimics the user's real-world document filing system for organizing and storing documents in an office environment. Predicate-based representation of documents is formalized for specifying knowledge about documents. Document filing and retrieval are predicate-driven. The filing criteria for the folders, which are specified in terms of predicates, govern the grouping of frame instances, regardless of their document types. We incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage. This architecture supports various text-based information retrieval techniques and content-based multimedia information retrieval techniques. The paper also proposes a knowledge-based query-preprocessing algorithm, which reduces the search space. For automating the document filing and retrieval, a predicate evaluation engine with a knowledge base is proposed. The learning agent is responsible for acquiring the knowledge needed by the evaluation engine.  相似文献   

3.
This paper formally specifies a document model for office information systems, including formal definitions of document types (frame templates), a document type hierarchy, folders, and folder organizations. Folder Organizations are defined using predicates and directed graphs. AReconstruction Problem for folder organizations is then formulated; viz., under what circumstances it is possible to reconstruct a folder organization from its folder level predicates. The Reconstruction Problem is solved in terms of such graph-theoretic concepts as Associated Digraphs, transitive closure, and redundant/nonredundant filing paths. A Transitive Closure Inversion algorithm is then presented which efficiently recovers a Folder Organization digraph from its Associated Digraph.This work was supported in part by the National Science Foundation under Grant No. IRI-9224602, by the New Jersey Institute of Technology undre Grant No. 421280 and by a grant from AT&T Foundation.  相似文献   

4.
A Knowledge-Based Approach to Effective Document Retrieval   总被引:3,自引:0,他引:3  
This paper presents a knowledge-based approach to effective document retrieval. This approach is based on a dual document model that consists of a document type hierarchy and a folder organization. A predicate-based document query language is proposed to enable users to precisely and accurately specify the search criteria and their knowledge about the documents to be retrieved. A guided search tool is developed as an intelligent natural language oriented user interface to assist users formulating queries. Supported by an intelligent question generator, an inference engine, a question base, and a predicate-based query composer, the guided search collects the most important information known to the user to retrieve the documents that satisfy users' particular interests. A knowledge-based query processing and search engine is devised as the core component in this approach. Algorithms are developed for the search engine to effectively and efficiently retrieve the documents that match the query.  相似文献   

5.
Sharing sustainable and valuable knowledge among knowledge workers is a fundamental aspect of knowledge management. In organizations, knowledge workers usually have personal folders in which they organize and store needed codified knowledge (textual documents) in categories. In such personal folder environments, providing knowledge workers with needed knowledge from other workers’ folders is important because it increases the workers’ productivity and the possibility of reusing and sharing knowledge. Conventional recommendation methods can be used to recommend relevant documents to workers; however, those methods recommend knowledge items without considering whether the items are assigned to the appropriate category in the target user’s personal folders. In this paper, we propose novel document recommendation methods, including content-based filtering and categorization, collaborative filtering and categorization, and hybrid methods, which integrate text categorization techniques, to recommend documents to target worker’s personalized categories. Our experiment results show that the hybrid methods outperform the pure content-based and the collaborative filtering and categorization methods. The proposed methods not only proactively notify knowledge workers about relevant documents held by their peers, but also facilitate push-mode knowledge sharing.  相似文献   

6.
This paper proposes an automatic folder allocation system for text documents through the implementation of a hybrid classification method which combines the Bayesian (Bayes) approach and the Support Vector Machines (SVMs). Folder allocation for text documents in computer is typically executed manually by the user. Every time the user creates text documents by using text editors or downloads the documents from the internet, and wishes to store these documents on the computer, the user needs to determine and allocate the appropriate folder in which to store these new documents. This situation is inconvenient as repeating the folder allocation each time a text document is stored becomes tedious especially when the numbers and layers of folders are huge and the structure is complex and continuously growing. This problem can be overcome by implementing Artificial Intelligence machine learning methods to classify the new text documents and allocate the most appropriate folder as the storage for them. In this paper we propose the Bayes-SVMs hybrid classification framework to perform the tedious task of automatically allocating the right folder for text documents in computers.  相似文献   

7.
随着XML技术的发展,如何利用现有的数据库技术存储和查询XML文档已成为XML数据管理领域研究的热点问题。本文介绍了一种新的文档编码方法,以及基于这种编码方式提出了一种新的XML文档存储方法。方法按照文档中结点类型将XML文档树型结构分解为结点,分别存储到对应的关系表中,这种方法能够将任意结构的文档存储到一个固定的关系模式中。同时为了便于实现数据的查询,将文档中出现的简单路径模式也存储为一个表。这种新的文档存储方法能够有效地支持文档的查询操作,并能根据结点的编码信息实现原XML文档的正确恢复。最后,对本文提出的存储方法和恢复算法进行了实验验证。  相似文献   

8.
Since documents on the Web are naturally partitioned into many text databases, the efficient document retrieval process requires identifying the text databases that are most likely to provide relevant documents to the query and then searching for the identified text databases. In this paper, we propose a neural net based approach to such an efficient document retrieval. First, we present a neural net agent that learns about underlying text databases from the user's relevance feedback. For a given query, the neural net agent, which is sufficiently trained on the basis of the BPN learning mechanism, discovers the text databases associated with the relevant documents and retrieves those documents effectively. In order to scale our approach with the large number of text databases, we also propose the hierarchical organization of neural net agents which reduces the total training cost at the acceptable level. Finally, we evaluate the performance of our approach by comparing it to those of the conventional well-known approaches. Received 5 March 1999 / Revised 7 March 2000 / Accepted in revised form 2 November 2000  相似文献   

9.
In this paper, we describe the representation and organization of the knowledge about the infrastructure of storing documents and about the document base itself, which support fast retrieval of documents and information from various documents. Numerous components of the knowledge base of TEXPROS, such as the system catalog, the frame template base and the frame instance base are discussed.  相似文献   

10.
Businesses and people often organize their information of interest (IOI) into a hierarchy of folders (or categories). The personalized folder hierarchy provides a natural way for each of the users to manage and utilize his/her IOI (a folder corresponds to an interest type). Since the interest is relatively long-term, continuous web scanning is essential. It should be directed by precise and comprehensible specifications of the interest. A precise specification may direct the scanner to those spaces that deserve scanning, while a specification comprehensible to the user may facilitate manual refinement, and a specification comprehensible to information providers (e.g. Internet search engines) may facilitate the identification of proper seed sites to start scanning. However, expressing such specifications is quite difficult (and even implausible) for the user, since each interest type is often implicitly and collectively defined by the content (i.e. documents) of the corresponding folder, which may even evolve over time. In this paper, we present an incremental text mining technique to efficiently identify the user's current interest by mining the user's information folders. The specification mined for each interest type specifies the context of the interest type in conjunctive normal form, which is comprehensible to general users and information providers. The specification is also shown to be more precise in directing the scanner to those sites that are more likely to provide IOI. The user may thus maintain his/her folders and then constantly get IOI, without paying much attention to the difficult tasks of interest specification and seed identification.  相似文献   

11.
When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.  相似文献   

12.
NoSQL document stores are well-tailored to efficiently load and manage massive collections of heterogeneous documents without any prior structural validation. However, this flexibility becomes a serious challenge when querying heterogeneous documents, and hence the user has to build complex queries or reformulate existing queries whenever new schemas are introduced in a collection. In this paper we propose a novel approach, based on formal foundations, for building schema-independent queries which are designed to query multi-structured documents. We present a query enrichment mechanism that consults a pre-constructed dictionary. This dictionary binds each possible path in the documents to all its corresponding absolute paths in all the documents. We automate the process of query reformulation via a set of rules that reformulate most document store operators, such as select, project, unnest, aggregate and lookup. We then produce queries across multi-structured documents which are compatible with the native query engine of the underlying document store. To evaluate our approach, we conducted experiments on synthetic datasets. Our results show that the induced overhead can be acceptable when compared to the efforts needed to restructure the data or the time required to execute several queries corresponding to the different schemas inside the collection.  相似文献   

13.
基于模糊相关的Web文档分类方法   总被引:2,自引:1,他引:1  
雷景生 《计算机工程》2005,31(24):13-14,17
面对Internet上不断增长的巨大信息量,如何使用户获得有趣的和有用的信息已成为信息检索急需解决的问题。由于Web文档往往具有不确定的特征,使得利用模糊集合理论对信息检索过程的不确定性建立模型成为可能。文章提出了一种基于模糊相关技术的Web文档分类方法,实验结果表明,该方法比基于向量空间模型的Web分类方法有较高的分类精度。  相似文献   

14.
This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically with minimal human input. The system starts with an initial taxonomy in which documents are automatically categorized, and then evolves so as to provide a good indexing service as the document collection grows or its usage changes. To this end, we propose a series of algorithms that utilize text-mining technologies such as document clustering, document categorization, and hierarchy reorganization. In particular, clustering and categorization algorithms have been intensively studied in order to provide evolving facilities for hierarchical structures and categorization criteria. Through experiments using the Reuters-21578 document collection, we evaluate the performance of the proposed clustering and categorization methods by comparing them to those of well-known conventional methods.  相似文献   

15.
一种基于向量空间模型的多层次文本分类方法   总被引:37,自引:2,他引:37  
本文研究和改进了经典的向量空间模型(VSM)的词语权重计算方法,并在此基础上提出了一种基于向量空间模型的多层次文本分类方法。也就是把各类按照一定的层次关系组织成树状结构,并将一个类中的所有训练文档合并为一个类文档,在提取各类模型时只在同层同一结点下的类文档之间进行比较;而对文档进行自动分类时,首先从根结点开始找到对应的大类,然后递归往下直到找到对应的叶子子类。实验和实际系统表明,该方法具有较高的正确率和召回率。  相似文献   

16.
Designers usually begin with a database to look for historical design solution, available experience and techniques through design documents, when initiating a new design. This database is a collection of labeled design documents under a few of predefined categories. However, little work has been done on labeling a relatively small number of design documents for information organization, so that most of design documents in this database can be automatically categorized.This paper initiates a study on this topic and proposes a methodology in four steps: design document collection, documents labeling, finalization of documents labeling and categorization of design database. Our discussion in this paper focuses on the first three steps. The key of this method is to collect relatively small number of design documents for manual labeling operation, and unify the effective labeling results as the final labels in terms of labeling agreement analysis and text classification experiment. Then these labeled documents are utilized as training samples to construct classifiers, which can automatically give appropriate labels to each design document.With this method, design documents are labeled in terms of the consensus of operators’ understanding, and design information can be organized in a comprehensive and universally accessible way. A case study of labeling robotic design documents is used to demonstrate the proposed methodology. Experimental results show that this method can significantly benefit efficient design information search.  相似文献   

17.
M W Lansdale 《Ergonomics》1991,34(8):1161-1178
If we remember the visual appearance of documents, and other attributes such as location, then a number of new information management strategies become possible candidates for application in the design of filing systems. This paper describes a number of experiments aimed at investigating aspects of memory for documents in office settings. There is no evidence, as has previously been suggested, that automatic encoding for appearance or location of documents occurs at significant levels. The results of these experiments are more consistent with the view that visual and spatial attributes of documents are remembered in proportion to the attention paid to them when the documents are handled. The experiments also illustrate the sensitivity of this principle to the context in which subjects use documents. It is apparent that office tasks vary considerably in the extent to which subjects must pay attention to the visual and locational attributes of the documents handled. The consequences for the design of filing systems is discussed in terms of what methods for storage and retrieval can usefully be built into the design of systems.  相似文献   

18.
Document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document clustering must consider individual preferences and needs to support personalization in document categorization. Most existing document-clustering techniques, generally anchoring in pure content-based analysis, generate a single set of clusters for all individuals without tailoring to individuals' preferences and thus are unable to support personalization. The partial-clustering-based personalized document-clustering approach, incorporating a target individual's partial clustering into the document-clustering process, has been proposed to facilitate personalized document clustering. However, given a collection of documents to be clustered, the individual might have categorized only a small subset of the collection into his or her personal folders. In this case, the small partial clustering would degrade the effectiveness of the existing personalized document-clustering approach for this particular individual. In response, we extend this approach and propose the collaborative-filtering-based personalized document-clustering (CFC) technique that expands the size of an individual's partial clustering by considering those of other users with similar categorization preferences. Our empirical evaluation results suggest that when given a small-sized partial clustering established by an individual, the proposed CFC technique generally achieves better clustering effectiveness for the individual than does the partial-clustering-based personalized document-clustering technique.  相似文献   

19.
The creation and deployment of knowledge repositories for managing, sharing, and reusing tacit knowledge within an organization has emerged as a prevalent approach in current knowledge management practices. A knowledge repository typically contains vast amounts of formal knowledge elements, which generally are available as documents. To facilitate users' navigation of documents within a knowledge repository, knowledge maps, often created by document clustering techniques, represent an appealing and promising approach. Various document clustering techniques have been proposed in the literature, but most deal with monolingual documents (i.e., written in the same language). However, as a result of increased globalization and advances in Internet technology, an organization often maintains documents in different languages in its knowledge repositories, which necessitates multilingual document clustering (MLDC) to create organizational knowledge maps. Motivated by the significance of this demand, this study designs a Latent Semantic Indexing (LSI)-based MLDC technique capable of generating knowledge maps (i.e., document clusters) from multilingual documents. The empirical evaluation results show that the proposed LSI-based MLDC technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision, and is capable of maintaining a good balance between monolingual and cross-lingual clustering effectiveness when clustering a multilingual document corpus.  相似文献   

20.
This paper investigates hierarchy extraction from results of multi-label classification (MC). MC deals with instances labeled by multiple classes rather than just one, and the classes are often hierarchically organized. Usually multi-label classifiers rely on a predefined class hierarchy. A much less investigated approach is to suppose that the hierarchy is unknown and to infer it automatically. In this setting, the proposed system classifies multi-label data and extracts a class hierarchy from multi-label predictions. It is based on a combination of a novel multi-label extension of the fuzzy Adaptive Resonance Associative Map (ARAM) neural network with an association rule learner.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号