首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Document images often suffer from different types of degradation that renders the document image binarization a challenging task. This paper presents a document image binarization technique that segments the text from badly degraded document images accurately. The proposed technique is based on the observations that the text documents usually have a document background of the uniform color and texture and the document text within it has a different intensity level compared with the surrounding document background. Given a document image, the proposed technique first estimates a document background surface through an iterative polynomial smoothing procedure. Different types of document degradation are then compensated by using the estimated document background surface. The text stroke edge is further detected from the compensated document image by using L1-norm image gradient. Finally, the document text is segmented by a local threshold that is estimated based on the detected text stroke edges. The proposed technique was submitted to the recent document image binarization contest (DIBCO) held under the framework of ICDAR 2009 and has achieved the top performance among 43 algorithms that are submitted from 35 international research groups.  相似文献   

2.
TEXPROS (TEXt PROcessing System) is an automatic document processing system which supports text-based information representation and manipulation, conveying meanings from stored information within office document texts. A dual modeling approach is employed to describe office documents and support document search and retrieval. The frame templates for representing document classes are organized to form a document type hierarchy. Based on its document type, the synopsis of a document is extracted to form its corresponding frame instance. According to the user predefined criteria, these frame instances are stored in different folders, which are organized as a folder organization (i.e., repository of frame instances associated with their documents). The concept of linking folders establishes filing paths for automatically filing documents in the folder organization. By integrating document type hierarchy and folder organization, the dual modeling approach provides efficient frame instance access by limiting the searches to those frame instances of a document type within those folders which appear to be the most similar to the corresponding queries.This paper presents an agent-based document filing system using folder organization. A storage architecture is presented to incorporate the document type hierarchy, folder organization and original document storage into a three-level storage system. This folder organization supports effective filing strategy and allows rapid frame instance searches by confining the search to the actual predicate-driven retrieval method. A predicate specification is proposed for specifying criteria on filing paths in terms of user predefined predicates for governing the document filing. A method for evaluating whether a given frame instance satisfies the criteria of a filing path is presented. The basic operations for constructing and reorganizing a folder organization are proposed.  相似文献   

3.
一种新的代理缓存替换策略   总被引:7,自引:0,他引:7  
代理缓存的替换策略事实上可以看做排序问题,排序的标准可能有多种,寻找一个好的代理缓存的替换策略就是寻找一个能够反映真实Web访问特性的标准,基于文档大小的替换策略是一种简单实用的策略,但是并未全部利用WWW访问特性,根据在代理缓存日志中对各种访问特性的分析,使用文档大小,访问频率、文档访问剩余寿命作为计算文档价值的要素,提出了一种新的替换策略,这种策略同时具有较高的文档命中率和文档字节命中率,最后给出了基于日志的模拟。  相似文献   

4.
This paper presents a knowledge-based approach to managing and retrieving personal documents. The dual document models consist of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization mimics the user's real-world document filing system for organizing and storing documents in an office environment. Predicate-based representation of documents is formalized for specifying knowledge about documents. Document filing and retrieval are predicate-driven. The filing criteria for the folders, which are specified in terms of predicates, govern the grouping of frame instances, regardless of their document types. We incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage. This architecture supports various text-based information retrieval techniques and content-based multimedia information retrieval techniques. The paper also proposes a knowledge-based query-preprocessing algorithm, which reduces the search space. For automating the document filing and retrieval, a predicate evaluation engine with a knowledge base is proposed. The learning agent is responsible for acquiring the knowledge needed by the evaluation engine.  相似文献   

5.
表格文档在日常生活中运用十分广泛 ,它应用于人口普查、银行票据、各类报表等领域 ,对这类文档进行计算机自动处理具有重要的现实意义。表格文档信息处理系统主要由文档原始图像获取、文档结构提取和填写信息识别等部分组成。在分析了国内外表格文档信息自动录入系统的优缺点后 ,采用一种基于接触式图像传感器 (CIS)摄取表格文档的原始图像信号 ,利用硬件获得了高质量的图像信号。采用光学字符识别 (OCR)技术对填写的表格文档信息进行识别。该表格文档信息处理系统具有对表格文档的纸张和填写的要求低和识别准确度高的特点。  相似文献   

6.
Efficient phrase-based document indexing for Web document clustering   总被引:4,自引:0,他引:4  
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.  相似文献   

7.
《Information Systems》2000,25(6-7):453-463
The paper discusses how recurrent organizational activities such as document preparation can be supported by a knowledge-based document preparation tool. REGENT (REport GENeration Tool) is a software environment, which generates documents from reusable document pieces by planning, executing and monitoring the document preparation process in an organizational setting. The documents are constructed from stored document pieces using artificial intelligence methods. A system architecture is developed to enable the document generation process within a broader office automation setting. The report preparation process knowledge is captured in a knowledge representation scheme. A two-phased artificial intelligence problem solving strategy is developed to carry out the reasoning steps when configuring reports from document pieces. The REGENT environment is especially effective when preparing recurrent report types such as the preparation of annual reports. The approach is illustrated with examples gathered during the partial implementation of REGENT at FAW (Artificial Intelligence Research Institute).  相似文献   

8.
An active document framework is a self-representable, self-explainable, and self-executable document mechanism. A document’s content is reflected in four aspects: granularity hierarchy, template hierarchy, background knowledge, and semantic links between fragments. An active document has a set of build-in engines for browsing, retrieving, and reasoning, which can work in a way best suited to the document’s content. Besides browsing and retrieval services, the active document supports intelligent information services such as complex question answering, online teaching, and assistant problem solving. The client side service provider is only responsible for the retrieval of the required active document. The detailed information services are provided by the document mechanism. This improves the current Web information retrieval approach by raising the efficiency of information retrieval, enhancing the preciseness and mobility of information services, and enabling intelligent information services. A tool for making semantic links in a document and an intelligent browser have been developed to support the proposed approach, which provides a new type of web information service.  相似文献   

9.
针对卫星星座健康状态管理文档涉及多项遥测参数的查询和计算、文档格式要求严格、编制工作量巨大、人工耗时较长的问题,提出了一种卫星星座健康状态管理文档自动生成方法.通过对文档中所含的基本数据类型进行归类分析,制定配置文件存储规则,对文档模板进行自定义设置,并应用文档自动生成算法,利用文档模板及相关参数生成数据汇总文档.该方法能够实现文档编制过程中的知识复用和通用内容生成,建立规范有效的文档编制流程.  相似文献   

10.
We present a novel method for detecting near-duplicates from a large collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is then applied and the similarity degree between the input document and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods. Experimental results show that our method is effective in near-duplicate document detection.  相似文献   

11.
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document  相似文献   

12.
In the traditional document authentication environment, the required document authentication time is determined by document applicants on the basis of the historic log, and the variation of document authentication time due to document contents is almost ignored. In addition to the authentication time estimation, once a document is rejected by an authenticator after authentication, the applicant might sometimes fail to exactly recognize the major rejection reasons and modification guides. Focusing on the above issues related to document authentication, this research proposes algorithms, namely the determination of document authentication time and the recommendation of document paragraph modification, to determine the document authentication information to support electronic document authentication. In addition to the proposed algorithms, a web-based document authentication information management system is developed and used to demonstrate feasibility of the proposed algorithms applied in CMP patent document authentication. Whether in providing authentication information to users or in reusing the information generated from the authentication processes, the document authentication information management system developed in this study can enhance the performance of the traditional approaches for document authentication.  相似文献   

13.
Maintaining, customizing, sharing and reusing ISO9000 quality documents are essential for many organizations, especially those who work as virtual enterprises (VE). In a VE, the documents must be shared among organizations to take the full advantages of the recent Internet advances. XML is a new browser-based language standard. The purpose of this research is to explore the capabilities of XML and Internet technologies in electronic document management environments to comply with the ISO9000 requirements. This research has demonstrated several XML-enabled examples beneficial for the main functions of ISO9000 document management such as document creation, document change, document control and document access. The implemented examples demonstrate the effectiveness and efficiencies of document customizing, querying, hierarchical linking, tracking and reusing. The research results solve the ISO9000 document-related problems among working partners and facilitate document flow and information integration of value chain.  相似文献   

14.
This paper presents a document retrieval technique that is capable of searching document images without OCR (optical character recognition). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation.  相似文献   

15.
为了增强 Word文档的安全 ,对 Word文档研究了数据级的信息安全问题 ,用面向对象的程序 ,获取Word文档中的字符信息 ,得出其编码 ,按“超长密码钥流”算法进行加密 ,并且在加密解密过程中 ,保持了 Word文档格式信息的无损化 ,也介绍了加密与解密功能的设置。用此方法加密的 Word文档 ,不能被非法破解。该研究使 Word文档的信息安全 ,从底层有了一种更坚固的保障  相似文献   

16.
本文主要研究在.net平台下对给定文档的标准XML格式的不同节点位置进行要求的RC5算法的加解密操作。包括对XML文档中不同节点进行一定的解析,然后加密并且将加密结果嵌入到XML文档中;对给定的加密后的XML文档进行相应节点的解密,并重新组织成规范的XML格式;提供不同的API接口以供其它程序调用。最后在具体RC5算法的基础上完成软件程序,并在特定XML文档格式的基础上完成了模拟实现。  相似文献   

17.
一种改进的中文文档图像倾斜检测方法   总被引:4,自引:0,他引:4  
孙楠  刘志文 《计算机仿真》2006,23(9):184-187
图像获取设备将纸质文档转换为文档图像时,经常会使文档图像出现某种程度的倾斜,从而可能使后续的文档版面理解和OCR识别算法失败。文中提出一种基于近邻法的中文图像的倾斜角度检测方法,并采用最小二乘法减小倾斜估计的误差,从而大大优化了运算速度,增强了算法的鲁棒性,与现有方法相比,具有运算速度快,检测精度高的优势。算法在Visual C++下编程加以实现,通过对检测库中100幅倾斜中文文档图像的检测证明,该方法具有精度高和适应性强的特点。  相似文献   

18.
19.
20.
不同词性特征在文本聚类中有不同的贡献度。该文对四组有代表性的中英文数据集,利用三种聚类算法验证了四种主要词性及其组合对中英文文本聚类的影响。实验结果表明,在中文和英文两种语言中,名词均是表征文本内容的最重要词性,动词、形容词和副词均对文本聚类结果有帮助,仅选择名词作为特征聚类的结果与保留所有词性聚类的结果相近,但可大大降低文本的维度;选用名词为文本特征不能实现最好的聚类效果;相对其他词性组合和单一词性,采用名词、动词、形容词和副词的组合特征往往可以实现更好的聚类效果。在词性所占的比例以及单一词性聚类的结果上,同一词性在中英文文本聚类中呈现出较大差异。相对于英文,不同词性特征及其组合在中文文本聚类中呈现的差异更为稳定。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号