首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 135 毫秒
1.
Current approaches to index weighting for information retrieval from texts are based on statistical analysis of the texts' contents. A key shortcoming of these indexing schemes, which consider only the terms in a document, is that they cannot extract semantically exact indexes that represent the semantic content of a document. To address this issue, we proposed a new indexing formalism that considers not only the terms in a document, but also the concepts. In the proposed method, concepts are extracted by exploiting clusters of terms that are semantically related, referred to as concept clusters. Through experiments on the TREC-2 collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the highest-ranked documents. Moreover, the index term dimension was 53.3% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.  相似文献   

2.
3.
The Indexing and Retrieval of Document Images: A Survey   总被引:2,自引:0,他引:2  
The economic feasibility of maintaining large data bases of document images has created a tremendous demand for robust ways to access and manipulate the information these images contain. In an attempt to move toward a paperless office, large quantities of printed documents are often scanned and archived as images, without adequate index information. One way to provide traditional data-base indexing and retrieval capabilities is to fully convert the document to an electronic representation which can be indexed automatically. Unfortunately, there are many factors which prohibit complete conversion including high cost, low document quality, and the fact that many nontext components cannot be adequately represented in a converted form. In such cases, it can be advantageous to maintain a copy of and use the document in image form. In this paper, we provide a survey of methods developed by researchers to access and manipulate document images without the need for complete and accurate conversion. We briefly discuss traditional text indexing techniques on imperfect data and the retrieval of partially converted documents. This is followed by a more comprehensive review of techniques for the direct characterization, manipulation, and retrieval, of images of documents containing text, graphics, and scene images.  相似文献   

4.
Visual (image and video) database systems require efficient indexing to enable fast access to the images in a database. In addition, the large memory capacity and channel bandwidth requirements for the storage and transmission of visual data necessitate the use of compression techniques. We note that image/video indexing and compression are typically pursued independently. This reduces the storage efficiency and may degrade the system performance. In this paper, we present novel algorithms based on vector quantization (VQ) for indexing of compressed images and video. To start with, the images are compressed using VQ. In the first technique, for each codeword in the codebook, a histogram is generated and stored along with the codeword. We note that the superposition of the histograms of the codewords, which are used to represent an image, is a close approximation of the histogram of the image. This histogram is used as an index to store and retrieve the image. In the second technique, the histogram of the labels of an image is used as an index to access the image. We also propose an algorithm for indexing compressed video sequences. Here, each frame is encoded in the intraframe mode using VQ. The labels are used for the segmentation of a video sequence into shots, and for indexing the representative frame of each shot. The proposed techniques not only provide fast access to stored visual data, but also combine compression and indexing. The average retrieval rates are 95% and 94% at compression ratios of 16:1 and 64:1, respectively. The corresponding cut detection rates are 97% and 90%, respectively.  相似文献   

5.
6.
Image database systems must effectively and efficiently handle and retrieve images from a large collection of images. A serious problem faced by these systems is the requirement to deal with the nonstationary database. In an image database system, image features are typically organized into an indexing structure, and updating the indexing structure involves many computations. In this paper, this difficult problem is converted into a constrained optimization problem, and the iteration-free clustering (IFC) algorithm based on the Lagrangian function, is presented for adapting the existing indexing structure for a nonstationary database. Experimental results concerning recall and precision indicate that the proposed method provides a binary tree that is almost optimal. Simulation results further demonstrate that the proposed algorithm can maintain 94% precision in seven-dimensional feature space, even when the number of new-coming images is one-half the number of images in the original database. Finally, our IFC algorithm outperforms other methods usually applied to image databases.  相似文献   

7.
在Lucene的全文检索中,直接对PDF文档进行全文检索几乎是不可能的。在实际应用中又需要对大量的PDF文档进行检索,通过Xpdf工具先对PDF文档转换为TXT文本,然后对TXT文本建立索引,在进行检索时通过文件名实现和原始PDF文档的一一对应,最终实现PDF文档的全文检索功能,同时还能实现对PDF文档所检索的包含关键词的内容进行高亮显示,实现全文检索的功能,通过实际项目应用,检索效果能够达到很好的效果。  相似文献   

8.
搜索引擎的诞生,给信息搜集带来了极大的方便与好处。一套完备、成熟的搜索引擎的开发需要耗费大量资源,本文围绕如何快速搭建一个简易的搜索引擎展开。基于各开源组织独立研发并对外提供的搜索引擎组件与框架,本文在JBuilder开发平台上调用各组件对外提供的Java API,快速地搭建起由数据抓取、建立索引及执行搜索3大部分组成的简易的全文搜索引擎,实现网页文档类数据的抓取与保存、文本提取、索引文档及索引库的建立、基本关键词的检索等功能,并描述搜索引擎实现及运行的一般过程。  相似文献   

9.
XML is a new standard for exchanging and representing information on the Internet. Documents can be hierarchically represented by XML-elements. In this paper, we propose that an XML document collection be represented and indexed using a bitmap indexing technique. We define the similarity and popularity operations suitable for bitmap indexes. We also define statistical measurements in the BitCube: center, and radius. Based on these measurements, we describe a new bitmap indexing based technique to cluster XML documents. The techniques for clustering are motivated by the fact that the bitmap indexes are expected to be very sparse.Furthermore, a 2-dimensional bitmap index is extended to a 3-dimensional bitmap index, called the BitCube. Sophisticated querying of XML document collections can be performed using primitive operations such as slice, project, and dice. Experiments show that the BitCube can be created efficiently and the primitive operations can be performed more efficiently with the BitCube than with other alternatives.  相似文献   

10.
The Digital Library Initiative (DLI) project at the University of Illinois at Urbana-Champaign is developing the information infrastructure to effectively search technical documents on the Internet. The authors are constructing a large testbed of scientific literature, evaluating its effectiveness under significant use, and researching enhanced search technology. They are building repositories (organized collections) of indexed multiple-source collections and federating (merging and mapping) them by searching the material via multiple views of a single virtual collection. Developing widely usable Web technology is also a key goal. Improving Web search beyond full-text retrieval will require using document structure in the short term and document semantics in the long term. Their testbed efforts concentrate on journal articles from the scientific literature, with structure specified by the Standard Generalized Markup Language (SGML). Research efforts extract semantics from documents using the scalable technology of concept spaces based on context frequency. They then merge these efforts with traditional library indexing to provide a single Internet interface to indexes of multiple repositories  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号