首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Maps are one of the most valuable documents for gathering geospatial information about a region. Yet, finding a collection of diverse, high-quality maps is a significant challenge because there is a dearth of content-specific metadata available to identify them from among other images on the Web. For this reason, it is desirous to analyze the content of each image. The problem is further complicated by the variations between different types of maps, such as street maps and contour maps, and also by the fact that many high-quality maps are embedded within other documents such as PDF reports. In this paper, we present an automatic method to find high-quality maps for a given geographic region. Not only does our method find documents that are maps, but also those that are embedded within other documents. We have developed a Content-Based Image Retrieval (CBIR) approach that uses a new set of features for classification in order to capture the defining characteristics of a map. This approach is able to identify all types of maps irrespective of their subject, scale, and color in a highly scalable and accurate way. Our classifier achieves an F1-measure of 74%, which is an 18% improvement over the previous work in the area.  相似文献   

2.
3.
The paper discusses issues of rule-based data transformation from arbitrary spreadsheet tables to a canonical (relational) form. We present a novel table object model and rule-based language for table analysis and interpretation. The model is intended to represent a physical (cellular) and logical (semantic) structure of an arbitrary table in the transformation process. The language allows drawing up this process as consecutive steps of table understanding, i. e. recovering implicit semantics. Both are implemented in our tool for spreadsheet data canonicalization. The presented case study demonstrates the use of the tool for developing a task-specific rule-set to convert data from arbitrary tables of the same genre (government statistical websites) to flat file databases. The performance evaluation confirms the applicability of the implemented rule-set in accomplishing the stated objectives of the application.  相似文献   

4.
On how to merge sorted lists coming from different web search tools   总被引:1,自引:0,他引:1  
Different web search tools often complement each other. So, if we want to have a good coverage of all relevant web items, a reasonable strategy is to use different search tools and them merge the resulting lists. How to merge them? In this paper, we describe reasonable axioms for the merging procedure and describe all mergings that satisfy these reasonable axioms.  相似文献   

5.
Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.  相似文献   

6.
7.
8.
The objective of this work is to automatically generate a large number of images for a specified object class. A multimodal approach employing both text, metadata, and visual features is used to gather many high-quality images from the Web. Candidate images are obtained by a text-based Web search querying on the object identifier (e.g., the word penguin). The Webpages and the images they contain are downloaded. The task is then to remove irrelevant images and rerank the remainder. First, the images are reranked based on the text surrounding the image and metadata features. A number of methods are compared for this reranking. Second, the top-ranked images are used as (noisy) training data and an SVM visual classifier is learned to improve the ranking further. We investigate the sensitivity of the cross-validation procedure to this noisy training data. The principal novelty of the overall method is in combining text/metadata and visual features in order to achieve a completely automatic ranking of the images. Examples are given for a selection of animals, vehicles, and other classes, totaling 18 classes. The results are assessed by precision/recall curves on ground-truth annotated data and by comparison to previous approaches, including those of Berg and Forsyth and Fergus et al.  相似文献   

9.
This article argues that a relational view of innovation opens up new perspectives of examining and explaining how novelty develops in creative industries. Although many researchers have given time to this topic, a theoretically grounded concept of relational innovation remains undeveloped within the literature. To address this issue, I set out to offer a framework informed by Gabriel Tarde's relational sociology, by re‐interpreting this sociology with regard to practice theory. By applying this framework in an empirical study of haute cuisine, I identify three processes of innovating at varying degrees of novelty (repeating, adapting, and differentiating). By relating those processes in the form of practices‐nets, I show that innovating is not a linear development process, but that a culinary innovation emerges in between relations of everyday practices that define and transform its value. I hope, in this way, to contribute to a more complex and subtle understanding of culinary innovation as relational.  相似文献   

10.
Data extraction from the web based on pre-defined schema   总被引:7,自引:1,他引:7       下载免费PDF全文
With the development of the Internet,the World Web has become an invaluable information source for most organizations,However,most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents.Effectively extracting data from such documents remains a non-trivial task.In this paper,we present a schema-guided approach to extracting data from HTML pages .Under the approach,the user defines a schema specifying what to be extracted and provides sample mappings between the schema and th HTML page.The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required datas in the form of XML conforming to the use-defined schema .A prototype system implementing the approach has been developed .The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy.  相似文献   

11.
12.
XML模式到关系数据模式转换的研究   总被引:4,自引:0,他引:4       下载免费PDF全文
分析和研究了DTD模式到关系模式的内联映射算法,提出一种带约束条件和函数依赖的映射方法。该方法结合给定规则对XML DTD进行简化,构造带约束条件的DTD图,并依照图中的一些函数依赖关系以及函数依赖关系的讨论得到最终关系集合,在引入映射方法的同时给出实例进行介绍,从而得到更加完备的关系模式。  相似文献   

13.
Many applications in inventory control, reliability engineering and preventive maintenance involve frequent calculations of probabilities and partial expectations. In the design of high-volume computer-based applications recourse to internal tables may therefore be preferable to an import of statistical packages. While interpolation in tabulated cdf’s will often prove sufficiently accurate from the point of view of statistical representation of the underlying problem, tables of compatible (partial) expectations need to be constructed with regard to the method of interpolation employed. The mathematics for establishing such tables differ from standard textbook procedures. This paper develops appropriate expressions in general terms and gives explicit results for the Gamma family of distributions, which is of particular interest in the areas of application mentioned.  相似文献   

14.
Ontologies are increasingly being recognized as a critical component in making networked knowledge accessible. Software architectures which can assemble knowledge from networked sources coherently according to the requirements of a particular task or perspective will be at a premium in the next generation of web services. We argue that the ability to generate task-relevant ontologies efficiently and relate them to web resources will be essential for creating a machine-inferencable “semantic web”. The Internet-based multi-agent problem solving (IMPS) architecture described here is designed to facilitate the retrieval, restructuring, integration and formalization of task-relevant ontological knowledge from the web. There are rich structured and semi-structured sources of knowledge available on the web that present implicit or explicit ontologies of domains. Knowledge-level models of tasks have an important role to play in extracting and structuring useful focused problem-solving knowledge from these web sources. IMPS uses a multi-agent architecture to combine these models with a selection of web knowledge extraction heuristics to provide clean syntactic integration of ontological knowledge from diverse sources and support a range of ontology merging operations at the semantic level. Whilst our specific aim is to enable on-line knowledge acquisition from web sources to support knowledge-based problem solving by a community of software agents encapsulating problem-sloving inferences, the techniques described here can be applied to more general task-based integration of knowledge from diverse web sources, and the provision of services such as the critical comparison, fusion, maintenance and update of both formal informal ontologies.  相似文献   

15.
16.
Web video categorization is a fundamental task for web video search. In this paper, we explore web video categorization from a new perspective, by integrating the model-based and data-driven approaches to boost the performance. The boosting comes from two aspects: one is the performance improvement for text classifiers through query expansion from related videos and user videos. The model-based classifiers are built based on the text features extracted from title and tags. Related videos and user videos act as external resources for compensating the shortcoming of the limited and noisy text features. Query expansion is adopted to reinforce the classification performance of text features through related videos and user videos. The other improvement is derived from the integration of model-based classification and data-driven majority voting from related videos and user videos. From the data-driven viewpoint, related videos and user videos are treated as sources for majority voting from the perspective of video relevance and user interest, respectively. Semantic meaning from text, video relevance from related videos, and user interest induced from user videos, are combined to robustly determine the video category. Their combination from semantics, relevance and interest further improves the performance of web video categorization. Experiments on YouTube videos demonstrate the significant improvement of the proposed approach compared to the traditional text based classifiers.  相似文献   

17.
18.
关于提取Web用户浏览行为特征的研究   总被引:5,自引:0,他引:5  
当前,Web日志挖掘技术已成为实现网站个性化服务的研究热点.运用Markov模型来预测用户的浏览模式,从而提高站点访问率、为站点重组提供有利信息是该领域广泛采用的方法之一.但传统方法建立的Markov模型,存在着数据冗余复杂、模型庞大繁琐等问题.针对这些问题,介绍了一种改进的Markov模型.其方法主要是在原有模型的基础之上,在数据清洗、用户会话识别过程中删除一些不予考虑的因素,大大简化了建立的Markov模型,提高了Web日志挖掘的效率.  相似文献   

19.
While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.  相似文献   

20.
In this paper, we deal with the issue of determining the scores of all matches involved in a football tournament. The final table of the tournament, which shows the standings of the teams, is taken as the initial data of the problem. This is a different kind of combinatorial problems which require the construction of valid initial states according to some given final state. We use a rules-based method to solve the problem, analyzing the search space and introducing the notion of black&white graphs. The table data is firstly used to compute possible results of all played matches. Based on the results as well as the total number of scored and conceded goals, possible scores of the matches are then computed. The solution strategy is experimented on several final tables from previous World Cup tournaments. Other experiments are conducted for various team standings, measuring the time required to process some specific data of up to 10 teams.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号