首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Data extraction from the web based on pre-defined schema   总被引:7,自引:1,他引:7       下载免费PDF全文
With the development of the Internet,the World Web has become an invaluable information source for most organizations,However,most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents.Effectively extracting data from such documents remains a non-trivial task.In this paper,we present a schema-guided approach to extracting data from HTML pages .Under the approach,the user defines a schema specifying what to be extracted and provides sample mappings between the schema and th HTML page.The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required datas in the form of XML conforming to the use-defined schema .A prototype system implementing the approach has been developed .The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy.  相似文献   

2.
基于Web服务的数据集成框架   总被引:1,自引:0,他引:1  
为了给用户和第三方应用提供更方便的接口支持并扩展数据集成产品的功能,提出基于Web服务的数据集成(Web Services Based Data Integration,WSDI)框架,并详细描述其框架结构及关键技术.WSDI可以提高数据集成产品的内部管理和扩展能力,在数据集成产品OnceDI中进行应用,效果良好.  相似文献   

3.
Socially aware networking (SAN) exploits social characteristics of mobile users to streamline data dissemination protocols in opportunistic environments. Existing protocols in this area utilized various social features such as user interests, social similarity, and community structure to improve the performance of data dissemination. However, the interrelationship between user interests and its impact on the efficiency of data dissemination has not been explored sufficiently. In this paper, we analyze various kinds of relationships between user interests and model them using a layer-based structure in order to form social communities in SAN paradigm. We propose Int-Tree, an Interest-Tree based scheme which uses the relationship between user interests to improve the performance of data dissemination. The core of Int-Tree is the interest-tree, a tree-based community structure that combines two social features, i.e., density of a community and social tie, to support data dissemination. The simulation results show that Int-Tree achieves higher delivery ratio, lower overhead, in comparison to two benchmark protocols, PROPHET and Epidemic routing. In addition, Int-Tree can perform with 1.36 hop counts in average, and tolerable latency in terms of buffer size, time to live (TTL) and simulation duration. Finally, Int-Tree keeps stable performance with various parameters.  相似文献   

4.
城市空气质量数据处理分析的Web服务模型   总被引:2,自引:1,他引:1  
研究了目前环境空气质量数据处理分析技术应用于大规模、分布式、异构的环境中所面临的问题,提出了一个基于Web服务的数据处理分析模型,将数据的实时采集、预处理、存储、发布、分析等都集成到一个框架结构中,满足了对系统的开放性、可扩展性和易维护性等方面的要求。  相似文献   

5.
基于马尔可夫链的Web访问序列挖掘算法   总被引:2,自引:0,他引:2  
Web问序列的数据挖掘有助于提高Web的访问质量,但序列是数据挖掘中一个比较困难的问题,经典序列算法普遍存在时间和存储空间开销过大的缺陷。提出了一种基于马尔可夫链的Web访问序列挖掘算法,可以通过较少的计算量发现请求网页的相关性。并将其用于某培训中心网站的设计,取得了良好的效果。  相似文献   

6.
在双向车道网络中,不同车道上分发的数据内容通常是不相同的,为了减轻丢包率的影响并利用移动车辆的存储转发能力,提出了一种可以结合定向天线的网络编码与存储转发机制来提高数据内容的分发速率。在该机制下,车辆节点驶经反向车道的分发节点时会存储接收到的来自不同车道的数据包,当其成为分发节点后,在分发同向车辆节点所需要的数据内容的同时也会转发反向车辆节点需要的数据包,以此来提高双向车道网络的传输速率。仿真结果显示,对于不同内容的分发,该机制下的分发速率增益接近相同内容分发但无转发机制的速率增益;而对于相同内容的分发,该机制下的分发速率增益可以超过无转发机制的两倍。  相似文献   

7.
8.
9.
The storage and retrieval of multimedia has become a requirement for many information systems. This paper presents a comprehensive survey of image search engines, with many clarifying comments. First, we looked at image search engine architecture, followed by the role of the crawler in detecting images. We reviewed the common World Wide Web based systems for image retrieval developed in research institutions and in commercial business. A comparative performance study between the existing engines is also presented.  相似文献   

10.
Visualizing processes on the web   总被引:1,自引:0,他引:1  
In this paper, we describe 3WPS, a framework to build distributed systems that are able to monitor and interact with a process through a 3D interface that is accessible via the World Wide Web (WWW). The 3WPS is easily configurable, easily adaptable to different processes with high reuse of its software components and its distributed architecture leverages on off-the-shelf components of the WWW infrastructure such as Java applets and Virtual Reality Modeling Language (VRML) browsers. We describe the characteristics of 3WPS framework by mainly focusing on the issue of programmability and by contextually providing an example tour of its usage.  相似文献   

11.
12.
This is not about hackers who might steal your credit card number as it passes over the overcrowded connections on the Web, nor about your stumbling over a false virus alert on your favorite newsgroup, or even about a wily hacker penetrating your system and reading the E-mail you've saved. Instead it's about cookies, electronic product registration privacy invasion, the new push technology that will clog your electronic mailbox, user-friendly updates of programs and drivers and denial-of-service, and other technological goodies.  相似文献   

13.
Maps are one of the most valuable documents for gathering geospatial information about a region. Yet, finding a collection of diverse, high-quality maps is a significant challenge because there is a dearth of content-specific metadata available to identify them from among other images on the Web. For this reason, it is desirous to analyze the content of each image. The problem is further complicated by the variations between different types of maps, such as street maps and contour maps, and also by the fact that many high-quality maps are embedded within other documents such as PDF reports. In this paper, we present an automatic method to find high-quality maps for a given geographic region. Not only does our method find documents that are maps, but also those that are embedded within other documents. We have developed a Content-Based Image Retrieval (CBIR) approach that uses a new set of features for classification in order to capture the defining characteristics of a map. This approach is able to identify all types of maps irrespective of their subject, scale, and color in a highly scalable and accurate way. Our classifier achieves an F1-measure of 74%, which is an 18% improvement over the previous work in the area.  相似文献   

14.
代码分发协议是无线传感器网络(WSNs)在实地部署之后进行软件更新的关键技术。针对现有代码分发协议对特定目标节点分发时需要传输冗余代码镜像的问题,提出了一种基于多播分发树的代码分发(MTCD)协议。MTCD协议通过建立基站节点到目标节点的分发树路径来降低网络中参与代码分发节点的个数,从而降低数据冗余传输和网络能量消耗。TOSSIM仿真结果表明:与TinyOS的标准代码分发协议Deluge相比,MTCD协议在分发时间和数据包传输方面都有更优的性能。  相似文献   

15.
External information search behaviour has long been of interest to consumer researchers. Experimental and post hoc survey research methodologies have typically used a large number of variables to record search activity. However, as these are usually considered in aggregate, there is little opportunity for the researcher to overview the search style of a consumer. To date, the diagrammatic illustration of search behaviour has been limited to experimental environments in which the available information was strictly bounded, for example, within databases or when information display boards have been used. This paper, which focuses largely on inter-site world wide web (WWW) search behaviour, discusses web search paradigms and the variables used to capture WWW search. It also provides a conceptual framework for the representation of external information search behaviour in diagrammatic form. The technique offers researchers an opportunity to holistically interpret information search data and search styles. The benefits include the identification of particular search styles, more precise interpretation of web search activity numeric data and the potential application for the training of web users to improve their search effectiveness.  相似文献   

16.
In the emerging field of the Internet of Things (IoT), Wireless Sensor Networks (WSNs) have a key role to play in sensing and collecting measures on the surrounding environment. In the deployment of large scale observation systems in remote areas, when there is not a permanent connection with the Internet, WSNs are calling for replication and distributed storage techniques that increase the amount of data stored within the WSN and reduce the probability of data loss. Unlike conventional network data storage, WSN-based distributed storage is constrained by the limited resources of the sensors. In this paper, we propose a low-complexity distributed data replication mechanism to increase the resilience of WSN-based distributed storage at large scale. In particular, we propose a simple, yet accurate, analytical modeling framework and an extensive simulation campaign, which complement experimental results on the SensLab testbed. The impact of several key parameters on the system performance is investigated.  相似文献   

17.
We present the web application ‘cplint on SWI‐Prolog for SHaring that allows the user to write (SWISH)' Probabilistic Logic Programs and submit the computation of the probability of queries with a web browser. The application is based on SWISH, a web framework for Logic Programming. SWISH is based on various features and packages of SWI‐Prolog, in particular, its web server and its Pengine library, that allow to create remote Prolog engines and to pose queries to them. In order to develop the web application, we started from the PITA system, which is included in cplint , a suite of programs for reasoning over Logic Programs with Annotated Disjunctions, by porting PITA to SWI‐Prolog. Moreover, we modified the PITA library so that it can be executed in a multi‐threading environment. Developing ‘cplint on SWISH’ also required modification of the JavaScript SWISH code that creates and queries Pengines. ‘cplint on SWISH’ includes a number of examples that cover a wide range of domains and provide interesting applications of Probabilistic Logic Programming. By providing a web interface to cplint , we allow users to experiment with Probabilistic Logic Programming without the need to install a system, a procedure that is often complex, error prone, and limited mainly to the Linux platform. In this way, we aim to reach out to a wider audience and popularize Probabilistic Logic Programming. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

18.
To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).  相似文献   

19.
20.
为了实现嵌入式Web服务器中的动态Web功能,讨论了适用于嵌入式环境的CGI与SSI技术,在分析其优缺点的基础上提出了CGI与SSI技术相结合的"扩展CGF解决方案.该方法克服了CGI方式可维护性差的缺点,减少了服务器生成动态页面时的开销.在成功设计扩展CGI的基础上,给出了嵌入式Web服务器的具体实现方法.实验结果表明了设计的可行性和有效性.该方法具有结构清晰、合理利用嵌入式系统资源和易于开发与维护的特点.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号