首页 | 官方网站   微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒
To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).  相似文献   


The first impressions of web pages presented to users was investigated by using 13 different web pages, three types of scales and 18 participants. Multidimensional analysis of similarity and preference judgements found four important dimensions: beauty, mostly illustrations versus mostly text, overview and structure. Category scales indicated the existence of two factors related to formal aspects and to appeal of the objects, respectively. The best predictor for the overall judgement of the category scales was beauty. Property vector fitting of the multidimensional solutions with the category scales further indicated the importance of beauty for the preference space. Aspects of usability, product design and aesthetics are discussed.  相似文献   

Aesthetics and preferences of web pages   总被引:3,自引:0,他引:3  
The first impressions of web pages presented to users was investigated by using 13 different web pages, three types of scales and 18 participants. Multidimensional analysis of similarity and preference judgements found four important dimensions: beauty, mostly illustrations versus mostly text, overview and structure. Category scales indicated the existence of two factors related to formal aspects and to appeal of the objects, respectively. The best predictor for the overall judgement of the category scales was beauty. Property vector fitting of the multidimensional solutions with the category scales further indicated the importance of beauty for the preference space. Aspects of usability, product design and aesthetics are discussed.  相似文献   

Although caching has been shown as an efficient technique to reduce the delay in generating web pages to meet the page requests from web users, it becomes less effective if the pages are dynamic and contain dynamic contents. In this paper, instead of using caching, we study the effectiveness of using pre-fetching to resolve the problems in handling dynamic web pages. Pre-fetching is a proactive caching scheme since a page is cached before the receipt of any page request for the page. In addition to the problem of which pages to be pre-fetched, another equally important question is when to perform the pre-fetching. To resolve the prediction and timing problems, we explore the temporal properties of the dynamic web pages and the timing issues in accessing the pages to determine which pages to be pre-fetched and the best time to pre-fetch the pages to maximize the cache hit probability of the pre-fetched page. If the required pages can be found in the cache validly, the response times of the requests can be greatly reduced. The proposed scheme is called temporal pre-fetching (TPF) in which we prioritize pre-fetching requests based on the predicted usability of the to-be pre-fetched pages. To minimize the impact of incorrect prediction in pre-fetching on processing of on-demand page requests, a qualifying examination is performed to remove unnecessary and low usability pre-fetching requests while they are waiting to be processed and just before their processing. We have implemented the proposed TPF scheme in a web server system and experiments have been performed to study its performance characteristics compared with conventional cache-only scheme using a benchmark auction application under different system and application settings. As shown in the experiment results, the overall system performance, i.e., response time, is improved as more page requests can be served immediately using pre-fetched pages.  相似文献   

网页在线实时推荐是网络个性化服务的重要内容,基于Web日志的网页实时推荐有助于提高检索效率、缓解网络拥塞,提高网站吸引力.该算法通过将用户会话处理成数字序列以体现用户访问路径的先后顺序,利用动态矩阵和滑动窗简化了路径相似度计算;算法只扫描一遍日志,可以进一步提高实时响应速度.该算法在考虑用户访问路径的先后顺序和简化推荐机制的基础上,提出了会话编码和动态矩阵的概念,利用其在推荐中可以兼顾满意度和实时性,同时实时性不受日志增加的影响.试验结果表明,该算法在兼顾满意度的基础上可以大幅提高推荐的实时性,具有实际应用价值.  相似文献   

网页去重方法研究   总被引:2,自引:1,他引:1       下载免费PDF全文
搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征,提出了一种基于语义的去重方法。该方法通过句子在文本中的位置和组块的重要度,提取出网页正文的主题句向量,然后对主题句向量进行语义相似度计算,把重复的网页去除。实验证明,该方法对全文重复和部分重复的网页都能进行较准确的检测。  相似文献   

In this paper we present a graphical software system that provides an automatic support to the extraction of information from web pages. The underlying extraction technique exploits the visual appearance of the information in the document, and is driven by the spatial relations occurring among the elements in the page. However, the usual information extraction modalities based on the web page structure can be used in our framework, too. The technique has been integrated within the Spatial Relation Query (SRQ) tool. The tool is provided with a graphical front-end which allows one to define and manage a library of spatial relations, and to use a SQL-like language for composing queries driven by these relations and by further semantic and graphical attributes.  相似文献   

Users' visual attention measured by eyetracking fixations was investigated in web pages with different designs. Browsing and search conditions were tested. Layout structure influenced attention with fixation densities being concentrated in upper parts of pages according to the layout structure. In sites with open graphical layouts, animations and images dominated attention. In the search condition, attention patterns focused on salient objects and information-scent components leading to the targets. Based on the results, a model of structured directed visual attention was proposed and implemented in the Web Page Analyser tool (WPA) to predict heat maps of visual attention. Validation of the tool demonstrated good accuracy in browse and search modes.  相似文献   

In the era of ubiquitous computing, applications are emerging to benefit from using devices of different users and different capabilities together. This paper focuses on user-centric web browsing using multiple devices, where content of a web page is partitioned, adapted and allocated to devices in the vicinity. We contribute two novel web page partitioning algorithms. They differ from existing approaches by allowing for both, automatic and semi-automatic partitioning. On the one hand, this provides good automatic, web page independent results by utilizing sophisticated structural pre- and postprocessing of the web page. On the other hand, these results can be improved by considering additional semantic information provided through user-generated web page annotations. We further present a performance evaluation of our algorithms. Moreover, we contribute the results of a user study. These clearly show that (1) our algorithms provide good automatic results and (2) the application of user-centric, annotation-based semantic information leads to a significantly higher user satisfaction.  相似文献   

Ranking web pages for presenting the most relevant web pages to user's queries is one of the main issues in any search engine. In this paper, two new ranking algorithms are offered, using Reinforcement Learning (RL) concepts. RL is a powerful technique of modern artificial intelligence that tunes agent's parameters, interactively. In the first step, with formulation of ranking as an RL problem, a new connectivity-based ranking algorithm, called RL_Rank, is proposed. In RL_Rank, agent is considered as a surfer who travels between web pages by clicking randomly on a link in the current page. Each web page is considered as a state and value function of state is used to determine the score of that state (page). Reward is corresponded to number of out links from the current page. Rank scores in RL_Rank are computed in a recursive way. Convergence of these scores is proved. In the next step, we introduce a new hybrid approach using combination of BM25 as a content-based algorithm and RL_Rank. Both proposed algorithms are evaluated by well known benchmark datasets and analyzed according to concerning criteria. Experimental results show using RL concepts leads significant improvements in raking algorithms.  相似文献   

英文网站的搜索引擎优化及其海外宣传策略   总被引:2,自引:0,他引:2  
文章介绍了搜索引擎基本概况及优化技术和网站海外宣传策略,包括搜索引擎定义、工作原理、网站优化的几种主要方法,海外宣传策略等等.有关搜索引擎的重要技术也在文中进行了详细介绍.  相似文献   

When users need to find something on the Web that is related to a place, chances are place names will be submitted along with some other keywords to a search engine. However, automatic recognition of geographic characteristics embedded in Web documents, which would allow for a better connection between documents and places, remains a difficult task. We propose an ontology-driven approach to facilitate the process of recognizing, extracting, and geocoding partial or complete references to places embedded in text. Our approach combines an extraction ontology with urban gazetteers and geocoding techniques. This ontology, called OnLocus, is used to guide the discovery of geospatial evidence from the contents of Web pages. We show that addresses and positioning expressions, along with fragments such as postal codes or telephone area codes, provide satisfactory support for local search applications, since they are able to determine approximations to the physical location of services and activities named within Web pages. Our experiments show the feasibility of performing automated address extraction and geocoding to identify locations associated to Web pages. Combining location identifiers with basic addresses improved the precision of extractions and reduced the number of false positive results.  相似文献   

基于Web页面链接和标签的聚类方法   总被引:1,自引:0,他引:1  
针对目前Web聚类效率和准确率不高的问题,提出一种基于Web页面链接结构和标签信息的聚类方法CWPBLT(clustering web pages based on their links and tags),它是通过分析Web页面中的链接结构和重要标签信息来比较页面之间的相似度,从而对Web站点中的Web页面进行聚类,聚类过程同时兼顾了Web页面结构和页面标签提供的内容信息.实验结果表明,该方法有效地提高了聚类的时间效率和准确性,是对以往仅基于页面主题内容或页面结构聚类方法的改进.  相似文献   

This paper presents a novel watermarking scheme for tamper-proof of web pages. It overwhelms existing methods of watermarking and Hash in that it does not increase the file size. Experimental results are promising.  相似文献   

随着网络信息的迅猛发展,信息处理已经成为人们获取有用信息不可缺少的工具,文本自动分类系统是信息处理的重要研究方向.对文本分类关键技术中的特征选择算法进行了探讨,并结合网页特性,对特征权重算法及互信息算法进行了改进.实验结果证明,改进算法是可行的.  相似文献   

极限学习机ELM不同于传统的神经网络学习算法(如BP算法),是一种高效的单隐层前馈神经网络(SLFNs)学习算法。将极限学习机引入到中文网页分类任务中。对中文网页进行预处理,提取其特性信息,从而形成网页特征树,产生定长编码作为极限学习机的输入数据。实验结果表明该方法能够有效地分类网页。  相似文献   

We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b?? in the same page if b?? content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.  相似文献   

With the advent of technology man is endeavoring for relevant and optimal results from the web through search engines. Retrieval performance can often be improved using several algorithms and methods. Abundance in web has impelled to exert better search systems. Categorization of the web pages abet fairly in addressing this issue. The anatomy of the web pages, links, categorization of text and their relations are empathized with time. Search engines perform critical analysis using several inputs for a keyword(s) to obtain quality results in shortest possible time. Categorization is mostly done with separating the content using the web link structure. We estimated two different page weights (a) Page Retaining Weight (PRW) and (b) Page Forwarding Weight (PFW) for a web page and grouped for categorization. Using these experimental results we classified the web pages into four different groups i.e. (A) Simple type (B) Axis shifted (c) Fluctuated and (d) Oscillating types. Implication in development of such categorization alleviates the performance of search engines and also delves into study of web modeling studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号