首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
探讨基于体裁的中文网页分类的特征项选取问题.词汇特征方面,结合自动抽取和人工归纳的方式来获得.通过改进PAT树存储结构,进行序列挖掘来获得频繁字符串特征,使得文本分类系统摆脱对切词处理和词典的依赖,并提出了模糊字符串模式的特征表达方式.此外,特征集中融入了文本的形式特征,并根据网页的特点,引入链接信息特征.实现了基于体裁的中文网页分类系统,结果表明分类效果得到了有效的改善.  相似文献   

3.
Recently, genre collection and automatic genre identification for the web has attracted much attention. However, currently there is no genre-annotated corpus of web pages where inter-annotator reliability has been established, i.e. the corpora are either not tested for inter-annotator reliability or exhibit low inter-coder agreement. Annotation has also mostly been carried out by a small number of experts, leading to concerns with regard to scalability of these annotation efforts and transferability of the schemes to annotators outside these small expert groups. In this paper, we tackle these problems by using crowd-sourcing for genre annotation, leading to the Leeds Web Genre Corpus—the first web corpus which is, demonstrably reliably annotated for genre and which can be easily and cost-effectively expanded using naive annotators. We also show that the corpus is source and topic diverse.  相似文献   

4.
Web结构挖掘中基于熵的链接分析法   总被引:1,自引:0,他引:1  
王勇  杨华千  李建福 《计算机工程与设计》2006,27(9):1622-1624,1688
在Web结构挖掘中,传统的HITS(hyperlinkinducedtopics search)算法被广泛应用来寻找搜索引擎返回页面中的Auto-rity页面和Hub页面.但是在网站中除了有价值的页面内容外,还有很多与页面内容无关的链接,如广告、链接导航等.由于这些链接的存在,应用HITS算法时就会导致某些广告网页或无关网页获得较高的Authority值和Hub值.为了解决这个问题,在原有HITS算法的基础上,引入了香农信息熵的概念,提出了基于熵的网页链接分析方法来挖掘网页结构.该算法的核心思想是用信息熵来表示链接文本所隐含的知识.  相似文献   

5.
The limited display size of current small Internet devices is becoming a serious obstacle to information access. In this paper, we introduce a Document REpresentation for Scalable Structures (DRESS) to help information providers make composite documents, typically web pages, scalable in both logic and layout structure to support effective information acquisition in heterogeneous environments. Through this novel document representation structure based on binary slicing trees, the document can dynamically adapt its presentation according to display sizes by maximizing the information throughput to users. We discuss the details of this structure with its key attributes. An automatic approach for generating this structure for existing web pages is also presented. A branch-and-bound algorithm and a capacity ratio-based slicing method are proposed to select proper content representation and aesthetic document layouts respectively. A set of user study experiments have been carried out and the results show that compared with the thumbnail-based approach, the DRESS-based interface can reduce browsing time by 23.5%. This work was performed when the second and the third authors were visiting students at Microsoft Research Asia.  相似文献   

6.
Web prefetching is a technique aimed at reducing user-perceived latencies in the World Wide Web. The spatial locality shown by user accesses makes it possible to predict future accesses from the previous ones. A prefetching engine uses these predictions to prefetch web objects before the user demands them. The existing prediction algorithms achieved an acceptable performance when they were proposed but the high increase in the number of embedded objects per page has reduced their effectiveness in the current web. In this paper, we show that most of the predictions made by the existing algorithms are not useful to reduce the user-perceived latency because these algorithms do not take into account the structure of the current web pages, i.e., an HTML object with several embedded objects. Thus, they predict the accesses to the embedded objects in an HTML after reading the HTML itself. For this reason, the prediction is not made early enough to prefetch the objects and, therefore, there is no latency reduction. In this paper we present the double dependency graph (DDG) algorithm that distinguishes between container objects (HTML) and embedded objects to create a new prediction model according to the structure of the current web. Results show that, for the same number of extra requests to the server, DDG reduces the perceived latency, on average, 40% more than the existing algorithms. Moreover, DDG distributes latency reductions more homogeneously among users.  相似文献   

7.
The objective of this analysis is to describe the characteristics of a sample of fifty web sites in order to define the basic graphic components they used, the frequency of their use and some general trends in web design. The comparison between sites also gives information on the different structures that are being used to present information. The characteristics analysed were the graphic organisation of the page, the elements to support navigation and the structure of information. The analysis identified graphic patterns (e.g. the use of specific text settings), navigation patterns (e.g. specific methods for manipulating documents) and information structures of sites that suggests that some conventions are beginning to emerge.  相似文献   

8.
The tamper-proof of web pages is of great importance. Some watermarking schemes have been reported to solve this problem. However, both these watermarking schemes and the traditional hash methods have a problem of increasing file size. In this paper, we propose a novel watermarking scheme for the tamper-proof of web pages, which is free of this embarrassment. For a web page, the proposed scheme generates watermarks based on the principal component analysis (PCA) technique. The watermarks are then embedded into the web page through the upper and lower cases of letters in HTML tags. When a watermarked web page is tampered, the extracted watermarks can detect the modifications to the web page, thus we can keep the tampered one from being published. Extensive experiments are performed on the proposed scheme and the results show that the proposed scheme can be a feasible and efficient tool for the tamper-proof of web pages.  相似文献   

9.
Traditional collaborative filtering (CF) based recommender systems on the basis of user similarity often suffer from low accuracy because of the difficulty in finding similar users. Incorporating trust network into CF-based recommender system is an attractive approach to resolve the neighbor selection problem. Most existing trust-based CF methods assume that underlying relationships (whether inferred or pre-existing) can be described and reasoned in a web of trust. However, in online sharing communities or e-commerce sites, a web of trust is not always available and is typically sparse. The limited and sparse web of trust strongly affects the quality of recommendation. In this paper, we propose a novel method that establishes and exploits a two-faceted web of trust on the basis of users’ personal activities and relationship networks in online sharing communities or e-commerce sites, to provide enhanced-quality recommendations. The developed web of trust consists of interest similarity graphs and directed trust graphs and mitigates the sparsity of web of trust. Moreover, the proposed method captures the temporal nature of trust and interest by dynamically updating the two-faceted web of trust. Furthermore, this method adapts to the differences in user rating scales by using a modified Resnick’s prediction formula. As enabled by the Pareto principle and graph theory, new users highly benefit from the aggregated global interest similarity (popularity) in interest similarity graph and the global trust (reputation) in the directed trust graph. The experiments on two datasets with different sparsity levels (i.e., Jester and MovieLens datasets) show that the proposed approach can significantly improve the predictive accuracy and decision-support accuracy of the trust-based CF recommender system.  相似文献   

10.
This paper looks at different ways of personalising web page presentation to alleviate functional impairments in older people. The paper considers how impairments may be addressed by web design and through various personalisation instruments: accessibility features of standard browsers, proxy servers, assistive technology, application adaptors, and special purpose browsers. A pilot study of five older web users indicated that the most favoured personalisation technique was overriding the CSS (cascading style sheet) with a readily available one using a standard browser. The least favoured one was using assistive technology. In a follow-up study with 16 older web users, performing goal-directed browsing tasks, overriding CSS remains the most favoured. Assistive technology remains the least favoured and the slowest. Based on user comments, one-take-home message for web personalisation instrument developer is that the best instrument for older persons is one that most faithfully preserves the original layout while requiring the least effort.  相似文献   

11.
A path-based approach for web page retrieval   总被引:1,自引:0,他引:1  
Use of links to enhance page ranking has been widely studied. The underlying assumption is that links convey recommendations. Although this technique has been used successfully in global web search, it produces poor results for website search, because the majority of the links in a website are used to organize information and convey no recommendations. By distinguishing these two kinds of links, respectively for recommendation and information organization, this paper describes a path-based method for web page ranking. We define the Hierarchical Navigation Path (HNP) as a new resource for improving web search. HNP is composed of multi-step navigation information in visitors’ website browsing. It provides indications of the content of the destination page. We first classify the links inside a website. Then, the links for web page organization are exploited to construct the HNPs for each page. Finally, the PathRank algorithm is described for web page retrieval. The experiments show that our approach results in significant improvements over existing solutions.  相似文献   

12.
13.
To help the growing qualitative and quantitative demands for information from the WWW, efficient automatic Web page classifiers are urgently needed. However, a classifier applied to the WWW faces a huge-scale dimensionality problem since it must handle millions of Web pages, tens of thousands of features, and hundreds of categories. When it comes to practical implementation, reducing the dimensionality is a critically important challenge. In this paper, we propose a fuzzy ranking analysis paradigm together with a novel relevance measure, discriminating power measure (DPM), to effectively reduce the input dimensionality from tens of thousands to a few hundred with zero rejection rate and small decrease in accuracy. The two-level promotion method based on fuzzy ranking analysis is proposed to improve the behavior of each relevance measure and combine those measures to produce a better evaluation of features. Additionally, the DPM measure has low computation cost and emphasizes on both positive and negative discriminating features. Also, it emphasizes classification in parallel order, rather than classification in serial order. In our experimental results, the fuzzy ranking analysis is useful for validating the uncertain behavior of each relevance measure. Moreover, the DPM reduces input dimensionality from 10,427 to 200 with zero rejection rate and with less than 5% decline (from 84.5% to 80.4%) in the test accuracy. Furthermore, to consider the impacts on classification accuracy for the proposed DPM, the experimental results of China Time and Reuter-21578 datasets have demonstrated that the DPM provides major benefit to promote document classification accuracy rate. The results also show that the DPM indeed can reduce both redundancy and noise features to set up a better classifier.  相似文献   

14.
Device-aware desktop web page transformation for rendering on handhelds   总被引:1,自引:0,他引:1  
This paper illustrates a new approach to automatic re-authoring of web pages for rendering on small-screen devices. The approach is based on automatic detection of the device type and screen size from the HTTP request header to render a desktop web page or a transformed one for display on small screen devices, for example, PDAs. Known algorithms (transforms) are employed to reduce the size of page elements, to hide parts of the text, and to transform tables into text while preserving the structural format of the web page. The system comprises a preprocessor that works offline and a just-in-time handler that responds to HTTP requests. The preprocessor employs Cascading Style Sheets (CSS) to set default attributes for the page and prepares it for the handler. The latter is responsible for downsizing graphical elements in the page, converting tables to text, and inserting visibility attributes and JavaScript code to allow the user of the client device to interact with the page and cause parts of the text to disappear or reappear. A system was developed that implements the approach and was used it to collect performance results and conduct usability testing. The importance of the approach lies in its ability to display hidden parts of the web page without having to revisit the server, thus reducing user wait times considerably, saving battery power, and cutting down on wireless network traffic.  相似文献   

15.
基于链接的Web网页分类   总被引:1,自引:1,他引:0  
基于链接的特点,提出了获取链接信息的模型,将得到的链接信息结合对象本身的属性来共同训练分类规则。针对网页链接的特殊性,对链接有向图重新建模。实验证明链接信息的加入可以有效地改善分类的结果,链接有向图的重新建模同样提高了分类的准确性。  相似文献   

16.
Donald B. Innes 《Software》1977,7(2):271-273
Many implementations of paged virtual memory systems employ demand fetching with least recently used (LRU) replacement. The stack characteristic of LRU replacement implies that a reference string which repeatedly accesses a number of pages in sequence will cause a page fault for each successive page referenced when the number of pages is greater than the number of page frames allocated to the program's LRU stack. In certain circumstances when the individual operations being performed on the referenced string are independent, or more precisely are commutative, the order of alternate page reference sequences can be reversed. This paper considers sequences which cannot be reversed and shows how placement of information on pages can achieve a similar effect if at least half the pages can be held in the LRU stack.  相似文献   

17.
The number of Internet users and the number of web pages being added to WWW increase dramatically every day.It is therefore required to automatically and e?ciently classify web pages into web directories.This helps the search engines to provide users with relevant and quick retrieval results.As web pages are represented by thousands of features,feature selection helps the web page classifiers to resolve this large scale dimensionality problem.This paper proposes a new feature selection method using Ward s minimum variance measure.This measure is first used to identify clusters of redundant features in a web page.In each cluster,the best representative features are retained and the others are eliminated.Removing such redundant features helps in minimizing the resource utilization during classification.The proposed method of feature selection is compared with other common feature selection methods.Experiments done on a benchmark data set,namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.  相似文献   

18.
刘金红  陆余良 《计算机工程与设计》2007,28(13):3213-3215,3219
文本自动分类技术为Internet上日益严重的"信息过载"问题提供了一种强有力的解决方法.面向中文文本分类领域,将ontology引入到N-Gram统计文本模型中,提出了一种基于"领域概念 有效词链"的多索引策略和相应的权重计算、参数平滑方法.通过在真实数据集上实验表明:应用领域本体的N-Gram中文文本分类模型不仅降低了索引项的数目,而且提高了文本分类的准确率.  相似文献   

19.
The complexity of web information environments and multiple‐topic web pages are negative factors significantly affecting the performance of focused crawling. A highly relevant region in a web page may be obscured because of low overall relevance of that page. Segmenting the web pages into smaller units will significantly improve the performance. Conquering and traversing irrelevant page to reach a relevant one (tunneling) can improve the effectiveness of focused crawling by expanding its reach. This paper presents a heuristic‐based method to enhance focused crawling performance. The method uses a Document Object Model (DOM)‐based page partition algorithm to segment a web page into content blocks with a hierarchical structure and investigates how to take advantage of block‐level evidence to enhance focused crawling by tunneling. Page segmentation can transform an uninteresting multi‐topic web page into several single topic context blocks and some of which may be interesting. Accordingly, focused crawler can pursue the interesting content blocks to retrieve the relevant pages. Experimental results indicate that this approach outperforms Breadth‐First, Best‐First and Link‐context algorithm both in harvest rate, target recall and target length. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

20.
Multi-label problems are challenging because each instance may be associated with an unknown number of categories, and the relationship among the categories is not always known. A large amount of data is necessary to infer the required information regarding the categories, but these data are normally available only in small batches and distributed over a period of time. In this work, multi-label problems are tackled using an incremental neural network known as the evolving Probabilistic Neural Network (ePNN). This neural network is capable of continuous learning while maintaining a reduced architecture, so that it can always receive training data when available with no drastic growth of its structure. We carried out a series of experiments on web page data sets and compared the performance of ePNN to that of other multi-label categorizers. On average, ePNN outperformed the other categorizers in four out of five metrics used for evaluation, and the structure of ePNN was less complex than that of the other algorithms evaluated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号