首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 271 毫秒
1.
Although personalized search has been under way for many years and many personalization algorithms have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users and under different search contexts. In this paper, we study this problem and provide some findings. We present a large-scale evaluation framework for personalized search based on query logs and then evaluate five personalized search algorithms (including two click-based ones and three topical-interest-based ones) using 12-day query logs of Windows Live Search. By analyzing the results, we reveal that personalized Web search does not work equally well under various situations. It represents a significant improvement over generic Web search for some queries, while it has little effect and even harms query performance under some situations. We propose click entropy as a simple measurement on whether a query should be personalized. We further propose several features to automatically predict when a query will benefit from a specific personalization algorithm. Experimental results show that using a personalization algorithm for queries selected by our prediction model is better than using it simply for all queries.  相似文献   

2.
dentifying ambiguous queries is crucial to research on personalized Web search and search result diversity. Intuitively, query logs contain valuable information on how many intentions users have when issuing a query. However, previous work showed user clicks alone are misleading in judging a query as being ambiguous or not. In this paper, we address the problem of learning a query ambiguity model by using search logs. First, we propose enriching a query by mining the documents clicked by users and the relevant follow up queries in a session. Second, we use a text classifier to map the documents and the queries into predefined categories. Third, we propose extracting features from the processed data. Finally, we apply a state-of-the-art algorithm, Support Vector Machine (SVM), to learn a query ambiguity classifier. Experimental results verify that the sole use of click based features or session based features perform worse than the previous work based on top retrieved documents. When we combine the two sets of features, our proposed approach achieves the best effectiveness, specifically 86% in terms of accuracy. It significantly improves the click based method by 5.6% and the session based method by 4.6%.  相似文献   

3.
Traditional search engines have become the most useful tools to search the World Wide Web. Even though they are good for certain search tasks, they may be less effective for others, such as satisfying ambiguous or synonym queries. In this paper, we propose an algorithm that, with the help of Wikipedia and collaborative semantic annotations, improves the quality of web search engines in the ranking of returned results. Our work is supported by (1) the logs generated after query searching, (2) semantic annotations of queries and (3) semantic annotations of web pages. The algorithm makes use of this information to elaborate an appropriate ranking. To validate our approach we have implemented a system that can apply the algorithm to a particular search engine. Evaluation results show that the number of relevant web resources obtained after executing a query with the algorithm is higher than the one obtained without it.  相似文献   

4.
Hundreds of millions of users each day submit queries to the Web search engine. The user queries are typically very short which makes query understanding a challenging problem. In this paper, we propose a novel approach for query representation and classification. By submitting the query to a web search engine, the query can be represented as a set of terms found on the web pages returned by search engine. In this way, each query can be considered as a point in high-dimensional space and standard classification algorithms such as regression can be applied. However, traditional regression is too flexible in situations with large numbers of highly correlated predictor variables. It may suffer from the overfitting problem. By using search click information, the semantic relationship between queries can be incorporated into the learning system as a regularizer. Specifically, from all the functions which minimize the empirical loss on the labeled queries, we select the one which best preserves the semantic relationship between queries. We present experimental evidence suggesting that the regularized regression algorithm is able to use search click information effectively for query classification.  相似文献   

5.
Thousands of users issue keyword queries to the Web search engines to find information on a number of topics. Since the users may have diverse backgrounds and may have different expectations for a given query, some search engines try to personalize their results to better match the overall interests of an individual user. This task involves two great challenges. First the search engines need to be able to effectively identify the user interests and build a profile for every individual user. Second, once such a profile is available, the search engines need to rank the results in a way that matches the interests of a given user. In this article, we present our work towards a personalized Web search engine and we discuss how we addressed each of these challenges. Since users are typically not willing to provide information on their personal preferences, for the first challenge, we attempt to determine such preferences by examining the click history of each user. In particular, we leverage a topical ontology for estimating a user’s topic preferences based on her past searches, i.e. previously issued queries and pages visited for those queries. We then explore the semantic similarity between the user’s current query and the query-matching pages, in order to identify the user’s current topic preference. For the second challenge, we have developed a ranking function that uses the learned past and current topic preferences in order to rank the search results to better match the preferences of a given user. Our experimental evaluation on the Google query-stream of human subjects over a period of 1 month shows that user preferences can be learned accurately through the use of our topical ontology and that our ranking function which takes into account the learned user preferences yields significant improvements in the quality of the search results.  相似文献   

6.
Search engine query log mining has evolved over time to more like data stream mining due to the endless and continuous sequence of queries known as query stream. In this paper, we propose an online frequent sequence discovery (OFSD) algorithm to extract frequent phrases from within query streams, based on a new frequency rate metric, which is suitable for query stream mining. OFSD is an online, single pass, and real-time frequent sequence miner appropriate for data streams. The frequent phrases extracted by the OFSD algorithm are used to guide novice Web search engine users to complete their search queries more efficiently. YourEye, our online phrase recommender is then introduced. The advantages of YourEye compared with Google Suggest, a service powered by Google for phrase suggestion, is also described. Various characteristics of two specific Web search engine query logs are analyzed and then the query logs are used to evaluate YourEye. The experimental results confirm the significant benefit of monitoring frequent phrases within the queries instead of the whole queries because none-separable items. The number of the monitored elements substantially decreases, which results in smaller memory consumption as well as better performance. Re-ranking the retrieved pages based on past users clicks for each frequent phrase extracted by OFSD is also introduced. The preliminary results show the advantages of the proposed method compared to the similar work reported in Smyth et al.  相似文献   

7.
王继民  龚笔宏  孟涛 《计算机工程》2006,32(14):25-26,6
用户在使用Web搜索引擎进行信息查询时,可能包含单个或多个主题。该文针对大规模中文搜索引擎系统——北大天网的多任务Web查询,进行了研究和分析。结果显示:多于1/3的用户进行多任务Web查询;超过1/2的多任务会话包含2个不同的主题并进行2~7次查询;多任务会话时间的均值是一般会话时间均值的2倍;天网用户的多任务查询主要有3个主题:计算机,娱乐和教育;近1/4的多任务会话中包含不确定的信息。该文用关联分析的方法发现了用户查询主题之间的一些关系。  相似文献   

8.
Modern search engines record user interactions and use them to improve search quality. In particular, user click-through has been successfully used to improve clickthrough rate (CTR), Web search ranking, and query recommendations and suggestions. Although click-through logs can provide implicit feedback of users’ click preferences, deriving accurate absolute relevance judgments is difficult because of the existence of click noises and behavior biases. Previous studies showed that user clicking behaviors are biased toward many aspects such as “position” (user’s attention decreases from top to bottom) and “trust” (Web site reputations will affect user’s judgment). To address these problems, researchers have proposed several behavior models (usually referred to as click models) to describe users? practical browsing behaviors and to obtain an unbiased estimation of result relevance. In this study, we review recent efforts to construct click models for better search ranking and propose a novel convolutional neural network architecture for building click models. Compared to traditional click models, our model not only considers user behavior assumptions as input signals but also uses the content and context information of search engine result pages. In addition, our model uses parameters from traditional click models to restrict the meaning of some outputs in our model’s hidden layer. Experimental results show that the proposed model can achieve considerable improvement over state-of-the-art click models based on the evaluation metric of click perplexity.  相似文献   

9.
在文本搜索领域,用自学习排序的方法构建排序模型越来越普遍。排序模型的性能很大程度上依赖训练集。每个训练样本需要人工标注文档与给定查询的相关程度。对于文本搜索而言,查询几乎是无穷的,而人工标注耗时费力,所以选择部分有信息量的查询来标注很有意义。提出一种同时考虑查询的难度、密度和多样性的贪心算法从海量的查询中选择有信息量的查询进行标注。在LETOR和从Web搜索引擎数据库上的实验结果,证明利用本文提出的方法能构造一个规模较小且有效的训练集。  相似文献   

10.
Nowadays, searches for webpages of a person with a given name constitute a notable fraction of queries to web search engines. Such a query would normally return webpages related to several namesakes, who happened to have the queried name, leaving the burden of disambiguating and collecting pages relevant to a particular person (from among the namesakes) on the user. In this article we develop a Web People Search approach that clusters webpages based on their association to different people. Our method exploits a variety of semantic information extracted from Web pages, such as named entities and hyperlinks, to disambiguate among namesakes referred to on the Web pages. We demonstrate the effectiveness of our approach by testing the efficacy of the disambiguation algorithms and its impact on person search.  相似文献   

11.
网络搜索分析在优化搜索引擎方面具有举足轻重的作用,而且对用户个人搜索特性进行分析能够提高搜索引擎的精准度。目前,大多数已有模型(比如点击图模型及其变体),注重研究用户群体的共同特点。然而,关于如何做到既可以获取用户群体共同特点又可以获取用户个人特点方面的研究却非常少。本文研究了基于个人用户网络搜索分析新问题,即通过研究用户搜索的突发性现象,获取个人用户搜索查询的主题分布情况。提出了两个搜索主题模型,即搜索突发性模型(SBM)和耦合敏感搜索突发性模型(CS-SBM)。SBM假设查询词和URL主题是无关的,CS-SBM假设查询词和URL之间是有主题关联的,得到的主题分布信息存储在偏Dirichlet先验中,采用Beta分布刻画用户搜索的时间特性。实验结果表明,每一个用户的网络搜索轨迹都有多种基于用户的独有特点。同时,在使用大量真实用户查询日志数据情况下,与LDA、DCMLDA、TOT相比,本文提出的模型具有明显的泛化性能优势,并且有效地描绘了用户搜索查询主题在时间上的变化过程。  相似文献   

12.
One of the useful tools offered by existing web search engines is query suggestion (QS), which assists users in formulating keyword queries by suggesting keywords that are unfamiliar to users, offering alternative queries that deviate from the original ones, and even correcting spelling errors. The design goal of QS is to enrich the web search experience of users and avoid the frustrating process of choosing controlled keywords to specify their special information needs, which releases their burden on creating web queries. Unfortunately, the algorithms or design methodologies of the QS module developed by Google, the most popular web search engine these days, is not made publicly available, which means that they cannot be duplicated by software developers to build the tool for specifically-design software systems for enterprise search, desktop search, or vertical search, to name a few. Keyword suggested by Yahoo! and Bing, another two well-known web search engines, however, are mostly popular currently-searched words, which might not meet the specific information needs of the users. These problems can be solved by WebQS, our proposed web QS approach, which provides the same mechanism offered by Google, Yahoo!, and Bing to support users in formulating keyword queries that improve the precision and recall of search results. WebQS relies on frequency of occurrence, keyword similarity measures, and modification patterns of queries in user query logs, which capture information on millions of searches conducted by millions of users, to suggest useful queries/query keywords during the user query construction process and achieve the design goal of QS. Experimental results show that WebQS performs as well as Yahoo! and Bing in terms of effectiveness and efficiency and is comparable to Google in terms of query suggestion time.  相似文献   

13.
针对搜索引擎查询结果缓存与预取问题,该文提出了一种基于查询特性的搜索引擎查询结果缓存与预取方法,该方法包括用来指导预取的查询结果页码预测模型和缓存与预取算法框架,用于提高搜索引擎系统性能。通过对国内某著名中文商业搜索引擎的某段时间的用户查询日志分析得出,用户对不同查询返回的查询结果所浏览的页数具有显著的非均衡性,结合该特性设计查询结果页码预测模型来进行预取和分区缓存。在该搜索引擎两个月的大规模真实用户查询日志上的实验结果表明,与传统的方法相比,该方法可以获得3.5%~8.45%的缓存命中率提升。  相似文献   

14.
Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search   总被引:14,自引:0,他引:14  
The original PageRank algorithm for improving the ranking of search-query results computes a single vector, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared. By using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. We describe techniques for efficiently implementing a large-scale search system based on the topic-sensitive PageRank scheme.  相似文献   

15.
一种基于用户标记的搜索结果排序算法   总被引:1,自引:0,他引:1  
随着计算机网络的快速发展,网络上的信息量也日益纷繁复杂.如何准确、快速地帮助人们从海量网络数据中获取所需信息,这是目前搜索引擎首要解决的问题,为此,各种搜索排序算法应运而生.但是目前,网页信息的表达形式都十分简单,用户描述查询的形式更是十分简单,这就造成了在判断网页内容与用户查询相关性时十分困难.首先对现有的搜索引擎排序算法进行了分类总结,分析它们的优缺点.然后提出了一种基于用户反馈的语义标记的新方法,最后采用多种评估方法与Google搜索结果进行对比分析.实验结果表明,利用该方法所得到的排序结果比Google的排序结果更接近用户需求.  相似文献   

16.
The popularity of Web Search Engines (WSEs) enables them to generate a lot of data in form of query logs. These files contain all search queries submitted by users. Economical benefits could be earned by means of selling or releasing those logs to third parties. Nevertheless, this data potentially expose sensitive user information. Removing direct identifiers is not sufficient to preserve the privacy of the users. Some existing privacy-preserving approaches use log batch processing but, as logs are generated and consumed in a real-time environment, a continuous anonymization process would be more convenient. In this way, in this paper we propose: (i) a new method to anonymize query logs, based on k-anonymity; and (ii) some de-anonymization tools to determine possible privacy problems, in case that an attacker gains access to the anonymized query logs. This approach preserves the original user interests, but spreads possible semi-identifier information over many users, preventing linkage attacks. To assess its performance, all the proposed algorithms are implemented and an extensive set of experiments are conducted using real data.  相似文献   

17.
This paper describes and evaluates a unified approach to phrasal query suggestions in the context of a high-precision search engine. The search engine performs ranked extended-Boolean searches with the proximity operator near being the default operation. Suggestions are offered to the searcher when the length of the result list falls outside predefined bounds. If the list is too long, the engine specializes the query through the use of super phrases; if the list is too short, the engine generalizes the query through the use of proximal subphrases.We describe methods for generating both types of suggestions and present algorithms for ranking the suggestions. Specifically, we present the problem of counting proximal subphrases for specialization and the problem of counting unordered super phrases for generalization.The uptake of our approach was evaluated by analyzing search log data from before and after the suggestion feature was added to a commercial version of the search engine. We looked at approximately 1.5 million queries and found that, after they were added, suggestions represented nearly 30% of the total queries. Efficacy was evaluated through a controlled study of 24 participants performing nine searches using three different search engines. We found that the engine with phrasal query suggestions had better high-precision recall than both the same search engine without suggestions and a search engine with a similar interface but using an Okapi BM25 ranking algorithm.  相似文献   

18.
Keyword-based Web search is a widely used approach for locating information on the Web. However, Web users usually suffer from the difficulties of organizing and formulating appropriate input queries due to the lack of sufficient domain knowledge, which greatly affects the search performance. An effective tool to meet the information needs of a search engine user is to suggest Web queries that are topically related to their initial inquiry. Accurately computing query-to-query similarity scores is a key to improve the quality of these suggestions. Because of the short lengths of queries, traditional pseudo-relevance or implicit-relevance based approaches expand the expression of the queries for the similarity computation. They explicitly use a search engine as a complementary source and directly extract additional features (such as terms or URLs) from the top-listed or clicked search results. In this paper, we propose a novel approach by utilizing the hidden topic as an expandable feature. This has two steps. In the offline model-learning step, a hidden topic model is trained, and for each candidate query, its posterior distribution over the hidden topic space is determined to re-express the query instead of the lexical expression. In the online query suggestion step, after inferring the topic distribution for an input query in a similar way, we then calculate the similarity between candidate queries and the input query in terms of their corresponding topic distributions; and produce a suggestion list of candidate queries based on the similarity scores. Our experimental results on two real data sets show that the hidden topic based suggestion is much more efficient than the traditional term or URL based approach, and is effective in finding topically related queries for suggestion.  相似文献   

19.
Search engines retrieve and rank Web pages which are not only relevant to a query but also important or popular for the users. This popularity has been studied by analysis of the links between Web resources. Link-based page ranking models such as PageRank and HITS assign a global weight to each page regardless of its location. This popularity measurement has shown successful on general search engines. However unlike general search engines, location-based search engines should retrieve and rank higher the pages which are more popular locally. The best results for a location-based query are those which are not only relevant to the topic but also popular with or cited by local users. Current ranking models are often less effective for these queries since they are unable to estimate the local popularity. We offer a model for calculating the local popularity of Web resources using back link locations. Our model automatically assigns correct locations to the links and content and uses them to calculate new geo-rank scores for each page. The experiments show more accurate geo-ranking of search engine results when this model is used for processing location-based queries.  相似文献   

20.
Most Web pages contain location information, which are usually neglected by traditional search engines. Queries combining location and textual terms are called as spatial textual Web queries. Based on the fact that traditional search engines pay little attention in the location information in Web pages, in this paper we study a framework to utilize location information for Web search. The proposed framework consists of an offline stage to extract focused locations for crawled Web pages, as well as an online ranking stage to perform location-aware ranking for search results. The focused locations of a Web page refer to the most appropriate locations associated with the Web page. In the offline stage, we extract the focused locations and keywords from Web pages and map each keyword with specific focused locations, which forms a set of <keyword, location> pairs. In the second online query processing stage, we extract keywords from the query, and computer the ranking scores based on location relevance and the location-constrained scores for each querying keyword. The experiments on various real datasets crawled from nj.gov, BBC and New York Time show that the performance of our algorithm on focused location extraction is superior to previous methods and the proposed ranking algorithm has the best performance w.r.t different spatial textual queries.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号