首页 | 官方网站   微博 | 高级检索  
     


Building Minority Language Corpora by Learning to Generate Web Search Queries
Authors:Rayid Ghani  Rosie Jones  Dunja Mladenic
Affiliation:(1) Accenture Technology Labs, 161 N. Clark St., Chicago, IL 60601, USA;(2) Carnegie Mellon University, Pittsburgh, 5000 Forbes Ave, PA, USA;(3) J. Stefan Institute, Ljubljana, Jamova 39, 1000, Slovenia
Abstract:The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.
Keywords:Web mining  Online learning  Query generation  Corpus construction
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号