Building Minority Language Corpora by Learning to Generate Web Search Queries期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Building Minority Language Corpora by Learning to Generate Web Search Queries

Authors:	Rayid Ghani Rosie Jones Dunja Mladenic

Affiliation:	(1) Accenture Technology Labs, 161 N. Clark St., Chicago, IL 60601, USA;(2) Carnegie Mellon University, Pittsburgh, 5000 Forbes Ave, PA, USA;(3) J. Stefan Institute, Ljubljana, Jamova 39, 1000, Slovenia

Abstract:	The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.

Keywords:	Web mining Online learning Query generation Corpus construction
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏