基于改进Apriori算法的问题模板无监督抽取方法 Unsupervised Question Template Extraction Based on Improved Apriori Algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于改进Apriori算法的问题模板无监督抽取方法

引用本文：	柯文俊,高金华,沈华伟,刘悦,程学旗.基于改进Apriori算法的问题模板无监督抽取方法[J].中文信息学报,1986,34(10):76-84.

作者姓名：	柯文俊高金华沈华伟刘悦程学旗

作者单位：	1.中国科学院计算技术研究所中国科学院网络数据科学与技术重点实验室,北京 100190; 2.中国科学院大学,北京 100049; 3.北京计算机技术及应用研究所,北京 100854; 4.烟台中科网络技术研究所,山东烟台 264005

基金项目：	国家重点研发计划(2016QY03D0504);国家自然科学基金(61425016,61902380);泰山学者工程专项经费(ts201511082)

摘要：	在面向限定领域的事实型问答系统中,基于模板匹配的问答是一种有效且稳定的方法。然而,现有的问题模板构建方法通常是在有监督场景下进行的,导致其严重依赖于人工标注数据,同时领域间可扩展性较差。因此,该文提出了一种改进Apriori算法的无监督模板抽取方法。对于限定领域问题样本,加入短语有序特征来挖掘频繁项集,将频繁项作为问题模板的框架词;同时,使用TF-IDF来度量模板的信息量,去除信息量小的模板;特别地,为了获取项数较长的模板,为Apriori算法引入了支持度自适应更新机制;最终,借助命名实体识别进行槽位识别,并组合框架词和槽,得到问题模板。实验表明,该方法可以在限定领域的问答数据集上有效挖掘问题模板,并取得了比基线模型更好的抽取效果。
关键词：	问答系统模板抽取 Apriori算法
Unsupervised Question Template Extraction Based on Improved Apriori Algorithm

KE Wenjun,GAO Jinhua,SHEN Huawei,LIU Yue,CHENG Xueqi.Unsupervised Question Template Extraction Based on Improved Apriori Algorithm[J].Journal of Chinese Information Processing,1986,34(10):76-84.

Authors:	KE Wenjun GAO Jinhua SHEN Huawei LIU Yue CHENG Xueqi

Affiliation:	1.CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2.University of Chinese Academy of Sciences, Beijing 100049, China; 3.Beijing Institute of Computer Technology and Applications, Beijing 100854, China; 4.Institute of Network Technology, ICT(YANTAI), CAS, Yantai, Shandong 264005, China

Abstract:	For domain-specific question answering (QA) systems, question retrieval via template matching proves to be effective and stable. However, existing template extraction methods usually work in a supervised manner, resulting in heavy dependence on manually annotated data and poor extensibility among different domains. To address this issue, this paper proposes an unsupervised template extraction method based on an improved Apriori algorithm. For given samples of question utterances, the frequently occurred phrases are first orderly extracted as frame words of candidate templates. The information inhabited in candidate templates is measured via TF-IDF, and candidates with low information are filtered out. In particular, to allow longer templates, an adaptive updating mechanism for support threshold is proposed. Finally, NER methods are adopted to locate slots, and question templates are obtained by combining frame words and the corresponding slots. Experimental results show that our method can effectively extract question templates for specific domains and obtain better results than baseline models.

Keywords:	question answering systems template extraction Apriori algorithm

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏