面向问题复述识别的定向数据增强方法 Directional Data Augmentation for Question Paraphrase Identification期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向问题复述识别的定向数据增强方法

引用本文：	朱鸿雨,金志凌,洪宇,苏玉兰,张民. 面向问题复述识别的定向数据增强方法[J]. 中文信息学报, 2022, 36(9): 38-45

作者姓名：	朱鸿雨金志凌洪宇苏玉兰张民

作者单位：	苏州大学计算机科学与技术学院,江苏苏州 215006

基金项目：	科技部重大专项课题(2020YFB1313601);国家自然科学基金(62076174,61773276)

摘要：	问题复述识别旨在召回“同质异构”的问句对子(语义相同表述迥异的问句)和摒弃语义无关的噪声问句,对输入的问句对进行“是复述”和“非复述”的二相判别。现有预训练语言模型(如BERT、RoBERTa和MacBERT)被广泛应用于自然语言的语义编码,并取得了显著的性能优势。然而,其优势并未在问句复述问题的求解中得到充分的体现,原因在于: ①预训练语言模型对特定任务中精细的语义表示需求并不敏感; ②复述样本的“是与非”往往取决于极为微妙的语义差异。微调预训练语言模型成为提高其任务适应性的关键步骤,但其极大地依赖训练数据的数量(多样性)与质量(可靠性)。为此,该文提出一种基于生成模型的定向数据增强方法(DDA)。该方法能够利用诱导标签对神经生成网络进行引导,借以自动生成多样的复述和非复述的增强样本(即高迷惑性的异构样本),促进训练数据的自动扩展。此外,该文设计了一种多模型集成的标签投票机制,并用其修正增强样本的潜在标签错误,以此提高扩展数据的可靠性。在中文问题复述数据集LCQMC上的实验结果证明,与传统数据增强方法相比,该文方法生成的样本质量更高,且语义表达更加多元化。
关键词：	复述识别预训练微调数据增强
收稿时间：	2021-11-09
Directional Data Augmentation for Question Paraphrase Identification

ZHU Hongyu,JIN Zhiling,HONG Yu,SU Yulan,ZHANG Min. Directional Data Augmentation for Question Paraphrase Identification[J]. Journal of Chinese Information Processing, 2022, 36(9): 38-45

Authors:	ZHU Hongyu JIN Zhiling HONG Yu SU Yulan ZHANG Min

Affiliation:	School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China

Abstract:	The purpose of the Question Paraphrase Identification is to find the “homogeneous and heterogeneous” question pairs (questions with different semantic expressions) and to discard semantic independent noise questions. The existing pre-trained language models are widely used in semantic encoding of natural texts, but not well-performed in Question Paraphrase Identification. We propose a Direcctional Data Augmentation (DDA) method based on generation model to fine-tune the pre-trained language model. DDA uses the directional label to guide the neural generation network, so as to automatically generate a variety of “paraphrase and non-paraphrase” as an augment to the training set. In addition, we design a model-ensemble voting mechanism to correct the potential label errors of augmentation samples. The results of LCQMC show that, compared with the traditional data Augmentation methods, DDA can produce higher quality samples with more diversified semantic expression.

Keywords:	paraphrase identification pre-trained fine-tune data augmentation

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏