首页 | 官方网站   微博 | 高级检索  
     

基于领域特征提纯的多领域文本分类
引用本文:马式琨,滕冲,李霏,姬东鸿.基于领域特征提纯的多领域文本分类[J].中文信息学报,2022,36(8):92-100.
作者姓名:马式琨  滕冲  李霏  姬东鸿
作者单位:武汉大学 国家网络安全学院 空天信息安全与可信计算教育部重点实验室,湖北 武汉 430072
基金项目:国家自然科学基金(62176187);国家重点研发计划(2017YFC1200500);教育部基金(18JZD015);教育部人文社会科学青年基金(22YJCZH064);湖北省自然科学基金(2021CFB385)
摘    要:文本分类是自然语言处理领域中一项基本任务,但目前的文本分类任务往往是领域独立的,且需要丰富的标注数据。该文通过利用不同领域的数据蕴含的相似信息,在一定程度上缓解标签训练数据不足的问题。该文提出了一种多任务学习模型来解决跨领域文本分类任务,通过每个领域的私有编码器和所有领域的共享编码器来分别提取私有特征和共享特征,从而利用不同层面的领域知识来表示文本,并帮助文本分类。另外,该文还利用正交投影将共享特征和领域私有特征进一步异化,从而强化共享特征的纯度,同时使用门控机制将共享特征和私有特征进行重组融合。我们在两个常用的多领域文本分类数据集(Amazon和FDU-MTL)上对所提模型进行了验证。实验结果表明,该模型在Amazon和FDU-MTL数据集上的平均分类准确率分别达到了86.04%和89.2%,较之前多个基线模型有明显提升。

关 键 词:文本分类  多领域  特征提纯  多任务学习  

Multi-domain Text Classification Based on Domain Feature Refinement
MA Shikun,TENG Chong,LI Fei,JI Donghong.Multi-domain Text Classification Based on Domain Feature Refinement[J].Journal of Chinese Information Processing,2022,36(8):92-100.
Authors:MA Shikun  TENG Chong  LI Fei  JI Donghong
Affiliation:Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, Hubei 430072, China
Abstract:Text Classification is a fundamental task in natural larguage processing communing. However, current text classification is usually domain-independent, suffering from insufficient annotated training data. We propose a solution by leveraging the similar information of data in different domains to address the limited labeled training data issue. Under the framework of multi-task learning proposed by this paper, we extract domain-invariant and domain-specific features by using a shared encoder and multiple private encoders, respectively. Latent informaton from different domaius can be captured, which is beneficial for multi-domain text classification. Besides, we further apply an orthogonal projection operation to inherently disjoint shared and private feature spaces to refine of the shared features, and then designed a gate mechanism to fuse the shared and private features. Experiments on Amazon review and FDU-MTL show that the average accuracy of the proposed model on two datasets are 86.04% and 89.2%, respectively, significant better compared with multiple baseline models.
Keywords:text classification  multi-domain  feature refinement  multi-task learning  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号