首页 | 官方网站   微博 | 高级检索  
     

基于BERT-CRF的领域词向量生成研究
引用本文:郭振东,林民,李成城,赵佳鹏.基于BERT-CRF的领域词向量生成研究[J].计算机工程与应用,2022,58(21):156-162.
作者姓名:郭振东  林民  李成城  赵佳鹏
作者单位:1.内蒙古师范大学 计算机科学技术学院,呼和浩特 010022 2.中国科学院大学 网络空间安全学院,北京 100089 3.中国科学院 信息工程研究所,北京 100089
摘    要:如何在中文BERT字向量基础上进一步得到高质量的领域词向量表示,用于各种以领域分词为基础的文本分析任务是一个亟待解决的问题。提出了一种基于BERT的领域词向量生成方法。建立一个BERT-CRF领域分词器,在预训练BERT字向量基础上结合领域文本进行fine-tuning和领域分词学习;通过领域分词解码结果进一步得到领域词向量表示。实验表明,该方法仅利用少量的领域文本就可以学习出符合领域任务需求的分词器模型,并能获得相比原始BERT更高质量的领域词向量。

关 键 词:BERT  领域分词器  领域词向量  条件随机场  词向量可视化  

Research on Domain-Specific Word Vector Generation Based on BERT-CRF
GUO Zhendong,LIN Min,LI Chengcheng,ZHAO Jiapeng.Research on Domain-Specific Word Vector Generation Based on BERT-CRF[J].Computer Engineering and Applications,2022,58(21):156-162.
Authors:GUO Zhendong  LIN Min  LI Chengcheng  ZHAO Jiapeng
Affiliation:1.College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China 2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100089, China 3.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100089, China
Abstract:How to obtain a high-quality domain-specific word vector representation based on the Chinese BERT word vector for various text analysis tasks based on domain word segmentation is an urgent problem to be solved. This paper proposes a domain-specific word vector generation method based on BERT. A BERT-CRF domain-specific word segmenter is established, and the domain text is combined with the domain text to perform fine-tuning and domain word segmentation learning based on the pre-trained BERT word vector. The domain-specific word vector representation is further obtained through the domain-specific word segmentation decoding results. Experiments show that this method can learn a tokenizer model that meets the requirements of the domain task using only a small amount of domain text, and can obtain a higher-quality domain-specific word vector than the original BERT.
Keywords:bidirectional encoder representations from transformers(BERT)  domain tokenizer  domain-specific word vector  conditional random field  word vector visualization  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号