首页 | 官方网站   微博 | 高级检索  
     

基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究
引用本文:吴俊,程垚,郝瀚,艾力亚尔·艾则孜,刘菲雪,苏亦坡.基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J].情报学报,2020,39(4):409-418.
作者姓名:吴俊  程垚  郝瀚  艾力亚尔·艾则孜  刘菲雪  苏亦坡
作者单位:北京邮电大学经济管理学院,北京 100876;深圳暴风智能科技有限公司,北京 100191
基金项目:国家重点研发计划项目“基于模式创新的科技咨询服务平台研发与应用示范”(2018YFB1403600);北京市社会科学基金一般项目“基于大数据的北京市共享单车产业监测与发展趋势研究”(17YJB018)。
摘    要:专业术语的识别与自动抽取对于提升专业信息检索精度,构建领域知识图谱发挥着重要基础性作用。为进一步提升中文专业术语识别的精确率和召回率,提出一种端到端的不依赖人工特征选择和领域知识,基于谷歌BERT预训练语言模型及中文预训练字嵌入向量,融合BiLSTM和CRF的中文专业术语抽取模型。以自建的1278条深度学习语料数据为实验对象,该模型对术语提取的F1值为92.96%,相对于传统的浅层机器学习模型(如左右熵与互信息算法、word2vec相似词算法等)和BiLSTM-CRF深度神经网络模型的性能有较为显著的提升。本文也给出了模型应用的具体流程,能够为中文专业术语库的构建提供实践指南。

关 键 词:BERT  BiLSTM  CRF  专业术语抽取

Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model
Wu Jun,Cheng Yao,Hao Han,Ailiyaer·Aizezi,Liu Feixue,Su Yipo.Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J].Journal of the China Society for Scientific andTechnical Information,2020,39(4):409-418.
Authors:Wu Jun  Cheng Yao  Hao Han  Ailiyaer·Aizezi  Liu Feixue  Su Yipo
Affiliation:(School of Economics and Management,Beijing University of Posts and Telecommunications,Beijing 100876;Shenzhen Storm Intelligent Technology Co.,Ltd,Beijing 100191)
Abstract:High quality professional term recognition and its extraction play an important role in the fields of domain information retrieval and knowledge graph building. To improve the precision and recall rate of terminology recognition, we propose a Chinese terminology recognition and extraction approach that does not rely on specific domain knowledge or artificial features. Using the latest developments in representation learning, this study introduces BERT embedding as a character-based pre-trained model and incorporates it with a bi-directional long short-term memory(BiLSTM) and a conditional random field(CRF) to extract deep learning terminologies from 1278 annotated datasets collected from domain e-books.The experimental results show that the proposed model reaches 92.96% in F-score and outperforms other competing algorithms, such as left and right entropy, mutual information, a word2 vec based similar terminology recognition algorithm,and a BiLSTM-CRF model. The best practices and related procedures for the implementation of the proposed model are also provided to guide its users in its further improvement.
Keywords:BERT  BiLSTM  CRF  terminology recognition and extraction
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号