首页 | 官方网站   微博 | 高级检索  
     

基于Transformer网络的中文单字词检错方法研究
引用本文:曹阳,曹存根,王石.基于Transformer网络的中文单字词检错方法研究[J].中文信息学报,2021,35(1):135-142.
作者姓名:曹阳  曹存根  王石
作者单位:1.中国科学院 计算技术研究所 智能信息处理重点实验室,北京100190;
2.中国科学院大学,北京 10049
基金项目:国家重点研发计划(2017YFC1700300;2017YFB1002300)
摘    要:错别字自动识别是自然语言处理中一项重要的研究任务,在搜索引擎、自动问答等应用中具有重要价值。尽管传统方法在识别文本中多字词错误方面的准确率较高,但由于中文单字词错误具有特殊性,传统方法对中文单字词检错准确率较低。该文提出了一种基于Transformer网络的中文单字词检错方法。首先,该文通过充分利用汉字混淆集和Web网页构建中文单字词错误训练语料库。其次,在实际测试过程中,该文对实际的待识别语句采用滑动窗口方法,对每个滑动窗口中的句子片段分别进行单字词检错,并且综合考虑不同窗口的识别结果。实验表明,该方法具有较好的实用性。在自动生成的测试集上,识别准确率和召回率分别达到83.6%和65.7%;在真实测试集上,识别准确率和召回率分别达到82.8%和61.4%。

关 键 词:单字词检错  Transformer网络  滑动窗口

A Transformer Approach to Error Detection of Chinese Single-character Word
CAO Yang,CAO Cungen,WANG Shi.A Transformer Approach to Error Detection of Chinese Single-character Word[J].Journal of Chinese Information Processing,2021,35(1):135-142.
Authors:CAO Yang  CAO Cungen  WANG Shi
Affiliation:1.Key Laboratory of Intelligent Information Processing, Institute of Computer Technology, Chinese Academy of Sciences, Beijing 100190, China;
2.University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:Typo automatic detection is an important research task in natural language processing. It has important value in search engine, automated Q & A, etc. Although the accuracy of traditional methods for recognizing muliti-word typos in Chinese text is relatively high. However, due to the particularity of Chinese single word error, these methods generally have low accuracy. This paper proposes a method to identify Chinese single word error using a Transformer network. Firstly, In this paper, we make full use of Chinese character confusion set and web pages to build a Chinese single word error training corpus. Secondly, during actual testing process, the sliding window method is adopted for the actual sentences to be identified, single word error detection is performed for each sentence segment in each sliding window, and the recognition results of each window are comprehensively considered. Experiments show that the method in this paper has better practicability. Experimental results indicate that our method achieves a precision rate of 83.6% and a recall rate of 65.7% on an artifical test set, and a precision rate of 82.8% and a recall rate of 61.4% respectively on a real test set.
Keywords:single word error detection  Transformer network  sliding window  
本文献已被 维普 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号