首页 | 官方网站   微博 | 高级检索  
     

面向中文社交媒体语料的无监督新词识别研究
引用本文:张婧,黄锴宇,梁晨,黄德根.面向中文社交媒体语料的无监督新词识别研究[J].中文信息学报,2018,32(3):17.
作者姓名:张婧  黄锴宇  梁晨  黄德根
作者单位:大连理工大学 计算机科学与技术学院,辽宁 大连 116024
基金项目:国家自然科学基金(61672127,61672126)
摘    要:该文结合词向量技术和传统统计量,提出了一种新的无监督新词识别方法。该方法利用传统统计量获得候选新词,然后采用多种策略训练得到词向量,利用词向量构建弱成词词串集合,并使用该集合从候选新词的内部构成和外部环境两个方面对其进行过滤。此外,该文人工标注了一万条微博的分词语料作为发展语料,用于分析传统统计量以及调整变量阈值。实验使用NLPCC2015面向微博的中文分词评测任务的训练语料作为最终的测试语料。实验表明,该文方法对二元新词进行识别的F值比基线系统提高了6.75%,比目前新词识别领域最佳方法之一Overlap Variety方法提高了4.9%。最终,在测试语料上对二元新词和三元新词识别的F值达到了56.2%。

关 键 词:未登录词识别  社交媒体语料  词向量  无监督方法  

Unsupervised New Word Extraction from Chinese Social Media Data
ZHANG Jing,HUANG Kaiyu,LIANG Chen,HUANG Degen.Unsupervised New Word Extraction from Chinese Social Media Data[J].Journal of Chinese Information Processing,2018,32(3):17.
Authors:ZHANG Jing  HUANG Kaiyu  LIANG Chen  HUANG Degen
Affiliation:School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
Abstract:Aiming to extract new words from Chinese social media data, a novel unsupervised method which utilizes traditional statistical measure and word embedding is proposed. Traditional statistical measure is applied to extract new word candidate list from segmented social media corpus, and then word embedding is trained via multi-strategies to filter out noises from the new word candidate list by constructing anti-word set which contains segments that are less likely to become a new word combining with other segments. Besides, in order to analyze traditional statistical measure and tuning thresholds, we annotated 10,000 tweets as development corpus, which is proved to be reliable by the experimental results. To assess the proposed method, the corpus released as training corpus by the evaluation of microblog-oriented Chinese word segmentation in NLPCC2015 is used as test corpus. The results show that our method significantly improves the new word extraction performance comparing to the baseline systems. The bigram and trigram new word extraction result on test corpus reaches 56.2% in F1-measure.
Keywords:unknown words recognition  social media data  word embedding  unsupervised method  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号