首页 | 官方网站   微博 | 高级检索  
     

基于条件随机场的地质矿产文本分词研究
引用本文:陈婧汶,陈建国,王成彬,朱月琴.基于条件随机场的地质矿产文本分词研究[J].中国矿业,2018,27(9).
作者姓名:陈婧汶  陈建国  王成彬  朱月琴
作者单位:中国地质大学地质过程与矿产资源国家重点实验室;中国地质大学(武汉)紧缺矿产资源勘查协同创新中心;中国地质大学(武汉)资源学院;自然资源部地质信息技术重点实验室;中国地质调查局发展研究中心
基金项目:国土资源公益性行业专项“地质大数据技术研究与应用试点”资助(编号:201511079-02);国家重点研发计划-基于“地质云”平台的深部找矿知识挖掘项目资助(编号:2016YFC0600510)
摘    要:中文与英文不同,词与词之间没有类似空格的天然分隔符,致使中文分词成为中文信息处理中的难题。地质矿产文本中含有大量未登录地质专业术语,现阶段仍无效果较好的分词方法。本文探讨了一种基于双语料库条件随机场模型的方法对地质矿产文本进行分词,并与通用领域分词方法、单语料库条件随机场模型分词方法进行对比实验。实验表明,本文提出的方法在开放测试下分词效果明显优于其他方法,准确率为94.80%,召回率为92.68%,F-值为93.73%。本文对地质矿产文本进行了中文分词研究,既能够很好地识别未登录地质专业术语,又保证了普通词汇的识别率,为对地质领域的自然语言处理工作奠定了基础。

关 键 词:中文分词  地质矿产文本  条件随机场  语料  地质词典
收稿时间:2018/8/16 0:00:00
修稿时间:2018/8/21 0:00:00

Research on segmentation of geological mineral text using conditional random fields
CHEN Jingwen,CHEN Jianguo,WANG Chengbin and ZHU Yueqin.Research on segmentation of geological mineral text using conditional random fields[J].China Mining Magazine,2018,27(9).
Authors:CHEN Jingwen  CHEN Jianguo  WANG Chengbin and ZHU Yueqin
Affiliation:State Key Laboratory of Geological Processes and Mineral Resources,China University of Geosciences,State Key Laboratory of Geological Processes and Mineral Resources,China University of Geosciences,State Key Laboratory of Geological Processes and Mineral Resources,China University of Geosciences,Key Laboratory of Geological Information Technology,Ministry of Natural Resources
Abstract:Unlike English, the Chinese language has no space between words; it is difficult for machines to detect what constitutes a word in Chinese. The geological mineral text contains a large number of unknown geological words, which still have no effective Chinese word segmentation method. This motivated us to develop a segmenter specifically for geological mineral text which combine the characteristic of dictionary and conditional random fields model. We make a comparison experiment with generic segmentation method and a conditional random fields model which just use a single corpus. The result shows that this measure should go far towards solving the Chinese word segmentation problem, and get 94.80% in precision, 92.68% in recall, 93.73% in F-score. Here we explore CRFs for a Chinese word segmentation of geological mineral text task that is good to identify the unknown geological words and ensure the recognition rate of ordinary words. This work made a base for natural language processing in the field of geology.
Keywords:Chinese word segmentation  geology mineral text document  conditional random fields  corpus  geologic dictionary
本文献已被 CNKI 等数据库收录!
点击此处可从《中国矿业》浏览原始摘要信息
点击此处可从《中国矿业》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号