首页 | 官方网站   微博 | 高级检索  
     

基于词典的汉藏句子对齐研究与实现
引用本文:于新,吴健,洪锦玲.基于词典的汉藏句子对齐研究与实现[J].中文信息学报,2011,25(4):57-63.
作者姓名:于新  吴健  洪锦玲
作者单位:1.中国科学院 软件研究所,北京 100190;2. 中国科学院 研究生院,北京 100190
基金项目:中国科学院西部行动计划高新技术项目
摘    要:双语语料库加工的关键技术之一是对齐,构建句子级别的对齐语料是构建语料库最基本的任务。该文参考其他语言句子对齐的成熟的方法,针对藏文语言的特殊性,提出基于词典的汉藏句子对齐。整理了对齐所用双语词典,并对其词语覆盖率进行了评价。在汉藏句子对齐过程中发现汉语与藏文的分词粒度不同的问题,采用在藏汉词典中进一步查词并在汉语句子中比对的方法,使正确句对的得分增加,从而提高对齐正确率。采用该方法准确率为 81.11%。

关 键 词:汉藏句子对齐  词典  分词粒度  平行语料库  藏文信息处理  

Research and realization of Dictionary-Based Chinese-Tibetan Sentence Alignment
YU Xin,WU Jian,HONG Jinling.Research and realization of Dictionary-Based Chinese-Tibetan Sentence Alignment[J].Journal of Chinese Information Processing,2011,25(4):57-63.
Authors:YU Xin  WU Jian  HONG Jinling
Affiliation:1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;2. Graduate University of the Chinese Academy of Sciences, Beijing 100190, China
Abstract:To construct the bilingual parallel corpus, the alignment at sentence level is a basic task. Considering the specific characteristic of Tibetan language, this paper proposes a dictionary-based Chinese-Tibetan sentence automatic alignment method. It establishes a bilingual dictionary for alignment, and evaluates its word coverage. To address the issueof different granularity between Chinese word segmentation and Tibetan word segmentation, this paper chooseto further look up the remaining big Tibetan word in Tibetan-Chinese dictionary and then match it in the original Chinese sentence, which increases the precision . Experiments show an average precision of 81.11% for this approach.
Key wordsChinese-Tibetan sentence alignment; dictionary; word segmentation granularity; parallel corpus; Tibetan information processing
Keywords:Chinese-Tibetan sentence alignment  dictionary  word segmentation granularity  parallel corpus  Tibetan information processing  
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号