首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 640 毫秒
1.
HMM Word and Phrase Alignment for Statistical Machine Translation   总被引:1,自引:0,他引:1  
Estimation and alignment procedures for word and phrase alignment hidden Markov models (HMMs) are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model 4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.  相似文献   

2.
We describe methods for improving the performance of statistical machine translation (SMT) between four linguistically different languages, i.e., Chinese, English, Japanese, and Korean by using morphosyntactic knowledge. For the purpose of reducing the translation ambiguities and generating grammatically correct and fluent translation output, we address the use of shallow linguistic knowledge, that is: (1) enriching a word with its morphosyntactic features, (2) obtaining shallow linguistically-motivated phrase pairs, (3) iteratively refining word alignment using filtered phrase pairs, and (4) building a language model from morphosyntactically enriched words. Previous studies reported that the introduction of syntactic features into SMT models resulted in only a slight improvement in performance in spite of the heavy computational expense, however, this study demonstrates the effectiveness of morphosyntactic features, when reliable, discriminative features are used. Our experimental results show that word representations that incorporate morphosyntactic features significantly improve the performance of the translation model and language model. Moreover, we show that refining the word alignment using fine-grained phrase pairs is effective in improving system performance.  相似文献   

3.
In most statistical machine translation (SMT) systems, bilingual segments are extracted via word alignment. However, there is a need for systematic study as to what alignment characteristics can benefit MT under specific experimental settings such as the type of MT system, the language pair or the type or size of the corpus. In this paper we perform, in each of these experimental settings, a statistical analysis of the data and study the sample correlation coefficients between a number of alignment or phrase table characteristics and variables such as the phrase table size, the number of untranslated words or the BLEU score. We report results for two different SMT systems (a phrase-based and an n-gram-based system) on Chinese-to-English FBIS and BTEC data, and Spanish-to-English European Parliament data. We find that the alignment characteristics which help in translation greatly depend on the MT system and on the corpus size. We give alignment hints to improve BLEU score, depending on the SMT system used and the type of corpus. For example, for phrase-based SMT, dense alignments are required with larger corpora, especially on the target side, while with smaller corpora, more precise, sparser alignments are better, especially on the source side. Avoiding some long-distance crossing links may also improve BLEU score with small corpora. We take these conclusions into account to modify two types of alignment systems, and get 1 to 1.6 % relative improvements in BLEU score on two held-out corpora, although the improved system is different in each corpus.  相似文献   

4.
We study challenges raised by the order of Arabic verbs and their subjects in statistical machine translation (SMT). We show that the boundaries of post-verbal subjects (VS) are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. In addition, VS constructions have highly ambiguous reordering patterns when translated to English, and these patterns are very different for matrix (main clause) VS and non-matrix (subordinate clause) VS. Based on this analysis, we propose a novel method for leveraging VS information in SMT: we reorder VS constructions into pre-verbal (SV) order for word alignment. Unlike previous approaches to source-side reordering, phrase extraction and decoding are performed using the original Arabic word order. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline. Limiting reordering to matrix VS yields further improvements.  相似文献   

5.
面向统计机器翻译的重对齐方法研究   总被引:3,自引:0,他引:3  
词对齐是统计机器翻译中的重要技术之一。该文提出了一种重对齐方法,它在IBM models获得的正反双向词对齐的基础上,确定出正反双向对齐不一致的部分。之后,对双向词对齐不一致的部分进行重新对齐以得到更好的对称化的词对齐结果。此外,该文提出的方法还可以利用大规模单语语料来强化对齐结果。实验结果表明,相比在统计机器翻译中广泛使用的基于启发信息的词对齐对称化方法,该文提出的方法可以使统计机器翻译系统得到更高的翻译准确率。  相似文献   

6.
Statistical machine translation (SMT) is based on alignment models which learn from bilingual corpora the word correspondences between source and target language. These models are assumed to be capable of learning reorderings. However, the difference in word order between two languages is one of the most important sources of errors in SMT. In this paper, we show that SMT can take advantage of inductive learning in order to solve reordering problems. Given a word alignment, we identify those pairs of consecutive source blocks (sequences of words) whose translation is swapped, i.e. those blocks which, if swapped, generate a correct monotonic translation. Afterwards, we classify these pairs into groups, following recursively a co-occurrence block criterion, in order to infer reorderings. Inside the same group, we allow new internal combination in order to generalize the reorder to unseen pairs of blocks. Then, we identify the pairs of blocks in the source corpora (both training and test) which belong to the same group. We swap them and we use the modified source training corpora to realign and to build the final translation system. We have evaluated our reordering approach both in alignment and translation quality. In addition, we have used two state-of-the-art SMT systems: a Phrased-based and an Ngram-based. Experiments are reported on the EuroParl task, showing improvements almost over 1 point in the standard MT evaluation metrics (mWER and BLEU).  相似文献   

7.
短语对抽取是基于短语统计机器翻译方法的关键技术。当前广泛使用的Och提出的短语对抽取方法,过于依赖词对齐结果,因而只能抽取与词对齐完全相容的短语对。本文给出一种基于“松弛尺度”的短语抽取方法,对不能完全相容的短语对,结合词性标注信息和词典信息来判断是否进行抽取,放松“完全相容”的限制,可以保证为更多的源短语找到目标短语。实验表明,该抽取方法的性能比Och的方法有明显的改善和提高。  相似文献   

8.
汉藏短语对抽取中短语译文获取方法研究   总被引:1,自引:0,他引:1  
该文从法律法规和公文领域汉藏语料中对待翻译汉语短语提取藏语译文。目前普遍采用的短语对抽取方法需要依赖于词性或句法分析等资源或词对齐技术。考虑现阶段藏文资源不足,词法句法相关技术不成熟,该文提出藏文词串频率统计方法(TSM)和藏文词序列相交算法(TIA)两种方法来获取藏语译文。其中TSM抽取1-1连续和非连续短语准确率达到90%左右,但遗漏1-n情况。TIA能够抽取1-n连续和非连续藏文语块,准确率达到81%。  相似文献   

9.
Long-range word order differences are a well-known problem for machine translation. Unlike the standard phrase-based models which work with sequential and local phrase reordering, the hierarchical phrase-based model (Hiero) embeds the reordering of phrases within pairs of lexicalized context-free rules. This allows the model to handle long range reordering recursively. However, the Hiero grammar works with a single nonterminal label, which means that the rules are combined together into derivations independently and without reference to context outside the rules themselves. Follow-up work explored remedies involving nonterminal labels obtained from monolingual parsers and taggers. As of yet, no labeling mechanisms exist for the many languages for which there are no good quality parsers or taggers. In this paper we contribute a novel approach for acquiring reordering labels for Hiero grammars directly from the word-aligned parallel training corpus, without use of any taggers or parsers. The new labels represent types of alignment patterns in which a phrase pair is embedded within larger phrase pairs. In order to obtain alignment patterns that generalize well, we propose to decompose word alignments into trees over phrase pairs. Beside this labeling approach, we contribute coarse and sparse features for learning soft, weighted label-substitution as opposed to standard substitution. We report extensive experiments comparing our model to two baselines: Hiero and the known syntax augmented machine translation (SAMT) variant, which labels Hiero rules with nonterminals extracted from monolingual syntactic parses. We also test a simplified labeling scheme based on inversion transduction grammar (ITG). For the Chinese–English task we obtain performance improvement up to 1 BLEU point, whereas for the German–English task, where morphology is an issue, a minor (but statistically significant) improvement of 0.2 BLEU points is reported over SAMT. While ITG labeling does give a performance improvement, it remains sometimes suboptimal relative to our proposed labeling scheme.  相似文献   

10.
该文对基于传统统计模型的蒙汉机器翻译模型和基于神经网络机器翻译模型进行了研究。其中,神经网络翻译模型分别为基于CNN、RNN的翻译模型,并通过将所有翻译模型结果进行句子级融合得到一个融合模型。面对蒙汉翻译面临资源稀少、蒙古文形态复杂等困难,该文提出多种翻译技术,对各个模型进行改进,并对蒙古文进行形态分析与处理。在翻译效果最好的CNN模型上,采用字和短语融合训练方法;基于RNN的翻译模型除用上述方法外,还采用Giza++指导对齐技术调整RNN注意力机制;针对SMT采用了实验室提出的重对齐技术。该文对实验结果进行了对比和分析,这三种技术方法对相应系统翻译效果有显著提升。此外,蒙古文形态分析与处理对缓解数据稀疏、提升译文质量也有重要作用。  相似文献   

11.
龚慧敏  段湘煜  张民 《计算机科学》2017,44(12):216-220, 238
词对齐是统计机器翻译系统的重要一环,但词对齐的获得往往基于序列模型的计算,而没有考虑语言的结构化信息及语言特征,从而造成词对齐中出现一些不符合语言特征的结果。文中提出一种词对齐的自纠正机制,以纠正词对齐中的错误部分。该机制使用一些语言学上的先验知识,对词对齐结果进行由粗颗粒度到细颗粒度的纠正。首先采用基于标点的方法对句对进行粗粒度化纠正,然后采用基于统计特征的方法对子句对进行细粒度化纠正。该自纠正过程不需要借助任何其他词对齐工具和新语料。实验结果显示,自纠正词对齐显著提高了词对齐的准确率,并提高了机器翻译的质量,其中粗粒度的纠正方法对翻译质量的提高最为显著,细粒度的纠正方法也提升了翻译质量,最终通过结合粗颗粒度和细颗粒度的纠正方法,使翻译结果相对基准系统取得了显著的提高。  相似文献   

12.
汉藏短语抽取   总被引:1,自引:1,他引:0  
该文将从汉藏法律法规和公文领域平行语料中提取双语短语对。考虑现阶段藏文资源匮乏,提出两步汉藏短语抽取方法。第一步是提取汉语有效语块,这部分工作不是该文工作重点。第二步是获取待翻译汉语短语的译文,该模块提出藏文词序列相交算法抽取藏文短语。该算法可以很好的抽取1-1和1-n连续和非连续藏文短语。  相似文献   

13.
We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation (SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated on the Chinese–English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores. Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously, the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE SMT system.  相似文献   

14.
在统计机器翻译中,短语翻译概率特征对最终的翻译结果有着重大的影响。传统的估计方法只考虑了双语短语同时出现,满足对齐一致性的情况,而没有对其他情况进行统计,因而短语翻译概率的估计不够准确。该文中,我们修改了传统的短语概率计算公式,在估计概率的过程中充分地考虑短语的各种出现情况。多个测试集上的实验结果证明了我们方法的有效性。  相似文献   

15.
Unknown words are one of the key factors that greatly affect the translation quality.Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words.However, these approaches have two disadvantages.On the one hand, they usually rely on many additional resources such as bilingual web data;on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words.This paper gives a new perspective on handling unknown words in statistical machine translation (SMT).Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process.In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated.In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model.Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.  相似文献   

16.
汉蒙语形态差异性及平行语料库规模小制约了汉蒙统计机器翻译性能的提升。该文将蒙古语形态信息引入汉蒙统计机器翻译中,通过将蒙古语切分成词素的形式,构造汉语词和蒙古语词素,以及蒙古语词素和蒙古语的映射关系,弥补汉蒙形态结构上的非对称性,并将词素作为中间语言,通过训练汉语—蒙古语词素以及蒙古语词素-蒙古语统计机器翻译系统,构建出新的短语翻译表和调序模型,并采用多路径解码及多特征的方式融入汉蒙统计机器翻译。实验结果表明,将基于词素媒介构建出的短语翻译表和调序模型引入现有统计机器翻译方法,使得译文在BLEU值上比基线系统有了明显提高,一定程度上消解了数据稀疏和形态差异对汉蒙统计机器翻译的影响。该方法是一种通用的方法,通过词素和短语两个层面信息的结合,实现了两种语言在形态结构上的对称,不仅适用于汉蒙统计机器翻译,还适用于形态非对称且低资源的语言对。  相似文献   

17.
In this paper, we describe a first version of a system for statisticaltranslation and present experimental results. The statistical translationapproach uses two types of information: a translation model and a languagemodel. The language model used is a standard bigram model. The translationmodel is decomposed into lexical and alignment models. After presenting the details of the alignment model, we describe the search problem and present a dynamic programming-based solution for the special case of monotone alignments.So far, the system has been tested on two limited-domain tasks for which abilingual corpus is available: the EuTrans traveller task (Spanish–English,500-word vocabulary) and the Verbmobil task (German–English, 3000-wordvocabulary). We present experimental results on these tasks. In addition to the translation of text input, we also address the problem of speech translation and suitable integration of the acoustic recognition process and the translation process.  相似文献   

18.
基于中心语块扩展的短语对齐   总被引:1,自引:0,他引:1  
短语等价对在词典编纂、机器翻译和跨语言信息检索中有着广泛的应用.提出了一种新的短语对齐方法,使用可信度较高的词典对齐结果来抽取源语言短语的译文中心语块,依据译文扩展可信度来确定源语言短语的译文统计边界.从译文中心语块出发,结合译文统计边界生成源语言短语的所有候选译文.对候选译文进行评价,从中选出最可靠的译文.同时利用贪心算法消除源语言短语译文边界之间的交叉冲突.实验结果表明,所提出的方法在开放测试中其正确率达到了82.76%,性能好于其他方法.  相似文献   

19.
This paper presents an extended, harmonised account of our previous work on combining subsentential alignments from phrase-based statistical machine translation (SMT) and example-based MT (EBMT) systems to create novel hybrid data-driven systems capable of outperforming the baseline SMT and EBMT systems from which they were derived. In previous work, we demonstrated that while an EBMT system is capable of outperforming a phrase-based SMT (PBSMT) system constructed from freely available resources, a hybrid ‘example-based’ SMT system incorporating marker chunks and SMT subsentential alignments is capable of outperforming both baseline translation models for French–English translation. In this paper, we show that similar gains are to be had from constructing a hybrid ‘statistical’ EBMT system. Unlike the previous research, here we use the Europarl training and test sets, which are fast becoming the standard data in the field. On these data sets, while all hybrid ‘statistical’ EBMT variants still fall short of the quality achieved by the baseline PBSMT system, we show that adding the marker chunks to create a hybrid ‘example-based’ SMT system outperforms the two baseline systems from which it is derived. Furthermore, we provide further evidence in favour of hybrid systems by adding an SMT target-language model to the EBMT system, and demonstrate that this too has a positive effect on translation quality. We also show that many of the subsentential alignments derived from the Europarl corpus are created by either the PBSMT or the EBMT system, but not by both. In sum, therefore, despite the obvious convergence of the two paradigms, the crucial differences between SMT and EBMT contribute positively to the overall translation quality. The central thesis of this paper is that any researcher who continues to develop an MT system using either of these approaches will benefit further from integrating the advantages of the other model; dogged adherence to one approach will lead to inferior systems being developed.  相似文献   

20.
源语言和目标语言的句法异构性对统计机器翻译(SMT)性能有重要影响。在基于短语的汉英统计机器翻译基础上,提出了一种基于N-best句法知识增强的源语言预调序方法。首先对源语言输入句子进行N-best句法分析,计算统计概率得到高可靠性子树结构,再根据词对齐信息从可靠性子树结构中抽取初始调序规则集。两种优化策略用于对初始规则集进行优化:基于中英文句法知识规则推导筛选和规则概率阈值控制机制。然后为减少短语内部调序,保证短语局部流利性,采用源语言短语翻译表为约束,使调序控制在短语块之间进行。最后根据获取的优化规则集和短语表约束条件对源语言端句子的句法分析树进行预调序。在基于NIST 2005和2008测试数据集上的汉英统计机器翻译实验结果表明,所提基于N-best句法知识增强的统计机器翻译预调序方法相对于基线系统,自动评价准则BLEU得分分别提高了0.68和0.83。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号