共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation
(SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred
across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework,
model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying
the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is
enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features
rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated
on the Chinese–English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores.
Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon
are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously,
the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically
significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE
SMT system. 相似文献
2.
Word reordering is one of the challengeable problems of machine translation. It is an important factor of quality and efficiency of machine translation systems. In this paper, we introduce a novel reordering model based on an innovative structure, named, phrasal dependency tree. The phrasal dependency tree is a modern syntactic structure which is based on dependency relationships between contiguous non-syntactic phrases. The proposed model integrates syntactical and statistical information in the context of log-linear model aimed at dealing with the reordering problems. It benefits from phrase dependencies, translation directions (orientations) and translation discontinuity between translated phrases. In comparison with well-known and popular reordering models such as distortion, lexicalised and hierarchical models, the experimental study demonstrates the superiority of our model in terms of translation quality. Performance is evaluated for Persian → English and English → German translation tasks using Tehran parallel corpus and WMT07 benchmarks, respectively. The results report 1.54/1.7 and 1.98/3.01 point improvements over the baseline in terms of BLEU/TER metrics on Persian → English and German → English translation tasks, respectively. On average our model retrieved a significant impact on precision with comparable recall value with respect to the lexicalised and distortion models. 相似文献
3.
对齐短语是决定统计机器翻译系统质量的核心模块。提出基于短语结构树的层次短语模型,这是利用串-树模型的思想对层次短语模型的扩展。基于短语结构树的层次短语模型是在双语对齐短语的基础之上结合英语短语结构树抽取翻译规则,并利用启发式策略获得翻译规则的扩展句法标记。采用翻译规则的统计机器翻译系统在不同数据集上具有稳定的翻译结果,在训练集和测试集的平均BlEU评分高于短语模型和层次短语模型的BLEU评分。 相似文献
4.
John Hutchins 《Machine Translation》2005,19(3-4):197-211
In the last decade the dominant models of MT have been data-driven or corpus-based. Of the two main trends, statistical machine
translation and example-based machine translation (EBMT), the latter is much less clearly defined. In a review of the recently
published collection edited by Michael Carl and Andy Way, this essay surveys the basic processes, methods, main problems and
tasks of EBMT, and attempts to provide a definition of the essence of EBMT in comparison with statistical MT and traditional
rule-based MT.
Recent Advances in Example-based Machine Translation. Edited by Michael Carl and Andy Way. Dordrecht: Kluwer Academic Publishers, 2003. xxxi, 482pp. (Text, Speech and Language
Technology, vol. 21) ISBN: 1-4020-1400-7 (hardback), 1-4020-1401-5 (paperback). 相似文献
5.
This paper proposes a novel method for phrase-based statistical machine translation based on the use of a pivot language.
To translate between languages L
s
and L
t
with limited bilingual resources, we bring in a third language, L
p
, called the pivot language. For the language pairs L
s
− L
p
and L
p
− L
t
, there exist large bilingual corpora. Using only L
s
− L
p
and L
p
− L
t
bilingual corpora, we can build a translation model for L
s
− L
t
. The advantage of this method lies in the fact that we can perform translation between L
s
and L
t
even if there is no bilingual corpus available for this language pair. Using BLEU as a metric, our pivot language approach
significantly outperforms the standard model trained on a small bilingual corpus. Moreover, with a small L
s
− L
t
bilingual corpus available, our method can further improve translation quality by using the additional L
s
− L
p
and L
p
− L
t
bilingual corpora. 相似文献
6.
7.
This paper describes an example-based machine translation (EBMT) method based on tree–string correspondence (TSC) and statistical
generation. In this method, the translation example is represented as a TSC, which is a triple consisting of a parse tree
in the source language, a string in the target language, and the correspondence between the leaf node of the source-language
tree and the substring of the target-language string. For an input sentence to be translated, it is first parsed into a tree.
Then the TSC forest which best matches the input tree is searched for. Finally the translation is generated using a statistical
generation model to combine the target-language strings of the TSCs. The generation model consists of three features: the
semantic similarity between the tree in the TSC and the input tree, the translation probability of translating the source
word into the target word, and the language-model probability for the target-language string. Based on the above method, we
build an English-to-Chinese MT system. Experimental results indicate that the performance of our system is comparable with
phrase-based statistical MT systems. 相似文献
8.
Some authors have recently devised adaptations of spectral grouping algorithms to integrate prior knowledge, as constrained eigenvalues problems. In this paper, we improve and adapt a recent statistical region merging approach to this task, as a non-parametric mixture model estimation problem. The approach appears to be attractive both for its theoretical benefits and its experimental results, as slight bias brings dramatic improvements over unbiased approaches on challenging digital pictures. 相似文献
9.
J. Andrs-Ferrer D. Ortiz-Martínez I. García-Varea F. Casacuberta 《Pattern recognition letters》2008,29(8):1072-PRintPerclntel
In pattern recognition, an elegant and powerful way to deal with classification problems is based on the minimisation of the classification risk. The risk function is defined in terms of loss functions that measure the penalty for wrong decisions. However, in practice a trivial loss function is usually adopted (the so-called 0–1 loss function) that do no make the most of this framework. This work is focused on the study of different loss functions, and specially on those loss functions that do not depend on the class proposed by the system. Loss functions of this kind have allowed us to theoretically explain heuristics that are successfully used with very complex pattern recognition problem, such as (statistical) machine translation. A comparative experimental work has also been carried out to compare different proposals of loss functions in the practical scenario of machine translation. 相似文献
10.
We describe a novel approach to MT that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and
EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with
conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show
that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results
from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two
distinct parsers and oracle experiments. We also validate our automated bleu scores with a small human evaluation. 相似文献
11.
D. Ortiz-Martínez I. García-Varea F. Casacuberta 《Pattern recognition letters》2008,29(8):1145-PRintPerclntel
Statistical machine translation (SMT) has proven to be an interesting pattern recognition framework for automatically building machine translations systems from available parallel corpora. In the last few years, research in SMT has been characterized by two significant advances. First, the popularization of the so called phrase-based statistical translation models, which allows to incorporate local contextual information to the translation models. Second, the availability of larger and larger parallel corpora, which are composed of millions of sentence pairs, and tens of millions of running words. Since phrase-based models basically consists in statistical dictionaries of phrase pairs, their estimation from very large corpora is a very costly task that yields a huge number of parameters which are to be stored in memory. The handling of millions of model parameters and a similar number of training samples have become a bottleneck in the field of SMT, as well as in other well-known pattern recognition tasks such as speech recognition or handwritten recognition, just to name a few. In this paper, we propose a general framework that deals with the scaling problem in SMT without introducing significant time overhead by means of the combination of different scaling techniques. This new framework is based on the use of counts instead of probabilities, and on the concept of cache memory. 相似文献
12.
Dekai Wu 《Machine Translation》2005,19(3-4):213-227
We offer a perspective on EBMT from a statistical MT standpoint, by developing a three-dimensional MT model space based on
three pairs of definitions: (1) logical versus statistical MT, (2) schema-based versus example-based MT, and (3) lexical versus
compositional MT. Within this space we consider the interplay of three key ideas in the evolution of transfer, example-based,
and statistical approaches to MT. We depict how all translation models face these issues in one way or another, regardless
of the school of thought, and suggest where the real questions for the future may lie. 相似文献
13.
14.
Sergei Nirenburg 《Machine Translation》1989,4(1):5-24
This paper provides an overview of the KBMT-89 project at Carmegie Mellon University's Center for Machine Translation, as well therefore of the special number of this journal, which reports on the project. The knowledge-based approach to machine translation is presented and defended in a historical context. Various components of the system, key parts of which are described in subsequent papers of the issue, are introduced and paired with their computational motivations. 相似文献
15.
Constructive machine translation evaluation 总被引:1,自引:0,他引:1
Stephen Minnis 《Machine Translation》1993,8(1-2):67-75
When surveying the many methods currently employed in MT evaluation,1 it is not immediately obvious that the methods used serve to increase the knowledge of the properties being measured. This report describes aconstructive machine translation evaluation method, aimed at addressing this issue.2
Edited version of a presentation given to the International Working Group on the Evaluation of Machine Translation Systems, Vaud, Switzerland, April 1991. 相似文献
16.
17.
18.
Training recognizers for handwritten characters is still a very time consuming task involving tremendous amounts of manual annotations by experts. In this paper we present semi-supervised labeling strategies that are able to considerably reduce the human effort. We propose two different methods to label and later recognize characters in collections of historical archive documents. The first one is based on clustering of different feature representations and the second one incorporates a simultaneous retrieval on different representations. Hence, both approaches are based on multi-view learning and later apply a voting procedure for reliably propagating annotations to unlabeled data. We evaluate our methods on the MNIST database of handwritten digits and introduce a realistic application in form of a database of handwritten historical weather reports. The experiments show that our method is able to significantly reduce the human effort that is required to build a character recognizer for the data collection considered while still achieving recognition rates that are close to a supervised classification experiment. 相似文献
19.
一种基于实例的汉英机器翻译策略 总被引:3,自引:0,他引:3
介绍了一种基于实例的汉英机器翻译策略,重点讨论了汉英双语语料库的设计和基于该语料库的汉语句子的匹配算法。在进行汉语句子的匹配时,根据汉语的特点直接采用汉字的匹配,而没有进行汉语句子的分词。另外,匹配时确定匹配片断的边界也是基于实例机器翻译的难点之一,在这方面也采取了相应的解决方法。没有对翻译句子的连接装配进行更深入的研究,这是因为该翻译策略是用于多翻译引擎系统的,它要与其它翻译策略配合使用,以提高翻译结果的正确率。基于实例的机器翻译需要大量的双语语料库作为翻译时的依据,而人工建设大型语料库费时费力,所以尝试采用计算机进行汉英双语语料库的自动建立,包括篇章对齐和单词级的对齐。 相似文献
20.
In Gaussian mixture modeling, it is crucial to select the number of Gaussians for a sample set, which becomes much more difficult
when the overlap in the mixture is larger. Under regularization theory, we aim to solve this problem using a semi-supervised
learning algorithm through incorporating pairwise constraints into entropy regularized likelihood (ERL) learning which can
make automatic model selection for Gaussian mixture. The simulation experiments further demonstrate that the presented semi-supervised
learning algorithm (i.e., the constrained ERL learning algorithm) can automatically detect the number of Gaussians with a
good parameter estimation, even when two or more actual Gaussians in the mixture are overlapped at a high degree. Moreover,
the constrained ERL learning algorithm leads to some promising results when applied to iris data classification and image
database categorization. 相似文献