首页 | 官方网站   微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
A general method which combines formant synthesis by rule and time-domain concatenation is proposed. This method utilizes the advantages of both techniques by maintaining naturalness while minimizing difficulties such as prosodic modification and spectral discontinuities at the point of concatenation. An integrated sampled natural glottal source (Matsui et al., 1991) and sampled voiceless consonants were incorporated into a real-time text-to-speech formant synthesizer. In special cases, voicing amplitude envelopes and formant transitions dirived from natural speech were also utilized. Several listening tests were performed to evaluate these methods. We obtained a significant overall improvement in intelligibility over our previous formant synthesizer. Such improvements in intelligibility were previously obtained with a Japanese text-to-speech system using a related hybrid system (Kamai and Matsui, 1993), indicating the applicability of this method for multi-lingual synthesis. The results of subjective analyses showed that these methods can alo improve naturalness and listenability factors.  相似文献   

This paper presents a prosodic phrasing model for Korean to be used in a text-to-speech synthesis (TTS) system. Read text corpora were morpho-syntactically parsed and prosodically labeled following the Penn Korean Treebank (Han, Chunghye, Ko, Eon-Suk, Yi, Heejong, Palmer, M., 2002. Penn Korean Treebank: development and evaluation. In: Proceedings of the 16th Pacific Asian Conference on Language and Computation. Korean Society for Language and Information.) and K-ToBI prosodic labeling conventions (Sun-Ah, J., 2000. K-ToBI (Korean ToBI) labelling conventions. Version 3.1. Available from: URL <http://www.linguistics.ucla.edu/people/jun/ktobi/K-tobi.html>.), respectively. Decision trees were trained with morpho-syntactic and textual distance features to predict locations of accentual and intonational phrase breaks. Our phrasing model cross-validated on a 300-sentence corpus (6936 words or 21,436 syllables, with an average of 72 syllables or 23 words per sentence) predicted non-breaks with F = 92.4% and breaks with F = 88.0% (F = 72.8% for accentual phrase breaks and F = 71.3% for intonational phrase breaks).  相似文献   

The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on the optimization criteria used in the unit selection process. While effective cost functions arise naturally for prosody assessment, the criteria typically selected to quantify discontinuities in the speech signal do not closely reflect users' perception of the resulting acoustic waveform. This paper introduces an alternative feature extraction paradigm, which eschews general purpose Fourier analysis in favor of a modal decomposition separately optimized for each boundary region. The ensuing transform framework preserves, by construction, those properties of the waveform which are globally relevant to each concatenation considered. In addition, it leads to a novel discontinuity measure which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/bandwidths. Experimental evaluations are conducted to characterize the behavior of this new metric, first on a contiguity prediction task, and then via a systematic listening comparison using a conventional metric as baseline. The results underscore the viability of the proposed framework in quantifying the perception of discontinuity between acoustic units.  相似文献   

Consonants in written Hindi often carry annotations indicating the nature of the following vowel, which is not written separately. When there is no explicit marking, schwa is the default vowel, but this vowel does not always emerge in a word’s pronunciation. In addition, morphological boundaries can block the deletion of inherent schwas. Previous implementations of schwa deletion in the domain of text-to-speech synthesis (Narasimhan et al., International Journal of Speech Technology, 7(4):319–333, 2004; Choudhury and Basu, Proceedings of the International Conference on Knowledge-Based Computer Systems, 343–353, 2002) delete schwa in phonetic environments that obey the phonotactic constraints of Hindi within word boundaries. Instead of using segmental contexts, in conjunction with a morphological analysis, to predict schwa deletion, we used an account of syllable structure and stress assignment for two- and three-syllable words (Beckman and Pierrehumbert, forthcoming) to predict the presence and absence of schwa in a corpus of phonetically transcribed Hindi. Our algorithm scored as high as 95% accuracy on the deletion of schwa from a small corpus of Hindi words.  相似文献   

Expressive text-to-speech (TTS) synthesis should contribute to the pleasantness, intelligibility, and speed of speech-based human-machine interactions which use TTS. We describe a TTS engine which can be directed, via text markup, to use a variety of expressive styles, here, questioning, contrastive emphasis, and conveying good and bad news. Differences in these styles lead us to investigate two approaches for expressive TTS, a "corpus-driven" and a "prosodic-phonology" approach. Each speaker records 11 h (excluding silences) of "neutral" sentences. In the corpus-driven approach, the speaker also records 1-h corpora in each expressive style; these segments are tagged by style for use during search, and decision trees for determining f/sub 0/ contours and timing are trained separately for each of the neutral and expressive corpora. In the prosodic-phonology approach, rules translating certain expressive markup elements to tones and break indices (ToBI) are manually determined, and the ToBI elements are used in single f/sub 0/ and duration trees for all expressions. Tests show that listeners identify synthesis in particular styles ranging from 70% correctly for "conveying bad news" to 85% for "yes-no questions". Further improvements are demonstrated through the use of speaker-pooled f/sub 0/ and duration models.  相似文献   

Multimedia Tools and Applications - This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like...  相似文献   

In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yorùbá (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model.We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications.  相似文献   

In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates. In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.  相似文献   

This paper reports on a cooperative international evaluation of grapheme-to-phoneme (GP) conversion for text-to-speech synthesis in French. Test methodology and test corpora are described. The results for eight systems are provided and analysed in some detail. The contribution of this paper is twofold: on the one hand, it gives an accurate picture of the state-of-the-art in the domain of GP conversion for French, and points out the problems still to be solved. On the other hand, much room is devoted to a discussion of methodological issues for this task. We hope this could help future evaluations of similar systems in other languages.  相似文献   

This paper proposes a two-stage feedforward neural network (FFNN) based approach for modeling fundamental frequency (F0) values of a sequence of syllables. In this study, (i) linguistic constraints represented by positional, contextual and phonological features, (ii) production constraints represented by articulatory features and (iii) linguistic relevance tilt parameters are proposed for predicting intonation patterns. In the first stage, tilt parameters are predicted using linguistic and production constraints. In the second stage, F0 values of the syllables are predicted using the tilt parameters predicted from the first stage, and basic linguistic and production constraints. The prediction performance of the neural network models is evaluated using objective measures such as average prediction error (μ), standard deviation (σ) and linear correlation coefficient (γX,Y). The prediction accuracy of the proposed two-stage FFNN model is compared with other statistical models such as Classification and Regression Tree (CART) and Linear Regression (LR) models. The prediction accuracy of the intonation models is also analyzed by conducting listening tests to evaluate the quality of synthesized speech obtained after incorporation of intonation models into the baseline system. From the evaluation, it is observed that prediction accuracy is better for two-stage FFNN models, compared to the other models.  相似文献   

This paper proposes a method for tuning the weights of unit selection cost functions in syllable based text-to-speech (TTS) synthesis system. In this work, unit selection cost functions, namely target cost and concatenation cost, are designed appropriate to syllables. The method tunes the weights in such a way that perceptual preference patterns are appropriately considered while selecting the units. The method uses genetic algorithm to derive the optimal weights. Fitness function is designed to map perceptual preference patterns into weights of unit selection cost functions. The effectiveness of proposed method is evaluated by both subjective and objective measures. From the results, it is observed that the derived optimal weights can synthesize good quality speech compared to manually tuned weights.  相似文献   

This paper presents a novel intonation modelling approach and demonstrates its applicability using the Standard Yorùbá language. Our approach is motivated by the theory that abstract and realised forms of intonation and other dimensions of prosody should be modelled within a modular and unified framework. In our model, this framework is implemented using the Relational Tree (R-Tree) technique. The R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree.Our R-Tree for an utterance is generated in two steps. First, the abstract structure of the waveform, called the Skeletal Tree (S-Tree), is generated using tone phonological rules for the target language. Second, the numerical values of the perceptually significant peaks and valleys on the S-Tree are computed using a fuzzy logic based model. The resulting points are then joined by applying interpolation techniques. The actual intonation contour is synthesised by Pitch Synchronous Overlap Technique (PSOLA) using the Praat software.We performed both quantitative and qualitative evaluations of our model. The preliminary results suggest that, although the model does not predict the numerical speech data as accurately as contemporary data-driven approaches, it produces synthetic speech with comparable intelligibility and naturalness. Furthermore, our model is easy to implement, interpret and adapt to other tone languages.  相似文献   

This paper describes a comprehensive usability evaluation of an automated telephone banking system which employs text-to-speech (TTS) synthesis in offering additional detail on customers’ account transactions. The paper describes a series of four experiments in which TTS was employed to offer an extra level of detail to recent transactions listings within an established banking service which otherwise uses recorded speech from a professional recording artist. Results from the experiments show that participants welcome the added value of TTS in being able to provide additional detail on their account transactions, but that TTS should be used minimally in the service.  相似文献   

The optical virtual concatenation (OVC) function of The Terabit LAN was demonstrated for the first time at the iGrid 2005 workshop in San Diego, California. The TERAbit-LAN establishes a lambda group path (LGP) for an application where the number of lambdas/L2 connections in a LGP can be specified by the application. Each LGP is logically treated as one end-to-end optical path, so during parallel transport, the LGP channels have no relative latency deviation. However, optical path diversity (e.g. restoration) can cause LGP relative latency deviations and negatively affect quality of service. OVC hardware developed by NTT compensates for relative latency deviations to achieve a virtual bulk transport for the Electronic Visualization Laboratory’s (EVL) Scalable Adaptive Graphics Environment application.  相似文献   

Various techniques for achieving waveform synthesis are presented, with particular attention directed toward acoustic waves. The numerical method of conjugate gradient direction is proposed as an efficient tool for extracting the modal properties of an “ideal violin” as it undergoes a sweep-type excitation. The feasibility of the method is first established by demonstrating its effectiveness in synthesizing various waveforms and comparing it to other existing methodologies. In the ensuing analysis, the “ideal violin” is assumed to be an input–output system that can be effectively modeled as a set of independent linear second-order systems. Characteristics of the instrument are then extracted from its response to the forcing functions.  相似文献   

We give a counterexample to the conjecture which was originally formulated by Straubing in 1986 concerning a certain algebraic characterization of regular languages of level 2 in the Straubing-Thérien concatenation hierarchy of star-free languages.  相似文献   

Detailed anthropometric data are valuable in making well-informed and responsible design decisions. However, such data are available only for a few user populations around the world. More widely-available information is in the form of summary statistics (e.g., means and standard deviations) and the values of body measures at certain key percentiles (e.g., 5th, 50th, 95th). Such information, while useful, is not suitable for in-depth analyses of a population's variability, since it does not allow for the consideration of correlations between different body measures, does not describe irregular distributions of body dimensions, etc. This paper presents a new method that utilizes values of body measures at different percentiles in synthesizing a detailed anthropometric database for a virtual population of users. The procedure is demonstrated in the context of Japanese civilian youth and U.S. military, and is shown to be simple, accurate, easy to use, and applicable across these two anthropometrically dissimilar populations. The case study shows that the virtual population is statistically equivalent to the actual target population in a number of ways. In addition to achieving statistical equivalence with the actual population's body dimensions, the method also ensures that the synthesized individuals are composed of appropriate and realistic body proportions and combinations of anthropometry.  相似文献   

The general Petri net (GPN) is useful for modeling flexible manufacturing systems with multiple robots and workstations [15] and for parallel programs [8]. A problem of using reachability analysis for analyzing Petri nets (PN) is the large number of states generated. Most of the existing synthesis techniques do not deal with GPN. Koh et al.[15] invented a synthesis technique for GPN. We propose to improve their achievement by adding the simple Arc-ratio rules to Yaw's knitting technique [37, 38, 39] based on the notion of structure relationship together with new path generations, which mark the most distinct feature compared with other approaches. The synthesis rules and procedures of how to update the temporal matrix and structure synchronic distance are presented. The Arc-ratio rules for GPN are also presented. One can successfully synthesize complicated Petri nets using these rules. An example to synthesize a Petri net in [15] is illustrated. The correctness of each synthesis rule with an appropriate Arc-ratio rule for GPN is proved.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号