首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper describes a method of modeling the characteristics of a singing voice from polyphonic musical audio signals including sounds of various musical instruments. Because singing voices play an important role in musical pieces with vocals, such representation is useful for music information retrieval systems. The main problem in modeling the characteristics of a singing voice is the negative influences caused by accompaniment sounds. To solve this problem, we developed two methods, accompaniment sound reduction and reliable frame selection. The former makes it possible to calculate feature vectors that represent a spectral envelope of a singing voice after reducing accompaniment sounds. It first extracts the harmonic components of the predominant melody from sound mixtures and then resynthesizes the melody by using a sinusoidal model driven by these components. The latter method then estimates the reliability of frame of the obtained melody (i.e., the influence of accompaniment sound) by using two Gaussian mixture models (GMMs) for vocal and nonvocal frames to select the reliable vocal portions of musical pieces. Finally, each song is represented by its GMM consisting of the reliable frames. This new representation of the singing voice is demonstrated to improve the performance of an automatic singer identification system and to achieve an MIR system based on vocal timbre similarity.   相似文献   

2.
This paper presents a technique to transform high-effort voices into breathy voices using adaptive pre-emphasis linear prediction (APLP). The primary benefit of this technique is that it estimates a spectral emphasis filter that can be used to manipulate the perceived vocal effort. The other benefit of APLP is that it estimates a formant filter that is more consistent across varying voice qualities. This paper describes how constant pre-emphasis linear prediction (LP) estimates a voice source with a constant spectral envelope even though the spectral envelope of the true voice source varies over time. A listening experiment demonstrates how differences in vocal effort and breathiness are audible in the formant filter estimated by constant pre-emphasis LP. APLP is presented as a technique to estimate a spectral emphasis filter that captures the combined influence of the glottal source and the vocal tract upon the spectral envelope of the voice. A final listening experiment demonstrates how APLP can be used to effectively transform high-effort voices into breathy voices. The techniques presented here are relevant to researchers in voice conversion, voice quality, singing, and emotion.  相似文献   

3.
A major challenge for the identification of singers from monaural popular music recording is to remove or alleviate the influence of accompaniments. Our system is realized in two stages. In the first stage, we exploit computational auditory scene analysis (CASA) to segregate the singing voice units from a mixture signal. First, the pitch of singing voice is estimated to extract the pitch-based features of each unit in an acoustic vector. These features are then exploited to estimate the binary time-frequency (T-F) masks, where 1 indicates that the corresponding T-F unit is dominated by the singing voice, and 0 indicates otherwise. These regions dominated by the singing voice are considered reliable, and other units are unreliable or missing. Thus the acoustic vector is incomplete. In the second stage, two missing feature methods, the reconstruction of acoustic vector and the marginalization, are used to identify the singer by dealing with the incomplete acoustic vectors. For the reconstruction of acoustic vector, the complete acoustic vector is first reconstructed and then converted to obtain the Gammatone frequency cepstral coefficients (GFCCs), which are further used to identify the singer. For the marginalization, the probabilities that the voice belonging to a certain singer are computed on the basis of only the reliable components. We find that the reconstruction method outperforms the marginalization method, while both methods have significantly good performances, especially at signal-to-accompaniment ratios (SARs) of 0 dB and ??3 dB, in contrast to another system.  相似文献   

4.
We propose a pitch synchronous approach to design the voice conversion system taking into account the correlation between the excitation signal and vocal tract system characteristics of speech production mechanism. The glottal closure instants (GCIs) also known as epochs are used as anchor points for analysis and synthesis of the speech signal. The Gaussian mixture model (GMM) is considered to be the state-of-art method for vocal tract modification in a voice conversion framework. However, the GMM based models generate overly-smooth utterances and need to be tuned according to the amount of available training data. In this paper, we propose the support vector machine multi-regressor (M-SVR) based model that requires less tuning parameters to capture a mapping function between the vocal tract characteristics of the source and the target speaker. The prosodic features are modified using epoch based method and compared with the baseline pitch synchronous overlap and add (PSOLA) based method for pitch and time scale modification. The linear prediction residual (LP residual) signal corresponding to each frame of the converted vocal tract transfer function is selected from the target residual codebook using a modified cost function. The cost function is calculated based on mapped vocal tract transfer function and its dynamics along with minimum residual phase, pitch period and energy differences with the codebook entries. The LP residual signal corresponding to the target speaker is generated by concatenating the selected frame and its previous frame so as to retain the maximum information around the GCIs. The proposed system is also tested using GMM based model for vocal tract modification. The average mean opinion score (MOS) and ABX test results are 3.95 and 85 for GMM based system and 3.98 and 86 for the M-SVR based system respectively. The subjective and objective evaluation results suggest that the proposed M-SVR based model for vocal tract modification combined with modified residual selection and epoch based model for prosody modification can provide a good quality synthesized target output. The results also suggest that the proposed integrated system performs slightly better than the GMM based baseline system designed using either epoch based or PSOLA based model for prosody modification.  相似文献   

5.
《Advanced Robotics》2013,27(1-2):105-120
We developed a three-dimensional mechanical vocal cord model for Waseda Talker No. 7 (WT-7), an anthropomorphic talking robot, for generating speech sounds with various voice qualities. The vocal cord model is a cover model that has two thin folds made of thermoplastic material. The model self-oscillates by airflow exhausted from the lung model and generates the glottal sound source, which is fed into the vocal tract for generating the speech sound. Using the vocal cord model, breathy and creaky voices, as well as the modal (normal) voice, were produced in a manner similar to the human laryngeal control. The breathy voice is characterized by a noisy component mixed with the periodic glottal sound source and the creaky voice is characterized by an extremely low-pitch vibration. The breathy voice was produced by adjusting the glottal opening and generating the turbulence noise by the airflow just above the glottis. The creaky voice was produced by adjusting the vocal cord tension, the sub-glottal pressure and the vibration mass so as to generate a double-pitch vibration with a long pitch interval. The vocal cord model used to produce these voice qualities was evaluated in terms of the vibration pattern as measured by a high-speed camera, the glottal airflow and the acoustic characteristics of the glottal sound source, as compared to the data for a human.  相似文献   

6.
为了在病理嗓音识别中为特征参数选择提供依据,提出声带非对称力学建模仿真病变声带并进行分析研究。依据声带的分层结构和组织特性,建立声带力学模型,耦合声门气流,求取模型输出的声门源激励波形。采用遗传粒子群 拟牛顿结合优化算法(Genetic particle swarm optimization based on quasi-Newton method, GPSO-QN)将模 型输出的声门源和实际目标声门波相匹配,提取优化模型参数。仿真实验结果表明,该声带模 型能产生与实际声门源相一致的声门波形,同时也证明了左右声带生理组织间的非对称性是产生病理嗓音的重要原因。  相似文献   

7.
为了增加读书机器人(JoyT0n)朗读声音的多样性,设计了一种基于单一语音库的声音变换系统。将读书机器TTS(text to speech)合成出的初始声音分解成声音激励信号和声道滤波器信号,并转换到频域进行修改。利用短时傅立叶幅度谱重构激励信号的方法以及通过修改声道滤波器参数的方法来变换音速、音调和音色。修改后的声音激励信号和声道滤波器信号被重新合成产生新的声音信号。该变声系统能在不增加语音库容量的情况下使读书机器人用丰富多彩的感情和声调朗读。  相似文献   

8.
Acoustic analysis is a noninvasive technique based on the digital processing of the speech signal. Acoustic analysis based techniques are an effective tool to support vocal and voice disease screening and especially in their early detection and diagnosis. Modern lifestyle has increased the risk of pathological voice problems. This work focuses on a robust, rapid and accurate system for automatic detection of normal and pathological speech and also to detect the type of pathology. This system employs non-invasive, inexpensive and fully automated measures of vocal tract characteristics and excitation information. Mel-frequency cepstral coefficients and linear prediction cepstral coefficients are used as acoustic features. The system uses Gaussian mixture model and hidden Markov model classifiers. Cerebral palsy, dysarthria, hearing impairments, laryngectomy, mental retardation, left side paralysis, quadriparesis, stammering, stroke, tumour in vocal tract are the types of pathologies considered in our experiments. From the experimental results, it is observed that to classify normal and pathological voice hidden Markov model with mel frequency cepstral coefficients with delta and acceleration coefficients is giving 94.44% efficiency. Likewise to identify the type of pathology Gaussian mixture model with mel frequency cepstral coefficients with delta and acceleration coefficients is giving 95.74% efficiency.  相似文献   

9.
针对构音异常,本文提出了使用声道仿真来实现辅助治疗的方法。基于声道是一个弯曲的、三维的具有慢时变特性的声学管道,并且在声道中的声波传播是平面波的特性,可以把声道等效于一个具有不同截面的圆柱体或者椭圆体管道。使用极点形式,在牛顿插值的基础上得到共振峰。对声道进行了60段分段,通过经验公式得到声道在不同部位的面积。定义了描述声道特性的9个参数,进而对这9个参数使用Corana算法进行优化。使用辐射模型描述声音从嘴唇辐射出去以后的特性。最后进行声音的合成,这个声音可用于反馈治疗。经过实验证明,这种声道仿真模型可以为制定合适治疗方法提供参考。  相似文献   

10.
In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system.  相似文献   

11.
Primary voice production occurs in the larynx through vibrational movements carried out by vocal folds. However, many problems can affect this complex system resulting in voice disorders. In this context, time–frequency–shape analysis based on embedding phase space plots and nonlinear dynamics methods have been used to evaluate the vocal fold dynamics during phonation. For this purpose, the present work used high-speed video to record the vocal fold movements of three subjects and extract the glottal area time series using an image segmentation algorithm. This signal is used for an optimization method which combines genetic algorithms and a quasi-Newton method to optimize the parameters of a biomechanical model of vocal folds based on lumped elements (masses, springs and dampers). After optimization, this model is capable of simulating the dynamics of recorded vocal folds and their glottal pulse. Bifurcation diagrams and phase space analysis were used to evaluate the behavior of this deterministic system in different circumstances. The results showed that this methodology can be used to extract some physiological parameters of vocal folds and reproduce some complex behaviors of these structures contributing to the scientific and clinical evaluation of voice production.  相似文献   

12.
The human larynx is an important organ for voice production and respiratory mechanisms. The vocal cord is approximated for voice production and open for breathing. The videolaryngoscope is widely used for vocal cord examination. At present, physicians usually diagnose vocal cord diseases by manually selecting the image of the vocal cord opening to the largest extent (abduction), thus maximally exposing the vocal cord lesion. On the other hand, the severity of diseases such as vocal palsy, atrophic vocal cord is largely dependent on the vocal cord closing to the smallest extent (adduction). Therefore, diseases can be assessed by the image of the vocal cord opening to the largest extent, and the seriousness of breathy voice is closely correlated to the gap between vocal cords when closing to the smallest extent. The aim of the study was to design an automatic vocal cord image selection system to improve the conventional selection process by physicians and enhance diagnosis efficiency. Also, due to the unwanted fuzzy images resulting from examination process caused by human factors as well as the non-vocal cord images, texture analysis is added in this study to measure image entropy to establish a screening and elimination system to effectively enhance the accuracy of selecting the image of the vocal cord closing to the smallest extent.  相似文献   

13.
The objective of voice conversion system is to formulate the mapping function which can transform the source speaker characteristics to that of the target speaker. In this paper, we propose the General Regression Neural Network (GRNN) based model for voice conversion. It is a single pass learning network that makes the training procedure fast and comparatively less time consuming. The proposed system uses the shape of the vocal tract, the shape of the glottal pulse (excitation signal) and long term prosodic features to carry out the voice conversion task. In this paper, the shape of the vocal tract and the shape of source excitation of a particular speaker are represented using Line Spectral Frequencies (LSFs) and Linear Prediction (LP) residual respectively. GRNN is used to obtain the mapping function between the source and target speakers. The direct transformation of the time domain residual using Artificial Neural Network (ANN) causes phase change and generates artifacts in consecutive frames. In order to alleviate it, wavelet packet decomposed coefficients are used to characterize the excitation of the speech signal. The long term prosodic parameters namely, pitch contour (intonation) and the energy profile of the test signal are also modified in relation to that of the target (desired) speaker using the baseline method. The relative performances of the proposed model are compared to voice conversion system based on the state of the art RBF and GMM models using objective and subjective evaluation measures. The evaluation measures show that the proposed GRNN based voice conversion system performs slightly better than the state of the art models.  相似文献   

14.
We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual features are well designed for training. F0 and spectrum of singing voice are simultaneously modeled with context-dependent HMMs. There is a new problem, F0 of singing voice is always sparse because of large amount of context, i.e., tempo and pitch of note, key, time signature and etc. So the features hardly ever appeared in the training data cannot be well obtained. To address this problem, difference between F0 of singing voice and that of musical score (DF0) is modeled by a single Viterbi training. To overcome the over-smoothing of the generated F0 contour, syllable level F0 model based on discrete cosine transforms (DCT) is applied, F0 contour is generated by integrating two-level statistical models. The experimental results demonstrate that the proposed system outperforms the baseline system in both objective and subjective evaluations. The proposed system can generate a more natural F0 contour. Furthermore, the syllable level F0 model can make singing voice more expressive.   相似文献   

15.
This paper describes a robust glottal source estimation method based on a joint source-filter separation technique. In this method, the Liljencrants-Fant (LF) model, which models the glottal flow derivative, is integrated into a time-varying ARX speech production model. These two models are estimated in a joint optimization procedure, in which a Kalman filtering process is embedded for adaptively identifying the vocal tract parameters. Since the formulated joint estimation problem is a multiparameter nonlinear optimization procedure, we separate the optimization procedure into two passes. The first pass initializes the glottal source and vocal tract models by solving a quasi-convex approximate optimization problem. Having robust initial values, the joint estimation procedure determines the accuracy of model estimation implemented with a trust-region descent optimization algorithm. Experiments with synthetic and real voice signals show that the proposed method is a robust glottal source parameter estimation method with a high degree of accuracy.  相似文献   

16.
The authors present new results in solving problems of concatenative segment synthesis of voice information with prosody and vocal utterance, computer modeling of human voice signals based on joint models of human voice source and vocal tract, and speech signal preprocessing for automated documenting systems. The experiments show the efficiency of the proposed approaches.  相似文献   

17.
ABSTRACT

Vocal cord diseases can cause irregular vibration of the vocal cords, resulting in abnormalities. Therefore, it is necessary to study abnormal vocal cords in a vocal cord model. Research that focuses on vocal cord diseases mainly combines acoustic parameters and pattern recognition. However, it is also important to study the causes of vocal abnormalities in vocal cord diseases. In this paper, a bionic vocal system is modeled, and the influence of pulmonary airflow changes on glottic vibration excitation is analyzed. The effects of asymmetric vocal polyps on changes to the vocal airflow and flow field are studied, showing that the proposed model can assist in the detection of abnormal voice.  相似文献   

18.
为提高构音障碍识别准确率,提出一种基于多特征组合的构音障碍语音识别方法.利用遗传算法进行特征选择,从语音的韵律特征、频谱特征、人耳听觉特征、嗓音质量特征和声道模型特征等5类特征组合成的多特征组合中选择出分类准确率最高的特征子集,通过SVM分类器对选择出的特征进行识别.在Torgo声学和发音数据库对不同的语音刺激类型进行...  相似文献   

19.
This paper reports an original technique for accurately estimating the parameters of human vocal tract filters for vowels in English for speech processing applications such as voice recognition. In this paper, the vocal tract filter design problem is reformulated as a general nonlinear optimization problem and solved using a hybrid Genetic Algorithm (GA). The hybrid GA computes a rough estimate of the global minimum using GA and refines using computationally cheap local search. Issues that are of concern in digital filtering such as achieving stability and overcoming finite precision effects are addressed. The objective function for optimization used in this paper is formulated in terms of poles and zeros of the filters to avoid ill-conditioning and to take advantage of symmetries in the location of poles and zeros. Simulation results indicate that the approach presented in this paper provides a close fit in terms of mean square error between the experimental and designed filters.  相似文献   

20.
O'Malley  M.H. 《Computer》1990,23(8):17-23
The historical and theoretical bases of contemporary high-performance text-to-speech (TTS) systems and their current design are discussed. The major elements of a TTS system are described, with particular reference to vocal tract models. The stages involved in the process of converting text into speech parameters are examined, covering text normalization, word pronunciation, prosodies, phonetic rules, voice tables, and hardware implementation. Examples are drawn mainly from Berkeley Speech Technologies' proprietary text-to-speech system, T-T-S, but other approaches are indicated briefly  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号