期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

High quality voice conversion using prosodic and high-resolution spectral features

Hy Quy Nguyen Siu Wa Lee Xiaohai Tian Minghui Dong Eng Siong Chng 《Multimedia Tools and Applications》2016,75(9):5265-5285

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests. 相似文献

2.

Chia-Hung Yeh Wen-Yu Tseng Chia-Yen Chen Yu-Dun Lin Yi-Ren Tsai Hsuan-I Bi Yu-Ching Lin Ho-Yi Lin 《Multimedia Tools and Applications》2014,73(3):2103-2128

相似文献

3.

Modeling timbre distance with temporal statistics from polyphonic music

Morchen F. Ultsch A. Thies M. Lohken I. 《IEEE transactions on audio, speech, and language processing》2006,14(1):81-90

Timbre distance and similarity are expressions of the phenomenon that some music appears similar while other songs sound very different to us. The notion of genre is often used to categorize music, but songs from a single genre do not necessarily sound similar and vice versa. In this work, we analyze and compare a large amount of different audio features and psychoacoustic variants thereof for the purpose of modeling timbre distance. The sound of polyphonic music is commonly described by extracting audio features on short time windows during which the sound is assumed to be stationary. The resulting down sampled time series are aggregated to form a high-level feature vector describing the music. We generated high-level features by systematically applying static and temporal statistics for aggregation. The temporal structure of features in particular has previously been largely neglected. A novel supervised feature selection method is applied to the huge set of possible features. The distances of the selected feature correspond to timbre differences in music. The features show few redundancies and have high potential for explaining possible clusters. They outperform seven other previously proposed feature sets on several datasets with respect to the separation of the known groups of timbrally different music. 相似文献

4.

Dancing-to-Music Character Animation 总被引：1，自引：0，他引：1

Takaaki Shiratori Atsushi Nakazawa Katsushi Ikeuchi 《Computer Graphics Forum》2006,25(3):449-458

相似文献

5.

OMR metrics and evaluation: a systematic review

Mengarelli Luciano Kostiuk Bruno Vitório João G. Tibola Maicon A. Wolff William Silla Carlos N. 《Multimedia Tools and Applications》2020,79(9-10):6383-6408

Multimedia Tools and Applications - Music is rhythm, timbre, tones, intensity and performance. Conventional Western Music Notation (CWMN) is used to generate Music Scores in order to register music... 相似文献

6.

基于改进投票机制的音乐流派分类方法研究

下载免费PDF全文

杨翠丽郭昭辉武港山《计算机工程》2008,34(9):213-215

在音乐流派分类过程中,音乐流派局部特征与整体特征不一致时,通常采用的局部特征投票取最大的方法(MaxVote)在音频片段流派分类精度不高,而流派特征分布比较均衡时分类结果不合理。针对以上问题,该文提出基于音乐片段流派分布特征的神经网络投票机制(NNVote)和结合高层音乐节奏特征的RhythmNNVote投票方法。实验结果表明,NNVote方法在7个流派上的分类总精度达到68.9%,较MaxVote提高将近10%。相似文献

7.

"The way it Sounds": timbre models for analysis and retrieval of music signals

Aucouturier J.-J. Pachet F. Sandler M. 《Multimedia, IEEE Transactions on》2005,7(6):1028-1035

相似文献

8.

Music emotion classification and context-based music recommendation 总被引：1，自引：0，他引：1

Byeong-jun Han Seungmin Rho Sanghoon Jun Eenjun Hwang 《Multimedia Tools and Applications》2010,47(3):433-460

相似文献

9.

交互式音乐类比生成

黄润泽郑茜颖周海芳《计算机应用研究》2021,38(9):2609-2613

类比生成是计算机生成自然和创造性音乐作品的一种关键方法.使用类比生成能够将高层次的音乐特征从一个作品转移到另一个.为了在进行高效类比的同时也能够控制音乐的特征属性,提出了一种新型的显式特征解耦的编解码模型,由编码器解开以和弦为条件的音乐片段的音高和节奏表示,并用解码器还原成原始的音乐.在进行音乐类比生成时,该模型能够使一个作品借用其他作品的表现形式,用不同的音高轮廓、节奏模式进行创作.另外,得益于可视化的特征编码方式,该模型可以对不同的特征属性进行直观控制. 相似文献

10.

Towards Timbre-Invariant Audio Features for Harmony-Based Music

Muller M. Ewert S. 《IEEE transactions on audio, speech, and language processing》2010,18(3):649-662

Chroma-based audio features are a well-established tool for analyzing and comparing harmony-based Western music that is based on the equal-tempered scale. By identifying spectral components that differ by a musical octave, chroma features possess a considerable amount of robustness to changes in timbre and instrumentation. In this paper, we describe a novel procedure that further enhances chroma features by significantly boosting the degree of timbre invariance without degrading the features' discriminative power. Our idea is based on the generally accepted observation that the lower mel-frequency cepstral coefficients (MFCCs) are closely related to timbre. Now, instead of keeping the lower coefficients, we discard them and only keep the upper coefficients. Furthermore, using a pitch scale instead of a mel scale allows us to project the remaining coefficients onto the 12 chroma bins. We present a series of experiments to demonstrate that the resulting chroma features outperform various state-of-the art features in the context of music matching and retrieval applications. As a final contribution, we give a detailed analysis of our enhancement procedure revealing the musical meaning of certain pitch-frequency cepstral coefficients. 相似文献

11.

Precise pitch profile feature extraction from musical audio for key detection 总被引：1，自引：0，他引：1

《Multimedia, IEEE Transactions on》2006,8(3):575-584

The majority of pieces of music, including classical and popular music,are composed using music scales, such as keys. The key or the scale information of a piece provides important clues on its high level musical content, like harmonic and melodic context. Automatic key detection from music data can be useful for music classification, retrieval or further content analysis. Many researchers have addressed key finding from symbolically encoded music(MIDI); however, works for key detection in musical audio is still limited. Techniques for key detection from musical audio mainly consist of two steps:pitch extraction and key detection. The pitch feature typically characterizes the weights of presence of particular pitch classes in the music audio. In the existing approaches to pitch extraction, little consideration has been taken on pitch mistuning and interference of noisy percussion sounds in the audio signals, which inevitably affects the accuracy of key detection. In this paper, we present a novel technique of precise pitch profile feature extraction, which deals with pitch mistuning and noisy percussive sounds. The extracted pitch profile feature can characterize the pitch content in the signal more accurately than the previous techniques, thus lead to a higher key detection accuracy. Experiments based on classical and popular music data were conducted. The results showed that the proposed method has higher key detection accuracy than previous methods, especially for popular music with a lot of noisy drum sounds. 相似文献

12.

基于拓扑独立成分分析和高斯混合模型的视频语义概念检测

孔玮婷詹永照《计算机应用》2016,36(3):770-773

针对目前词袋模型(BoW)视频语义概念检测方法中的量化误差问题,为了更有效地自动提取视频的底层特征,提出一种基于拓扑独立成分分析(TICA)和高斯混合模型(GMM)的视频语义概念检测算法。首先,通过TICA算法进行视频片段的特征提取,该特征提取算法能够学习到视频片段复杂不变性特征;其次利用GMM方法对视频视觉特征进行建模,描述视频特征的分布情况;最后构造视频片段的GMM超向量,采用支持向量机(SVM)进行视频语义概念检测。GMM是BoW概率框架下的拓展,能够减少量化误差,具有良好的鲁棒性。在TRECVID 2012和OV两个视频库上,将所提方法与传统的BoW、SIFT-GMM方法进行了对比实验,结果表明,基于TICA和GMM的视频语义概念检测方法能够提高视频语义概念检测的准确率。相似文献

13.

细粒度分层时空特征描述符的微表情识别方法

下载免费PDF全文

张力为王甦菁段先华《计算机工程与应用》2021,57(13):154-160

由于微表情持续时间小于0.5?s、非自愿性和低强度等特点,微表情识别仍然是具有挑战性的任务。对分层时空特征描述符进行改进,提出一种新的细粒度分层时空特征的微表情识别方法。提取微表情视频片段中的各层次时空特征,利用投影矩阵建立时空特征和微表情之间的联系,进而选择对识别任务有贡献的区域。然后统计具有整体最大贡献度的层次,将该层次下选中的区域块和前一层选中的区域块进行交集操作,达到去除分层时空特征的空间冗余性和提升微表情特征区分度的目的。在CASME[Ⅱ]上的实验表明,提出的方法能够细粒度化微表情发生区域,获得了更好的识别结果。相似文献

14.

Multi-objective evolutionary feature selection for instrument recognition in polyphonic audio mixtures

Igor Vatolkin Mike Preu? Günter Rudolph Markus Eichhoff Claus Weihs 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2012,16(12):2027-2047

Instrument recognition is one of the music information retrieval research topics. This task becomes very challenging if several instruments are played simultaneously because of their varying physical characteristics: inharmonic attack noise, energy development during attack–decay–sustain–release envelope or overtone distribution. In our framework, we treat instrument detection as a machine-learning task based on a large amount of preprocessed audio features with target to build classification models. Since classification algorithms are very sensitive to feature input and the optimal feature set differs from instrument to instrument, we propose to run a multi-objective feature selection procedure before building of classification models. Two objectives are considered for evaluation: classification mean-squared error and feature rate (smaller amount of features stands for reduced costs and decreased risk of overfitting). The analysis of the extensive experimental study confirms that application of an evolutionary multi-objective algorithm is a good choice to optimize feature selection for music instrument identification. 相似文献

15.

面向中文歌词的音乐情感分类方法

王洁朱贝贝《计算机系统应用》2019,28(8):24-29

情感是音乐最重要的语义信息,音乐情感分类广泛应用于音乐检索,音乐推荐和音乐治疗等领域.传统的音乐情感分类大都是基于音频的,但基于现在的技术水平,很难从音频中提取出语义相关的音频特征.歌词文本中蕴含着一些情感信息,结合歌词进行音乐情感分类可以进一步提高分类性能.本文将面向中文歌词进行研究,构建一部合理的音乐情感词典是歌词情感分析的前提和基础,因此基于Word2Vec构建音乐领域的中文情感词典,并基于情感词加权和词性进行中文音乐情感分析.本文首先以VA情感模型为基础构建情感词表,采用Word2Vec中词语相似度计算的思想扩展情感词表,构建中文音乐情感词典,词典中包含每个词的情感类别和情感权值.然后,依照该词典获取情感词权值,构建基于TF-IDF （Term Frequency-Inverse Document Frequency）和词性的歌词文本的特征向量,最终实现音乐情感分类.实验结果表明所构建的音乐情感词典更适用于音乐领域,同时在构造特征向量时考虑词性的影响也可以提高准确率. 相似文献

16.

以声谱图相似度为度量的波形音乐检索 总被引：1，自引：1，他引：0

下载免费PDF全文

孔旭关佶红《计算机工程与应用》2009,45(13):136-141

近年来,基于内容的音乐检索研究日益受到重视,不少检索方法被提出来。其中,大部分方法主要集中在精确地表征音乐的某一两个特征,以反映出音乐某一两个突出方面的性质。论文采取完全不同的思路,使用从声谱图中提取的特征矩阵来表示音乐,查询片断与数据库中候选乐曲的相似度从而转化成两个特征矩阵间的相似度。实验结果表明：该方法不仅过程与计算简单,而且能够取得良好的检索效果。相似文献

17.

Unified View of Prediction and Repetition Structure in Audio Signals With Application to Interest Point Detection

Dubnov S. 《IEEE transactions on audio, speech, and language processing》2008,16(2):327-337

In this paper, we present a new method for analysis of musical structure that captures local prediction and global repetition properties of audio signals in one information processing framework. The method is motivated by a recent work in music perception where machine features were shown to correspond to human judgments of familiarity and emotional force when listening to music. Using a notion of information rate in a model-based framework, we develop a measure of mutual information between past and present in a time signal and show that it consist of two factors - prediction property related to data statistics within an individual block of signal features, and repetition property based on differences in model likelihood across blocks. The first factor, when applied to spectral representation of audio signals, is known as spectral anticipation, and the second factor is known as recurrence analysis. We present algorithms for estimation of these measures and create a visualization that displays their temporal structure in musical recordings. Considering these features as a measure of the amount of information processing that a listening system performs on a signal, information rate is used to detect interest points in music. Several musical works with different performances are analyzed in this paper, and their structure and interest points are displayed and discussed. Extensions of this approach towards a general framework of characterizing machine listening experience are suggested. 相似文献

18.

New speech/music discrimination approach based on fundamental frequency estimation

N. Ruiz-Reyes P. Vera-Candeas J. E. Muñoz S. García-Galán F. J. Cañadas 《Multimedia Tools and Applications》2009,41(2):253-286

Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a robust and effective approach for speech/music discrimination, which relies on a set of features derived from fundamental frequency (F0) estimation. Comparison between the proposed set of features and some commonly used timbral features is performed, aiming to assess the good discriminatory power of the proposed F0-based feature set. The classification scheme is composed of a classical Statistical Pattern Recognition classifier followed by a Fuzzy Rules Based System. Comparison with other well-proven classification schemes is also performed. Experimental results reveal that our speech/music discriminator is robust enough, making it suitable for a wide variety of multimedia applications. 相似文献

19.

Hierarchical audio content classification system using an optimal feature selection algorithm

P. Krishnamoorthy Sarvesh Kumar 《Multimedia Tools and Applications》2011,54(2):415-444

This paper proposes a hierarchical time-efficient method for audio classification and also presents an automatic procedure to select the best set of features for audio classification using Kolmogorov-Smirnov test (KS-test). The main motivation for our study is to propose a framework of general genre (e.g., action, comedy, drama, documentary, musical, etc...) movie video abstraction scheme for embedded devices-based only on the audio component. Accordingly simple audio features are extracted to ensure the feasibility of real-time processing. Five audio classes are considered in this paper: pure speech, pure music or songs, speech with background music, environmental noise and silence. Audio classification is processed in three stages, (i) silence or environmental noise detection, (ii) speech and non-speech classification and (iii) pure music or songs and speech with background music classification. The proposed system has been tested on various real time audio sources extracted from movies and TV programs. Our experiments in the context of real time processing have shown the algorithms produce very satisfactory results. 相似文献

20.

Pattern classification models for classifying and indexing audio signals

P. Dhanalakshmi S. Palanivel V. Ramalingam 《Engineering Applications of Artificial Intelligence》2011,24(2):350-357

In the age of digital information, audio data has become an important part in many modern computer applications. Audio classification and indexing has been becoming a focus in the research of audio processing and pattern recognition. In this paper, we propose effective algorithms to automatically classify audio clips into one of six classes: music, news, sports, advertisement, cartoon and movie. For these categories a number of acoustic features that include linear predictive coefficients, linear predictive cepstral coefficients and mel-frequency cepstral coefficients are extracted to characterize the audio content. The autoassociative neural network model (AANN) is used to capture the distribution of the acoustic feature vectors. Then the proposed method uses a Gaussian mixture model (GMM)-based classifier where the feature vectors from each class were used to train the GMM models for those classes. During testing, the likelihood of a test sample belonging to each model is computed and the sample is assigned to the class whose model produces the highest likelihood. Audio clip extraction, feature extraction, creation of index, and retrieval of the query clip are the major issues in automatic audio indexing and retrieval. A method for indexing the classified audio using LPCC features and k-means clustering algorithm is proposed. 相似文献