WAVENET:原始音频的生成模型外文翻译资料
2022-08-12 16:30:29
英语原文共 15 页,剩余内容已隐藏,支付完成后下载完整资料
ABSTRACT
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-ofthe-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
1 INTRODUCTION
This work explores raw audio generation techniques, inspired by recent advances in neural autoregressive generative models that model complex distributions such as images (van den Oord et al., 2016a;b) and text (Jozefowicz et al., 2016). Modeling joint probabilities over pixels or words using acute; neural architectures as products of conditional distributions yields state-of-the-art generation.
Remarkably, these architectures are able to model distributions over thousands of random variables (e.g. 64times;64 pixels as in PixelRNN (van den Oord et al., 2016a)). The question this paper addresses is whether similar approaches can succeed in generating wideband raw audio waveforms, which are signals with very high temporal resolution, at least 16,000 samples per second (see Fig. 1).
This paper introduces WaveNet, an audio generative model based on the PixelCNN (van den Oord et al., 2016a;b) architecture. The main contributions of this work are as follows:
bull; We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters. 1
bull; In order to deal with long-range temporal dependencies needed for raw audio generation, we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.
bull; We show that a single model can be used to generate different voices, conditioned on a speaker identity.
bull; The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music.
We believe that WaveNets provide a generic and flexible framework for tackling many applications that rely on audio generation (e.g. TTS, music, speech enhancement, voice conversion, source separation).
2 WAVENET
In this paper audio signals are modelled with a generative model operating directly on the raw audio waveform. The joint probability of a waveform x = {x1, . . . , xT } is factorised as a product of conditional probabilities as follows:
Each audio sample xt is therefore conditioned on the samples at all previous timesteps. Similarly to PixelCNNs (van den Oord et al., 2016a;b), the conditional probability distribution is modelled by a stack of convolutional layers. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model outputs a categorical distribution over the next value xt with a softmax layer. The model is optimized to maximize the log-likelihood of the data w.r.t. the parameters. Because log-likelihoods are tractable, we tune hyper-parameters on a validation set and can easily measure overfitting/underfitting.
2.1 DILATED CAUSAL CONVOLUTIONS
The main ingredient of WaveNet are causal convolutions. By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data: the prediction p (xt 1 | x1, ..., xt) emitted by the model at timestep t cannot depend on any of the future timesteps xt 1, xt 2, . . . , xT . This is visualized in Fig. 2. For images, the equivalent of a causal convolution is a masked convolution (van den Oord et al., 2016a) which can be implemented by constructing a mask tensor and multiplying this elementwise with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.
At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sa
剩余内容已隐藏,支付完成后下载完整资料
资料编号:[236662],资料为PDF文档或Word文档,PDF文档可免费转换为Word
课题毕业论文、开题报告、任务书、外文翻译、程序设计、图纸设计等资料可联系客服协助查找。