Neural Style Transfer on Audio Signals

Published: 11/01/2018  

Last Updated: 11/01/2018


"Style Transfer" of images are very popular and an active research topic. This article shows how convolutional neural networks (CNNs) are adaptable to a great variety of tasks. We will extend and modify an algorithm for audio signals and use the power of CNNs to generate new audio from a style audio. This can be a tune or a beat, or even simply someone speaking the lyrics of a song.

Neural Style Transfer on Images

Originally created for extracting images, "Neural Style Transfer" uses a CNN model for generating a new image rendered in the style of one image but containing the content of a separate image. This is done by encoding the two images using a CNN model and then taking a white noise image and minimizing the loss between the image and content and style images. The result is, for example, a cityscape of Tübingen, Germany re-imagined in the style of Van Gogh, Picasso or Munch.

paintings, photo, computer renditions

Figure 1. Cityscape of Tübingen, Germany re-imagined using a CNN model.

Model Selection

A deep CNN model is chosen to extract the features of images. Deep CNN models provide proper encoding for the features of images. A model like VGG-19 is chosen having a large number of convolutional layers. Pre-trained models are used as they provide proper encoding.

diagram of image representations and reconstructions

Figure 2. Image representations and reconstructions in a convolution neural network.

Loss Function

Let be the content image, be the style image, and be the white noise image (i.e., the generated image) that will constitute the final image. Total loss is the sum of content loss and style loss.

Content Loss

The content loss is the mean squared error between the encoding of the white noise image and the content image.

For a layer l and the input image , let the number of filters be Nl . The output (or encoded) image will have Nl feature maps, each of size Ml , where Ml is the height times width. So, the encoded image of layer can be stored in a matrix .

Where is the activation of ith filter at position j in layer l.

Style Loss

For capturing the style of an artist, a style representation is used. It computes the correlations between the different filter responses, where the expectation is taken over the spatial extent of the input image. These feature correlations are given by Gram Matrix , where is the inner product between the feature maps i and j represented by vectors in layer l and Nl is the number of feature maps.

The style loss is the mean squared error between the gram matrices of style image and the white noise image. Let be the style image and be the white noise image. Let Al and Xl be the style representations of style image and white noise image in layer l. So, total style loss of a layer l is El.

The total style is:

where wl is the weighting factor of each layer.

Hyperparameter Tuning

To calculate the style and content loss, standard error back-propagation is done. To calculate Ltotal , Lcontent is weighted by α and Lstyle is weighted by β. The ratio of is generally kept 10–3 or 10–4, to prevent the style from dominating and therefore preventing the loss of content.

diagram of neural style transfer algorithm

Figure 3. Neural style transfer algorithm.

Neural Style Transfer on Audio Signals

The basic idea for a neural style algorithm for audio signals is the same as for images: the extracted style of the style audio is applied to the generated audio. Here, the content audio is directly used for generation instead of noise audio, as this prevents calculation of content loss and eliminates the noise from the generated audio.


Figure 4. Spectrogram of content audio, style audio and result audio.

Model Selection

For audio signals, one-dimensional convolutions are used as they have different spatial features than images. For this reason, models having 1-D CNN layers are used. We found that shallow models perform better than deep models, so we used a shallow model having only one layer but a large number of filters. The models are not pre-trained and have random weights, which we found makes no difference; we only need the encoding.


An audio signal must be converted to frequency domain from time domain because the frequencies have the spatial features of audio signals. The raw audio is converted to spectrogram via Short-Time Fourier Transform (STFT). Spectrogram is a 2D representation of a 1D signal. Spectrogram has C channels and S samples for every channel. A spectrogram can be considered as an 1xS image with C channels.

Content Loss

As the content audio is used to generate the new audio, i.e., the generated audio, content loss is not typically taken into consideration. In this case, total loss is simply the style loss.

Style Loss

For style extraction, gram matrices are used as they are for images. Gram Matrix , where Gij is the inner product between the feature maps i and j represented by vectors, and N is the number feature maps. The difference is that the feature maps here are 1- dimensional whereas in images they are 2D. Also, as we are using a model with only one layer, there is no notation of layer . Let Fij be the spectrogram, i.e., the encoding of the audio of ith filter at jth position.

Style loss is the mean squared error between the gram matrices of style audio and the generated audio, i.e., content audio. Let be the style audio and be the generated audio. Let A and X be the style representations of style audio and generated audio with N number of channels (or number of filters) and M number of samples. Total style loss is .

Here, only one layer is present, so weighting the layer is not significant.


After completing the generation of audio phase reconstruction, convert the audio back to time domain from frequency domain. Griffin-Lim algorithm is used for reconstruction. Also, use content audio instead of white noise to generate the final audio. This prevents the need for content loss calculations – only style loss is used.

Hyperparameter Tuning

To calculate the style loss, perform standard error back-propagation.

audio style transfer framework diagram

Figure 5. Audio style transfer framework.


Neural style transfer on audio has applications in the music industry. Music generated using AI is very popular nowadays. This algorithm can be used to generate new music by enthusiasts as well as by industry professionals. New songs can be generated just by recording vocals as content and musical tone as style. This can be made into an android application (like Prisma*) where singers can record their vocals and apply any music style to them.

Future Work

Different signal processing techniques can be applied during post-processing to reduce noise. With the emergence of new generative models such as Generative Adversarial Networks, the neural style transfer algorithm can be modified and used for better results.

Technology Overview

Python* 2.7 is used for model implementation using Deep Learning library PyTorch*. LibROSA* is used for audio analysis. The model is executed on Intel® AI DevCloud, which is 3x to 4x faster than the workstation being used. Intel AI DevCloud runs the models and processes on high-performance Intel® Xeon® processors.


This article shows how to mix audio signals to create new music using Neural Style Transfer. New music is created by applying the style of one audio to the content of another audio using the neural style transfer algorithm. This gives better insight into how the algorithm can be applied to different signals and also how new audio signals can be synthesized. Check out the project on Developer Mesh. Sample code can be downloaded from GitHub*.

About the Author

Alish Dipani is a Computer Science student at BITS Pilani, K.K. Birla Goa Campus in India and an Intel® Student Ambassador for Artificial Intelligence. He is using his experience in Artificial Intelligence to develop various Computer Vision, Signal Processing and Neuroscience applications.


Project Resources:



Code Implementations:

Technical Components:

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at