Skip to main content

Do Androids Dream of Electropunk?

Posted on Jan 2, 2019

Studying the ready models, I found an article with an overview of the six most famous options. These are, of course, digital music formats. According to the article, there are two main approaches to music generation: based on the digitized audio stream (the sound we hear from the speakers – raw audio, wav files) and based on working with MIDI (musical notation).

I didn’t take into account the raw audio options, and here’s why:

1. The results are not impressive – the use of such models for polyphonic music gives a very specific result. This is unusual, you can create interesting canvases, but they are not suitable for my purposes: it sounds weird, and I wanted to hear something similar to the original.

A good example of piano music:

And it sounds even weirder mIvxwith orchestral music or rock:

Here the guys tried to process Black Metal in raw audio

2. The tracks of my favourite bands use different instruments – vocals, drums, bass, guitars, synthesizers. Each instrument sounds together with the rest. I am looking for a model that would act just in the same way, that is, would work not only with individual instruments, but also take into account their joint sound.

When a musician needs to learn the part of some instrument by ear, he tries to select the instrument he needs from the whole sound stream. Then he repeats its sound until he achieves a similar result. This is not the easiest task even for a person with a good ear – music can be difficult, the instruments “merge”.


I came across a software that tried to solve a similar problem. There are several projects that do this based on machine learning. For example, while I was writing this article, Magenta released the new Wave2Midi2Wave tool, which is able to record the piano notes and realistically play them. There are other tools, although, in general, this problem has not been solved yet.

Thus, the easiest way to learn a part from a piece of music is to take ready-made notes. It is logical to assume that the neural network will be easier to work with music notes, where each instrument is represented by a separate track.

3. In the case of raw audio, the result is the mix of all instruments, the parts can’t be individually loaded into the sequencer (audio editor), it’s not possible to correct, change the sound, and so on. I’d be quite satisfied if the neural network composed a hit, but there are usually mistakes in a couple of notes – I can easily correct them when working with notes, but with raw audio this is almost impossible.

The musical notation also has its drawbacks. It does not take into account many performance nuances. When it comes to MIDI, it is not always known who created these MIDI files, how close they are to the original. Maybe the creator simply made a mistake, because it is not an easy task to accurately capture the part.

When working with polyphonic notes, it is necessary to ensure that the instruments are consonant at each moment of time. In addition, it is important that the sequence of these moments is a logical music from the person’s point of view.

It turns out that there are not so many solutions that can work with notes, and not only with one instrument, but with several sounds at the same time. I initially overlooked the Magenta project from Google TensorFlow because it was described as “non-polyphonic”. At that time, the MusicVAE library was not published yet, so I chose the BachBot project.




It turns out that the solution to my problem already exists. Listen to the Happy Birthday melody, processed by BachBot and it sounds like Bach’s chorale.

Chorale is a specific music that consists of four voices: soprano, alto, tenor, and basso. Each of the instruments can perform one note at a time. We will have to get a little deeper into the music here. I’ll describe music played in four quarters.

In a musical notation, a note has two indicators – pitch (do, re, mi ...) and duration (whole, half, eighth, sixteenth, thirty-second). Accordingly, a whole note sounds the whole beat, two halves sound the whole beat, sixteen sixteenths sound the whole beat.

In preparing data for training the neural network, the creators of BachBot considered the following:

  1. Not to confuse a model with chords of a different tonality, which together will not sound harmonious, all the chorales were led to the same tonality.

  2. The neural network must be supplied with discrete values at the input, and music is a continuous process, which means that discretization is necessary. One instrument can play a long whole note, and the other at the same time several sixteenths. To solve this problem, all notes were broken down into sixteenths. In other words, if there is a quarter note in the notes, it enters the input four times as the same sixteenth – the first time with the flag that means it was pressed, and the next three times with the flag that it continues.

The data format looks as follows – (pitch, new note | continuation of the sound of the old note).
(56, True)    #Soprano
(52, False)  # Alto
(47, False)  #Tenor
(38, False)  #Basso

Having run all chorales from the popular music21 data set through such a procedure, the authors of BachBot found that only 108 combinations of four notes are used in chorales (if they led to one tonality), although it would seem that they could potentially be 128 x 128 x 128 x 128 (128 pitch levels are used in midi). The size of the conventional dictionary is not so big. This is a curious note, we will get back to it when we talk about MusicVAE. Thus, we have Bach’s chorales recorded as sequences of such fours.

They often say that music is a language. Therefore, it is not surprising that the creators of BachBot applied the popular technology in NLP (Natural Language Processing) to music, namely, they trained the LSTM network ( on the generated dataset and got the model able to complement one or more instruments or even create chorales from scratch. That is, you set alto, tenor and basso, and BachBot writes a soprano melody for you, and it all sounds like Bach.

Here is another example: 

Sounds great!

For more details, have a look at this video. There is an interesting analytics collected based on a survey at

Users are suggested to distinguish Bach’s original chorales from the music created by the neural network. In the results it is mentioned that if the neural network creates a basso part with all the other data specified, then only half of users can distinguish the chorales created by the neural network from the original ones. It's funny, but music experts are mostly confused. With other instruments, it looks a little bit better. It sounds a bit insulting to me as a bass player –violinists still seem to be needed, but it is time for bass players to refresh their skills in working with drywall.

AI piano



Studying BachBot, I found out that it was included in the Magenta project (Google TensorFlow). I decided to take a closer look at it and learned that within the framework of Magenta several interesting models were developed, one of which is devoted to work with polyphonic musical pieces. Magenta made their wonderful tools and even launched a plugin for the Ableton audio editor, which is especially nice in terms of application for musicians.

My favourite beat blender:  creates variations based on the given drum set
and  creates transitions between tunes.

The main idea of the MusicVAE tool, which I decided to use, is that the creators tried to combine the model and variational autoencoder in the LSTM network – VAE –

If you remember when talking about BachBot, we noticed that the chord dictionary consists not of 128х128х128х128 elements, but only of 108. The creators of MusicVAE also noticed it and decided to use compressed latent space.

By the way, what is typical, for the training of MusicVAE, is that it’s not necessary to transfer the sources into one tonality. I believe it is not necessary because the sources will still be converted by the autoencoder and the information about the tonality will disappear.

VAE is designed in such a way that allows the decoder to efficiently recover data from the training dataset, while the latent space is a smooth distribution of the features of the input data.

It’s a very important point. It allows creating similar objects and conducting a logical meaningful interpolation. In the original space, we have 128x128x128x128 variants of the combination of four notes, but in fact, not all of them are used (sound nice for the human ear). A variational autoencoder turns them into a much smaller set in a hidden space, and you can think of mathematical operations in this space that have a meaning from the point of view of the original space, for example, neighboring points will be similar musical fragments.

A good example of how to draw glasses on a photo using an autoencoder is in this article. You can read more about how Music VAE works on the official Magenta website in this article, there is also a link to arXiv.

So, the instrument is selected, it is necessary to use it with my original goal – to create a new piece of music based on the already recorded tracks and evaluate if it will sound similar to the original group. Magenta does not work on my laptop on Windows, and calculates a model without a GPU for quite a long time. Having struggled with virtual machines, a docker container, etc., I decided to use the cloud.

Google provides colab notebooks where it is possible to play with Magenta models. However, in my case, I couldn’t train the model, the process was stopping all the time due to various restrictions – the amount of available memory, shutdown by timeout, lack of a command line and root rights to install the necessary libraries. Hypothetically, there is even an opportunity to use a GPU, but again I could not install the model and run it.

I was thinking about buying a server and, with good luck, I discovered that Google provides Google Cloud services with a GPU, and there is even a free trial period. However, it turned out that in Russia they are officially available only to legal entities, but they allowed me to use it in a test free mode.

So, I created a virtual machine in GoogleCloud with one GPU module, found several midi files from one of my favourite groups on the Internet and uploaded them to the cloud in the midi folder.

Let’s install Magenta
pip install magenta-gpu

It's great that all this can be installed using just one command, I thought, but ... errors. It seems I’ll have to use the command line.

Let’s have a look at the errors: the rtmidi library can’t be installed in the cloud machine, and therefore Magenta does not work.

And it, in turn, doesn’t work due to the absence of the libasound2-dev package, and I also have no root privileges.

Not so scary:
sudo su root
apt-get install libasound2-dev

Hurray, now pip install rtmidi runs without errors, as well as pip install magenta-gpu.

Let’s find the source files in the folder midi on the Internet and download them. They look like this:

Let’s convert midi to a data format that the network can already work with:
convert_dir_to_note_sequences \
--output_file=notesequences_R2Midi.tfrecord \
--log=DEBUG \
and we start training
music_vae_train \
--config=hier-multiperf_vel_1bar_med \
--run_dir=/home/RNCDtrain/ \
--num_steps=1 \
--checkpoints_to_keep=2 \
--hparams=sampling_rate=1000.0 \
--hparams=batch_size=32,learning_rate=0.0005 \
--num_steps=5000 \
--mode=train \

There’s a problem again. Tensorflow crashes with an error ( – it can’t find the library, it’s good that a few days ago someone already described this error, and the source code in Python can be fixed.

We get into the folder
and replace the import line as described in the github bug.

Launch music_vae_train again and ... Hurray! Training has started!

hier-multiperf_vel_1bar_med – I use a polyphonic model (up to 8 instruments), producing one beat each.

An important parameter is checkpoints_to_keep=2, the disk volume is limited in the cloud, and one of the problems – I have always had to stop the training process due to the disk overfilling, checkpoints are quite heavy – 0.6-1GB each.

Somewhere in the 5000 periods, the error occurs around 40-70. I don’t know if this is a good result or not, but it seems that with small training data, the network will be retrained further, and there’s no point in wasting so much free time of graphic processors provided by Google data centers. Let’s move to generation.

For some reason, when installing, Magenta did not install the generation file itself, I had to drop it into the folder manually
curl -o

Finally, let’s create fragments:
music_vae_generate --config=hier-multiperf_vel_1bar_med --checkpoint_file=/home/RNCDtrain/train/ --mode=sample --num_outputs=32 --output_dir=/home/andrey_shagal/  --temperature=0.3
config – a type of generation, exactly the same as when training – multitrek, 1 beat
checkpoint_file – a folder where it is possible to get the latest file with the trained model
mode - sample – create a sample (there is another option interpolate – create a transitional beat between two beats)
num_outputs – how many items to generate
temperature – the randomization parameter when creating a sample, from 0 to 1. In 0, the result is more predictable, closer to the original sources, in 1 – I am an artist, as I see it.

Finally, I get 32 fragments based on beats. Running the generator several times, I listen to the fragments and add the best ones into one track: neurancid.mp3.

Of course, Maximum Radio is unlikely to include it on the playlist, but if you listen, it really looks like the original Rancid group. The sound, of course, differs from studio recording, but we primarily worked with notes. Then there is room for action – to process midi with different VST plugins, re-record parts with live musicians, or wait until the guys from Wave2Midi2Wave get to the guitars being overloaded.

There are no complaints to the notes. Ideally, I would like the neural network to create a masterpiece or at least a hit for the Billboard top, but now it has seemed to have learned from the partying rockstars to play one note the whole beat using eighths (in fact, I am proud of its transition from 20 to 22 seconds). There are reasons for this, and more about them here:

1. Small amount of data.
2. The model I used produces the fragments in the amount of one beat. In punk rock, usually not so many events occur within one beat.
3. Interesting transitions and melody work just against the background of great riffs, transitions from chord to chord, and autoencoder together with a small amount of data seem to have lost most of the melodies, and also reduced all riffs to two consonant and several atonal chords. You need to try a model that works with 16 beats, it’s a pity that only three voices are available there.

I talked with the developers, they recommended trying to reduce the dimension of the latent space, because they trained their network on 200,000 tracks, and I used just 15. I didn’t manage to get a visible effect from the reduction of z-space, but there is still something to work with.

By the way, a lack of variety and monotony is not always a minus. There is just one step differing from the shaman rituals to the techno party, as you know. We must try teachg the model on something like this – techno, dub, reggae, hip-hop cons. Surely, there is a chance to create something zombifying. I found 20 pieces of Bob Marley’s songs in the midi and, voila, a very nice loop:

The midi parts are recorded by live bass and guitars, processed by VST synthesizers, so that the piece sounds juicier. In the original, the network produced just notes. If you play them with a standard midi player, it sounds like this:

Surely, if you create a certain number of basic thematic drum canvases, run them in beat blender + basic parts of bass and synths with a latent loop (mentioned above), you can easily launch an algorithm for techno-radio that will continuously create new tracks or even one endless track. Everlasting thrill!

MusicVAE also provides the ability to train the network to generate 16 trio clock fragments – drums, bass, and lead. Also quite interesting. Input data is multitrack midi files – the system splits them into triples in all possible combinations and then trains the model on its basis. Such a network requires much more resources, but the result is just 16 beats! It’s impossible to resist. I tried to imagine how a band could sound that played something between Rancid and NOFX, loading an equal number of tracks from each band for training:

There are also midi parts rewritten by live guitars. Using the standard midi player:

Interesting! This is definitely better than my first band! And by the way, the same model produces nice free jazz:

1. The lack of a good, convenient stand that would reduce the time for waiting of training. The model works only with linux, the training is long, it takes much time without a GPU, and all the time I want to try changing the parameters and see what happens. For example, a cloud server with a single GPU processor calculated 100 periods for the 16-beats trio model during 8 hours.

2. A typical problem of machine learning is a lack of data. Only 15 midi files are not enough to understand music. Neural network, unlike me in my youth, did not listen to 6 Rancid albums numerous times, did not go to concerts, this result was obtained from 15 midi tracks composed by someone that were far from the original. Now, if you stick sensors around the guitarist and record each overtone from each note ... We’ll see how the idea of Wave2Midi2Wave will develop. Maybe in a few years it will be possible to refuse from the notes in solving this problem.

3. The musician should get clearly in the rhythm, but not perfectly. In the output midi notes, there is no dynamics (for example, in drums), they are all performed at the same volume, exactly in a click (as the musicians say, exactly in a beat), even if they are randomly diversified, the music starts to sound livelier and nicer. Again, Wave2Midi2Wave has already started working on this problem.

Now you have some idea of the possibilities of AI in creating music and my musical preferences. What do you think, what role awaits AI in the creative process in the future? Can a machine create music as a human or even better than a human, be an assistant in the creative process? Or AI will become famous in the music area only for its primitive pieces?