Generating Music from Raw MIDI
Aug 2023 ~ Network Institute VU
Length: 11mo
Programming language: Python (NumPy, Random, Math, PyYAML, PyTorch, PyTorch Lightning, W&B,
TensorBoard, Python Fire)
Data: Raw MIDI representations composed of the following features: type (7 event
types), note (128 notes), velocity (128 velocity levels), channel (16 MIDI channels),
instrument (128 MIDI instruments), and tick (note on and off).
Problem description:
Develop and train GPT-like transformers on HPC to generate music from Raw MIDI
Approach & Results:
Starting from the decoder-only skeleton architecture displayed below, several structural
adjustments were implemented, one by one, and evaluated after training the respective transformers
on HPC and comparing their losses in W&B. The list of conducted experiments includes weight scaling,
T-Fixup initialization,
Stochastic Weight Averaging (SWA),
Scalenorm and Fixnorm.
Additionally, the model was developed to enable multi-node training and resume learning from checkpoints.
Lastly, its embedding dimension, batch throughput, learning rate, and dropout rate were optimized, and the
installation and utilization instructions were drafted.
The performance of the transformers was quantified using the amount of information a model
can compress per bit, measured in bits per event (the lower, the better). Regarding the
enhancements incorporated, one that produced remarkable results is mixed precision.
In the image below, one can see the validation losses of two models, out of which one uses mixed precision.
Even though both transformers were trained for 120 hours, the one with mixed precision was three times faster
without affecting the loss.
The next successful experiment involved adjusting the embedding initialization to PyTorch's
default function, nn.Embedding(). Consequently, the validation curve was improved by 2%, as
displayed in the next chart.
In order to reduce overfitting, various data augmentation techniques were applied, resulting
in another significant decrease in the validation curve, as the figure below suggests.
Finally, the best model was used to generate the following samples. These are composed
starting with a seed extracted from an existing song, followed by a whistle, which marks the beginning of
the music generated by the model. Accordingly, one can notice that the transformer can reliably
produce chords and learn the timing from the seed.