Deep Music Genre Classification

2024-05-08

Content warning: this blog post involves the use of song lyric data, some of which may be obscene or offensive.

1 Introduction

The code below will allow you to download and load into memory a Pandas data frame containing information on 28,000 musical tracks produced between the years 1950 and 2019.

import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

I accessed the data on Kaggle here. The data was originally collected from Spotify by researchers who published in the following data publication:

Moura, Luan; Fontelles, Emanuel; Sampaio, Vinicius; França, Mardônio (2020), “Music Dataset: Lyrics and Metadata from 1950 to 2019”, Mendeley Data, V3, doi: 10.17632/3t9vbwxgr5.3

Here’s an excerpt of the data:

df.head()

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
0	0	mukesh	mohabbat bhi jhoothi	1950	pop	hold time feel break feel untrue convince spea...	95	0.000598	0.063746	0.000598	...	0.380299	0.117175	0.357739	0.454119	0.997992	0.901822	0.339448	0.137110	sadness	1.0
1	4	frankie laine	i believe	1950	pop	believe drop rain fall grow believe darkest ni...	51	0.035537	0.096777	0.443435	...	0.001284	0.001284	0.331745	0.647540	0.954819	0.000002	0.325021	0.263240	world/life	1.0
2	6	johnnie ray	cry	1950	pop	sweetheart send letter goodbye secret feel bet...	24	0.002770	0.002770	0.002770	...	0.002770	0.225422	0.456298	0.585288	0.840361	0.000000	0.351814	0.139112	music	1.0
3	10	pérez prado	patricia	1950	pop	kiss lips want stroll charm mambo chacha merin...	54	0.048249	0.001548	0.001548	...	0.225889	0.001548	0.686992	0.744404	0.083935	0.199393	0.775350	0.743736	romantic	1.0
4	12	giorgos papadopoulos	apopse eida oneiro	1950	pop	till darling till matter know till dream live ...	48	0.001350	0.001350	0.417772	...	0.068800	0.001350	0.291671	0.646489	0.975904	0.000246	0.597073	0.394375	romantic	1.0

5 rows × 31 columns

In this blog post, your task is to use Torch to predict the genre of the track based on the track’s lyrics and engineered features. The lyrics are contained in the lyrics column. Here is a list of the engineered features that you may additionally find useful.

engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']

These features were engineered by teams at Spotify to describe attributes of the tracks.

2 What You Should Do

Create at least three neural networks with Torch and train them.

Your first network should use only the lyrics to perform the classification task. You are welcome to use any technique for this, and it’s ok for your solution to closely resemble the methods from our lecture on text classification.
Your second network should use only the engineered features to perform the classification task. Don’t overthink this one: a few fully-connected layers are fine.
Your third network should use both the lyrics and the engineered features to perform the classification task.
Finally, visualize the word embedding learned by your model and comment on any interesting results you notice.

3 Code Specifications

Please implement exactly one data loader class for your networks. This class should return batches of data containing both the text features and the engineered features.
Please implement exactly one function for your training loop. You can pass this function arguments that state whether the model being trained should use only the text features, only the engineered features, or both.
- You can use simple array slicing to access only one set of features from a batch containing both features.
Please perform a train-validation split and compare each of your three models on validation data. Again, please implement only one function for your evaluation loop, which can use the same mechanism as the training loop to determine which part of the data should be passed to the model.

4 Writing Specifications

Please include:

An introductory paragraph describing the purpose of the post.
A closing paragraph summarizing your results and what you learned from the process.
Commentary throughout describing the design of your models, how you implemented utility functions, and how to interpret the visualizations and numerical results you produce.

5 Hints

Architecture for Third Network

For the third network, I recommend that you:

Separate the input data into the text features and engineered features (this can be done in either the model or the data loader).
In the forward method of your model, process the text features and the engineered features through separate pipelines. Then, use the torch.cat function to combine them, and finally pass the result through one or two fully-connected layers before the output. Here’s a rough outline:

class CombinedNet(nn.Module):
    
    def __init__(self):
        # ...
    
    def forward(self, x):
        # separate x into x_1 (text features) and x_2 (engineered features)
        
        # text pipeline: try embedding! 
        # x_1 = ...

        # engineered features: fully-connected Linear layers are fine
        # x_2 = ...

        # ensure that both x_1 and x_2 are 2-d tensors, flattening if necessary
        # then, combine them with: 
        x = torch.cat(x_1, x_2, 1)
        # pass x through a couple more fully-connected layers and return output

Working with Lyrics

There are a few things that I found it helpful to do when working with the song lyrics in order to make the task manageable.

That said, I strongly suspect that it is possible to significantly improve on my solution.

First, I chose only the most common tokens in the data by including only tokens that appeared at least 50 times. I did this using the min_freq argument of build_vocab_from_iterator:

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"], min_freq = 50)

Second, I incorporated some nn.Dropout layers in between several stages of my model, with the dropout probability equal to 0.2.

Finally, I took the average across tokens for each embedding dimension, using tensor.mean():


    def forward(self, x):
        # ...
        x = self.embedding(x)
        x = self.dropout(x)
        x = x.mean(axis = 1)
        # ...

6 Expected Accuracy

Please compute the base rate for your problem. While it’s possible to achieve relatively high accuracy classification on this data set, you should consider your three models to be successful if they all consistently score above the base rate after training, even if the improvement is not large. Please make sure to comment on the performance of each model, especially on the performance of the third model compared to the other two.

7 If It’s Not Working / Too Slow

Come chat with me in OH!
Work in Colab and make sure that you are using the GPU.
It’s ok to reduce the size of your data by restricting to a smaller number of genres, e.g. 3.

8 Optional Extras

Optionally, I encourage you to create some interesting visualizations that might highlight differences between genres in terms of some of the engineered features, perhaps over time. You’re welcome to pose and address your own questions here. A few that I am wondering about are:

Has pop music gotten more danceable over time in this sample, according to Spotify’s definition of danceability?
Does blues music tend to have more sadness than other genres? Does pop or rock have more energy?
Are acousticness and instrumentalness similar features? Can you find any patterns in when they disagree?