About

GenreLink is a program created by the Radiant Ravens team in order to further analyze genres and their correlation to different musical qualities. This helps users narrow down genres that may be more interesting to them and explore Spotify's vast catalog of music.

Dataset

This database includes information of top songs from Spotify in over 35 countries around the world. The data includes information on genre in Spotify, positivity levels, danceability, valence, livelness, tempo, and tone of lyrics based on lyric library Syuzhet dates between 2017-2020 songs

Spotify database

Cleaning Our Data

To clean the data, we first decided what columns to use for our machine learning problem. Once we took only those columns and created a separate data frame, we proceeded to clean the data further by making all number columns have the string data type, getting rid of a few problematic cells in the process. Finally, we checked to make sure we used the proper datatypes and began creating relevant visualizations.

Results

We found that in most genres there is a postive correlation between energy and valence in our Linear Regression Model and our KNN model managed to average around a 47% success rate

Our EDA

Genre and Valence Heatmap

This visualization shows a heatmap comparing genres to their valence on a scale of 0.0 - 1.0. The brightly colored areas indicate more songs that fall into a certain valence within a genre.

Valence vs Tempo with Genre

In this visualization, we compare the valence and the tempo of songs, with color matching what genre a song is in our spotify dataset using Plotly. This demonstrates how tempo and valence correlate to the 21 genres liked by Spotify’s listeners. This shows that many songs can have wildly varying ratios of tempo to valence, though it does have at least a small correlation with the genre. X-axis: Tempo, Y-axis: Valence, Color: Genre

Multigraph Scatterplot

This visualization shows multiple graphs comparing negative and positive attributes of a song, its genre, and energy

Correlation Heatmap

This heatmap compares mutliple song attributes including livelness and energy to each other

Our Models

Linear Regression

Our goal with linear regression is to predict the value of valence based on energy. Additionally, it enables us to analyze and comprehend the relationship and effects between these variables. By incorporating a trendline, we can further enhance our ability to predict future data.

studies the relationship between energy and valence of in each genres
Bolero exhibits the strongest positive correlation
House demonstrates the weakest correlation between energy and valence
All the listed genres has a positive correlation
Findings suggest that higher energy levels in music are generally associated with increased positivity, regardless of the specific genre.

Genres implemented in model include...

r&b/soul, k-pop, reggaeton, trap, jazz, hip hop, boy band, pop, indie, country, reggae, metal, rap, dance/electronic, latin, funk, opm, rock, house, else, bolero

Graphs of genres in display...

hip hop, rap, house, bolero

KNN Genre Prediction Model

Idea

The idea behind this model is to take in a bunch of numerical values based around values in a dataset and see if it can predict the genre of a song if given those values

Creation

As can be seen from some of the EDA above, the following parameters were chosen to predict the genre:

valence
tempo
energy
instrumentalness
speechiness
acoustics
liveliness
loudness
key
danceability

From there, we used the KNN section of the sklearn library. To put it simply, KNN (k-nearest neighbors) is a classification algorithm that assigns a label to a new data point based on the majority class of its k nearest neighbors in a labeled training dataset.

Training and Fine-tuning

The EDA cleaned up our data, so now all that was needed was to prep the data to be fed to the model. To do this we took the aforementioned statistics on songs, along with what genre the song was (as numbers that represented what genre the song was since the model only accepts numerical values in this scenario), and randomized the order of the songs in such a way that every time the model was run it randomized the order so we could make sure the model would work on more data and not just memorize a few values, along with making it so we could get a range of how well the model does on average.
Afterward, we split up the data into 4 different sets, two holding input, one for training and the other testing, and the same for holding the 'targets.' Base testing yeilded around a 44-45% success rate, which is pretty good for a start.
When setting up the model, you can set various hyperparameters that affect how the model goes about sorting data. The following is the hyperparameters we used in the final version of our model:

n_neighbors (the number of 'neighbors' in similar data values it checks for trends): 134 (so it checks the 134 data points in the training set with the most similar values to make it's decision on which genre a song is)
algorithm (which of 3 possible KNN methods it uses to check the data): ball_tree (which orgainizes the data into 'balls' that it puts the data points into based on what balls it's neighbors in terms of data values have gone into; each 'ball' represents a genre in this case)
weight (how all of the 'neighbors' are weighed in the internal decision process of the algorithm): distance (meaning 'neighbors' that have more similar data values have more of an impact on which 'ball' a data point will be sorted into)
p (the power parameter of the Minkowski distance metric, which decides if the internal decision process should pay more attention to small changes in data or large changes): 1 (meaning it will pay more attention to small changes; otherwise known as the Manhattan distance type)
leaf size (how many data points can be checked per 'leaf node' in the decision process; a higher number can reduce how long it takes for the algorithm to run, but requires more memory and may reduce accuracy as a tradeoff): 1 (meaning it may take longer to run than it possibly could, but takes up less memory and is likely more accurate as a result)

Model Results

The finalized model averages about an upper-46-to-mid-47% success rate, with this particular run managing about 47.65%. Considering a Naive Model, which is the baseline of whether or not a classification model has is at the very least ok, as it is simply a 1/n chance of success where n is the number of possible classifications (in this case genres), has a 1/21 (or about a 4.76%) chance of being correct, the model averages out to be about 10x better on average.

conclusion

Exploring the correlation between different aspects of songs holds significant value as it provides insights into the nature of music composition and the emotional impact it carries. By understanding how elements like melody, lyrics, and rhythm interact, we can enhance the creation or recommendation process. Furthermore, this project's potential extends to being able to develop products like music recommendation apps or sites, or music advertisement.

Unveiling Music's Hidden Connections

Welcome to GenreLink: a Radiant Ravens Project

About

Dataset

Cleaning Our Data

Results

Our EDA

Genre and Valence Heatmap

Valence vs Tempo with Genre

Multigraph Scatterplot

Correlation Heatmap

Our Models

Linear Regression

KNN Genre Prediction Model

Idea

Creation

Training and Fine-tuning

Model Results

conclusion

The Team

Nancy Hernandez

Vincent Chen

Jada Jackson

Kamden Evans

Stephen Yang

Katie Guzman

Honorable mention