Spencer Wueste (instructor)
GenreLink is a program created by the Radiant Ravens team in order to further analyze genres and their correlation to different musical qualities. This helps users narrow down genres that may be more interesting to them and explore Spotify's vast catalog of music.
This database includes information of top songs from Spotify in over 35 countries around the world. The data includes information on genre in Spotify, positivity levels, danceability, valence, livelness, tempo, and tone of lyrics based on lyric library Syuzhet dates between 2017-2020 songs
Spotify databaseTo clean the data, we first decided what columns to use for our machine learning problem. Once we took only those columns and created a separate data frame, we proceeded to clean the data further by making all number columns have the string data type, getting rid of a few problematic cells in the process. Finally, we checked to make sure we used the proper datatypes and began creating relevant visualizations.
We found that in most genres there is a postive correlation between energy and valence in our Linear Regression Model and our KNN model managed to average around a 47% success rate
This visualization shows a heatmap comparing genres to their valence on a scale of 0.0 - 1.0. The brightly colored areas indicate more songs that fall into a certain valence within a genre.
In this visualization, we compare the valence and the tempo of songs, with color matching what genre a song is in our spotify dataset using Plotly. This demonstrates how tempo and valence correlate to the 21 genres liked by Spotify’s listeners. This shows that many songs can have wildly varying ratios of tempo to valence, though it does have at least a small correlation with the genre. X-axis: Tempo, Y-axis: Valence, Color: Genre
This visualization shows multiple graphs comparing negative and positive attributes of a song, its genre, and energy
This heatmap compares mutliple song attributes including livelness and energy to each other
Our goal with linear regression is to predict the value of valence based on energy. Additionally, it enables us to analyze and comprehend the relationship and effects between these variables. By incorporating a trendline, we can further enhance our ability to predict future data.
Genres implemented in model include...
r&b/soul, k-pop, reggaeton, trap, jazz, hip hop, boy band, pop, indie, country, reggae, metal, rap, dance/electronic, latin, funk, opm, rock, house, else, bolero
Graphs of genres in display...
hip hop, rap, house, bolero
The idea behind this model is to take in a bunch of numerical values based around values in a dataset and see if it can predict the genre of a song if given those values
As can be seen from some of the EDA above, the following parameters were chosen to predict the genre:
From there, we used the KNN section of the sklearn library. To put it simply, KNN (k-nearest neighbors) is a classification algorithm that assigns a label to a new data point based on the majority class of its k nearest neighbors in a labeled training dataset.
The EDA cleaned up our data, so now all that was needed was to prep the data to be fed to the model. To do this we took the aforementioned statistics on songs, along with what genre the song was (as numbers that represented what genre the song was since the model only accepts numerical values in this scenario), and randomized the order of the songs in such a way that every time the model was run it randomized the order so we could make sure the model would work on more data and not just memorize a few values, along with making it so we could get a range of how well the model does on average.
Afterward, we split up the data into 4 different sets, two holding input, one for training and the other testing, and the same for holding the 'targets.' Base testing yeilded around a 44-45% success rate, which is pretty good for a start.
When setting up the model, you can set various hyperparameters that affect how the model goes about sorting data. The following is the hyperparameters we used in the final version of our model:
The finalized model averages about an upper-46-to-mid-47% success rate, with this particular run managing about 47.65%. Considering a Naive Model, which is the baseline of whether or not a classification model has is at the very least ok, as it is simply a 1/n chance of success where n is the number of possible classifications (in this case genres), has a 1/21 (or about a 4.76%) chance of being correct, the model averages out to be about 10x better on average.
Exploring the correlation between different aspects of songs holds significant value as it provides insights into the nature of music composition and the emotional impact it carries. By understanding how elements like melody, lyrics, and rhythm interact, we can enhance the creation or recommendation process. Furthermore, this project's potential extends to being able to develop products like music recommendation apps or sites, or music advertisement.
Spencer Wueste (instructor)