The 2021 Rolling Stone Top 500 Songs of All Time List has Become a Personality Trait

Jul 9, 2022 26 min read

In a previous post, I ranted about ranking systems and webscraped data on a fascinating list - Rolling Stone Magazine’s Top 500 Greatest Songs of All Time. In this post, I pull the data again using the Spotify API and perform more in depth analytics on it including supervised learning, dimensionality reduction, and unsupervised learning. I also use Streamlit to make a web application that allows you to explore the data and see the results in a more interactive way.

TL/DR because this article is long

For those who don’t want to read the article, I built an accompanying Streamlit application for further exploration of the Rolling Stone top 500 data. Try it out here. You can also download the dataset here for your own use.

Background

Released in September 2021, Rolling Stone attempted to rank what maybe shouldn’t be ranked. Across all genres over the last 100 years, they ordered from 1 to 500 what they considered to be the best songs of all time. It was the first time they made a list like this in 17 years. In constructing the list, they surveyed a bunch of artists and music critics about their 50 favorite songs and aggregated the results to create their ranking. Possibly because of who they surveyed and possibly because of how they scored the survey results, the emerging list makes the listener stop and think, “Really? Gasolina by Daddy Yankee is in the top 50?”. Yet for all of its spicy takes, it makes for an excellent ~34 hour playlist, and I’ve been obsessed with it for almost a year.

You can read more about how they came up with this list on their website, but I site it below for reference:

In 2004, Rolling Stone published its list of the 500 Greatest Songs of All Time. It’s one of the most widely read stories in our history, viewed hundreds of millions of times on this site. But a lot has changed since 2004; back then the iPod was relatively new, and Billie Eilish was three years old. So we’ve decided to give the list a total reboot. To create the new version of the RS 500 we convened a poll of more than 250 artists, musicians, and producers — from Angelique Kidjo to Zedd, Sam Smith to Megan Thee Stallion, M. Ward to Bill Ward — as well as figures from the music industry and leading critics and journalists. They each sent in a ranked list of their top 50 songs, and we tabulated the results.

Nearly 4,000 songs received votes. Where the 2004 version of the list was dominated by early rock and soul, the new edition contains more hip-hop, modern country, indie rock, Latin pop, reggae, and R&B. More than half the songs here — 254 in all — weren’t present on the old list, including a third of the Top 100. The result is a more expansive, inclusive vision of pop, music that keeps rewriting its history with every beat.

Why I scraped data from the Rolling Stone Top 500.. Twice

In my last post I described how and why I scraped data from the Rolling Stone top 500. In my own words, “I’ve been so into my playlist that I wanted to be more exact when I told people where certain songs ranked in the Rolling Stone top 500 list”. My methodology was to use the BeautifulSoup library to scrape the songs off of the website and then merge in other information like genre and track time using the now deprecated iTunes API. The result had a few misses in some information but overall was pretty thorough.

So I pulled this dataset and I sent it to my little brother to run some kind of analysis on. After waiting a few months, I saw that someone else made the same playlist on Spotify that I had made in Youtube Music. For me this was a game changer, because Spotify API is free to use and returns high quality results. Besides filling in some missing records I had from linking my webscraping to the iTunes API, it also provided a bunch of quantitative metrics for evaluating and comparing music, such as the danceability and liveness of a given song.

Below, I show just how easy this was to do. In a few lines I pull more data than I ever had before on all of the songs in the top 500 list as well as metadata about the corresponding artists and albums.

Pulling Data

Imports

import numpy as np
import pandas as pd
import spotipy
from spotipy import SpotifyClientCredentials
import tqdm
import yaml

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Authentication

with open('secrets.yaml') as file:
    secrets = yaml.safe_load(file)
    client_id = secrets.get('client_id')
    client_secret = secrets.get('client_secret')

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Pull track URIs and general info

playlist_link = "https://open.spotify.com/playlist/4EdmaJCUXvXsXJKXZGvj2r"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]
playlist_data = []
for i in range(5):
    playlist_data = playlist_data + sp.playlist_tracks(playlist_URI, offset=i*100)['items']
track_uris = [x["track"]["uri"] for x in playlist_data]

# Print information from Stronger by Kanye West
playlist_data[0]

{'added_at': '2021-09-15T18:48:41Z',
 'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/117710003'},
  'href': 'https://api.spotify.com/v1/users/117710003',
  'id': '117710003',
  'type': 'user',
  'uri': 'spotify:user:117710003'},
 'is_local': False,
 'primary_color': None,
 'track': {'album': {'album_type': 'album',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/5K4W6rqBFWDnAN6FQUkS6x'},
     'href': 'https://api.spotify.com/v1/artists/5K4W6rqBFWDnAN6FQUkS6x',
     'id': '5K4W6rqBFWDnAN6FQUkS6x',
     'name': 'Kanye West',
     'type': 'artist',
     'uri': 'spotify🧑‍🎨5K4W6rqBFWDnAN6FQUkS6x'}],
   'available_markets': [],
   'external_urls': {'spotify': 'https://open.spotify.com/album/5fPglEDz9YEwRgbLRvhCZy'},
   'href': 'https://api.spotify.com/v1/albums/5fPglEDz9YEwRgbLRvhCZy',
   'id': '5fPglEDz9YEwRgbLRvhCZy',
   'images': [{'height': 640,
     'url': 'https://i.scdn.co/image/ab67616d0000b2739bbd79106e510d13a9a5ec33',
     'width': 640},
    {'height': 300,
     'url': 'https://i.scdn.co/image/ab67616d00001e029bbd79106e510d13a9a5ec33',
     'width': 300},
    {'height': 64,
     'url': 'https://i.scdn.co/image/ab67616d000048519bbd79106e510d13a9a5ec33',
     'width': 64}],
   'name': 'Graduation',
   'release_date': '2007-09-11',
   'release_date_precision': 'day',
   'total_tracks': 13,
   'type': 'album',
   'uri': 'spotify:album:5fPglEDz9YEwRgbLRvhCZy'},
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/5K4W6rqBFWDnAN6FQUkS6x'},
    'href': 'https://api.spotify.com/v1/artists/5K4W6rqBFWDnAN6FQUkS6x',
    'id': '5K4W6rqBFWDnAN6FQUkS6x',
    'name': 'Kanye West',
    'type': 'artist',
    'uri': 'spotify🧑‍🎨5K4W6rqBFWDnAN6FQUkS6x'}],
  'available_markets': [],
  'disc_number': 1,
  'duration_ms': 311866,
  'episode': False,
  'explicit': True,
  'external_ids': {'isrc': 'USUM70741299'},
  'external_urls': {'spotify': 'https://open.spotify.com/track/4fzsfWzRhPawzqhX8Qt9F3'},
  'href': 'https://api.spotify.com/v1/tracks/4fzsfWzRhPawzqhX8Qt9F3',
  'id': '4fzsfWzRhPawzqhX8Qt9F3',
  'is_local': False,
  'name': 'Stronger',
  'popularity': 31,
  'preview_url': None,
  'track': True,
  'track_number': 3,
  'type': 'track',
  'uri': 'spotify:track:4fzsfWzRhPawzqhX8Qt9F3'},
 'video_thumbnail': {'url': None}}

Put Into a Neat Dataframe

playlist_dict = {
    
    'rank': [],
    'artist_name':[],
    'track_name':[],
    
    # Track data
    'track_popularity':[],
    'track_duration_ms':[],
    'track_is_explicit':[],
    'track_number':[],
    'track_danceability': [],
    'track_energy': [],
    'track_key': [],
    'track_loudness': [],
    'track_mode': [],
    'track_speechiness': [],
    'track_acousticness': [],
    'track_instrumentalness': [],
    'track_liveness': [],
    'track_valence': [],
    'track_tempo': [],
    'track_time_signature': [],
    'track_analysis_url': [],

    # Album data
    'album_name':[],
    'album_release_date':[],
    'album_release_year':[],
    'album_image':[],
    
    # Artist data
    'artist_genre':[],
    'artist_popularity':[],
    
}
for count, i in tqdm.tqdm(enumerate(playlist_data)):
    
    # Query for More Song Data
    track_info = sp.audio_features(i['track']['id'])[0]

    playlist_dict['rank'].append(500 - count)
    playlist_dict['artist_name'].append(i['track']['artists'][0]['name'])
    playlist_dict['track_name'].append(i['track']['name'])
    

    # Song Stuff
    playlist_dict['track_popularity'].append(i['track']['popularity'])
    playlist_dict['track_duration_ms'].append(i['track']['duration_ms'])
    playlist_dict['track_is_explicit'].append(i['track']['explicit'])
    playlist_dict['track_number'].append(i['track']['track_number'])
    playlist_dict['track_danceability'].append(track_info['danceability'])
    playlist_dict['track_energy'].append(track_info['energy'])
    playlist_dict['track_key'].append(track_info['key'])
    playlist_dict['track_loudness'].append(track_info['loudness'])
    playlist_dict['track_mode'].append(track_info['mode'])
    playlist_dict['track_speechiness'].append(track_info['speechiness'])
    playlist_dict['track_acousticness'].append(track_info['acousticness'])
    playlist_dict['track_instrumentalness'].append(track_info['instrumentalness'])
    playlist_dict['track_liveness'].append(track_info['liveness'])
    playlist_dict['track_valence'].append(track_info['valence'])
    playlist_dict['track_tempo'].append(track_info['tempo'])
    playlist_dict['track_time_signature'].append(track_info['time_signature'])
    playlist_dict['track_analysis_url'].append(track_info['analysis_url'])
    
    
    # Album Stuff
    playlist_dict['album_name'].append(i['track']['album']['name'])
    playlist_dict['album_release_date'].append(i['track']['album']['release_date'])
    playlist_dict['album_release_year'].append(np.nan)
    playlist_dict['album_image'].append(i['track']['album']['images'][0]['url'])
    
    # Artist Stuff
    # Query More Artist Data
    artist_uri = i["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    
    playlist_dict['artist_genre'].append(', '.join(artist_info['genres']))
    playlist_dict['artist_popularity'].append(artist_info['popularity'])

500it [02:10,  3.84it/s]


df = (
    pd.DataFrame(data = playlist_dict)
    .assign(album_release_date = lambda x: pd.to_datetime(x.album_release_date))
    .assign(album_release_year = lambda x: x.album_release_date.dt.year)
)
# Check for null values
df.isna().any().sum()

df.head()
df.tail()
df.to_csv('data/spotify_music_output.csv',index=False)

	rank	artist_name	track_name	track_popularity	track_duration_ms	track_is_explicit	track_number	track_danceability	track_energy	track_key	...	track_valence	track_tempo	track_time_signature	track_analysis_url	album_name	album_release_date	album_release_year	album_image	artist_genre	artist_popularity
0	500	Kanye West	Stronger	31	311866	True	3	0.617	0.717	10	...	0.490	103.992	4	https://api.spotify.com/v1/audio-analysis/4fzs...	Graduation	2007-09-11	2007	https://i.scdn.co/image/ab67616d0000b2739bbd79...	chicago rap, rap	90
1	499	The Supremes	Baby Love	67	158040	False	3	0.595	0.643	5	...	0.730	135.633	4	https://api.spotify.com/v1/audio-analysis/5uES...	Where Did Our Love Go	1964-08-31	1964	https://i.scdn.co/image/ab67616d0000b273d5ea12...	brill building pop, classic girl group, disco,...	61
2	498	Townes Van Zandt	Poncho & Lefty	55	220573	False	8	0.636	0.277	1	...	0.685	133.363	4	https://api.spotify.com/v1/audio-analysis/6QXt...	The Late Great Townes Van Zandt	1972-01-01	1972	https://i.scdn.co/image/ab67616d0000b273cbf571...	alternative country, cosmic american, country ...	53
3	497	Lizzo	Truth Hurts	78	173325	True	13	0.715	0.624	4	...	0.412	158.087	4	https://api.spotify.com/v1/audio-analysis/3HWz...	Cuz I Love You (Super Deluxe)	2019-04-19	2019	https://i.scdn.co/image/ab67616d0000b2737bebcd...	dance pop, escape room, minnesota hip hop, pop...	80
4	496	Harry Nilsson	Without You	68	201000	False	6	0.381	0.186	4	...	0.142	65.058	4	https://api.spotify.com/v1/audio-analysis/6MrI...	Nilsson Schmilsson	1971-01-01	1971	https://i.scdn.co/image/ab67616d0000b2734df5b5...	art rock, brill building pop, classic rock, cl...	59

5 rows × 26 columns

	rank	artist_name	track_name	track_popularity	track_duration_ms	track_is_explicit	track_number	track_danceability	track_energy	track_key	...	track_valence	track_tempo	track_time_signature	track_analysis_url	album_name	album_release_date	album_release_year	album_image	artist_genre	artist_popularity
495	5	Nirvana	Smells Like Teen Spirit	14	301920	False	1	0.502	0.912	1	...	0.720	116.761	4	https://api.spotify.com/v1/audio-analysis/1f3y...	Nevermind (Deluxe Edition)	1991-09-26	1991	https://i.scdn.co/image/ab67616d0000b27328a90d...	alternative rock, grunge, permanent wave, rock	80
496	4	Bob Dylan	Like a Rolling Stone	71	369600	False	1	0.482	0.721	0	...	0.557	95.263	4	https://api.spotify.com/v1/audio-analysis/3AhX...	Highway 61 Revisited	1965-08-30	1965	https://i.scdn.co/image/ab67616d0000b27341720e...	classic rock, country rock, folk, folk rock, r...	70
497	3	Sam Cooke	A Change Is Gonna Come	66	191160	False	7	0.212	0.383	10	...	0.452	173.790	3	https://api.spotify.com/v1/audio-analysis/0KOE...	Ain't That Good News	1964-03-01	1964	https://i.scdn.co/image/ab67616d0000b2737329db...	adult standards, brill building pop, classic s...	64
498	2	Public Enemy	Fight The Power	57	282640	True	20	0.797	0.582	2	...	0.415	105.974	4	https://api.spotify.com/v1/audio-analysis/1yo1...	Fear Of A Black Planet	1990-04-10	1990	https://i.scdn.co/image/ab67616d0000b2732e3d1d...	conscious hip hop, east coast hip hop, gangste...	56
499	1	Aretha Franklin	Respect	73	147600	False	1	0.805	0.558	0	...	0.965	114.950	4	https://api.spotify.com/v1/audio-analysis/7s25...	I Never Loved a Man the Way I Love You	1967-03-10	1967	https://i.scdn.co/image/ab67616d0000b2736aa931...	classic soul, jazz blues, memphis soul, soul, ...	68

5 rows × 26 columns

The Measures We Have

In less than two minutes, we have complete data on all 500 songs in the list! I’m almost upset with how easy this was compared with scraping the data from Rolling Stone Magazine. Great job Spotify. Still, we might wonder why I bothered to gather this info given I already did it before even if it is cleaner and more reliable. But as I said above, the other reason I went through the trouble to pull this was for the beautiful quantitative measures we can get for each song on the list. I’ll show why these are so useful for my purposes in a little bit. As an example of what we have, below I show some new measures we have for the song “Stronger” by Kanye West.

sp.audio_features(track_uris[0])[0]

{'danceability': 0.617,
 'energy': 0.717,
 'key': 10,
 'loudness': -7.858,
 'mode': 0,
 'speechiness': 0.153,
 'acousticness': 0.00564,
 'instrumentalness': 0,
 'liveness': 0.408,
 'valence': 0.49,
 'tempo': 103.992,
 'type': 'audio_features',
 'id': '4fzsfWzRhPawzqhX8Qt9F3',
 'uri': 'spotify:track:4fzsfWzRhPawzqhX8Qt9F3',
 'track_href': 'https://api.spotify.com/v1/tracks/4fzsfWzRhPawzqhX8Qt9F3',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/4fzsfWzRhPawzqhX8Qt9F3',
 'duration_ms': 311867,
 'time_signature': 4}

Some explanation of what some of these metrics mean is worthwhile. I pulled these definitions straight from the spotify documentation and include them here for completeness.

acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. >= -1, <= 11
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”. >= 3, <= 7
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Some Quick EDA

Alright now that I have my dataset, I’d like to answer some important questions. I’m currently writing this piece on July 9, 2022 and I’m wondering which tracks and which artists are most and least popular right now.

(
    df[['track_name', 'artist_name', 'track_popularity', 'artist_popularity', 'rank']]
    .sort_values(by='track_popularity', ascending=False)
)

	track_name	artist_name	track_popularity	artist_popularity	rank
440	Running Up That Hill (A Deal With God)	Kate Bush	97	82	60
181	Everybody Wants To Rule The World	Tears For Fears	86	71	319
195	Every Breath You Take	The Police	86	73	305
143	Blank Space	Taylor Swift	85	92	357
122	Mr. Brightside	The Killers	85	76	378
...	...	...	...	...	...
486	Waterloo Sunset	The Kinks	3	65	14
259	The Humpty Dance	Digital Underground	3	54	241
50	Powderfinger - 2016 Remaster	Neil Young	2	60	450
105	Planet Rock	Afrika Bambaataa	1	42	395
380	Oh Bondage! Up Yours!	X-Ray Spex	0	36	120

500 rows × 5 columns

Man, Kate Bush’s Running up that hill as number 1. Given the recent release of season 4 of Stranger Things, that tracks.

As a Neil Young fan, I’m kind of sad to see Powder Finger so low although it’s not really one of my favorite songs by him anyway. The Humpty Dance is amusing. If you’ve never listened to it, I recommend checking it out.

Following on rank against popularity, I’m curious if more currently popular songs tend to be ranked higher on the list. Below I make a simple scatter plot showing the rank against the popularity of the top 500 songs.

import plotly.express as px

def plot_popularity_against_track_rank(df, outpath=None):
    fig = px.scatter(
        data_frame=df, 
        x='rank', 
        y='track_popularity', 
        hover_name='track_name', 
        title='Comparing Rank and Track Popularity',
        hover_data=['artist_name'],
        trendline="lowess"
    )
    fig = fig.update_coloraxes(showscale=False)
    fig = fig.update_xaxes(autorange="reversed")
    fig = fig.update_layout(
        plot_bgcolor="white", 
        margin=dict(t=50, l=10, b=10, r=10),
        xaxis_title='Song Ranking',
        yaxis_title="Current Song Popularity",
        width = 800, height = 400
    )
    if outpath:
        fig.write_image(outpath)
    return fig

plot_popularity_against_track_rank(df, outpath="images/track_pop.png")

Dammnnn, Kate Busch so far up there. And we see that there is no real link between current popularity and ranking. I mean, fair. Now I’d like to do something similar with artists in general. Below I explore the rank against popularity at the artist level by grabbing the best ranked song for each artist and compare it against the artist level popularity.

artist_df = (
    df
    .groupby('artist_name')
    .agg({'rank':'min', 'artist_popularity':'max'})
    .reset_index()
    .sort_values(by='artist_popularity', ascending=False)
)
artist_df

	artist_name	rank	artist_popularity
21	Bad Bunny	329	100
86	Drake	129	95
127	Harry Styles	428	93
279	Taylor Swift	69	92
19	BTS	346	92
...	...	...	...
354	Woody Guthrie	229	41
320	The Slits	381	39
258	Screamin' Jay Hawkins	299	38
356	X-Ray Spex	120	36
112	Funky 4 + 1	288	23

359 rows × 3 columns

Wow this definitely tracks. Props to Harry Styles and X-Ray Spex for lining up their artist popularity generally with their song popularity. Now I’ll make the same plot I built above at the artist level. It turns out Kanye West and Kendrick Lamar are the two most popular artists with songs in the top 50 with Runaway and Alright respectively.

def plot_popularity_against_artist_rank(df, outpath=None):

    fig = px.scatter(
        data_frame=artist_df, 
        x='rank', 
        y='artist_popularity', 
        hover_name='artist_name', 
        title='Best Track Rank and Artist Popularity',
        trendline= 'lowess'
    )
    fig = fig.update_coloraxes(showscale=False)
    fig = fig.update_xaxes(autorange="reversed")
    fig = fig.update_layout(
        plot_bgcolor="white", 
        margin=dict(t=50, l=10, b=10, r=10),
        xaxis_title='Best Song Ranking',
        yaxis_title="Current Artist Popularity",
        width = 800, height = 400
    )
    if outpath:
        fig.write_image(outpath)
    return fig

plot_popularity_against_artist_rank(artist_df, outpath="images/artist_pop.png")

Advanced Analytics Part 1 - Forecast the album release date

With all of these new quantitative measures, I started to wonder if I could actually use machine learning in some way on this list. For example, maybe in different years there was a trend for the top songs to have varying amounts of accousticenss or loudness. Or, if I wanted to, I could leverage the artist genres to see if there was a correlation between the top songs and their genres. Below, I implement a simple pipeline using scikit-learn on a random forest. First, I split my data and define my datatypes. Then I provide two helper functions for handling boolean variables and for running singular value decomposition on the combined results. Finally, I define my pipeline, fit it to the data, and predict release years.

## Packages
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer

SEED = 42

column_type_mapping = {
    "track_danceability": "numeric",
    "track_energy": "numeric",
    "track_loudness": "numeric",
    "track_speechiness": "numeric",
    "track_acousticness": "numeric",
    "track_instrumentalness": "numeric",
    "track_liveness": "numeric",
    "track_valence": "numeric",
    "track_tempo": "numeric",
    "track_duration_ms": "numeric",
    "track_popularity": "numeric",
    "track_is_explicit": "categorical",
    "artist_popularity": "numeric",
    "track_time_signature": "categorical",
    "track_key": "categorical",
    "album_release_year": "numeric",
    "artist_genre": "text",
}


target = 'album_release_year'
features = list(column_type_mapping.keys())

numeric_cols = [i for i in features if column_type_mapping[i] == "numeric"]
categorical_cols = [i for i in features if column_type_mapping[i] == "categorical"]
text_cols = [i for i in features if column_type_mapping[i] == "text"]

X, y = df[features], df[target]
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, y, test_size=.2, random_state=42)

# Modified TruncatedSVD that doesn't fail if n_components > ncols
class MyTruncatedSVD(TruncatedSVD):
    def fit_transform(self, X, y=None):
        if X.shape[1] <= self.n_components:
            self.n_components = X.shape[1] - 1
        return TruncatedSVD.fit_transform(self, X=X, y=y)

def to_string(x):
    """Handle values as string.  They may be treated as booleans or numerics otherwise"""
    return x.astype(str)

#Define the hyperparameter grid
param_grid = {
    'svd__n_components' : [5, 10, 30],
    'rf__max_depth': [5, 10, 20],
    'rf__n_estimators': [30, 100, 200]
    }

# Build Pipeline
numeric_pipeline = Pipeline(
        steps=[
            ("scaler", MinMaxScaler()),
        ]
    )

# Handle Categorical Variables
categorical_pipeline = Pipeline(
    steps=[
        ("convert_to_string", FunctionTransformer(to_string)),
        ("onehot", OneHotEncoder(categories="auto", handle_unknown="ignore")),
    ]
)


# Handle Text features (just artist genre)
text_pipeline = Pipeline(
    steps=[
        ("ngrams", CountVectorizer(ngram_range=(1, 2))),
    ]
)

# Handle each variable type as overall preprocessing
preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
        ("text", text_pipeline, text_cols[0])
        ]
)

# Run SVD on the results
steps = [
    ("preprocessing", preprocessing_pipeline),
    ("svd", MyTruncatedSVD(random_state=SEED),),
    ("rf", RandomForestRegressor())
    ]

pipe = Pipeline(
    steps=steps,
    verbose=False,
)

# Run grid search to tune hyperparameters
estimator = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1,) 
x = estimator.fit(X_train, Y_train)

Alright, we’ve fit our model, and found our hyperparameters. I also printed out the average cross validation score from the best hyperparameter settings. We have an in sample MAE of about 10. In other words, when the model predicts a release year, it’s off by 10 years on average. Considering the space of songs is about 100 years in total, the performance is actually a little better than I expected it to be. There are two easy reasons to I can use to explain this:

This estimate is overly optimistic as I explain in my article about nested cross validation. When I evaluate the model on the test set, performance will probably be a little lower.
This estimate and the holdout set have some leakage because I randomly split my dataset instead of partioning it by the artist name. The problem with random partitioning in this case is that I have artist level features like popularity and genre that are essentially duplicated in training validation and holdout sets. For exmaple, the list of artist genres affiliated with each song of the Beatles' is [beatlesque, british invasion, classic rock, merseybeat, psychedelic rock, rock]. No other artist will have ‘beatlesque’ in their genre so when the model sees this on the holdout data, it will know the release year of the album will be much lower than with other songs. This leakage will persist in the holdout set. I should repartition my data more thoughtfully, but I’m just kind of having fun here. My article, my rules!

print(f"Best parameters found: {estimator.best_params_}")
print(f"In sample MAE: {round(abs(estimator.best_score_),2)}")

Best parameters found: {'rf__max_depth': 20, 'rf__n_estimators': 200, 'svd__n_components': 30}
In sample MAE: 10.05

Surprisingly, the MAE actually dropped by nearly two years and even the Root Mean Squared Error, a metric that penalizes far off predictions, is under 12 years. As I listed before, there is a good amount of leakage in this model from the artist level features. Still, I do feel pretty impressed with this low error on the model.

preds = estimator.predict(X_test)
r2 = round((1 - np.sum(np.abs(Y_test - preds))/np.sum(np.abs(Y_test - preds.mean()))) * 100,2)
print(f'MAE: {mean_absolute_error(preds, Y_test)}')
print(f'RMSE: {round(np.sqrt(mean_squared_error(preds, Y_test)),2)}')
print(f"Fraction of deviance explained: {r2}%")

MAE: 8.301839416971916
RMSE: 11.88
Fraction of deviance explained: 41.82%

Advanced Analytics Part 2 - Dimensionality Reduction and Clustering

Now I want to disperse with machine learning in the predictive sense. Instead, I think I’m going to try and find tracks that are similar to each other. For example, I just moved in with this couple in Lawrenceville. One of my new roommates told me, “I think all of those 80’s rock bands sound the same. Like I never know which song is AC/DC or Kiss, or the Eagles”.

Now first of all, the Eagles became famous in the mid 70’s. Kiss, surprisingly, also became famous in the later 70’s rather than the 80’s. Regardless of timelines, I would secondly debate the claim that the Eagles sound at all similar to AC/DC or Kiss. Luckily, I now have data on at least one of each of these bands most famous songs. So, I’m going to do three things here:

Use UMAP (a popular dimensionality reduction algorithm) on some columns on the dataset to project it onto a two dimensional plane
Run DBSCAN on the dataset to assign our songs to distinct groups or clusters.
Plot the results and look at where we stand

Doing this might sound complicated, and from a mathematics perspective it is. Luckily, Python has many packages to help us and running UMAP and DBSCAN on a dataset like this will be pretty easy.

# Select the artists (Note that Don Henley is the lead singer of the eagles)
search_artists = ['kiss', 'ac/dc', 'eagles', 'don henley']

artist_indexes = df.loc[lambda x: x.artist_name.str.lower().isin(search_artists)].index

df.loc[lambda x: x.artist_name.str.lower().isin(search_artists)]

	rank	artist_name	track_name	track_popularity	track_duration_ms	track_is_explicit	track_number	track_danceability	track_energy	track_key	...	track_valence	track_tempo	track_time_signature	track_analysis_url	album_name	album_release_date	album_release_year	album_image	artist_genre	artist_popularity
96	404	KISS	Rock And Roll All Nite	76	168840	False	10	0.654	0.929	1	...	0.902	144.774	4	https://api.spotify.com/v1/audio-analysis/6KTv...	Dressed To Kill	1975-03-19	1975	https://i.scdn.co/image/ab67616d0000b27365a7b0...	album rock, glam rock, hard rock, metal, rock	72
189	311	Eagles	Hotel California - 2013 Remaster	84	391376	False	1	0.579	0.508	2	...	0.609	147.125	4	https://api.spotify.com/v1/audio-analysis/40ri...	Hotel California (2013 Remaster)	1976-12-08	1976	https://i.scdn.co/image/ab67616d0000b273463734...	album rock, classic rock, country rock, folk r...	74
213	287	AC/DC	You Shook Me All Night Long	81	210173	False	7	0.532	0.767	7	...	0.755	127.361	4	https://api.spotify.com/v1/audio-analysis/2SiX...	Back In Black	1980-07-25	1980	https://i.scdn.co/image/ab67616d0000b2730b51f8...	australian rock, hard rock, rock	79
291	209	Don Henley	The Boys Of Summer	78	288733	False	1	0.516	0.549	6	...	0.907	176.941	4	https://api.spotify.com/v1/audio-analysis/4gve...	Building The Perfect Beast	1984-01-01	1984	https://i.scdn.co/image/ab67616d0000b273e59d7f...	album rock, art rock, classic rock, country ro...	63

4 rows × 26 columns

Build Pipeline

The pipe I will use for this will be similar to the one I used for predicting the album release year. In fact, I can even reuse the preprocessing steps from the previous section and then just apply UMAP and DBSCAN on the results.

from sklearn.cluster import DBSCAN
import umap

preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
        ("text", text_pipeline, text_cols[0])
        ]
)

pipeline = Pipeline(
    steps = [
        ("preprocessing", preprocessing_pipeline),
        ("svd", MyTruncatedSVD(n_components=10, random_state=SEED)),
        ("umap", umap.UMAP(random_state=42)),
        ("dbscan", DBSCAN())
    ],
    verbose=False
)

Now I’ll use the first two steps to decompose the dataset into a two dimensional plane. This will let me plot the results in a 2D. Then, I let the full pipeline run so I can get cluster assignments.


umap_embedding = (
    pipeline[:-1]
    .set_params(umap__n_components = 2)
    .fit_transform(df)
)

fit_clustering_params = {
    "umap__n_components": 2, 
    "umap__min_dist":0.0,
    "umap__n_neighbors": 30,
    "dbscan__eps":.5,
    "dbscan__min_samples":5
}

clusters_2d = (pipeline
            .set_params(**fit_clustering_params)
            .fit_predict(df, y=None)
           )

[Pipeline] ..... (step 1 of 3) Processing preprocessing, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing svd, total=   0.0s
[Pipeline] .............. (step 3 of 3) Processing umap, total=   2.3s
[Pipeline] ..... (step 1 of 4) Processing preprocessing, total=   0.0s
[Pipeline] ............... (step 2 of 4) Processing svd, total=   0.0s
[Pipeline] .............. (step 3 of 4) Processing umap, total=   2.9s
[Pipeline] ............ (step 4 of 4) Processing dbscan, total=   0.0s

Plotting the results

Below I plot my two dimensional embedding and clustering results, marking the artists I was comparing with a star. In this case, Boys of Summer by Don Henley was placed right near Hotel California by the Eagles while Shook Me All Night Long by AC/DC was placed near Rock and Roll All Night by Kiss. While these were the results I expected to prove. I actually see a problem with them. Far to the left are all of the Beatles songs. Why are they all so close together? Do they really all sound the same? No, they certainly do not. The issue here is that I’m using the artist genre in my dimensionality reduction process. Because I’m running ngrams on that text field, it creates a ton of extra features. Each of these has the same weight as the numeric features. I’ll need to run this again and remove the text features. I’ll also throw out the categorical features for the same reason.

import plotly.graph_objects as go

def plot_embedding(df, umap_embedding, clusters, artist_indexes, outpath=None):

    text = [
            f"Artist Name: {artist_name} </br> Song Name: {track_name}"
            for artist_name, track_name in zip(
                df.artist_name, df.track_name
            )
        ]
    hovertemplate = "</br>%{text}<extra></extra>"

        # Add in the listed bands from above
    filter_df = df.iloc[artist_indexes]

    text_record = [
            f"Artist Name: {artist_name} </br> Song Name: {track_name}"
            for artist_name, track_name in zip(
                filter_df.artist_name, filter_df.track_name
            )
        ]
    fig = go.Figure()

    if umap_embedding.shape[1] == 2:
        fig = fig.add_trace(
            go.Scatter(
                x=umap_embedding[:, 0],
                y=umap_embedding[:, 1],
                hovertemplate=hovertemplate,
                text=text,
                showlegend=False,
                mode="markers",
                marker_color=clusters,
                marker_line_color="black",
                marker_line_width=1,
            )
        )



        fig = fig.add_trace(
            go.Scatter(
                x=umap_embedding[artist_indexes, 0],
                y=umap_embedding[artist_indexes, 1],
                text=text_record,
                showlegend=True,
                mode="markers",
                marker_symbol="star",
                marker_line_color="midnightblue",
                marker_color="lightskyblue",
                marker_line_width=1,
                marker=dict(size=15),
            )
        )
    else:
        fig = go.Figure(data=[go.Scatter3d(
        x=umap_embedding[:, 0],
        y=umap_embedding[:, 1],
        z=umap_embedding[:, 2],
        mode='markers',
        text=text,
        marker_line_color="black",
        marker_line_width=1,
        marker=dict(
            size=10,
            color=clusters,  
            opacity=0.75
        )
        )])


        fig = fig.add_trace(
            go.Scatter3d(
                    x=umap_embedding[artist_indexes, 0],
                    y=umap_embedding[artist_indexes, 1],
                    z=umap_embedding[artist_indexes, 2],
                    text=text_record,
                    showlegend=True,
                    mode="markers",
                    marker_symbol="diamond",
                    marker_line_color="midnightblue",
                    marker_color="lightskyblue",
                    marker_line_width=1,
                    marker=dict(size=12),
            )
        )

    fig = fig.update_layout(
        title_font_size=20,
        height=400,
        width=800,
        plot_bgcolor="#ffffff",
        hoverlabel=dict(
            bgcolor="white", font_size=16, font_family="Rockwell", namelength=-1
        ),
        margin=dict(l=20, r=20, t=30, b=20),
    )
    if outpath:
        fig.write_image(outpath)
    return fig

plot_embedding(df, umap_embedding, clusters_2d, artist_indexes, outpath="images/umap_2d_first_take.png")

Dimensionality Reduction and Clustering - Take 2

Below I run the same pipeline, this time only taking my numeric features. Since there will be far fewer features to run UMAP on, I also remove the Singular Value Decomposition step.

from sklearn.cluster import DBSCAN
import umap

numeric_preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ]
)

pipeline = Pipeline(
    steps = [
        ("preprocessing", numeric_preprocessing_pipeline),
        ("umap", umap.UMAP(random_state=42)),
        ("dbscan", DBSCAN())
    ],
    verbose=False
)

Below I run the pipeline to get the UMAP embeddings and cluster assignments again. Because everyone loves 3D charts, I also extract the embeddings and cluster assignments in three dimensions. We’ll see how this looks in a sec.


umap_embedding = (
    pipeline[:-1]
    .set_params(umap__n_components = 2)
    .fit_transform(df)
)
umap_embedding_3d = (
    pipeline[:-1]
    .set_params(umap__n_components = 3)
    .fit_transform(df)
)

fit_clustering_params = {
    "umap__n_components": 2, 
    "umap__min_dist": 0.0,
    "umap__n_neighbors": 30,
    "dbscan__eps":.3,
    "dbscan__min_samples":4
}


clusters_2d = (pipeline
            .set_params(**fit_clustering_params)
            .fit_predict(df, y=None)
           )

fit_clustering_params['umap__n_components'] = 3
clusters_3d = (
    pipeline
    .set_params(**fit_clustering_params)
    .fit_predict(df, y=None)
)

[Pipeline] ..... (step 1 of 2) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 2) Processing umap, total=   2.8s
[Pipeline] ..... (step 1 of 2) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 2) Processing umap, total=   2.5s
[Pipeline] ..... (step 1 of 3) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 3) Processing umap, total=   3.2s
[Pipeline] ............ (step 3 of 3) Processing dbscan, total=   0.0s
[Pipeline] ..... (step 1 of 3) Processing preprocessing, total=   0.0s
[Pipeline] .............. (step 2 of 3) Processing umap, total=   2.8s
[Pipeline] ............ (step 3 of 3) Processing dbscan, total=   0.0s

plot_embedding(df, umap_embedding, clusters_2d, artist_indexes, outpath="images/umap_2d_second_take.png")

Interestingly, now all of the tracks are pretty far away from each other. Does this make this representation meaningless? I think the answer is no. Looking around the chart on the far left, I found ‘Crying’ by Roy Orbison next to ‘River’ by Joni Mitchell and ‘Yesterday’ by the Beatles. On the far right, I see a bunch of rap songs like ‘The Message’ by Grandmaster Flash and ‘Nuthin But a ‘G’ Thang’ by Dr Dre. These actually kind of track. What looks less good in this group are the actual cluster assignments, which don’t seem to contribute much knowledge at all. In a future implementation, I’d like to try this again using Kmeans or HDBSCAN. Then again, a friend of mine once told me that clustering is “Like reading chicken scratch”. It would also be nice to see what features have the most influence on the structure of this embedding. I know I could find this if I had used PCA instead of UMAP, but I’ll leave doing it to another post. In the meantime, let’s look at the 3D embedding.

plot_embedding(df, umap_embedding_3d, clusters_3d, artist_indexes, outpath="images/umap_3d_second_take.png")

Welp, it’s a mess but it’s my mess. There are a few conclusions I can draw from this exercise.

The features you include in dimensionality reduction are important, and without proper handling they will have outsized influence if they are categorical or especially text. One way we could potentially handle this is by running SVD earlier on in the pipeline, exclusively on the categorical and text variables.
Be careful with your clustering methodology. As we see from the chart above, the cluster assignments created from running DBSCAN on one set of parameters are not great.
Unsupervised learning can be a great way to find patterns in high dimensional data but it can also be like reading signs from chicken scratch. We can drastically impact cluster assignments and embeddings by varying features to include as well as parameters for UMAP and DBSCAN.

Of course, this wouldn’t be complete if I did not allow you as a reader to play around with this data yourself. As with many other projects I’ve undertaken, I built an accompanying Streamlit application for further exploration on UMAP and DBSCAN. Try it out here. You can also download the dataset here for your own use.

If you are interested in learning more or talking to me about this subject (especially about ranking music), feel free to look me up on Github or add me on Linkedin! The greatest reward I have from doing any of this is hearing from people who took something away from my post.

References

“2.3. Clustering.” Scikit-Learn, https://scikit-learn/stable/modules/clustering.html. Accessed 9 July 2022.
“6.1. Pipelines and Composite Estimators.” Scikit-Learn, https://scikit-learn/stable/modules/compose.html. Accessed 9 July 2022.
500 Best Songs of All Time - Rolling Stone. https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/. Accessed 9 July 2022.

Credits

None of this has anything to do with DataRobot, the company I work at. Nonetheless, I thank all of my coworkers who came to that one deep dive session on Streamlit I did a couple weeks ago because it inspired me to build my app and make this post.