Clustify
Music is a driving force for me. I listen to it in the morning while making breakfast, during my commute to work and before I go to bed. Since Im usually to busy to make my own playlist, Spotify's "Daily Mixes" are perfect for me. They take my favorite songs, group similar ones together and create cohesive playlists for me.
I wanted to see if I could create my own system to do something similar. Thus, Clustify was born.
Clustify takes your top 100 most played tracks and creates four unique playlist using the K-Means clustering algorithm. This notebook will walk you step by step through the process.
%matplotlib inline
import os
import json
import pickle
import spotipy
import numpy as np
import pandas as pd
import spotipy.util as util
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.cluster import KMeans
The first thing you have to do is create your Spotify app through the Spotify for Developers site . Once created you will have access to your Client ID, Client Secret and redirect URI.
Now that we have those we will create a pkl file to hold our sensitive information. They will be empty but you can add your Client ID and Secert. Then we can setup a token that will allow us to login.
The scope atribute tells the Spotify API what premissions we need. You can read more about them here. For now all we need to do is read the users most played songs.
if not os.path.exists('secret_spotify_credentials.pkl'):
Spotify = {}
Spotify['Client ID'] = ''
Spotify['Client Secret'] = ''
Spotify['Redirect_URI'] = ''
with open('secret_spotify_credentials.pkl', 'wb') as f:
pickle.dump(Spotify, f)
else:
Spotify = pickle.load(open('secret_spotify_credentials.pkl', 'rb'))
SPOTIPY_CLIENT_ID = Spotify['Client ID']
SPOTIPY_CLIENT_SECRET = Spotify['Client Secret']
SPOTIPY_REDIRECT_URI = Spotify['Redirect_URI']
username = 'DevTheWrightOne'
scope = 'user-top-read'
# Prompt for user premissions
token = util.prompt_for_user_token(username, scope, client_id = SPOTIPY_CLIENT_ID,
client_secret = SPOTIPY_CLIENT_SECRET,redirect_uri=SPOTIPY_REDIRECT_URI)
Now that we have sucessfully logged in we can create a Spotify object using the Spotipy Python library.
The Spotify API only allows you to retrieve 50 songs at a time. You can also specify what time span you want to collect the songs from with the 'time_range' parameter. We are going to use 'short_term' which will give us the most played song from the past month.
# Create a spotify object
spotify = spotipy.Spotify(auth=token)
# Get currents user's top 50 songs.
results = spotify.current_user_top_tracks(offset=0, time_range='short_term', limit=50)
The results are returned to us in JSON format. Lets go ahead and display the top 50 songs along with the artist in a more readable format.
print('Top 50 Songs')
for i, item in enumerate(results['items']):
print(i, item['name'], 'by', item['artists'][0]['name'])
Now that we have the top 50 songs we can analysis them and create some cohesive playlists, but we can't do that with just song names. We need some type of features that provide us with information about the songs. Luckily, Spotify's 'get-audio-features' API request will give us that information.
To get these features we have to use the song IDs which were provided to use in the last request. Lets go ahead and store them in there own list along with the song names and artist names.
# List containing song names
top_song_names = [item['name'] for item in results['items']]
# List containing artist names
top_song_artists = [item['artists'][0]['name'] for item in results['items']]
# List containing song IDs
top_song_ids = [item['uri'] for item in results['items']]
Now we can pass the song IDs to the audio_features method and retrieve the song feature for all 50 songs. They will come in JSON format, but to make things simpler we'll store them in a Pandas Dataframe and take a look at the features names.
# Using song ids get song features for all the songs
tracks_features = spotify.audio_features(top_song_ids)
# Create a dataframe containing all the songs and their features
top_50_songs = pd.read_json(json.dumps(tracks_features))
# View the features present in the DataFrame
print(top_50_songs.columns)
For the purposes of this notebook we will be using the energy and valence features.
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
Valence is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
We can use a combination of the features to gauge a persons most common mood. If someone is listening to music with high Energy and high Valence we can consider that person to be Happy or Cheerful. On the other hand if someone listens to music with low Energy and Valence we can consider them to be sad or depressed. We'll get a better idea of this when we visualize our data.
For now lets get rid of the columns in the Dataframe we dont need and add the song names and artist names to the Dataframe.
# Get the features that we need (Energy and Valence)
columns = ['energy', 'valence', 'id']
top_50_songs = top_50_songs[columns]
# Add the song name and artist name to the DataFrame
top_50_songs['title'] = top_song_names
top_50_songs['artist'] = top_song_artists
# Rearrange columns
columns = ['title', 'artist', 'valence', 'energy', 'id']
top_50_songs = top_50_songs[columns]
# Veiw the Dataframe
top_50_songs
Lets graph our data on a scatter plot to get an idea of how dispersed the songs are from each other. A songs valence will be measured along the x-axis and its energy will be measured along the y-axis. We'll also be able to see what mood the songs fall under.
#Plot the songs
x = top_50_songs['valence']
y = top_50_songs['energy']
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(x, y)
ax.set_xlabel('Valence')
ax.set_ylabel('Energy')
ax.xaxis.set_major_locator(ticker.FixedLocator(np.arange(0,1.1, step=0.1)))
ax.yaxis.set_major_locator(ticker.FixedLocator(np.arange(0,1.1, step=0.1)))
ax.set_xlim(left=0, right=1.0)
ax.set_ylim(bottom=0, top=1.0)
plt.grid(b=True, which='major')
plt.minorticks_on()
plt.grid(b=True, which='minor', alpha=0.25)
plt.axhline(0.5, color='black')
plt.axvline(0.5, color='black')
plt.text(0.78, 0.94, s="Happy / Cheerful")
plt.text(0.8, 0.04, s="Chill / Relaxed")
plt.text(0.04, 0.94, s="Angry / Irate")
plt.text(0.04, 0.04, s="Sad / Depressed")
plt.show()
We can see that alot of the music I listen to has high energy and falls under the Angry to Happy spectrum. I assume this is due to the large amount of Rap and Hip Hop music I listen too. Where songs have fast paced beats that result in a high Energy value, but the songs convey a wide range of messages. Some my discuss the struggles of being raised in a bad enviorment while others discuss to the feeling of being happy with a lot of money, this would explain the wide range of Valence feature values.
Lets count the number of songs in each quadrants and determine my most frequently ocurring mood.
# Filters for each quadrant.
angry_filter = (top_50_songs['valence'] < 0.5) & (top_50_songs['energy'] > 0.5)
happy_filter = (top_50_songs['valence'] > 0.5) & (top_50_songs['energy'] > 0.5)
sad_filter = (top_50_songs['valence'] < 0.5) & (top_50_songs['energy'] < 0.5)
chill_filter = (top_50_songs['valence'] > 0.5) & (top_50_songs['energy'] < 0.5)
angry_songs = len(top_50_songs[angry_filter])
happy_songs = len(top_50_songs[happy_filter])
sad_songs = len(top_50_songs[sad_filter])
chill_songs = len(top_50_songs[chill_filter])
print("Angry/Irate songs:", angry_songs, "or", angry_songs/len(top_50_songs) * 100, "%")
print("Happy/Cheerful songs:", happy_songs, "or", happy_songs/len(top_50_songs) * 100, "%")
print("Sad/Depressed songs:", sad_songs, "or", sad_songs/len(top_50_songs) * 100, "%")
print("Chill/Relaxed songs:", chill_songs, "or", chill_songs/len(top_50_songs) * 100, "%")
We can see that based on the music I listen to my most frequently occuring mood is Happy/Cheerful with 46% of the songs I listen to fall under that category.
We are going to use the K-Means clustering algorithm from the Scikit-learn library to model are data. We will be group our songs into 4 clusters based on their energy and valence features. Each one of these clusters will represent a playlist.
Before we being we need to prepare our data by moving it into a numpy array so that we can feed it into the model.
# Prepare data for K-Means clustering algorithm by putting data in a numpy array
data = np.array([[i,j] for i,j in zip(x,y)])
Lets go ahead and check that we have all 50 data points.
# Check that there is 50 pairs of data points.
print(data.shape)
Now that the data is prepared lets go head and create our model, fit our data and plot the clusters.
# Now we can use KMeans to cluster the songs togther using 4 centroids.
kmeans = KMeans(n_clusters=4)
kmeans.fit(data)
# Plot the clusters
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(data[:,0],data[:,1], c=kmeans.labels_, cmap='rainbow')
ax.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
ax.set_xlabel('Valence')
ax.set_ylabel('Energy')
ax.xaxis.set_major_locator(ticker.FixedLocator(np.arange(0,1.1, step=0.1)))
ax.yaxis.set_major_locator(ticker.FixedLocator(np.arange(0,1.1, step=0.1)))
ax.set_xlim(left=0, right=1.0)
ax.set_ylim(bottom=0, top=1.0)
plt.grid(b=True, which='major')
plt.minorticks_on()
plt.grid(b=True, which='minor', alpha=0.25)
plt.axhline(0.5, color='black') # x = 0
plt.axvline(0.5, color='black') # y = 0
plt.text(0.78, 0.94, s="Happy / Cheerful")
plt.text(0.8, 0.04, s="Chill / Relaxed")
plt.text(0.04, 0.94, s="Angry / Irate")
plt.text(0.04, 0.04, s="Sad / Depressed")
plt.show()
We have successfuly grouped and labeled our top 50 songs into 4 unique playlist/clusters. Lets create a list containing the labels because we will need them later.
# Lets save the labels in a list.
playlist_labels = kmeans.labels_
Now that we have a model of the data we can use it to predict which playlist a song belongs to. We will do this with the next 50 songs for the users top play tracks.
Lets collect the next 50 songs and store them with the rest for a total of 100 songs.
# Create new token and spotify object
token = util.prompt_for_user_token(username, scope, client_id = SPOTIPY_CLIENT_ID,
client_secret = SPOTIPY_CLIENT_SECRET,redirect_uri=SPOTIPY_REDIRECT_URI)
spotify = spotipy.Spotify(auth=token)
results = spotify.current_user_top_tracks(offset=49, time_range='short_term', limit=50)
top_song_names.extend([item['name'] for item in results['items']])
top_song_artists.extend([item['artists'][0]['name'] for item in results['items']])
top_song_ids.extend([item['uri'] for item in results['items']])
# Check that there are 100 songs in total.
print(len(top_song_names))
Now that we have the next 50 songs and their IDs we can get their features. We will then create a DataFrame containing all the information and feature for all 100 songs.
# Get song features for all 100 songs.
tracks_features = spotify.audio_features(top_song_ids)
# Create a new DataFrame containing all 100 songs.
top_100_songs = pd.read_json(json.dumps(tracks_features))
# Get the features that we need (Energy and Valence)
columns = ['energy', 'valence', 'id']
top_100_songs = top_100_songs[columns]
# Add the song name and artist name to the DataFrame
top_100_songs['title'] = top_song_names
top_100_songs['artist'] = top_song_artists
# Rearrange columns
columns = ['title', 'artist', 'valence', 'energy', 'id']
top_100_songs = top_100_songs[columns]
top_100_songs
We already have a labels for the first 50 songs, now we can use the model to predict the labels for the next 50 songs. We will add the labels to the 'playlist_labels' list.
data = np.array([[i,j] for i,j in zip(top_100_songs['valence'], top_100_songs['energy'])])
predictions = kmeans.predict(data[50:100])
playlist_labels = np.append(playlist_labels, predictions)
# Check that there are 100 label in total.
print(len(playlist_labels))
Now we can add the labels to the DataFrame.
top_100_songs['playlist'] = playlist_labels
top_100_songs
Great we successfully labeled our top 100 songs in 4 unique playlist. Now all we need to do is create the playlist and add the songs in the out Spotify library.
# Scope needed to create playlists and add songs to them.
scope = 'playlist-modify-public'
# Create a new token and spotify object.
token = util.prompt_for_user_token(username, scope, client_id = SPOTIPY_CLIENT_ID,
client_secret = SPOTIPY_CLIENT_SECRET,redirect_uri=SPOTIPY_REDIRECT_URI)
spotify = spotipy.Spotify(auth=token)
# Create the playlist and add the songs.
for i in range(1,5):
# Get the songs the belong in playlist 'i'
filter = top_100_songs['playlist'] == i-1
stage = top_100_songs[filter]
track_ids = stage['id']
# Create the playlist
name = "Clustify Daily Mix {}".format(i)
playlist = spotify.user_playlist_create(user=username, name=name, public=True)
playlist_id = playlist['id']
# Add songs to the playlist
results = spotify.user_playlist_add_tracks(user=username, playlist_id=playlist_id, tracks=track_ids)
We have successfully created our own Spotify Daily Mixes