Day 30 of 50 Days of Python: Advanced Pandas Data Manipulation and Aggregation
Part of Week 5: Data Analysis and Visualisation
Welcome to Day 30! Hope you’ve all been enjoying the series so far. Today, we are going to dive deeper into Pandas. We’ll explore advanced data manipulation and aggregation techniques. To make this more engaging I’ve decided to use the real-world dataset we collected from a previous week, Drake’s Spotify Albums. If you don’t have it, don’t worry. You can still follow along.
Python Setup
So, some prequites packages you’ll need to install are:
Pandas
matplotlib
seaborn
And if you’ve forgotten how to install these, it’s just a simple line of code:
pip install pandas matplotlib seaborn
The Dataset
As i’ve already mentioned we’ll use Drake’s Spotify Albums data. If you don’t have the data or you need to pull it again, I’ve got you covered:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
# Replace these with your actual credentials
client_id = 'YOUR_SPOTIFY_CLIENT_ID'
client_secret = 'YOUR_SPOTIFY_CLIENT_SECRET'
# Authenticate with Spotify
auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(auth_manager=auth_manager)
# Drake's Spotify Artist ID
drake_id = '3TVXtAsR1Inumwj472S9r4'
# Fetch Drake's albums
albums = []
results = sp.artist_albums(drake_id, album_type='album', country='GB', limit=50)
albums.extend(results['items'])
# Handle pagination
while results['next']:
results = sp.next(results)
albums.extend(results['items'])
# Create a DataFrame
df_albums = pd.DataFrame(albums)
Data Cleaning and Preparation
We’ll clean the above data, to make sure its ready to do some handy work with:
# Remove duplicate albums based on name
df_albums = df_albums.drop_duplicates(subset='name')
# Convert release_date to datetime
df_albums['release_date'] = pd.to_datetime(df_albums['release_date'])
# Extract year from release_date
df_albums['release_year'] = df_albums['release_date'].dt.year
# Select relevant columns
df_albums = df_albums[['name', 'release_date', 'release_year', 'total_tracks', 'external_urls']]
Advanced Data Manipulation and Aggregation
Aggregating total tracks by release year:
# Group by release_year and sum total_tracks tracks_by_year = df_albums.groupby('release_year')['total_tracks'].sum().reset_index() # Rename columns for clarity tracks_by_year.columns = ['Release Year', 'Total Tracks'] # Sort by Release Year tracks_by_year = tracks_by_year.sort_values('Release Year') print(tracks_by_year)
Calculating the average tracks per album, by year:
# Group by release_year and calculate mean of total_tracks avg_tracks_by_year = df_albums.groupby('release_year')['total_tracks'].mean().reset_index() # Rename columns for clarity avg_tracks_by_year.columns = ['Release Year', 'Average Tracks per Album'] # Sort by Release Year avg_tracks_by_year = avg_tracks_by_year.sort_values('Release Year') print(avg_tracks_by_year)
Filtering albums with more than 15 tracks:
# Filter albums with more than 15 tracks albums_gt_15_tracks = df_albums[df_albums['total_tracks'] > 15] print(albums_gt_15_tracks[['name', 'total_tracks']])
Data Visualisation
In most cases, visualising the data can help you find better insights to report on. This is where matplotlib and seaborn come into play:
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style
sns.set(style="whitegrid")
# Bar plot for Total Tracks by Release Year
plt.figure(figsize=(10, 6))
sns.barplot(x='Release Year', y='Total Tracks', data=tracks_by_year, palette='Blues_d')
plt.title('Total Tracks by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Total Tracks')
plt.tight_layout()
plt.show()
# Bar plot for Average Tracks per Album by Release Year
plt.figure(figsize=(10, 6))
sns.barplot(x='Release Year', y='Average Tracks per Album', data=avg_tracks_by_year, palette='Blues_d')
plt.title('Average Tracks per Album by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Average Tracks per Album')
plt.tight_layout()
plt.show()
Insights from above:
Total Tracks by Release Year: This aggregation provides a clear view of Drake's productivity over the years, highlighting periods of increased output.
Average Tracks per Album by Release Year: This metric offers insights into the depth of each album, indicating whether Drake has been releasing more concise projects or expansive ones.
Albums with More Than 15 Tracks: Filtering albums with a higher track count can help identify projects that are more comprehensive in nature.
Next Up: Day 31 - Data Visualisation with Matplotlib
I’ve given you a bit of a taster of matplotlib in todays post, as outside of what I’ve covered with Pandas there’s not much else to deep dive in as most of that is reusable and applicable to most if not all datasets.
So the next post will cover all the cool things you can actually do with matplotlib and will include some visualisations just so its not just all text.
Thanks for reading and as always, happy coding.