What is the main goal of the Pandas Time Series case study?

The goal is to analyze a small learning platform and answer questions related to learner activity, course revenue, engagement channels, revenue changes, frequent learners, certificate delivery times, and data quality issues.

What datasets are used in this case study?

The case study uses original sample datasets including pandascasestudylearners.csv, pandascasestudycourses.csv, pandascasestudyeventsjan.csv, pandascasestudyeventsfeb.csv, pandascasestudyeventsmar.csv, and pandascasestudycertificatedeliveries.csv.

What are some of the key Pandas skills revised in this project?

The project revises skills such as readcsv, column selection, boolean filtering, datetime parsing, merging, groupby, resampling, rolling windows, and creating export-ready tables.

What are the final outputs expected from the analysis pipeline?

The final outputs include a cleaned learner table, a cleaned event table, a merged activity table, revenue and engagement reports, time series summaries, rolling trend metrics, certificate delivery SLA metrics, and data quality checks.

What is the scenario for this case study?

The scenario involves helping a learning platform understand Q1 activity by analyzing learner profiles, a course catalog, monthly event logs, and certificate delivery records.

Pandas Time Series Case Study: Analyze Learner

Pandas Time Series Case Study: Clean, Join, Resample, and Report

This is an end-to-end Pandas case study.

Instead of learning one method at a time, you will work through a realistic analytics problem from raw CSV files to final reporting tables.

The goal is to analyze a small learning platform and answer questions like:

How many learners are active each day?
Which courses generate the most paid revenue?
Which channels bring the best engagement?
How does weekly revenue change over time?
Which learners come back frequently?
How long does certificate delivery take?
Where are data quality problems hiding?
How do we build reports that a product or growth team can actually use?

This case study uses original sample datasets created for this lesson. It does not use copied stock, delivery, sports, passenger, finance, or public dataset examples.

Files used in this case study:

pandas_case_study_learners.csv
pandas_case_study_courses.csv
pandas_case_study_events_jan.csv
pandas_case_study_events_feb.csv
pandas_case_study_events_mar.csv
pandas_case_study_certificate_deliveries.csv

Place all CSV files in the same folder as this Markdown file before running the examples.

Case Study Scenario

You are helping a learning platform understand Q1 activity.

The platform has:

learner profiles
a course catalog
monthly event logs
certificate delivery records

Your job is to create a reliable analysis pipeline.

The final output should include:

a cleaned learner table
a cleaned event table
a merged activity table
revenue reports
engagement reports
time series summaries
rolling trend metrics
certificate delivery SLA metrics
data quality checks

What This Case Study Covers

This project revises many important Pandas skills:

read_csv
head, tail, shape, info, describe
column selection
boolean filtering
assign
query
sort_values
missing-value checks
duplicate checks
string cleaning with .str
datetime parsing with pd.to_datetime
Timestamp
Timedelta
DateOffset
concat
merge
indicator=True
validate
groupby
agg
named aggregation
transform
pivot_table
set_index
partial datetime slicing
.dt accessor
asfreq
resample
interpolation
rolling windows
exponentially weighted moving averages
shift
time gaps per user
pct_change
MultiIndex-style reports
final export-ready tables

1. Import Libraries

python

import pandas as pd
import numpy as np

Optional display settings:

python

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

2. Load The Raw Data

Load dimension tables:

python

learners = pd.read_csv("pandas_case_study_learners.csv")
courses = pd.read_csv("pandas_case_study_courses.csv")
deliveries = pd.read_csv("pandas_case_study_certificate_deliveries.csv")

Load monthly event logs:

python

jan = pd.read_csv("pandas_case_study_events_jan.csv")
feb = pd.read_csv("pandas_case_study_events_feb.csv")
mar = pd.read_csv("pandas_case_study_events_mar.csv")

Check shapes:

python

for name, df in {
    "learners": learners,
    "courses": courses,
    "deliveries": deliveries,
    "jan": jan,
    "feb": feb,
    "mar": mar,
}.items():
    print(name, df.shape)

Preview the tables:

python

learners.head()

python

courses.head()

python

jan.head()

3. Understand The Data Model

The raw data has two dimension tables:

learners: one row per learner
courses: one row per course

It has three event files:

jan: events in January
feb: events in February
mar: events in March

It also has an operations table:

deliveries: certificate request and delivery timestamps

Important keys:

Table	Key Columns
learners	`learner_id`
courses	`course_id`
events	`event_id`, `learner_id`, `course_id`
deliveries	`delivery_id`, `learner_id`, `course_id`

4. First Inspection

Use info() to check data types and missing values:

python

learners.info()

python

courses.info()

python

jan.info()

Check numeric summaries:

python

jan.describe(numeric_only=True)

Check text columns quickly:

python

learners.describe(include="object")

This inspection step tells you what needs cleaning before analysis.

5. Concatenate Monthly Event Logs

The three monthly event files have the same structure, so use pd.concat().

python

events = pd.concat([jan, feb, mar], ignore_index=True)
events.shape

If you want to remember the source month, use keys.

python

events_with_source = pd.concat(
    [jan, feb, mar],
    keys=["Jan", "Feb", "Mar"]
).reset_index(level=0).rename(columns={"level_0": "source_month"})

events_with_source.head()

For the rest of the case study, keep the clean sequential index:

python

events = pd.concat([jan, feb, mar], ignore_index=True)

6. Data Quality Checks

Check duplicate IDs:

python

learners["learner_id"].duplicated().sum()

python

courses["course_id"].duplicated().sum()

python

events["event_id"].duplicated().sum()

Check missing values:

python

learners.isna().sum()

python

events.isna().sum()

Check unexpected event types:

python

events["event_type"].value_counts(dropna=False)

Check amount values:

python

events.groupby("event_type")["amount"].describe()

Data quality is part of analysis. Do not skip it.

7. Clean Learner Text Columns

The learner table intentionally contains messy text.

Clean names:

python

learners["clean_name"] = learners["raw_name"].str.strip().str.title()

Clean email:

python

learners["email_clean"] = learners["email"].str.strip().str.lower()

Clean city and plan:

python

learners["city_clean"] = learners["city"].str.strip().str.title()
learners["plan_clean"] = learners["plan"].str.strip().str.lower()

Extract email domain:

python

learners["email_domain"] = learners["email_clean"].str.extract(r"@([^@\s]+)$")

Split campaign code:

python

learners[["campaign_track", "campaign_tier", "campaign_year"]] = (
    learners["campaign_code"].str.split("-", expand=True)
)

learners["campaign_year"] = learners["campaign_year"].astype(int)

Create a cleaned learner table:

python

learner_clean = learners[[
    "learner_id",
    "clean_name",
    "email_clean",
    "email_domain",
    "city_clean",
    "plan_clean",
    "campaign_track",
    "campaign_tier",
    "campaign_year",
    "signup_at",
    "referral_source"
]].copy()

learner_clean.head()

8. Parse Dates Safely

Convert learner signup dates:

python

learner_clean["signup_at"] = pd.to_datetime(
    learner_clean["signup_at"],
    errors="coerce"
)

Convert course publish dates:

python

courses["published_at"] = pd.to_datetime(
    courses["published_at"],
    errors="coerce"
)

Convert event timestamps:

python

events["event_at"] = pd.to_datetime(
    events["event_time"],
    errors="coerce"
)

Find invalid event timestamps:

python

events[events["event_at"].isna()]

The invalid rows are not automatically deleted. You inspect them first.

For time series analysis, create a valid-events table:

python

events_valid = events.dropna(subset=["event_at"]).copy()

9. Create Date Features

Use the .dt accessor.

python

events_valid["event_date"] = events_valid["event_at"].dt.date
events_valid["event_month"] = events_valid["event_at"].dt.month_name()
events_valid["event_week"] = events_valid["event_at"].dt.isocalendar().week.astype(int)
events_valid["event_day_name"] = events_valid["event_at"].dt.day_name()
events_valid["event_hour"] = events_valid["event_at"].dt.hour
events_valid["event_quarter"] = events_valid["event_at"].dt.quarter
events_valid["is_weekend"] = events_valid["event_at"].dt.dayofweek >= 5

Inspect:

python

events_valid[[
    "event_id",
    "event_at",
    "event_month",
    "event_day_name",
    "event_hour",
    "is_weekend"
]].head()

10. Convert Durations With Timedelta

Session duration is stored as seconds.

python

events_valid["duration"] = pd.to_timedelta(
    events_valid["duration_seconds"],
    unit="s"
)

Create minutes:

python

events_valid["duration_minutes"] = (
    events_valid["duration"].dt.total_seconds() / 60
).round(2)

Check duration by event type:

python

events_valid.groupby("event_type")["duration_minutes"].describe()

11. Audit Joins Before Merging

Events should match learner IDs and course IDs.

Audit learners:

python

learner_audit = events_valid.merge(
    learner_clean,
    how="left",
    on="learner_id",
    indicator=True
)

learner_audit["_merge"].value_counts()

Find event rows with missing learner profiles:

python

learner_audit[learner_audit["_merge"] == "left_only"][
    ["event_id", "learner_id", "event_time", "event_type"]
]

Audit courses:

python

course_audit = events_valid.merge(
    courses,
    how="left",
    on="course_id",
    indicator=True
)

course_audit["_merge"].value_counts()

Find event rows with missing course records:

python

course_audit[course_audit["_merge"] == "left_only"][
    ["event_id", "course_id", "event_time", "event_type"]
]

This is safer than blindly merging.

12. Build The Activity Fact Table

Merge events with learners and courses.

python

activity = (
    events_valid
    .merge(learner_clean, how="left", on="learner_id", validate="many_to_one")
    .merge(courses, how="left", on="course_id", validate="many_to_one")
)

activity.head()

Check the shape:

python

activity.shape

Select useful columns:

python

activity = activity[[
    "event_id",
    "event_at",
    "event_date",
    "event_month",
    "event_week",
    "event_day_name",
    "event_hour",
    "is_weekend",
    "learner_id",
    "clean_name",
    "city_clean",
    "plan_clean",
    "referral_source",
    "course_id",
    "course_title",
    "track",
    "difficulty",
    "mentor",
    "event_type",
    "device",
    "channel",
    "duration_minutes",
    "amount",
    "status"
]]

activity.head()

This is the main table for analysis.

13. Basic Filtering

Paid purchases:

python

paid_purchases = activity.query("event_type == 'purchase' and amount > 0")
paid_purchases.head()

Mobile events:

python

mobile_events = activity[activity["device"].eq("mobile")]
mobile_events.head()

Weekend events:

python

weekend_events = activity[activity["is_weekend"]]
weekend_events.head()

Long sessions:

python

long_sessions = activity[activity["duration_minutes"] >= 30]
long_sessions[["event_id", "clean_name", "event_type", "duration_minutes"]]

14. High-Level Metrics

Total events:

python

activity["event_id"].nunique()

Unique learners:

python

activity["learner_id"].nunique()

Total paid revenue:

python

activity["amount"].sum()

Average session duration:

python

activity["duration_minutes"].mean()

Events by type:

python

activity["event_type"].value_counts()

15. GroupBy Reports

Revenue by course:

python

course_revenue = (
    activity
    .groupby(["course_id", "course_title"], as_index=False)
    .agg(
        events=("event_id", "count"),
        learners=("learner_id", "nunique"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
    .sort_values("revenue", ascending=False)
)

course_revenue["avg_duration"] = course_revenue["avg_duration"].round(2)
course_revenue

Engagement by channel:

python

channel_report = (
    activity
    .groupby("channel", as_index=False)
    .agg(
        events=("event_id", "count"),
        learners=("learner_id", "nunique"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
)

channel_report["avg_duration"] = channel_report["avg_duration"].round(2)
channel_report.sort_values("revenue", ascending=False)

Track and plan summary:

python

track_plan_report = (
    activity
    .groupby(["track", "plan_clean"], as_index=False)
    .agg(
        learners=("learner_id", "nunique"),
        events=("event_id", "count"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
)

track_plan_report

16. Transform For Row-Level Context

Add each learner's total activity duration to every row.

python

activity["learner_total_minutes"] = (
    activity.groupby("learner_id")["duration_minutes"].transform("sum")
)

Add each course's total revenue to every row.

python

activity["course_total_revenue"] = (
    activity.groupby("course_id")["amount"].transform("sum")
)

Calculate each row's share of course revenue:

python

activity["course_revenue_share"] = np.where(
    activity["course_total_revenue"] > 0,
    activity["amount"] / activity["course_total_revenue"],
    0
)

Inspect:

python

activity[[
    "event_id",
    "course_title",
    "amount",
    "course_total_revenue",
    "course_revenue_share"
]].head()

17. Pivot Tables

Revenue by month and track:

python

activity.pivot_table(
    index="event_month",
    columns="track",
    values="amount",
    aggfunc="sum",
    fill_value=0
)

Events by device and event type:

python

activity.pivot_table(
    index="device",
    columns="event_type",
    values="event_id",
    aggfunc="count",
    fill_value=0
)

Average duration by track and device:

python

activity.pivot_table(
    index=["track", "difficulty"],
    columns="device",
    values="duration_minutes",
    aggfunc="mean",
    fill_value=0
).round(2)

Pivot tables are compact reporting tools.

18. Build A DatetimeIndex

Time series operations work best when the datetime column is the index.

python

activity_ts = activity.set_index("event_at").sort_index()
activity_ts.head()

Now you can use partial string selection.

Select all February events:

python

activity_ts.loc["2026-02"]

Select a date range:

python

activity_ts.loc["2026-02-10":"2026-03-05"]

Select one day:

python

activity_ts.loc["2026-03-01"]

Partial date indexing is one of the most convenient reasons to use a DatetimeIndex.

19. DateOffset For Calendar Jumps

Timedelta is for exact durations.

DateOffset understands calendar concepts like months and years.

python

base_date = pd.Timestamp("2026-01-31")
base_date + pd.Timedelta(days=30)

python

base_date + pd.DateOffset(months=1)

For business calendar logic, DateOffset is often clearer.

Find activity on the same day every month:

python

monthly_checkpoints = pd.date_range(
    start="2026-01-05",
    end="2026-03-05",
    freq=pd.DateOffset(months=1)
)

activity_ts[activity_ts.index.normalize().isin(monthly_checkpoints.normalize())]

20. Daily Time Series

Daily events:

python

daily_events = activity_ts["event_id"].resample("D").count()
daily_events.head()

Daily revenue:

python

daily_revenue = activity_ts["amount"].resample("D").sum()
daily_revenue.head()

Daily active learners:

python

daily_active_learners = activity_ts["learner_id"].resample("D").nunique()
daily_active_learners.head()

Combine them:

python

daily_summary = pd.DataFrame({
    "events": daily_events,
    "revenue": daily_revenue,
    "active_learners": daily_active_learners
})

daily_summary.head()

21. `asfreq()` For Frequency Conversion

asfreq() changes the displayed frequency without aggregation.

python

daily_summary.asfreq("D").head(10)

Business-day frequency:

python

daily_summary.asfreq("B").head(10)

Month-start snapshots:

python

daily_summary.asfreq("MS")

Quarter-end snapshots:

python

daily_summary.asfreq("QE")

Use asfreq() when you want observations at a specific frequency.

Use resample() when you want aggregation into new time buckets.

22. Downsampling With `resample()`

Weekly summary:

python

weekly_summary = activity_ts.resample("W").agg(
    events=("event_id", "count"),
    active_learners=("learner_id", "nunique"),
    revenue=("amount", "sum"),
    avg_duration=("duration_minutes", "mean")
)

weekly_summary["avg_duration"] = weekly_summary["avg_duration"].round(2)
weekly_summary

Monthly summary:

python

monthly_summary = activity_ts.resample("MS").agg(
    events=("event_id", "count"),
    active_learners=("learner_id", "nunique"),
    revenue=("amount", "sum"),
    avg_duration=("duration_minutes", "mean")
)

monthly_summary["avg_duration"] = monthly_summary["avg_duration"].round(2)
monthly_summary

Downsampling means going from a more detailed frequency to a less detailed frequency.

Examples:

hourly to daily
daily to weekly
daily to monthly

23. Upsampling And Filling Gaps

Suppose you only care about days with revenue.

python

revenue_days = daily_revenue[daily_revenue > 0]
revenue_days

Upsample to a daily frequency:

python

revenue_days.asfreq("D").head(15)

Missing days appear because no revenue happened on those dates.

Fill missing values with zero:

python

revenue_days.asfreq("D", fill_value=0).head(15)

Forward-fill the last known value:

python

revenue_days.asfreq("D").ffill().head(15)

Use zero fill for event counts and revenue when missing means nothing happened.

Use forward fill when the last known state should continue.

24. Interpolation

Interpolation estimates missing numeric values.

Create a sparse learner activity series:

python

sparse_active = daily_active_learners[daily_active_learners > 0]
sparse_active.head()

Convert it to daily frequency:

python

sparse_daily = sparse_active.asfreq("D")
sparse_daily.head(12)

Interpolate:

python

sparse_daily.interpolate(method="time").head(12)

Use interpolation carefully.

It can be useful for sensor-like data or smooth trend lines, but it may be misleading for counts where missing means zero.

25. Rolling Window Metrics

Rolling average smooths noisy daily values.

python

daily_summary["revenue_7d_avg"] = (
    daily_summary["revenue"].rolling(window=7, min_periods=1).mean()
)

daily_summary[["revenue", "revenue_7d_avg"]].head(12)

Rolling active learner average:

python

daily_summary["active_learners_7d_avg"] = (
    daily_summary["active_learners"].rolling(window=7, min_periods=1).mean()
)

Rolling event standard deviation:

python

daily_summary["events_7d_std"] = (
    daily_summary["events"].rolling(window=7, min_periods=2).std()
)

Rolling windows are useful when daily data is volatile.

26. Exponentially Weighted Moving Average

EWM gives more weight to recent values.

python

daily_summary["revenue_ewm_7"] = (
    daily_summary["revenue"].ewm(span=7, adjust=False).mean()
)

daily_summary[["revenue", "revenue_7d_avg", "revenue_ewm_7"]].head(15)

Rolling mean treats all values in the window equally.

EWM reacts faster to recent changes.

27. Shift For Previous Values

Shift daily revenue by one day:

python

daily_summary["previous_day_revenue"] = daily_summary["revenue"].shift(1)

Daily revenue change:

python

daily_summary["revenue_change"] = (
    daily_summary["revenue"] - daily_summary["previous_day_revenue"]
)

Percentage change:

python

daily_summary["revenue_pct_change"] = (
    daily_summary["revenue"].pct_change() * 100
)

Inspect:

python

daily_summary[[
    "revenue",
    "previous_day_revenue",
    "revenue_change",
    "revenue_pct_change"
]].head(12)

Use shift() for lag features, comparisons, and time gaps.

28. Time Gap Per Learner

Find time between consecutive events for each learner.

Sort by learner and time:

python

activity_gap = activity.sort_values(["learner_id", "event_at"]).copy()

Use group-wise shift:

python

activity_gap["previous_event_at"] = (
    activity_gap.groupby("learner_id")["event_at"].shift(1)
)

Calculate the gap:

python

activity_gap["time_since_previous_event"] = (
    activity_gap["event_at"] - activity_gap["previous_event_at"]
)

Average gap by learner:

python

avg_gap_by_learner = (
    activity_gap
    .groupby(["learner_id", "clean_name"], as_index=False)
    .agg(avg_gap=("time_since_previous_event", "mean"))
)

avg_gap_by_learner.sort_values("avg_gap")

Convert to hours:

python

avg_gap_by_learner["avg_gap_hours"] = (
    avg_gap_by_learner["avg_gap"].dt.total_seconds() / 3600
).round(2)

avg_gap_by_learner[["learner_id", "clean_name", "avg_gap_hours"]]

This is useful for retention and habit analysis.

29. Certificate Delivery SLA With Timedelta

Parse delivery dates:

python

deliveries["requested_at"] = pd.to_datetime(deliveries["requested_at"], errors="coerce")
deliveries["delivered_at"] = pd.to_datetime(deliveries["delivered_at"], errors="coerce")

Calculate delivery time:

python

deliveries["delivery_time"] = deliveries["delivered_at"] - deliveries["requested_at"]
deliveries["delivery_hours"] = (
    deliveries["delivery_time"].dt.total_seconds() / 3600
).round(2)

Flag late deliveries:

python

deliveries["is_late"] = deliveries["delivery_hours"] > 48

Merge delivery records with learner and course details:

python

delivery_report = (
    deliveries
    .merge(learner_clean, how="left", on="learner_id", validate="many_to_one")
    .merge(courses, how="left", on="course_id", validate="many_to_one")
)

delivery_report[[
    "delivery_id",
    "clean_name",
    "course_title",
    "requested_at",
    "delivered_at",
    "delivery_hours",
    "is_late"
]]

SLA summary:

python

delivery_report.groupby("priority", as_index=False).agg(
    requests=("delivery_id", "count"),
    delivered=("delivered_at", "count"),
    avg_delivery_hours=("delivery_hours", "mean"),
    late_count=("is_late", "sum")
)

Create signup month:

python

learner_clean["signup_month"] = learner_clean["signup_at"].dt.to_period("M").astype(str)

Merge signup month into activity:

python

activity = activity.merge(
    learner_clean[["learner_id", "signup_month"]],
    how="left",
    on="learner_id"
)

Create event month period:

python

activity["event_month_period"] = activity["event_at"].dt.to_period("M").astype(str)

Cohort table:

python

cohort = activity.pivot_table(
    index="signup_month",
    columns="event_month_period",
    values="learner_id",
    aggfunc="nunique",
    fill_value=0
)

cohort

This shows how many learners from each signup month were active in each event month.

31. MultiIndex Summary Report

Build a MultiIndex report by track, difficulty, and month.

python

track_month_report = (
    activity
    .groupby(["track", "difficulty", "event_month_period"])
    .agg(
        events=("event_id", "count"),
        learners=("learner_id", "nunique"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
)

track_month_report

Select one track:

python

track_month_report.loc["Pandas"]

Unstack month into columns:

python

track_month_report["revenue"].unstack("event_month_period")

Reset index for export:

python

track_month_report.reset_index().head()

This connects GroupBy, MultiIndex, and reporting.

32. Final Executive Summary

Create one compact table for leadership.

python

executive_summary = pd.DataFrame({
    "metric": [
        "total_events",
        "valid_event_days",
        "unique_learners",
        "total_revenue",
        "average_session_minutes",
        "certificate_requests",
        "average_delivery_hours",
        "late_certificates"
    ],
    "value": [
        activity["event_id"].nunique(),
        activity["event_date"].nunique(),
        activity["learner_id"].nunique(),
        activity["amount"].sum(),
        round(activity["duration_minutes"].mean(), 2),
        deliveries["delivery_id"].nunique(),
        round(deliveries["delivery_hours"].mean(), 2),
        int(deliveries["is_late"].sum())
    ]
})

executive_summary

Create final reports dictionary:

python

reports = {
    "executive_summary": executive_summary,
    "course_revenue": course_revenue,
    "channel_report": channel_report,
    "weekly_summary": weekly_summary,
    "monthly_summary": monthly_summary,
    "delivery_report": delivery_report,
    "cohort": cohort,
}

In a real project, you could export these:

python

# executive_summary.to_csv("executive_summary.csv", index=False)
# course_revenue.to_csv("course_revenue.csv", index=False)
# weekly_summary.to_csv("weekly_summary.csv")

33. Case Study Checklist

Use this checklist for future Pandas projects:

Load all raw files.
Inspect shapes, columns, and data types.
Check missing values and duplicates.
Clean text columns.
Parse datetime columns with errors="coerce".
Inspect invalid dates.
Concatenate same-schema files.
Audit joins with indicator=True.
Merge dimensions with validate.
Create date features with .dt.
Create duration features with Timedelta.
Build high-level metrics.
Build groupby summaries.
Build pivot tables.
Set a DatetimeIndex for time series.
Use partial date slicing.
Use resample() for time buckets.
Use asfreq() for frequency conversion.
Use rolling and EWM metrics for trend smoothing.
Use shift() for previous values and time gaps.
Create final export-ready tables.
Document assumptions and data quality issues.

34. Practice Tasks

Try these tasks after completing the walkthrough.

Practice Lab

Task 1. Find the top 5 learners by total duration.

python

activity.groupby(["learner_id", "clean_name"], as_index=False)["duration_minutes"].sum() \
    .sort_values("duration_minutes", ascending=False) \
    .head(5)

Explanation

Groups learning activity records by learner ID and course name, then sums the total duration minutes for each combination
Sorts the aggregated results in descending order by total duration to rank learners by activity level
Returns only the top 5 records showing the most active learners based on cumulative time spent
Uses as_index=False parameter to maintain group columns as regular columns rather than index levels
The chain of operations efficiently transforms raw activity data into a ranked summary report of learner engagement

Practice Lab

Task 2. Find revenue by city and track.

python

activity.pivot_table(
    index="city_clean",
    columns="track",
    values="amount",
    aggfunc="sum",
    fill_value=0
)

Explanation

The code transforms a dataset into a pivot table format that groups transactions by city and track category
It uses the city names as row identifiers and track categories as column headers
The values represent the sum of transaction amounts for each city-track combination
Missing combinations are filled with zero instead of NaN values for cleaner data presentation
This structure enables easy comparison of spending patterns across different cities and music tracks

Practice Lab

Task 3. Find the busiest hour of the day.

python

activity.groupby("event_hour")["event_id"].count().sort_values(ascending=False).head(1)

Explanation

Groups activity data by hour of the event using the event_hour column
Counts the number of events occurring in each hour group
Sorts the counts in descending order to identify the highest frequency hours
Returns only the top result showing the hour with the most events
Useful for identifying peak activity periods in time-series data analysis

Practice Lab

Task 4. Find the average gap between purchases.

python

purchases = activity[activity["event_type"] == "purchase"].sort_values("event_at").copy()
purchases["previous_purchase_at"] = purchases["event_at"].shift(1)
purchases["purchase_gap"] = purchases["event_at"] - purchases["previous_purchase_at"]
purchases["purchase_gap"].mean()

Explanation

Filters purchase events from activity data and sorts them chronologically to establish purchase sequence
Creates a new column showing the timestamp of the previous purchase for each transaction using shift() function
Computes time differences between consecutive purchases by subtracting previous timestamps from current ones
Calculates and returns the mean gap duration between purchases to measure customer buying frequency patterns

Practice Lab

Task 5. Find weekly revenue growth.

python

weekly_revenue = activity_ts["amount"].resample("W").sum()
weekly_revenue_growth = weekly_revenue.pct_change() * 100
weekly_revenue_growth

Explanation

Resamples the time series data to weekly intervals and calculates the sum of amounts for each week
Computes the percentage change between consecutive weeks to determine revenue growth rate
Multiplies the percentage change by 100 to convert to percentage format for easier interpretation
The resulting series shows the weekly revenue growth percentages with NaN for the first week since there's no previous week to compare against

Practice Lab

Task 6. Find courses with engagement but no revenue.

python

course_summary = activity.groupby(["course_id", "course_title"], as_index=False).agg(
    events=("event_id", "count"),
    revenue=("amount", "sum")
)

course_summary.query("events > 0 and revenue == 0")

Explanation

Groups activity data by course_id and course_title to aggregate event counts and total revenue per course
Uses agg() function to count events and sum revenue for each course grouping
Applies query() method to filter results showing only courses with at least one event but zero revenue
Returns a filtered dataframe containing courses that had user interactions but no monetary transactions
Useful for identifying potentially problematic courses or analyzing incomplete transaction data

Practice Lab

python

active_ids = activity["learner_id"].dropna().unique()
learner_clean[~learner_clean["learner_id"].isin(active_ids)]

Explanation

Retrieves all unique learner IDs from the activity dataset while removing any null values
Creates a boolean mask to identify rows in learner_clean that do not have matching learner IDs in the active_ids set
Returns only the records from learner_clean where the learner_id is NOT present in the list of active learner IDs
This operation effectively filters out any learner records that appear in the activity data, leaving only inactive or non-active learners
The result represents the subset of learners who haven't participated in any activities within the dataset

Practice Lab

Task 8. Find late certificate deliveries by course.

python

delivery_report.groupby("course_title", as_index=False).agg(
    requests=("delivery_id", "count"),
    late_count=("is_late", "sum"),
    avg_delivery_hours=("delivery_hours", "mean")
).sort_values("late_count", ascending=False)

Explanation

Groups delivery records by course title to analyze performance metrics per course
Calculates three key statistics: total delivery requests, count of late deliveries, and average delivery time in hours
Uses aggregation functions to efficiently compute multiple metrics in a single operation
Sorts results by late delivery count in descending order to identify courses with the most delayed deliveries
Returns a clean dataframe showing course performance metrics for quick analysis and decision making

Practice Lab

Task 9. Build a daily dashboard with rolling averages.

python

daily_dashboard = daily_summary.copy()
daily_dashboard["events_7d_avg"] = daily_dashboard["events"].rolling(7, min_periods=1).mean()
daily_dashboard["revenue_7d_avg"] = daily_dashboard["revenue"].rolling(7, min_periods=1).mean()
daily_dashboard["active_learners_7d_avg"] = daily_dashboard["active_learners"].rolling(7, min_periods=1).mean()
daily_dashboard.tail()

Explanation

Creates a copy of the daily summary dataframe to avoid modifying the original data while building the dashboard
Applies a 7-day rolling window calculation with minimum periods set to 1 to handle incomplete windows at the start of the dataset
Computes three separate 7-day moving averages for events, revenue, and active learners metrics to smooth out short-term fluctuations
Displays the last few rows of the updated dashboard to preview the calculated rolling averages alongside original data

Practice Lab

Task 10. Build a month and channel revenue report.

python

activity.pivot_table(
    index="event_month_period",
    columns="channel",
    values="amount",
    aggfunc="sum",
    fill_value=0,
    margins=True
)

Explanation

Generates a cross-tabulated view showing total amounts grouped by event month periods and marketing channels
Uses sum aggregation to combine transaction values across matching combinations of month and channel
Sets missing values to zero using fill_value parameter for clean data presentation
Includes marginal totals (row and column sums) through the margins=True parameter for comprehensive analysis
Organizes data with event months as rows, channels as columns, and summed amounts as cell values

Layer	Pandas Tools
Ingest	`read_csv`, `concat`
Inspect	`shape`, `info`, `describe`, `isna`, `duplicated`
Clean	`.str`, `replace`, `astype`, `to_datetime`
Model	`merge`, `validate`, `indicator`
Feature Engineering	`.dt`, `Timedelta`, `transform`, `assign`
Analysis	`groupby`, `agg`, `pivot_table`, `query`
Time Series	`DatetimeIndex`, `asfreq`, `resample`, `rolling`, `ewm`, `shift`
Reporting	sorted summaries, pivot tables, export-ready DataFrames

If you remember only one thing, remember this:

Pandas is not a collection of random methods. It is a workflow: load, inspect, clean, combine, transform, analyze, validate, and report.

Official References

Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html
Pandas windowing operations user guide: https://pandas.pydata.org/docs/user_guide/window.html
Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
Pandas text data user guide: https://pandas.pydata.org/docs/user_guide/text.html
Pandas reshaping and pivot tables user guide: https://pandas.pydata.org/docs/user_guide/reshaping.html

Pandas Time Series Case Study: Analyze Learner Data

AI Insights

Task 1. Find the top 5 learners by total duration.

Task 2. Find revenue by city and track.

Task 3. Find the busiest hour of the day.

Task 4. Find the average gap between purchases.

Task 5. Find weekly revenue growth.

Task 6. Find courses with engagement but no revenue.

Task 7. Find learners with no activity after signup.

Task 8. Find late certificate deliveries by course.

Task 9. Build a daily dashboard with rolling averages.

Task 10. Build a month and channel revenue report.