Pandas Time Series Case Study: Analyze Learner Data

Jun 6, 2026
78 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: pandas-time-series-case-study-analyze-learner-data
Quick Answer

Build a complete Pandas time series case study with original learning-platform data. Load multiple CSVs, clean text, parse dates, concatenate logs, merge dimensions, audit data quality, use datetime indexes, calculate timedeltas, resample, interpolate, use rolling and EWM averages, shift values, build pivots, and create final reporting tables.

Quick Summary

Explore a comprehensive Pandas case study to analyze learner engagement, revenue, and data quality in a learning platform.

Pandas Time Series Case Study: Clean, Join, Resample, and Report

This is an end-to-end Pandas case study.

Instead of learning one method at a time, you will work through a realistic analytics problem from raw CSV files to final reporting tables.

The goal is to analyze a small learning platform and answer questions like:

  • How many learners are active each day?
  • Which courses generate the most paid revenue?
  • Which channels bring the best engagement?
  • How does weekly revenue change over time?
  • Which learners come back frequently?
  • How long does certificate delivery take?
  • Where are data quality problems hiding?
  • How do we build reports that a product or growth team can actually use?

This case study uses original sample datasets created for this lesson. It does not use copied stock, delivery, sports, passenger, finance, or public dataset examples.

Files used in this case study:

  • pandas_case_study_learners.csv
  • pandas_case_study_courses.csv
  • pandas_case_study_events_jan.csv
  • pandas_case_study_events_feb.csv
  • pandas_case_study_events_mar.csv
  • pandas_case_study_certificate_deliveries.csv

Place all CSV files in the same folder as this Markdown file before running the examples.

Case Study Scenario

You are helping a learning platform understand Q1 activity.

The platform has:

  • learner profiles
  • a course catalog
  • monthly event logs
  • certificate delivery records

Your job is to create a reliable analysis pipeline.

The final output should include:

  • a cleaned learner table
  • a cleaned event table
  • a merged activity table
  • revenue reports
  • engagement reports
  • time series summaries
  • rolling trend metrics
  • certificate delivery SLA metrics
  • data quality checks

What This Case Study Covers

This project revises many important Pandas skills:

  • read_csv
  • head, tail, shape, info, describe
  • column selection
  • boolean filtering
  • assign
  • query
  • sort_values
  • missing-value checks
  • duplicate checks
  • string cleaning with .str
  • datetime parsing with pd.to_datetime
  • Timestamp
  • Timedelta
  • DateOffset
  • concat
  • merge
  • indicator=True
  • validate
  • groupby
  • agg
  • named aggregation
  • transform
  • pivot_table
  • set_index
  • partial datetime slicing
  • .dt accessor
  • asfreq
  • resample
  • interpolation
  • rolling windows
  • exponentially weighted moving averages
  • shift
  • time gaps per user
  • pct_change
  • MultiIndex-style reports
  • final export-ready tables

1. Import Libraries

python
import pandas as pd
import numpy as np

Optional display settings:

python
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

2. Load The Raw Data

Load dimension tables:

python
learners = pd.read_csv("pandas_case_study_learners.csv")
courses = pd.read_csv("pandas_case_study_courses.csv")
deliveries = pd.read_csv("pandas_case_study_certificate_deliveries.csv")

Load monthly event logs:

python
jan = pd.read_csv("pandas_case_study_events_jan.csv")
feb = pd.read_csv("pandas_case_study_events_feb.csv")
mar = pd.read_csv("pandas_case_study_events_mar.csv")

Check shapes:

python
for name, df in {
    "learners": learners,
    "courses": courses,
    "deliveries": deliveries,
    "jan": jan,
    "feb": feb,
    "mar": mar,
}.items():
    print(name, df.shape)

Preview the tables:

python
learners.head()
python
courses.head()
python
jan.head()

3. Understand The Data Model

The raw data has two dimension tables:

  • learners: one row per learner
  • courses: one row per course

It has three event files:

  • jan: events in January
  • feb: events in February
  • mar: events in March

It also has an operations table:

  • deliveries: certificate request and delivery timestamps

Important keys:

TableKey Columns
learnerslearner_id
coursescourse_id
eventsevent_id, learner_id, course_id
deliveriesdelivery_id, learner_id, course_id

4. First Inspection

Use info() to check data types and missing values:

python
learners.info()
python
courses.info()
python
jan.info()

Check numeric summaries:

python
jan.describe(numeric_only=True)

Check text columns quickly:

python
learners.describe(include="object")

This inspection step tells you what needs cleaning before analysis.

5. Concatenate Monthly Event Logs

The three monthly event files have the same structure, so use pd.concat().

python
events = pd.concat([jan, feb, mar], ignore_index=True)
events.shape

If you want to remember the source month, use keys.

python
events_with_source = pd.concat(
    [jan, feb, mar],
    keys=["Jan", "Feb", "Mar"]
).reset_index(level=0).rename(columns={"level_0": "source_month"})

events_with_source.head()

For the rest of the case study, keep the clean sequential index:

python
events = pd.concat([jan, feb, mar], ignore_index=True)

6. Data Quality Checks

Check duplicate IDs:

python
learners["learner_id"].duplicated().sum()
python
courses["course_id"].duplicated().sum()
python
events["event_id"].duplicated().sum()

Check missing values:

python
learners.isna().sum()
python
events.isna().sum()

Check unexpected event types:

python
events["event_type"].value_counts(dropna=False)

Check amount values:

python
events.groupby("event_type")["amount"].describe()

Data quality is part of analysis. Do not skip it.

7. Clean Learner Text Columns

The learner table intentionally contains messy text.

Clean names:

python
learners["clean_name"] = learners["raw_name"].str.strip().str.title()

Clean email:

python
learners["email_clean"] = learners["email"].str.strip().str.lower()

Clean city and plan:

python
learners["city_clean"] = learners["city"].str.strip().str.title()
learners["plan_clean"] = learners["plan"].str.strip().str.lower()

Extract email domain:

python
learners["email_domain"] = learners["email_clean"].str.extract(r"@([^@\s]+)$")

Split campaign code:

python
learners[["campaign_track", "campaign_tier", "campaign_year"]] = (
    learners["campaign_code"].str.split("-", expand=True)
)

learners["campaign_year"] = learners["campaign_year"].astype(int)

Create a cleaned learner table:

python
learner_clean = learners[[
    "learner_id",
    "clean_name",
    "email_clean",
    "email_domain",
    "city_clean",
    "plan_clean",
    "campaign_track",
    "campaign_tier",
    "campaign_year",
    "signup_at",
    "referral_source"
]].copy()

learner_clean.head()

8. Parse Dates Safely

Convert learner signup dates:

python
learner_clean["signup_at"] = pd.to_datetime(
    learner_clean["signup_at"],
    errors="coerce"
)

Convert course publish dates:

python
courses["published_at"] = pd.to_datetime(
    courses["published_at"],
    errors="coerce"
)

Convert event timestamps:

python
events["event_at"] = pd.to_datetime(
    events["event_time"],
    errors="coerce"
)

Find invalid event timestamps:

python
events[events["event_at"].isna()]

The invalid rows are not automatically deleted. You inspect them first.

For time series analysis, create a valid-events table:

python
events_valid = events.dropna(subset=["event_at"]).copy()

9. Create Date Features

Use the .dt accessor.

python
events_valid["event_date"] = events_valid["event_at"].dt.date
events_valid["event_month"] = events_valid["event_at"].dt.month_name()
events_valid["event_week"] = events_valid["event_at"].dt.isocalendar().week.astype(int)
events_valid["event_day_name"] = events_valid["event_at"].dt.day_name()
events_valid["event_hour"] = events_valid["event_at"].dt.hour
events_valid["event_quarter"] = events_valid["event_at"].dt.quarter
events_valid["is_weekend"] = events_valid["event_at"].dt.dayofweek >= 5

Inspect:

python
events_valid[[
    "event_id",
    "event_at",
    "event_month",
    "event_day_name",
    "event_hour",
    "is_weekend"
]].head()

10. Convert Durations With Timedelta

Session duration is stored as seconds.

python
events_valid["duration"] = pd.to_timedelta(
    events_valid["duration_seconds"],
    unit="s"
)

Create minutes:

python
events_valid["duration_minutes"] = (
    events_valid["duration"].dt.total_seconds() / 60
).round(2)

Check duration by event type:

python
events_valid.groupby("event_type")["duration_minutes"].describe()

11. Audit Joins Before Merging

Events should match learner IDs and course IDs.

Audit learners:

python
learner_audit = events_valid.merge(
    learner_clean,
    how="left",
    on="learner_id",
    indicator=True
)

learner_audit["_merge"].value_counts()

Find event rows with missing learner profiles:

python
learner_audit[learner_audit["_merge"] == "left_only"][
    ["event_id", "learner_id", "event_time", "event_type"]
]

Audit courses:

python
course_audit = events_valid.merge(
    courses,
    how="left",
    on="course_id",
    indicator=True
)

course_audit["_merge"].value_counts()

Find event rows with missing course records:

python
course_audit[course_audit["_merge"] == "left_only"][
    ["event_id", "course_id", "event_time", "event_type"]
]

This is safer than blindly merging.

12. Build The Activity Fact Table

Merge events with learners and courses.

python
activity = (
    events_valid
    .merge(learner_clean, how="left", on="learner_id", validate="many_to_one")
    .merge(courses, how="left", on="course_id", validate="many_to_one")
)

activity.head()

Check the shape:

python
activity.shape

Select useful columns:

python
activity = activity[[
    "event_id",
    "event_at",
    "event_date",
    "event_month",
    "event_week",
    "event_day_name",
    "event_hour",
    "is_weekend",
    "learner_id",
    "clean_name",
    "city_clean",
    "plan_clean",
    "referral_source",
    "course_id",
    "course_title",
    "track",
    "difficulty",
    "mentor",
    "event_type",
    "device",
    "channel",
    "duration_minutes",
    "amount",
    "status"
]]

activity.head()

This is the main table for analysis.

13. Basic Filtering

Paid purchases:

python
paid_purchases = activity.query("event_type == 'purchase' and amount > 0")
paid_purchases.head()

Mobile events:

python
mobile_events = activity[activity["device"].eq("mobile")]
mobile_events.head()

Weekend events:

python
weekend_events = activity[activity["is_weekend"]]
weekend_events.head()

Long sessions:

python
long_sessions = activity[activity["duration_minutes"] >= 30]
long_sessions[["event_id", "clean_name", "event_type", "duration_minutes"]]

14. High-Level Metrics

Total events:

python
activity["event_id"].nunique()

Unique learners:

python
activity["learner_id"].nunique()

Total paid revenue:

python
activity["amount"].sum()

Average session duration:

python
activity["duration_minutes"].mean()

Events by type:

python
activity["event_type"].value_counts()

15. GroupBy Reports

Revenue by course:

python
course_revenue = (
    activity
    .groupby(["course_id", "course_title"], as_index=False)
    .agg(
        events=("event_id", "count"),
        learners=("learner_id", "nunique"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
    .sort_values("revenue", ascending=False)
)

course_revenue["avg_duration"] = course_revenue["avg_duration"].round(2)
course_revenue

Engagement by channel:

python
channel_report = (
    activity
    .groupby("channel", as_index=False)
    .agg(
        events=("event_id", "count"),
        learners=("learner_id", "nunique"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
)

channel_report["avg_duration"] = channel_report["avg_duration"].round(2)
channel_report.sort_values("revenue", ascending=False)

Track and plan summary:

python
track_plan_report = (
    activity
    .groupby(["track", "plan_clean"], as_index=False)
    .agg(
        learners=("learner_id", "nunique"),
        events=("event_id", "count"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
)

track_plan_report

16. Transform For Row-Level Context

Add each learner's total activity duration to every row.

python
activity["learner_total_minutes"] = (
    activity.groupby("learner_id")["duration_minutes"].transform("sum")
)

Add each course's total revenue to every row.

python
activity["course_total_revenue"] = (
    activity.groupby("course_id")["amount"].transform("sum")
)

Calculate each row's share of course revenue:

python
activity["course_revenue_share"] = np.where(
    activity["course_total_revenue"] > 0,
    activity["amount"] / activity["course_total_revenue"],
    0
)

Inspect:

python
activity[[
    "event_id",
    "course_title",
    "amount",
    "course_total_revenue",
    "course_revenue_share"
]].head()

17. Pivot Tables

Revenue by month and track:

python
activity.pivot_table(
    index="event_month",
    columns="track",
    values="amount",
    aggfunc="sum",
    fill_value=0
)

Events by device and event type:

python
activity.pivot_table(
    index="device",
    columns="event_type",
    values="event_id",
    aggfunc="count",
    fill_value=0
)

Average duration by track and device:

python
activity.pivot_table(
    index=["track", "difficulty"],
    columns="device",
    values="duration_minutes",
    aggfunc="mean",
    fill_value=0
).round(2)

Pivot tables are compact reporting tools.

18. Build A DatetimeIndex

Time series operations work best when the datetime column is the index.

python
activity_ts = activity.set_index("event_at").sort_index()
activity_ts.head()

Now you can use partial string selection.

Select all February events:

python
activity_ts.loc["2026-02"]

Select a date range:

python
activity_ts.loc["2026-02-10":"2026-03-05"]

Select one day:

python
activity_ts.loc["2026-03-01"]

Partial date indexing is one of the most convenient reasons to use a DatetimeIndex.

19. DateOffset For Calendar Jumps

Timedelta is for exact durations.

DateOffset understands calendar concepts like months and years.

python
base_date = pd.Timestamp("2026-01-31")
base_date + pd.Timedelta(days=30)
python
base_date + pd.DateOffset(months=1)

For business calendar logic, DateOffset is often clearer.

Find activity on the same day every month:

python
monthly_checkpoints = pd.date_range(
    start="2026-01-05",
    end="2026-03-05",
    freq=pd.DateOffset(months=1)
)

activity_ts[activity_ts.index.normalize().isin(monthly_checkpoints.normalize())]

20. Daily Time Series

Daily events:

python
daily_events = activity_ts["event_id"].resample("D").count()
daily_events.head()

Daily revenue:

python
daily_revenue = activity_ts["amount"].resample("D").sum()
daily_revenue.head()

Daily active learners:

python
daily_active_learners = activity_ts["learner_id"].resample("D").nunique()
daily_active_learners.head()

Combine them:

python
daily_summary = pd.DataFrame({
    "events": daily_events,
    "revenue": daily_revenue,
    "active_learners": daily_active_learners
})

daily_summary.head()

21. asfreq() For Frequency Conversion

asfreq() changes the displayed frequency without aggregation.

python
daily_summary.asfreq("D").head(10)

Business-day frequency:

python
daily_summary.asfreq("B").head(10)

Month-start snapshots:

python
daily_summary.asfreq("MS")

Quarter-end snapshots:

python
daily_summary.asfreq("QE")

Use asfreq() when you want observations at a specific frequency.

Use resample() when you want aggregation into new time buckets.

22. Downsampling With resample()

Weekly summary:

python
weekly_summary = activity_ts.resample("W").agg(
    events=("event_id", "count"),
    active_learners=("learner_id", "nunique"),
    revenue=("amount", "sum"),
    avg_duration=("duration_minutes", "mean")
)

weekly_summary["avg_duration"] = weekly_summary["avg_duration"].round(2)
weekly_summary

Monthly summary:

python
monthly_summary = activity_ts.resample("MS").agg(
    events=("event_id", "count"),
    active_learners=("learner_id", "nunique"),
    revenue=("amount", "sum"),
    avg_duration=("duration_minutes", "mean")
)

monthly_summary["avg_duration"] = monthly_summary["avg_duration"].round(2)
monthly_summary

Downsampling means going from a more detailed frequency to a less detailed frequency.

Examples:

  • hourly to daily
  • daily to weekly
  • daily to monthly

23. Upsampling And Filling Gaps

Suppose you only care about days with revenue.

python
revenue_days = daily_revenue[daily_revenue > 0]
revenue_days

Upsample to a daily frequency:

python
revenue_days.asfreq("D").head(15)

Missing days appear because no revenue happened on those dates.

Fill missing values with zero:

python
revenue_days.asfreq("D", fill_value=0).head(15)

Forward-fill the last known value:

python
revenue_days.asfreq("D").ffill().head(15)

Use zero fill for event counts and revenue when missing means nothing happened.

Use forward fill when the last known state should continue.

24. Interpolation

Interpolation estimates missing numeric values.

Create a sparse learner activity series:

python
sparse_active = daily_active_learners[daily_active_learners > 0]
sparse_active.head()

Convert it to daily frequency:

python
sparse_daily = sparse_active.asfreq("D")
sparse_daily.head(12)

Interpolate:

python
sparse_daily.interpolate(method="time").head(12)

Use interpolation carefully.

It can be useful for sensor-like data or smooth trend lines, but it may be misleading for counts where missing means zero.

25. Rolling Window Metrics

Rolling average smooths noisy daily values.

python
daily_summary["revenue_7d_avg"] = (
    daily_summary["revenue"].rolling(window=7, min_periods=1).mean()
)

daily_summary[["revenue", "revenue_7d_avg"]].head(12)

Rolling active learner average:

python
daily_summary["active_learners_7d_avg"] = (
    daily_summary["active_learners"].rolling(window=7, min_periods=1).mean()
)

Rolling event standard deviation:

python
daily_summary["events_7d_std"] = (
    daily_summary["events"].rolling(window=7, min_periods=2).std()
)

Rolling windows are useful when daily data is volatile.

26. Exponentially Weighted Moving Average

EWM gives more weight to recent values.

python
daily_summary["revenue_ewm_7"] = (
    daily_summary["revenue"].ewm(span=7, adjust=False).mean()
)

daily_summary[["revenue", "revenue_7d_avg", "revenue_ewm_7"]].head(15)

Rolling mean treats all values in the window equally.

EWM reacts faster to recent changes.

27. Shift For Previous Values

Shift daily revenue by one day:

python
daily_summary["previous_day_revenue"] = daily_summary["revenue"].shift(1)

Daily revenue change:

python
daily_summary["revenue_change"] = (
    daily_summary["revenue"] - daily_summary["previous_day_revenue"]
)

Percentage change:

python
daily_summary["revenue_pct_change"] = (
    daily_summary["revenue"].pct_change() * 100
)

Inspect:

python
daily_summary[[
    "revenue",
    "previous_day_revenue",
    "revenue_change",
    "revenue_pct_change"
]].head(12)

Use shift() for lag features, comparisons, and time gaps.

28. Time Gap Per Learner

Find time between consecutive events for each learner.

Sort by learner and time:

python
activity_gap = activity.sort_values(["learner_id", "event_at"]).copy()

Use group-wise shift:

python
activity_gap["previous_event_at"] = (
    activity_gap.groupby("learner_id")["event_at"].shift(1)
)

Calculate the gap:

python
activity_gap["time_since_previous_event"] = (
    activity_gap["event_at"] - activity_gap["previous_event_at"]
)

Average gap by learner:

python
avg_gap_by_learner = (
    activity_gap
    .groupby(["learner_id", "clean_name"], as_index=False)
    .agg(avg_gap=("time_since_previous_event", "mean"))
)

avg_gap_by_learner.sort_values("avg_gap")

Convert to hours:

python
avg_gap_by_learner["avg_gap_hours"] = (
    avg_gap_by_learner["avg_gap"].dt.total_seconds() / 3600
).round(2)

avg_gap_by_learner[["learner_id", "clean_name", "avg_gap_hours"]]

This is useful for retention and habit analysis.

29. Certificate Delivery SLA With Timedelta

Parse delivery dates:

python
deliveries["requested_at"] = pd.to_datetime(deliveries["requested_at"], errors="coerce")
deliveries["delivered_at"] = pd.to_datetime(deliveries["delivered_at"], errors="coerce")

Calculate delivery time:

python
deliveries["delivery_time"] = deliveries["delivered_at"] - deliveries["requested_at"]
deliveries["delivery_hours"] = (
    deliveries["delivery_time"].dt.total_seconds() / 3600
).round(2)

Flag late deliveries:

python
deliveries["is_late"] = deliveries["delivery_hours"] > 48

Merge delivery records with learner and course details:

python
delivery_report = (
    deliveries
    .merge(learner_clean, how="left", on="learner_id", validate="many_to_one")
    .merge(courses, how="left", on="course_id", validate="many_to_one")
)

delivery_report[[
    "delivery_id",
    "clean_name",
    "course_title",
    "requested_at",
    "delivered_at",
    "delivery_hours",
    "is_late"
]]

SLA summary:

python
delivery_report.groupby("priority", as_index=False).agg(
    requests=("delivery_id", "count"),
    delivered=("delivered_at", "count"),
    avg_delivery_hours=("delivery_hours", "mean"),
    late_count=("is_late", "sum")
)

30. Cohort Analysis By Signup Month

Create signup month:

python
learner_clean["signup_month"] = learner_clean["signup_at"].dt.to_period("M").astype(str)

Merge signup month into activity:

python
activity = activity.merge(
    learner_clean[["learner_id", "signup_month"]],
    how="left",
    on="learner_id"
)

Create event month period:

python
activity["event_month_period"] = activity["event_at"].dt.to_period("M").astype(str)

Cohort table:

python
cohort = activity.pivot_table(
    index="signup_month",
    columns="event_month_period",
    values="learner_id",
    aggfunc="nunique",
    fill_value=0
)

cohort

This shows how many learners from each signup month were active in each event month.

31. MultiIndex Summary Report

Build a MultiIndex report by track, difficulty, and month.

python
track_month_report = (
    activity
    .groupby(["track", "difficulty", "event_month_period"])
    .agg(
        events=("event_id", "count"),
        learners=("learner_id", "nunique"),
        revenue=("amount", "sum"),
        avg_duration=("duration_minutes", "mean")
    )
)

track_month_report

Select one track:

python
track_month_report.loc["Pandas"]

Unstack month into columns:

python
track_month_report["revenue"].unstack("event_month_period")

Reset index for export:

python
track_month_report.reset_index().head()

This connects GroupBy, MultiIndex, and reporting.

32. Final Executive Summary

Create one compact table for leadership.

python
executive_summary = pd.DataFrame({
    "metric": [
        "total_events",
        "valid_event_days",
        "unique_learners",
        "total_revenue",
        "average_session_minutes",
        "certificate_requests",
        "average_delivery_hours",
        "late_certificates"
    ],
    "value": [
        activity["event_id"].nunique(),
        activity["event_date"].nunique(),
        activity["learner_id"].nunique(),
        activity["amount"].sum(),
        round(activity["duration_minutes"].mean(), 2),
        deliveries["delivery_id"].nunique(),
        round(deliveries["delivery_hours"].mean(), 2),
        int(deliveries["is_late"].sum())
    ]
})

executive_summary

Create final reports dictionary:

python
reports = {
    "executive_summary": executive_summary,
    "course_revenue": course_revenue,
    "channel_report": channel_report,
    "weekly_summary": weekly_summary,
    "monthly_summary": monthly_summary,
    "delivery_report": delivery_report,
    "cohort": cohort,
}

In a real project, you could export these:

python
# executive_summary.to_csv("executive_summary.csv", index=False)
# course_revenue.to_csv("course_revenue.csv", index=False)
# weekly_summary.to_csv("weekly_summary.csv")

33. Case Study Checklist

Use this checklist for future Pandas projects:

  1. Load all raw files.
  2. Inspect shapes, columns, and data types.
  3. Check missing values and duplicates.
  4. Clean text columns.
  5. Parse datetime columns with errors="coerce".
  6. Inspect invalid dates.
  7. Concatenate same-schema files.
  8. Audit joins with indicator=True.
  9. Merge dimensions with validate.
  10. Create date features with .dt.
  11. Create duration features with Timedelta.
  12. Build high-level metrics.
  13. Build groupby summaries.
  14. Build pivot tables.
  15. Set a DatetimeIndex for time series.
  16. Use partial date slicing.
  17. Use resample() for time buckets.
  18. Use asfreq() for frequency conversion.
  19. Use rolling and EWM metrics for trend smoothing.
  20. Use shift() for previous values and time gaps.
  21. Create final export-ready tables.
  22. Document assumptions and data quality issues.

34. Practice Tasks

Try these tasks after completing the walkthrough.

Practice Lab

Task 1. Find the top 5 learners by total duration.

python
activity.groupby(["learner_id", "clean_name"], as_index=False)["duration_minutes"].sum() \
    .sort_values("duration_minutes", ascending=False) \
    .head(5)

Explanation

  • Groups learning activity records by learner ID and course name, then sums the total duration minutes for each combination
  • Sorts the aggregated results in descending order by total duration to rank learners by activity level
  • Returns only the top 5 records showing the most active learners based on cumulative time spent
  • Uses as_index=False parameter to maintain group columns as regular columns rather than index levels
  • The chain of operations efficiently transforms raw activity data into a ranked summary report of learner engagement

Practice Lab

Task 2. Find revenue by city and track.

python
activity.pivot_table(
    index="city_clean",
    columns="track",
    values="amount",
    aggfunc="sum",
    fill_value=0
)

Explanation

  • The code transforms a dataset into a pivot table format that groups transactions by city and track category
  • It uses the city names as row identifiers and track categories as column headers
  • The values represent the sum of transaction amounts for each city-track combination
  • Missing combinations are filled with zero instead of NaN values for cleaner data presentation
  • This structure enables easy comparison of spending patterns across different cities and music tracks

Practice Lab

Task 3. Find the busiest hour of the day.

python
activity.groupby("event_hour")["event_id"].count().sort_values(ascending=False).head(1)

Explanation

  • Groups activity data by hour of the event using the event_hour column
  • Counts the number of events occurring in each hour group
  • Sorts the counts in descending order to identify the highest frequency hours
  • Returns only the top result showing the hour with the most events
  • Useful for identifying peak activity periods in time-series data analysis

Practice Lab

Task 4. Find the average gap between purchases.

python
purchases = activity[activity["event_type"] == "purchase"].sort_values("event_at").copy()
purchases["previous_purchase_at"] = purchases["event_at"].shift(1)
purchases["purchase_gap"] = purchases["event_at"] - purchases["previous_purchase_at"]
purchases["purchase_gap"].mean()

Explanation

  • Filters purchase events from activity data and sorts them chronologically to establish purchase sequence
  • Creates a new column showing the timestamp of the previous purchase for each transaction using shift() function
  • Computes time differences between consecutive purchases by subtracting previous timestamps from current ones
  • Calculates and returns the mean gap duration between purchases to measure customer buying frequency patterns

Practice Lab

Task 5. Find weekly revenue growth.

python
weekly_revenue = activity_ts["amount"].resample("W").sum()
weekly_revenue_growth = weekly_revenue.pct_change() * 100
weekly_revenue_growth

Explanation

  • Resamples the time series data to weekly intervals and calculates the sum of amounts for each week
  • Computes the percentage change between consecutive weeks to determine revenue growth rate
  • Multiplies the percentage change by 100 to convert to percentage format for easier interpretation
  • The resulting series shows the weekly revenue growth percentages with NaN for the first week since there's no previous week to compare against

Practice Lab

Task 6. Find courses with engagement but no revenue.

python
course_summary = activity.groupby(["course_id", "course_title"], as_index=False).agg(
    events=("event_id", "count"),
    revenue=("amount", "sum")
)

course_summary.query("events > 0 and revenue == 0")

Explanation

  • Groups activity data by course_id and course_title to aggregate event counts and total revenue per course
  • Uses agg() function to count events and sum revenue for each course grouping
  • Applies query() method to filter results showing only courses with at least one event but zero revenue
  • Returns a filtered dataframe containing courses that had user interactions but no monetary transactions
  • Useful for identifying potentially problematic courses or analyzing incomplete transaction data

Practice Lab

Task 7. Find learners with no activity after signup.

python
active_ids = activity["learner_id"].dropna().unique()
learner_clean[~learner_clean["learner_id"].isin(active_ids)]

Explanation

  • Retrieves all unique learner IDs from the activity dataset while removing any null values
  • Creates a boolean mask to identify rows in learner_clean that do not have matching learner IDs in the active_ids set
  • Returns only the records from learner_clean where the learner_id is NOT present in the list of active learner IDs
  • This operation effectively filters out any learner records that appear in the activity data, leaving only inactive or non-active learners
  • The result represents the subset of learners who haven't participated in any activities within the dataset

Practice Lab

Task 8. Find late certificate deliveries by course.

python
delivery_report.groupby("course_title", as_index=False).agg(
    requests=("delivery_id", "count"),
    late_count=("is_late", "sum"),
    avg_delivery_hours=("delivery_hours", "mean")
).sort_values("late_count", ascending=False)

Explanation

  • Groups delivery records by course title to analyze performance metrics per course
  • Calculates three key statistics: total delivery requests, count of late deliveries, and average delivery time in hours
  • Uses aggregation functions to efficiently compute multiple metrics in a single operation
  • Sorts results by late delivery count in descending order to identify courses with the most delayed deliveries
  • Returns a clean dataframe showing course performance metrics for quick analysis and decision making

Practice Lab

Task 9. Build a daily dashboard with rolling averages.

python
daily_dashboard = daily_summary.copy()
daily_dashboard["events_7d_avg"] = daily_dashboard["events"].rolling(7, min_periods=1).mean()
daily_dashboard["revenue_7d_avg"] = daily_dashboard["revenue"].rolling(7, min_periods=1).mean()
daily_dashboard["active_learners_7d_avg"] = daily_dashboard["active_learners"].rolling(7, min_periods=1).mean()
daily_dashboard.tail()

Explanation

  • Creates a copy of the daily summary dataframe to avoid modifying the original data while building the dashboard
  • Applies a 7-day rolling window calculation with minimum periods set to 1 to handle incomplete windows at the start of the dataset
  • Computes three separate 7-day moving averages for events, revenue, and active learners metrics to smooth out short-term fluctuations
  • Displays the last few rows of the updated dashboard to preview the calculated rolling averages alongside original data

Practice Lab

Task 10. Build a month and channel revenue report.

python
activity.pivot_table(
    index="event_month_period",
    columns="channel",
    values="amount",
    aggfunc="sum",
    fill_value=0,
    margins=True
)

Explanation

  • Generates a cross-tabulated view showing total amounts grouped by event month periods and marketing channels
  • Uses sum aggregation to combine transaction values across matching combinations of month and channel
  • Sets missing values to zero using fill_value parameter for clean data presentation
  • Includes marginal totals (row and column sums) through the margins=True parameter for comprehensive analysis
  • Organizes data with event months as rows, channels as columns, and summed amounts as cell values

35. Interview Questions From This Case Study

1. Why should you use errors="coerce" with pd.to_datetime()?

It prevents invalid date strings from crashing the pipeline and turns them into NaT, which can be inspected and handled.

2. What is the difference between asfreq() and resample()?

asfreq() changes the frequency view without aggregation. resample() groups time-indexed data into new time buckets and applies aggregations.

3. What is downsampling?

Downsampling means moving from a more detailed frequency to a less detailed one, such as daily data to weekly or monthly summaries.

4. What is upsampling?

Upsampling means moving from a less detailed frequency to a more detailed one, such as weekly data to daily data.

5. Why use a DatetimeIndex?

It enables partial date slicing, time-aware resampling, frequency conversion, rolling windows, and other time series operations.

6. What does shift() help with?

It helps compare each row with a previous or future row, such as previous-day revenue or previous login time.

7. Why use group-wise shift()?

It calculates previous values within each group, such as the previous event for each learner.

8. What is the purpose of rolling averages?

Rolling averages smooth noisy time series data by averaging values over a moving window.

9. How is EWM different from a rolling mean?

EWM gives more weight to recent observations, while a simple rolling mean gives equal weight to all observations in the window.

10. Why audit joins with indicator=True?

It reveals unmatched rows before or after a merge, helping catch missing dimension records or bad IDs.

11. Why use validate in merge()?

It checks that the join relationship matches your expectation, such as many events to one learner.

12. What is a Timedelta?

A Timedelta represents a duration between two moments, such as certificate delivery time or time between learner sessions.

13. Why use named aggregation?

It creates readable summary tables with clear output column names.

14. Why use pivot tables?

Pivot tables quickly summarize data across two or more dimensions, such as month by track or channel by event type.

15. What makes a case study different from isolated examples?

A case study forces you to connect loading, cleaning, joining, time handling, analysis, validation, and reporting into one workflow.

36. Final Mental Model

For a real Pandas analytics project, think in layers:

LayerPandas Tools
Ingestread_csv, concat
Inspectshape, info, describe, isna, duplicated
Clean.str, replace, astype, to_datetime
Modelmerge, validate, indicator
Feature Engineering.dt, Timedelta, transform, assign
Analysisgroupby, agg, pivot_table, query
Time SeriesDatetimeIndex, asfreq, resample, rolling, ewm, shift
Reportingsorted summaries, pivot tables, export-ready DataFrames

If you remember only one thing, remember this:

Pandas is not a collection of random methods. It is a workflow: load, inspect, clean, combine, transform, analyze, validate, and report.

Official References