Pandas Time Series Case Study: Clean, Join, Resample, and Report
This is an end-to-end Pandas case study.
Instead of learning one method at a time, you will work through a realistic analytics problem from raw CSV files to final reporting tables.
The goal is to analyze a small learning platform and answer questions like:
- How many learners are active each day?
- Which courses generate the most paid revenue?
- Which channels bring the best engagement?
- How does weekly revenue change over time?
- Which learners come back frequently?
- How long does certificate delivery take?
- Where are data quality problems hiding?
- How do we build reports that a product or growth team can actually use?
This case study uses original sample datasets created for this lesson. It does not use copied stock, delivery, sports, passenger, finance, or public dataset examples.
Files used in this case study:
pandas_case_study_learners.csvpandas_case_study_courses.csvpandas_case_study_events_jan.csvpandas_case_study_events_feb.csvpandas_case_study_events_mar.csvpandas_case_study_certificate_deliveries.csv
Place all CSV files in the same folder as this Markdown file before running the examples.
Case Study Scenario
You are helping a learning platform understand Q1 activity.
The platform has:
- learner profiles
- a course catalog
- monthly event logs
- certificate delivery records
Your job is to create a reliable analysis pipeline.
The final output should include:
- a cleaned learner table
- a cleaned event table
- a merged activity table
- revenue reports
- engagement reports
- time series summaries
- rolling trend metrics
- certificate delivery SLA metrics
- data quality checks
What This Case Study Covers
This project revises many important Pandas skills:
read_csvhead,tail,shape,info,describe- column selection
- boolean filtering
assignquerysort_values- missing-value checks
- duplicate checks
- string cleaning with
.str - datetime parsing with
pd.to_datetime TimestampTimedeltaDateOffsetconcatmergeindicator=Truevalidategroupbyagg- named aggregation
transformpivot_tableset_index- partial datetime slicing
.dtaccessorasfreqresample- interpolation
- rolling windows
- exponentially weighted moving averages
shift- time gaps per user
pct_change- MultiIndex-style reports
- final export-ready tables
1. Import Libraries
import pandas as pd
import numpy as npOptional display settings:
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)2. Load The Raw Data
Load dimension tables:
learners = pd.read_csv("pandas_case_study_learners.csv")
courses = pd.read_csv("pandas_case_study_courses.csv")
deliveries = pd.read_csv("pandas_case_study_certificate_deliveries.csv")Load monthly event logs:
jan = pd.read_csv("pandas_case_study_events_jan.csv")
feb = pd.read_csv("pandas_case_study_events_feb.csv")
mar = pd.read_csv("pandas_case_study_events_mar.csv")Check shapes:
for name, df in {
"learners": learners,
"courses": courses,
"deliveries": deliveries,
"jan": jan,
"feb": feb,
"mar": mar,
}.items():
print(name, df.shape)Preview the tables:
learners.head()courses.head()jan.head()3. Understand The Data Model
The raw data has two dimension tables:
learners: one row per learnercourses: one row per course
It has three event files:
jan: events in Januaryfeb: events in Februarymar: events in March
It also has an operations table:
deliveries: certificate request and delivery timestamps
Important keys:
| Table | Key Columns |
|---|---|
| learners | learner_id |
| courses | course_id |
| events | event_id, learner_id, course_id |
| deliveries | delivery_id, learner_id, course_id |
4. First Inspection
Use info() to check data types and missing values:
learners.info()courses.info()jan.info()Check numeric summaries:
jan.describe(numeric_only=True)Check text columns quickly:
learners.describe(include="object")This inspection step tells you what needs cleaning before analysis.
5. Concatenate Monthly Event Logs
The three monthly event files have the same structure, so use pd.concat().
events = pd.concat([jan, feb, mar], ignore_index=True)
events.shapeIf you want to remember the source month, use keys.
events_with_source = pd.concat(
[jan, feb, mar],
keys=["Jan", "Feb", "Mar"]
).reset_index(level=0).rename(columns={"level_0": "source_month"})
events_with_source.head()For the rest of the case study, keep the clean sequential index:
events = pd.concat([jan, feb, mar], ignore_index=True)6. Data Quality Checks
Check duplicate IDs:
learners["learner_id"].duplicated().sum()courses["course_id"].duplicated().sum()events["event_id"].duplicated().sum()Check missing values:
learners.isna().sum()events.isna().sum()Check unexpected event types:
events["event_type"].value_counts(dropna=False)Check amount values:
events.groupby("event_type")["amount"].describe()Data quality is part of analysis. Do not skip it.
7. Clean Learner Text Columns
The learner table intentionally contains messy text.
Clean names:
learners["clean_name"] = learners["raw_name"].str.strip().str.title()Clean email:
learners["email_clean"] = learners["email"].str.strip().str.lower()Clean city and plan:
learners["city_clean"] = learners["city"].str.strip().str.title()
learners["plan_clean"] = learners["plan"].str.strip().str.lower()Extract email domain:
learners["email_domain"] = learners["email_clean"].str.extract(r"@([^@\s]+)$")Split campaign code:
learners[["campaign_track", "campaign_tier", "campaign_year"]] = (
learners["campaign_code"].str.split("-", expand=True)
)
learners["campaign_year"] = learners["campaign_year"].astype(int)Create a cleaned learner table:
learner_clean = learners[[
"learner_id",
"clean_name",
"email_clean",
"email_domain",
"city_clean",
"plan_clean",
"campaign_track",
"campaign_tier",
"campaign_year",
"signup_at",
"referral_source"
]].copy()
learner_clean.head()8. Parse Dates Safely
Convert learner signup dates:
learner_clean["signup_at"] = pd.to_datetime(
learner_clean["signup_at"],
errors="coerce"
)Convert course publish dates:
courses["published_at"] = pd.to_datetime(
courses["published_at"],
errors="coerce"
)Convert event timestamps:
events["event_at"] = pd.to_datetime(
events["event_time"],
errors="coerce"
)Find invalid event timestamps:
events[events["event_at"].isna()]The invalid rows are not automatically deleted. You inspect them first.
For time series analysis, create a valid-events table:
events_valid = events.dropna(subset=["event_at"]).copy()9. Create Date Features
Use the .dt accessor.
events_valid["event_date"] = events_valid["event_at"].dt.date
events_valid["event_month"] = events_valid["event_at"].dt.month_name()
events_valid["event_week"] = events_valid["event_at"].dt.isocalendar().week.astype(int)
events_valid["event_day_name"] = events_valid["event_at"].dt.day_name()
events_valid["event_hour"] = events_valid["event_at"].dt.hour
events_valid["event_quarter"] = events_valid["event_at"].dt.quarter
events_valid["is_weekend"] = events_valid["event_at"].dt.dayofweek >= 5Inspect:
events_valid[[
"event_id",
"event_at",
"event_month",
"event_day_name",
"event_hour",
"is_weekend"
]].head()10. Convert Durations With Timedelta
Session duration is stored as seconds.
events_valid["duration"] = pd.to_timedelta(
events_valid["duration_seconds"],
unit="s"
)Create minutes:
events_valid["duration_minutes"] = (
events_valid["duration"].dt.total_seconds() / 60
).round(2)Check duration by event type:
events_valid.groupby("event_type")["duration_minutes"].describe()11. Audit Joins Before Merging
Events should match learner IDs and course IDs.
Audit learners:
learner_audit = events_valid.merge(
learner_clean,
how="left",
on="learner_id",
indicator=True
)
learner_audit["_merge"].value_counts()Find event rows with missing learner profiles:
learner_audit[learner_audit["_merge"] == "left_only"][
["event_id", "learner_id", "event_time", "event_type"]
]Audit courses:
course_audit = events_valid.merge(
courses,
how="left",
on="course_id",
indicator=True
)
course_audit["_merge"].value_counts()Find event rows with missing course records:
course_audit[course_audit["_merge"] == "left_only"][
["event_id", "course_id", "event_time", "event_type"]
]This is safer than blindly merging.
12. Build The Activity Fact Table
Merge events with learners and courses.
activity = (
events_valid
.merge(learner_clean, how="left", on="learner_id", validate="many_to_one")
.merge(courses, how="left", on="course_id", validate="many_to_one")
)
activity.head()Check the shape:
activity.shapeSelect useful columns:
activity = activity[[
"event_id",
"event_at",
"event_date",
"event_month",
"event_week",
"event_day_name",
"event_hour",
"is_weekend",
"learner_id",
"clean_name",
"city_clean",
"plan_clean",
"referral_source",
"course_id",
"course_title",
"track",
"difficulty",
"mentor",
"event_type",
"device",
"channel",
"duration_minutes",
"amount",
"status"
]]
activity.head()This is the main table for analysis.
13. Basic Filtering
Paid purchases:
paid_purchases = activity.query("event_type == 'purchase' and amount > 0")
paid_purchases.head()Mobile events:
mobile_events = activity[activity["device"].eq("mobile")]
mobile_events.head()Weekend events:
weekend_events = activity[activity["is_weekend"]]
weekend_events.head()Long sessions:
long_sessions = activity[activity["duration_minutes"] >= 30]
long_sessions[["event_id", "clean_name", "event_type", "duration_minutes"]]14. High-Level Metrics
Total events:
activity["event_id"].nunique()Unique learners:
activity["learner_id"].nunique()Total paid revenue:
activity["amount"].sum()Average session duration:
activity["duration_minutes"].mean()Events by type:
activity["event_type"].value_counts()15. GroupBy Reports
Revenue by course:
course_revenue = (
activity
.groupby(["course_id", "course_title"], as_index=False)
.agg(
events=("event_id", "count"),
learners=("learner_id", "nunique"),
revenue=("amount", "sum"),
avg_duration=("duration_minutes", "mean")
)
.sort_values("revenue", ascending=False)
)
course_revenue["avg_duration"] = course_revenue["avg_duration"].round(2)
course_revenueEngagement by channel:
channel_report = (
activity
.groupby("channel", as_index=False)
.agg(
events=("event_id", "count"),
learners=("learner_id", "nunique"),
revenue=("amount", "sum"),
avg_duration=("duration_minutes", "mean")
)
)
channel_report["avg_duration"] = channel_report["avg_duration"].round(2)
channel_report.sort_values("revenue", ascending=False)Track and plan summary:
track_plan_report = (
activity
.groupby(["track", "plan_clean"], as_index=False)
.agg(
learners=("learner_id", "nunique"),
events=("event_id", "count"),
revenue=("amount", "sum"),
avg_duration=("duration_minutes", "mean")
)
)
track_plan_report16. Transform For Row-Level Context
Add each learner's total activity duration to every row.
activity["learner_total_minutes"] = (
activity.groupby("learner_id")["duration_minutes"].transform("sum")
)Add each course's total revenue to every row.
activity["course_total_revenue"] = (
activity.groupby("course_id")["amount"].transform("sum")
)Calculate each row's share of course revenue:
activity["course_revenue_share"] = np.where(
activity["course_total_revenue"] > 0,
activity["amount"] / activity["course_total_revenue"],
0
)Inspect:
activity[[
"event_id",
"course_title",
"amount",
"course_total_revenue",
"course_revenue_share"
]].head()17. Pivot Tables
Revenue by month and track:
activity.pivot_table(
index="event_month",
columns="track",
values="amount",
aggfunc="sum",
fill_value=0
)Events by device and event type:
activity.pivot_table(
index="device",
columns="event_type",
values="event_id",
aggfunc="count",
fill_value=0
)Average duration by track and device:
activity.pivot_table(
index=["track", "difficulty"],
columns="device",
values="duration_minutes",
aggfunc="mean",
fill_value=0
).round(2)Pivot tables are compact reporting tools.
18. Build A DatetimeIndex
Time series operations work best when the datetime column is the index.
activity_ts = activity.set_index("event_at").sort_index()
activity_ts.head()Now you can use partial string selection.
Select all February events:
activity_ts.loc["2026-02"]Select a date range:
activity_ts.loc["2026-02-10":"2026-03-05"]Select one day:
activity_ts.loc["2026-03-01"]Partial date indexing is one of the most convenient reasons to use a DatetimeIndex.
19. DateOffset For Calendar Jumps
Timedelta is for exact durations.
DateOffset understands calendar concepts like months and years.
base_date = pd.Timestamp("2026-01-31")
base_date + pd.Timedelta(days=30)base_date + pd.DateOffset(months=1)For business calendar logic, DateOffset is often clearer.
Find activity on the same day every month:
monthly_checkpoints = pd.date_range(
start="2026-01-05",
end="2026-03-05",
freq=pd.DateOffset(months=1)
)
activity_ts[activity_ts.index.normalize().isin(monthly_checkpoints.normalize())]20. Daily Time Series
Daily events:
daily_events = activity_ts["event_id"].resample("D").count()
daily_events.head()Daily revenue:
daily_revenue = activity_ts["amount"].resample("D").sum()
daily_revenue.head()Daily active learners:
daily_active_learners = activity_ts["learner_id"].resample("D").nunique()
daily_active_learners.head()Combine them:
daily_summary = pd.DataFrame({
"events": daily_events,
"revenue": daily_revenue,
"active_learners": daily_active_learners
})
daily_summary.head()21. asfreq() For Frequency Conversion
asfreq() changes the displayed frequency without aggregation.
daily_summary.asfreq("D").head(10)Business-day frequency:
daily_summary.asfreq("B").head(10)Month-start snapshots:
daily_summary.asfreq("MS")Quarter-end snapshots:
daily_summary.asfreq("QE")Use asfreq() when you want observations at a specific frequency.
Use resample() when you want aggregation into new time buckets.
22. Downsampling With resample()
Weekly summary:
weekly_summary = activity_ts.resample("W").agg(
events=("event_id", "count"),
active_learners=("learner_id", "nunique"),
revenue=("amount", "sum"),
avg_duration=("duration_minutes", "mean")
)
weekly_summary["avg_duration"] = weekly_summary["avg_duration"].round(2)
weekly_summaryMonthly summary:
monthly_summary = activity_ts.resample("MS").agg(
events=("event_id", "count"),
active_learners=("learner_id", "nunique"),
revenue=("amount", "sum"),
avg_duration=("duration_minutes", "mean")
)
monthly_summary["avg_duration"] = monthly_summary["avg_duration"].round(2)
monthly_summaryDownsampling means going from a more detailed frequency to a less detailed frequency.
Examples:
- hourly to daily
- daily to weekly
- daily to monthly
23. Upsampling And Filling Gaps
Suppose you only care about days with revenue.
revenue_days = daily_revenue[daily_revenue > 0]
revenue_daysUpsample to a daily frequency:
revenue_days.asfreq("D").head(15)Missing days appear because no revenue happened on those dates.
Fill missing values with zero:
revenue_days.asfreq("D", fill_value=0).head(15)Forward-fill the last known value:
revenue_days.asfreq("D").ffill().head(15)Use zero fill for event counts and revenue when missing means nothing happened.
Use forward fill when the last known state should continue.
24. Interpolation
Interpolation estimates missing numeric values.
Create a sparse learner activity series:
sparse_active = daily_active_learners[daily_active_learners > 0]
sparse_active.head()Convert it to daily frequency:
sparse_daily = sparse_active.asfreq("D")
sparse_daily.head(12)Interpolate:
sparse_daily.interpolate(method="time").head(12)Use interpolation carefully.
It can be useful for sensor-like data or smooth trend lines, but it may be misleading for counts where missing means zero.
25. Rolling Window Metrics
Rolling average smooths noisy daily values.
daily_summary["revenue_7d_avg"] = (
daily_summary["revenue"].rolling(window=7, min_periods=1).mean()
)
daily_summary[["revenue", "revenue_7d_avg"]].head(12)Rolling active learner average:
daily_summary["active_learners_7d_avg"] = (
daily_summary["active_learners"].rolling(window=7, min_periods=1).mean()
)Rolling event standard deviation:
daily_summary["events_7d_std"] = (
daily_summary["events"].rolling(window=7, min_periods=2).std()
)Rolling windows are useful when daily data is volatile.
26. Exponentially Weighted Moving Average
EWM gives more weight to recent values.
daily_summary["revenue_ewm_7"] = (
daily_summary["revenue"].ewm(span=7, adjust=False).mean()
)
daily_summary[["revenue", "revenue_7d_avg", "revenue_ewm_7"]].head(15)Rolling mean treats all values in the window equally.
EWM reacts faster to recent changes.
27. Shift For Previous Values
Shift daily revenue by one day:
daily_summary["previous_day_revenue"] = daily_summary["revenue"].shift(1)Daily revenue change:
daily_summary["revenue_change"] = (
daily_summary["revenue"] - daily_summary["previous_day_revenue"]
)Percentage change:
daily_summary["revenue_pct_change"] = (
daily_summary["revenue"].pct_change() * 100
)Inspect:
daily_summary[[
"revenue",
"previous_day_revenue",
"revenue_change",
"revenue_pct_change"
]].head(12)Use shift() for lag features, comparisons, and time gaps.
28. Time Gap Per Learner
Find time between consecutive events for each learner.
Sort by learner and time:
activity_gap = activity.sort_values(["learner_id", "event_at"]).copy()Use group-wise shift:
activity_gap["previous_event_at"] = (
activity_gap.groupby("learner_id")["event_at"].shift(1)
)Calculate the gap:
activity_gap["time_since_previous_event"] = (
activity_gap["event_at"] - activity_gap["previous_event_at"]
)Average gap by learner:
avg_gap_by_learner = (
activity_gap
.groupby(["learner_id", "clean_name"], as_index=False)
.agg(avg_gap=("time_since_previous_event", "mean"))
)
avg_gap_by_learner.sort_values("avg_gap")Convert to hours:
avg_gap_by_learner["avg_gap_hours"] = (
avg_gap_by_learner["avg_gap"].dt.total_seconds() / 3600
).round(2)
avg_gap_by_learner[["learner_id", "clean_name", "avg_gap_hours"]]This is useful for retention and habit analysis.
29. Certificate Delivery SLA With Timedelta
Parse delivery dates:
deliveries["requested_at"] = pd.to_datetime(deliveries["requested_at"], errors="coerce")
deliveries["delivered_at"] = pd.to_datetime(deliveries["delivered_at"], errors="coerce")Calculate delivery time:
deliveries["delivery_time"] = deliveries["delivered_at"] - deliveries["requested_at"]
deliveries["delivery_hours"] = (
deliveries["delivery_time"].dt.total_seconds() / 3600
).round(2)Flag late deliveries:
deliveries["is_late"] = deliveries["delivery_hours"] > 48Merge delivery records with learner and course details:
delivery_report = (
deliveries
.merge(learner_clean, how="left", on="learner_id", validate="many_to_one")
.merge(courses, how="left", on="course_id", validate="many_to_one")
)
delivery_report[[
"delivery_id",
"clean_name",
"course_title",
"requested_at",
"delivered_at",
"delivery_hours",
"is_late"
]]SLA summary:
delivery_report.groupby("priority", as_index=False).agg(
requests=("delivery_id", "count"),
delivered=("delivered_at", "count"),
avg_delivery_hours=("delivery_hours", "mean"),
late_count=("is_late", "sum")
)30. Cohort Analysis By Signup Month
Create signup month:
learner_clean["signup_month"] = learner_clean["signup_at"].dt.to_period("M").astype(str)Merge signup month into activity:
activity = activity.merge(
learner_clean[["learner_id", "signup_month"]],
how="left",
on="learner_id"
)Create event month period:
activity["event_month_period"] = activity["event_at"].dt.to_period("M").astype(str)Cohort table:
cohort = activity.pivot_table(
index="signup_month",
columns="event_month_period",
values="learner_id",
aggfunc="nunique",
fill_value=0
)
cohortThis shows how many learners from each signup month were active in each event month.
31. MultiIndex Summary Report
Build a MultiIndex report by track, difficulty, and month.
track_month_report = (
activity
.groupby(["track", "difficulty", "event_month_period"])
.agg(
events=("event_id", "count"),
learners=("learner_id", "nunique"),
revenue=("amount", "sum"),
avg_duration=("duration_minutes", "mean")
)
)
track_month_reportSelect one track:
track_month_report.loc["Pandas"]Unstack month into columns:
track_month_report["revenue"].unstack("event_month_period")Reset index for export:
track_month_report.reset_index().head()This connects GroupBy, MultiIndex, and reporting.
32. Final Executive Summary
Create one compact table for leadership.
executive_summary = pd.DataFrame({
"metric": [
"total_events",
"valid_event_days",
"unique_learners",
"total_revenue",
"average_session_minutes",
"certificate_requests",
"average_delivery_hours",
"late_certificates"
],
"value": [
activity["event_id"].nunique(),
activity["event_date"].nunique(),
activity["learner_id"].nunique(),
activity["amount"].sum(),
round(activity["duration_minutes"].mean(), 2),
deliveries["delivery_id"].nunique(),
round(deliveries["delivery_hours"].mean(), 2),
int(deliveries["is_late"].sum())
]
})
executive_summaryCreate final reports dictionary:
reports = {
"executive_summary": executive_summary,
"course_revenue": course_revenue,
"channel_report": channel_report,
"weekly_summary": weekly_summary,
"monthly_summary": monthly_summary,
"delivery_report": delivery_report,
"cohort": cohort,
}In a real project, you could export these:
# executive_summary.to_csv("executive_summary.csv", index=False)
# course_revenue.to_csv("course_revenue.csv", index=False)
# weekly_summary.to_csv("weekly_summary.csv")33. Case Study Checklist
Use this checklist for future Pandas projects:
- Load all raw files.
- Inspect shapes, columns, and data types.
- Check missing values and duplicates.
- Clean text columns.
- Parse datetime columns with
errors="coerce". - Inspect invalid dates.
- Concatenate same-schema files.
- Audit joins with
indicator=True. - Merge dimensions with
validate. - Create date features with
.dt. - Create duration features with
Timedelta. - Build high-level metrics.
- Build
groupbysummaries. - Build pivot tables.
- Set a
DatetimeIndexfor time series. - Use partial date slicing.
- Use
resample()for time buckets. - Use
asfreq()for frequency conversion. - Use rolling and EWM metrics for trend smoothing.
- Use
shift()for previous values and time gaps. - Create final export-ready tables.
- Document assumptions and data quality issues.
34. Practice Tasks
Try these tasks after completing the walkthrough.
Practice Lab
Task 1. Find the top 5 learners by total duration.
activity.groupby(["learner_id", "clean_name"], as_index=False)["duration_minutes"].sum() \
.sort_values("duration_minutes", ascending=False) \
.head(5)Explanation
- Groups learning activity records by learner ID and course name, then sums the total duration minutes for each combination
- Sorts the aggregated results in descending order by total duration to rank learners by activity level
- Returns only the top 5 records showing the most active learners based on cumulative time spent
- Uses as_index=False parameter to maintain group columns as regular columns rather than index levels
- The chain of operations efficiently transforms raw activity data into a ranked summary report of learner engagement
Practice Lab
Task 2. Find revenue by city and track.
activity.pivot_table(
index="city_clean",
columns="track",
values="amount",
aggfunc="sum",
fill_value=0
)Explanation
- The code transforms a dataset into a pivot table format that groups transactions by city and track category
- It uses the city names as row identifiers and track categories as column headers
- The values represent the sum of transaction amounts for each city-track combination
- Missing combinations are filled with zero instead of NaN values for cleaner data presentation
- This structure enables easy comparison of spending patterns across different cities and music tracks
Practice Lab
Task 3. Find the busiest hour of the day.
activity.groupby("event_hour")["event_id"].count().sort_values(ascending=False).head(1)Explanation
- Groups activity data by hour of the event using the event_hour column
- Counts the number of events occurring in each hour group
- Sorts the counts in descending order to identify the highest frequency hours
- Returns only the top result showing the hour with the most events
- Useful for identifying peak activity periods in time-series data analysis
Practice Lab
Task 4. Find the average gap between purchases.
purchases = activity[activity["event_type"] == "purchase"].sort_values("event_at").copy()
purchases["previous_purchase_at"] = purchases["event_at"].shift(1)
purchases["purchase_gap"] = purchases["event_at"] - purchases["previous_purchase_at"]
purchases["purchase_gap"].mean()Explanation
- Filters purchase events from activity data and sorts them chronologically to establish purchase sequence
- Creates a new column showing the timestamp of the previous purchase for each transaction using shift() function
- Computes time differences between consecutive purchases by subtracting previous timestamps from current ones
- Calculates and returns the mean gap duration between purchases to measure customer buying frequency patterns
Practice Lab
Task 5. Find weekly revenue growth.
weekly_revenue = activity_ts["amount"].resample("W").sum()
weekly_revenue_growth = weekly_revenue.pct_change() * 100
weekly_revenue_growthExplanation
- Resamples the time series data to weekly intervals and calculates the sum of amounts for each week
- Computes the percentage change between consecutive weeks to determine revenue growth rate
- Multiplies the percentage change by 100 to convert to percentage format for easier interpretation
- The resulting series shows the weekly revenue growth percentages with NaN for the first week since there's no previous week to compare against
Practice Lab
Task 6. Find courses with engagement but no revenue.
course_summary = activity.groupby(["course_id", "course_title"], as_index=False).agg(
events=("event_id", "count"),
revenue=("amount", "sum")
)
course_summary.query("events > 0 and revenue == 0")Explanation
- Groups activity data by course_id and course_title to aggregate event counts and total revenue per course
- Uses agg() function to count events and sum revenue for each course grouping
- Applies query() method to filter results showing only courses with at least one event but zero revenue
- Returns a filtered dataframe containing courses that had user interactions but no monetary transactions
- Useful for identifying potentially problematic courses or analyzing incomplete transaction data
Practice Lab
Task 7. Find learners with no activity after signup.
active_ids = activity["learner_id"].dropna().unique()
learner_clean[~learner_clean["learner_id"].isin(active_ids)]Explanation
- Retrieves all unique learner IDs from the activity dataset while removing any null values
- Creates a boolean mask to identify rows in learner_clean that do not have matching learner IDs in the active_ids set
- Returns only the records from learner_clean where the learner_id is NOT present in the list of active learner IDs
- This operation effectively filters out any learner records that appear in the activity data, leaving only inactive or non-active learners
- The result represents the subset of learners who haven't participated in any activities within the dataset
Practice Lab
Task 8. Find late certificate deliveries by course.
delivery_report.groupby("course_title", as_index=False).agg(
requests=("delivery_id", "count"),
late_count=("is_late", "sum"),
avg_delivery_hours=("delivery_hours", "mean")
).sort_values("late_count", ascending=False)Explanation
- Groups delivery records by course title to analyze performance metrics per course
- Calculates three key statistics: total delivery requests, count of late deliveries, and average delivery time in hours
- Uses aggregation functions to efficiently compute multiple metrics in a single operation
- Sorts results by late delivery count in descending order to identify courses with the most delayed deliveries
- Returns a clean dataframe showing course performance metrics for quick analysis and decision making
Practice Lab
Task 9. Build a daily dashboard with rolling averages.
daily_dashboard = daily_summary.copy()
daily_dashboard["events_7d_avg"] = daily_dashboard["events"].rolling(7, min_periods=1).mean()
daily_dashboard["revenue_7d_avg"] = daily_dashboard["revenue"].rolling(7, min_periods=1).mean()
daily_dashboard["active_learners_7d_avg"] = daily_dashboard["active_learners"].rolling(7, min_periods=1).mean()
daily_dashboard.tail()Explanation
- Creates a copy of the daily summary dataframe to avoid modifying the original data while building the dashboard
- Applies a 7-day rolling window calculation with minimum periods set to 1 to handle incomplete windows at the start of the dataset
- Computes three separate 7-day moving averages for events, revenue, and active learners metrics to smooth out short-term fluctuations
- Displays the last few rows of the updated dashboard to preview the calculated rolling averages alongside original data
Practice Lab
Task 10. Build a month and channel revenue report.
activity.pivot_table(
index="event_month_period",
columns="channel",
values="amount",
aggfunc="sum",
fill_value=0,
margins=True
)Explanation
- Generates a cross-tabulated view showing total amounts grouped by event month periods and marketing channels
- Uses sum aggregation to combine transaction values across matching combinations of month and channel
- Sets missing values to zero using fill_value parameter for clean data presentation
- Includes marginal totals (row and column sums) through the margins=True parameter for comprehensive analysis
- Organizes data with event months as rows, channels as columns, and summed amounts as cell values
35. Interview Questions From This Case Study
1. Why should you use errors="coerce" with pd.to_datetime()?
It prevents invalid date strings from crashing the pipeline and turns them into NaT, which can be inspected and handled.
2. What is the difference between asfreq() and resample()?
asfreq() changes the frequency view without aggregation. resample() groups time-indexed data into new time buckets and applies aggregations.
3. What is downsampling?
Downsampling means moving from a more detailed frequency to a less detailed one, such as daily data to weekly or monthly summaries.
4. What is upsampling?
Upsampling means moving from a less detailed frequency to a more detailed one, such as weekly data to daily data.
5. Why use a DatetimeIndex?
It enables partial date slicing, time-aware resampling, frequency conversion, rolling windows, and other time series operations.
6. What does shift() help with?
It helps compare each row with a previous or future row, such as previous-day revenue or previous login time.
7. Why use group-wise shift()?
It calculates previous values within each group, such as the previous event for each learner.
8. What is the purpose of rolling averages?
Rolling averages smooth noisy time series data by averaging values over a moving window.
9. How is EWM different from a rolling mean?
EWM gives more weight to recent observations, while a simple rolling mean gives equal weight to all observations in the window.
10. Why audit joins with indicator=True?
It reveals unmatched rows before or after a merge, helping catch missing dimension records or bad IDs.
11. Why use validate in merge()?
It checks that the join relationship matches your expectation, such as many events to one learner.
12. What is a Timedelta?
A Timedelta represents a duration between two moments, such as certificate delivery time or time between learner sessions.
13. Why use named aggregation?
It creates readable summary tables with clear output column names.
14. Why use pivot tables?
Pivot tables quickly summarize data across two or more dimensions, such as month by track or channel by event type.
15. What makes a case study different from isolated examples?
A case study forces you to connect loading, cleaning, joining, time handling, analysis, validation, and reporting into one workflow.
36. Final Mental Model
For a real Pandas analytics project, think in layers:
| Layer | Pandas Tools |
|---|---|
| Ingest | read_csv, concat |
| Inspect | shape, info, describe, isna, duplicated |
| Clean | .str, replace, astype, to_datetime |
| Model | merge, validate, indicator |
| Feature Engineering | .dt, Timedelta, transform, assign |
| Analysis | groupby, agg, pivot_table, query |
| Time Series | DatetimeIndex, asfreq, resample, rolling, ewm, shift |
| Reporting | sorted summaries, pivot tables, export-ready DataFrames |
If you remember only one thing, remember this:
Pandas is not a collection of random methods. It is a workflow: load, inspect, clean, combine, transform, analyze, validate, and report.
Official References
- Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html
- Pandas windowing operations user guide: https://pandas.pydata.org/docs/user_guide/window.html
- Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
- Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
- Pandas text data user guide: https://pandas.pydata.org/docs/user_guide/text.html
- Pandas reshaping and pivot tables user guide: https://pandas.pydata.org/docs/user_guide/reshaping.html
