# Pandas Time Series Case Study: Analyzing Learner Engagement URL: https://madhudadi.in/blog/posts/pandas-time-series-case-study-analyze-learner-data Published: 2026-06-06 Tags: Pandas, python Read time: 78 min Difficulty: advanced > Build a complete Pandas time series case study with original learning-platform data. Load multiple CSVs, clean text, parse dates, concatenate logs, merge dimensions, audit data quality, use datetime indexes, calculate timedeltas, resample, interpolate, use rolling and EWM averages, shift values, build pivots, and create final reporting tables.# Pandas Time Series Case Study: Clean, Join, Resample, and Report This is an end-to-end Pandas case study. Instead of learning one method at a time, you will work through a realistic analytics problem from raw CSV files to final reporting tables. The goal is to analyze a small learning platform and answer questions like: - How many learners are active each day? - Which courses generate the most paid revenue? - Which channels bring the best engagement? - How does weekly revenue change over time? - Which learners come back frequently? - How long does certificate delivery take? - Where are data quality problems hiding? - How do we build reports that a product or growth team can actually use? This case study uses original sample datasets created for this lesson. It does not use copied stock, delivery, sports, passenger, finance, or public dataset examples. Files used in this case study: - `pandas_case_study_learners.csv` - `pandas_case_study_courses.csv` - `pandas_case_study_events_jan.csv` - `pandas_case_study_events_feb.csv` - `pandas_case_study_events_mar.csv` - `pandas_case_study_certificate_deliveries.csv` Place all CSV files in the same folder as this Markdown file before running the examples. ## Case Study Scenario You are helping a learning platform understand Q1 activity. The platform has: - learner profiles - a course catalog - monthly event logs - certificate delivery records Your job is to create a reliable analysis pipeline. The final output should include: - a cleaned learner table - a cleaned event table - a merged activity table - revenue reports - engagement reports - time series summaries - rolling trend metrics - certificate delivery SLA metrics - data quality checks ## What This Case Study Covers This project revises many important Pandas skills: - `read_csv` - `head`, `tail`, `shape`, `info`, `describe` - column selection - boolean filtering - `assign` - `query` - `sort_values` - missing-value checks - duplicate checks - string cleaning with `.str` - datetime parsing with `pd.to_datetime` - `Timestamp` - `Timedelta` - `DateOffset` - `concat` - `merge` - `indicator=True` - `validate` - `groupby` - `agg` - named aggregation - `transform` - `pivot_table` - `set_index` - partial datetime slicing - `.dt` accessor - `asfreq` - `resample` - interpolation - rolling windows - exponentially weighted moving averages - `shift` - time gaps per user - `pct_change` - MultiIndex-style reports - final export-ready tables ## 1. Import Libraries ```python import pandas as pd import numpy as np ``` Optional display settings: ```python pd.set_option("display.max_columns", 100) pd.set_option("display.width", 120) ``` ## 2. Load The Raw Data Load dimension tables: ```python learners = pd.read_csv("pandas_case_study_learners.csv") courses = pd.read_csv("pandas_case_study_courses.csv") deliveries = pd.read_csv("pandas_case_study_certificate_deliveries.csv") ``` Load monthly event logs: ```python jan = pd.read_csv("pandas_case_study_events_jan.csv") feb = pd.read_csv("pandas_case_study_events_feb.csv") mar = pd.read_csv("pandas_case_study_events_mar.csv") ``` Check shapes: ```python for name, df in { "learners": learners, "courses": courses, "deliveries": deliveries, "jan": jan, "feb": feb, "mar": mar, }.items(): print(name, df.shape) ``` Preview the tables: ```python learners.head() ``` ```python courses.head() ``` ```python jan.head() ``` ## 3. Understand The Data Model The raw data has two dimension tables: - `learners`: one row per learner - `courses`: one row per course It has three event files: - `jan`: events in January - `feb`: events in February - `mar`: events in March It also has an operations table: - `deliveries`: certificate request and delivery timestamps Important keys: | Table | Key Columns | |---|---| | learners | `learner_id` | | courses | `course_id` | | events | `event_id`, `learner_id`, `course_id` | | deliveries | `delivery_id`, `learner_id`, `course_id` | ## 4. First Inspection Use `info()` to check data types and missing values: ```python learners.info() ``` ```python courses.info() ``` ```python jan.info() ``` Check numeric summaries: ```python jan.describe(numeric_only=True) ``` Check text columns quickly: ```python learners.describe(include="object") ``` This inspection step tells you what needs cleaning before analysis. ## 5. Concatenate Monthly Event Logs The three monthly event files have the same structure, so use `pd.concat()`. ```python events = pd.concat([jan, feb, mar], ignore_index=True) events.shape ``` If you want to remember the source month, use `keys`. ```python events_with_source = pd.concat( [jan, feb, mar], keys=["Jan", "Feb", "Mar"] ).reset_index(level=0).rename(columns={"level_0": "source_month"}) events_with_source.head() ``` For the rest of the case study, keep the clean sequential index: ```python events = pd.concat([jan, feb, mar], ignore_index=True) ``` ## 6. Data Quality Checks Check duplicate IDs: ```python learners["learner_id"].duplicated().sum() ``` ```python courses["course_id"].duplicated().sum() ``` ```python events["event_id"].duplicated().sum() ``` Check missing values: ```python learners.isna().sum() ``` ```python events.isna().sum() ``` Check unexpected event types: ```python events["event_type"].value_counts(dropna=False) ``` Check amount values: ```python events.groupby("event_type")["amount"].describe() ``` Data quality is part of analysis. Do not skip it. ## 7. Clean Learner Text Columns The learner table intentionally contains messy text. Clean names: ```python learners["clean_name"] = learners["raw_name"].str.strip().str.title() ``` Clean email: ```python learners["email_clean"] = learners["email"].str.strip().str.lower() ``` Clean city and plan: ```python learners["city_clean"] = learners["city"].str.strip().str.title() learners["plan_clean"] = learners["plan"].str.strip().str.lower() ``` Extract email domain: ```python learners["email_domain"] = learners["email_clean"].str.extract(r"@([^@\s]+)$") ``` Split campaign code: ```python learners[["campaign_track", "campaign_tier", "campaign_year"]] = ( learners["campaign_code"].str.split("-", expand=True) ) learners["campaign_year"] = learners["campaign_year"].astype(int) ``` Create a cleaned learner table: ```python learner_clean = learners[[ "learner_id", "clean_name", "email_clean", "email_domain", "city_clean", "plan_clean", "campaign_track", "campaign_tier", "campaign_year", "signup_at", "referral_source" ]].copy() learner_clean.head() ``` ## 8. Parse Dates Safely Convert learner signup dates: ```python learner_clean["signup_at"] = pd.to_datetime( learner_clean["signup_at"], errors="coerce" ) ``` Convert course publish dates: ```python courses["published_at"] = pd.to_datetime( courses["published_at"], errors="coerce" ) ``` Convert event timestamps: ```python events["event_at"] = pd.to_datetime( events["event_time"], errors="coerce" ) ``` Find invalid event timestamps: ```python events[events["event_at"].isna()] ``` The invalid rows are not automatically deleted. You inspect them first. For time series analysis, create a valid-events table: ```python events_valid = events.dropna(subset=["event_at"]).copy() ``` ## 9. Create Date Features Use the `.dt` accessor. ```python events_valid["event_date"] = events_valid["event_at"].dt.date events_valid["event_month"] = events_valid["event_at"].dt.month_name() events_valid["event_week"] = events_valid["event_at"].dt.isocalendar().week.astype(int) events_valid["event_day_name"] = events_valid["event_at"].dt.day_name() events_valid["event_hour"] = events_valid["event_at"].dt.hour events_valid["event_quarter"] = events_valid["event_at"].dt.quarter events_valid["is_weekend"] = events_valid["event_at"].dt.dayofweek >= 5 ``` Inspect: ```python events_valid[[ "event_id", "event_at", "event_month", "event_day_name", "event_hour", "is_weekend" ]].head() ``` ## 10. Convert Durations With Timedelta Session duration is stored as seconds. ```python events_valid["duration"] = pd.to_timedelta( events_valid["duration_seconds"], unit="s" ) ``` Create minutes: ```python events_valid["duration_minutes"] = ( events_valid["duration"].dt.total_seconds() / 60 ).round(2) ``` Check duration by event type: ```python events_valid.groupby("event_type")["duration_minutes"].describe() ``` ## 11. Audit Joins Before Merging Events should match learner IDs and course IDs. Audit learners: ```python learner_audit = events_valid.merge( learner_clean, how="left", on="learner_id", indicator=True ) learner_audit["_merge"].value_counts() ``` Find event rows with missing learner profiles: ```python learner_audit[learner_audit["_merge"] == "left_only"][ ["event_id", "learner_id", "event_time", "event_type"] ] ``` Audit courses: ```python course_audit = events_valid.merge( courses, how="left", on="course_id", indicator=True ) course_audit["_merge"].value_counts() ``` Find event rows with missing course records: ```python course_audit[course_audit["_merge"] == "left_only"][ ["event_id", "course_id", "event_time", "event_type"] ] ``` This is safer than blindly merging. ## 12. Build The Activity Fact Table Merge events with learners and courses. ```python activity = ( events_valid .merge(learner_clean, how="left", on="learner_id", validate="many_to_one") .merge(courses, how="left", on="course_id", validate="many_to_one") ) activity.head() ``` Check the shape: ```python activity.shape ``` Select useful columns: ```python activity = activity[[ "event_id", "event_at", "event_date", "event_month", "event_week", "event_day_name", "event_hour", "is_weekend", "learner_id", "clean_name", "city_clean", "plan_clean", "referral_source", "course_id", "course_title", "track", "difficulty", "mentor", "event_type", "device", "channel", "duration_minutes", "amount", "status" ]] activity.head() ``` This is the main table for analysis. ## 13. Basic Filtering Paid purchases: ```python paid_purchases = activity.query("event_type == 'purchase' and amount > 0") paid_purchases.head() ``` Mobile events: ```python mobile_events = activity[activity["device"].eq("mobile")] mobile_events.head() ``` Weekend events: ```python weekend_events = activity[activity["is_weekend"]] weekend_events.head() ``` Long sessions: ```python long_sessions = activity[activity["duration_minutes"] >= 30] long_sessions[["event_id", "clean_name", "event_type", "duration_minutes"]] ``` ## 14. High-Level Metrics Total events: ```python activity["event_id"].nunique() ``` Unique learners: ```python activity["learner_id"].nunique() ``` Total paid revenue: ```python activity["amount"].sum() ``` Average session duration: ```python activity["duration_minutes"].mean() ``` Events by type: ```python activity["event_type"].value_counts() ``` ## 15. GroupBy Reports Revenue by course: ```python course_revenue = ( activity .groupby(["course_id", "course_title"], as_index=False) .agg( events=("event_id", "count"), learners=("learner_id", "nunique"), revenue=("amount", "sum"), avg_duration=("duration_minutes", "mean") ) .sort_values("revenue", ascending=False) ) course_revenue["avg_duration"] = course_revenue["avg_duration"].round(2) course_revenue ``` Engagement by channel: ```python channel_report = ( activity .groupby("channel", as_index=False) .agg( events=("event_id", "count"), learners=("learner_id", "nunique"), revenue=("amount", "sum"), avg_duration=("duration_minutes", "mean") ) ) channel_report["avg_duration"] = channel_report["avg_duration"].round(2) channel_report.sort_values("revenue", ascending=False) ``` Track and plan summary: ```python track_plan_report = ( activity .groupby(["track", "plan_clean"], as_index=False) .agg( learners=("learner_id", "nunique"), events=("event_id", "count"), revenue=("amount", "sum"), avg_duration=("duration_minutes", "mean") ) ) track_plan_report ``` ## 16. Transform For Row-Level Context Add each learner's total activity duration to every row. ```python activity["learner_total_minutes"] = ( activity.groupby("learner_id")["duration_minutes"].transform("sum") ) ``` Add each course's total revenue to every row. ```python activity["course_total_revenue"] = ( activity.groupby("course_id")["amount"].transform("sum") ) ``` Calculate each row's share of course revenue: ```python activity["course_revenue_share"] = np.where( activity["course_total_revenue"] > 0, activity["amount"] / activity["course_total_revenue"], 0 ) ``` Inspect: ```python activity[[ "event_id", "course_title", "amount", "course_total_revenue", "course_revenue_share" ]].head() ``` ## 17. Pivot Tables Revenue by month and track: ```python activity.pivot_table( index="event_month", columns="track", values="amount", aggfunc="sum", fill_value=0 ) ``` Events by device and event type: ```python activity.pivot_table( index="device", columns="event_type", values="event_id", aggfunc="count", fill_value=0 ) ``` Average duration by track and device: ```python activity.pivot_table( index=["track", "difficulty"], columns="device", values="duration_minutes", aggfunc="mean", fill_value=0 ).round(2) ``` Pivot tables are compact reporting tools. ## 18. Build A DatetimeIndex Time series operations work best when the datetime column is the index. ```python activity_ts = activity.set_index("event_at").sort_index() activity_ts.head() ``` Now you can use partial string selection. Select all February events: ```python activity_ts.loc["2026-02"] ``` Select a date range: ```python activity_ts.loc["2026-02-10":"2026-03-05"] ``` Select one day: ```python activity_ts.loc["2026-03-01"] ``` Partial date indexing is one of the most convenient reasons to use a `DatetimeIndex`. ## 19. DateOffset For Calendar Jumps Timedelta is for exact durations. DateOffset understands calendar concepts like months and years. ```python base_date = pd.Timestamp("2026-01-31") base_date + pd.Timedelta(days=30) ``` ```python base_date + pd.DateOffset(months=1) ``` For business calendar logic, `DateOffset` is often clearer. Find activity on the same day every month: ```python monthly_checkpoints = pd.date_range( start="2026-01-05", end="2026-03-05", freq=pd.DateOffset(months=1) ) activity_ts[activity_ts.index.normalize().isin(monthly_checkpoints.normalize())] ``` ## 20. Daily Time Series Daily events: ```python daily_events = activity_ts["event_id"].resample("D").count() daily_events.head() ``` Daily revenue: ```python daily_revenue = activity_ts["amount"].resample("D").sum() daily_revenue.head() ``` Daily active learners: ```python daily_active_learners = activity_ts["learner_id"].resample("D").nunique() daily_active_learners.head() ``` Combine them: ```python daily_summary = pd.DataFrame({ "events": daily_events, "revenue": daily_revenue, "active_learners": daily_active_learners }) daily_summary.head() ``` ## 21. `asfreq()` For Frequency Conversion `asfreq()` changes the displayed frequency without aggregation. ```python daily_summary.asfreq("D").head(10) ``` Business-day frequency: ```python daily_summary.asfreq("B").head(10) ``` Month-start snapshots: ```python daily_summary.asfreq("MS") ``` Quarter-end snapshots: ```python daily_summary.asfreq("QE") ``` Use `asfreq()` when you want observations at a specific frequency. Use `resample()` when you want aggregation into new time buckets. ## 22. Downsampling With `resample()` Weekly summary: ```python weekly_summary = activity_ts.resample("W").agg( events=("event_id", "count"), active_learners=("learner_id", "nunique"), revenue=("amount", "sum"), avg_duration=("duration_minutes", "mean") ) weekly_summary["avg_duration"] = weekly_summary["avg_duration"].round(2) weekly_summary ``` Monthly summary: ```python monthly_summary = activity_ts.resample("MS").agg( events=("event_id", "count"), active_learners=("learner_id", "nunique"), revenue=("amount", "sum"), avg_duration=("duration_minutes", "mean") ) monthly_summary["avg_duration"] = monthly_summary["avg_duration"].round(2) monthly_summary ``` Downsampling means going from a more detailed frequency to a less detailed frequency. Examples: - hourly to daily - daily to weekly - daily to monthly ## 23. Upsampling And Filling Gaps Suppose you only care about days with revenue. ```python revenue_days = daily_revenue[daily_revenue > 0] revenue_days ``` Upsample to a daily frequency: ```python revenue_days.asfreq("D").head(15) ``` Missing days appear because no revenue happened on those dates. Fill missing values with zero: ```python revenue_days.asfreq("D", fill_value=0).head(15) ``` Forward-fill the last known value: ```python revenue_days.asfreq("D").ffill().head(15) ``` Use zero fill for event counts and revenue when missing means nothing happened. Use forward fill when the last known state should continue. ## 24. Interpolation Interpolation estimates missing numeric values. Create a sparse learner activity series: ```python sparse_active = daily_active_learners[daily_active_learners > 0] sparse_active.head() ``` Convert it to daily frequency: ```python sparse_daily = sparse_active.asfreq("D") sparse_daily.head(12) ``` Interpolate: ```python sparse_daily.interpolate(method="time").head(12) ``` Use interpolation carefully. It can be useful for sensor-like data or smooth trend lines, but it may be misleading for counts where missing means zero. ## 25. Rolling Window Metrics Rolling average smooths noisy daily values. ```python daily_summary["revenue_7d_avg"] = ( daily_summary["revenue"].rolling(window=7, min_periods=1).mean() ) daily_summary[["revenue", "revenue_7d_avg"]].head(12) ``` Rolling active learner average: ```python daily_summary["active_learners_7d_avg"] = ( daily_summary["active_learners"].rolling(window=7, min_periods=1).mean() ) ``` Rolling event standard deviation: ```python daily_summary["events_7d_std"] = ( daily_summary["events"].rolling(window=7, min_periods=2).std() ) ``` Rolling windows are useful when daily data is volatile. ## 26. Exponentially Weighted Moving Average EWM gives more weight to recent values. ```python daily_summary["revenue_ewm_7"] = ( daily_summary["revenue"].ewm(span=7, adjust=False).mean() ) daily_summary[["revenue", "revenue_7d_avg", "revenue_ewm_7"]].head(15) ``` Rolling mean treats all values in the window equally. EWM reacts faster to recent changes. ## 27. Shift For Previous Values Shift daily revenue by one day: ```python daily_summary["previous_day_revenue"] = daily_summary["revenue"].shift(1) ``` Daily revenue change: ```python daily_summary["revenue_change"] = ( daily_summary["revenue"] - daily_summary["previous_day_revenue"] ) ``` Percentage change: ```python daily_summary["revenue_pct_change"] = ( daily_summary["revenue"].pct_change() * 100 ) ``` Inspect: ```python daily_summary[[ "revenue", "previous_day_revenue", "revenue_change", "revenue_pct_change" ]].head(12) ``` Use `shift()` for lag features, comparisons, and time gaps. ## 28. Time Gap Per Learner Find time between consecutive events for each learner. Sort by learner and time: ```python activity_gap = activity.sort_values(["learner_id", "event_at"]).copy() ``` Use group-wise shift: ```python activity_gap["previous_event_at"] = ( activity_gap.groupby("learner_id")["event_at"].shift(1) ) ``` Calculate the gap: ```python activity_gap["time_since_previous_event"] = ( activity_gap["event_at"] - activity_gap["previous_event_at"] ) ``` Average gap by learner: ```python avg_gap_by_learner = ( activity_gap .groupby(["learner_id", "clean_name"], as_index=False) .agg(avg_gap=("time_since_previous_event", "mean")) ) avg_gap_by_learner.sort_values("avg_gap") ``` Convert to hours: ```python avg_gap_by_learner["avg_gap_hours"] = ( avg_gap_by_learner["avg_gap"].dt.total_seconds() / 3600 ).round(2) avg_gap_by_learner[["learner_id", "clean_name", "avg_gap_hours"]] ``` This is useful for retention and habit analysis. ## 29. Certificate Delivery SLA With Timedelta Parse delivery dates: ```python deliveries["requested_at"] = pd.to_datetime(deliveries["requested_at"], errors="coerce") deliveries["delivered_at"] = pd.to_datetime(deliveries["delivered_at"], errors="coerce") ``` Calculate delivery time: ```python deliveries["delivery_time"] = deliveries["delivered_at"] - deliveries["requested_at"] deliveries["delivery_hours"] = ( deliveries["delivery_time"].dt.total_seconds() / 3600 ).round(2) ``` Flag late deliveries: ```python deliveries["is_late"] = deliveries["delivery_hours"] > 48 ``` Merge delivery records with learner and course details: ```python delivery_report = ( deliveries .merge(learner_clean, how="left", on="learner_id", validate="many_to_one") .merge(courses, how="left", on="course_id", validate="many_to_one") ) delivery_report[[ "delivery_id", "clean_name", "course_title", "requested_at", "delivered_at", "delivery_hours", "is_late" ]] ``` SLA summary: ```python delivery_report.groupby("priority", as_index=False).agg( requests=("delivery_id", "count"), delivered=("delivered_at", "count"), avg_delivery_hours=("delivery_hours", "mean"), late_count=("is_late", "sum") ) ``` ## 30. Cohort Analysis By Signup Month Create signup month: ```python learner_clean["signup_month"] = learner_clean["signup_at"].dt.to_period("M").astype(str) ``` Merge signup month into activity: ```python activity = activity.merge( learner_clean[["learner_id", "signup_month"]], how="left", on="learner_id" ) ``` Create event month period: ```python activity["event_month_period"] = activity["event_at"].dt.to_period("M").astype(str) ``` Cohort table: ```python cohort = activity.pivot_table( index="signup_month", columns="event_month_period", values="learner_id", aggfunc="nunique", fill_value=0 ) cohort ``` This shows how many learners from each signup month were active in each event month. ## 31. MultiIndex Summary Report Build a MultiIndex report by track, difficulty, and month. ```python track_month_report = ( activity .groupby(["track", "difficulty", "event_month_period"]) .agg( events=("event_id", "count"), learners=("learner_id", "nunique"), revenue=("amount", "sum"), avg_duration=("duration_minutes", "mean") ) ) track_month_report ``` Select one track: ```python track_month_report.loc["Pandas"] ``` Unstack month into columns: ```python track_month_report["revenue"].unstack("event_month_period") ``` Reset index for export: ```python track_month_report.reset_index().head() ``` This connects GroupBy, MultiIndex, and reporting. ## 32. Final Executive Summary Create one compact table for leadership. ```python executive_summary = pd.DataFrame({ "metric": [ "total_events", "valid_event_days", "unique_learners", "total_revenue", "average_session_minutes", "certificate_requests", "average_delivery_hours", "late_certificates" ], "value": [ activity["event_id"].nunique(), activity["event_date"].nunique(), activity["learner_id"].nunique(), activity["amount"].sum(), round(activity["duration_minutes"].mean(), 2), deliveries["delivery_id"].nunique(), round(deliveries["delivery_hours"].mean(), 2), int(deliveries["is_late"].sum()) ] }) executive_summary ``` Create final reports dictionary: ```python reports = { "executive_summary": executive_summary, "course_revenue": course_revenue, "channel_report": channel_report, "weekly_summary": weekly_summary, "monthly_summary": monthly_summary, "delivery_report": delivery_report, "cohort": cohort, } ``` In a real project, you could export these: ```python # executive_summary.to_csv("executive_summary.csv", index=False) # course_revenue.to_csv("course_revenue.csv", index=False) # weekly_summary.to_csv("weekly_summary.csv") ``` ## 33. Case Study Checklist Use this checklist for future Pandas projects: 1. Load all raw files. 2. Inspect shapes, columns, and data types. 3. Check missing values and duplicates. 4. Clean text columns. 5. Parse datetime columns with `errors="coerce"`. 6. Inspect invalid dates. 7. Concatenate same-schema files. 8. Audit joins with `indicator=True`. 9. Merge dimensions with `validate`. 10. Create date features with `.dt`. 11. Create duration features with `Timedelta`. 12. Build high-level metrics. 13. Build `groupby` summaries. 14. Build pivot tables. 15. Set a `DatetimeIndex` for time series. 16. Use partial date slicing. 17. Use `resample()` for time buckets. 18. Use `asfreq()` for frequency conversion. 19. Use rolling and EWM metrics for trend smoothing. 20. Use `shift()` for previous values and time gaps. 21. Create final export-ready tables. 22. Document assumptions and data quality issues. ## 34. Practice Tasks Try these tasks after completing the walkthrough. ### Task 1. Find the top 5 learners by total duration. ```python activity.groupby(["learner_id", "clean_name"], as_index=False)["duration_minutes"].sum() \ .sort_values("duration_minutes", ascending=False) \ .head(5) ``` **Explanation** - Groups learning activity records by learner ID and course name, then sums the total duration minutes for each combination - Sorts the aggregated results in descending order by total duration to rank learners by activity level - Returns only the top 5 records showing the most active learners based on cumulative time spent - Uses as_index=False parameter to maintain group columns as regular columns rather than index levels - The chain of operations efficiently transforms raw activity data into a ranked summary report of learner engagement ### Task 2. Find revenue by city and track. ```python activity.pivot_table( index="city_clean", columns="track", values="amount", aggfunc="sum", fill_value=0 ) ``` **Explanation** - The code transforms a dataset into a pivot table format that groups transactions by city and track category - It uses the city names as row identifiers and track categories as column headers - The values represent the sum of transaction amounts for each city-track combination - Missing combinations are filled with zero instead of NaN values for cleaner data presentation - This structure enables easy comparison of spending patterns across different cities and music tracks ### Task 3. Find the busiest hour of the day. ```python activity.groupby("event_hour")["event_id"].count().sort_values(ascending=False).head(1) ``` **Explanation** - Groups activity data by hour of the event using the event_hour column - Counts the number of events occurring in each hour group - Sorts the counts in descending order to identify the highest frequency hours - Returns only the top result showing the hour with the most events - Useful for identifying peak activity periods in time-series data analysis ### Task 4. Find the average gap between purchases. ```python purchases = activity[activity["event_type"] == "purchase"].sort_values("event_at").copy() purchases["previous_purchase_at"] = purchases["event_at"].shift(1) purchases["purchase_gap"] = purchases["event_at"] - purchases["previous_purchase_at"] purchases["purchase_gap"].mean() ``` **Explanation** - Filters purchase events from activity data and sorts them chronologically to establish purchase sequence - Creates a new column showing the timestamp of the previous purchase for each transaction using shift() function - Computes time differences between consecutive purchases by subtracting previous timestamps from current ones - Calculates and returns the mean gap duration between purchases to measure customer buying frequency patterns ### Task 5. Find weekly revenue growth. ```python weekly_revenue = activity_ts["amount"].resample("W").sum() weekly_revenue_growth = weekly_revenue.pct_change() * 100 weekly_revenue_growth ``` **Explanation** - Resamples the time series data to weekly intervals and calculates the sum of amounts for each week - Computes the percentage change between consecutive weeks to determine revenue growth rate - Multiplies the percentage change by 100 to convert to percentage format for easier interpretation - The resulting series shows the weekly revenue growth percentages with NaN for the first week since there's no previous week to compare against ### Task 6. Find courses with engagement but no revenue. ```python course_summary = activity.groupby(["course_id", "course_title"], as_index=False).agg( events=("event_id", "count"), revenue=("amount", "sum") ) course_summary.query("events > 0 and revenue == 0") ``` **Explanation** - Groups activity data by course_id and course_title to aggregate event counts and total revenue per course - Uses agg() function to count events and sum revenue for each course grouping - Applies query() method to filter results showing only courses with at least one event but zero revenue - Returns a filtered dataframe containing courses that had user interactions but no monetary transactions - Useful for identifying potentially problematic courses or analyzing incomplete transaction data ### Task 7. Find learners with no activity after signup. ```python active_ids = activity["learner_id"].dropna().unique() learner_clean[~learner_clean["learner_id"].isin(active_ids)] ``` **Explanation** - Retrieves all unique learner IDs from the activity dataset while removing any null values - Creates a boolean mask to identify rows in learner_clean that do not have matching learner IDs in the active_ids set - Returns only the records from learner_clean where the learner_id is NOT present in the list of active learner IDs - This operation effectively filters out any learner records that appear in the activity data, leaving only inactive or non-active learners - The result represents the subset of learners who haven't participated in any activities within the dataset ### Task 8. Find late certificate deliveries by course. ```python delivery_report.groupby("course_title", as_index=False).agg( requests=("delivery_id", "count"), late_count=("is_late", "sum"), avg_delivery_hours=("delivery_hours", "mean") ).sort_values("late_count", ascending=False) ``` **Explanation** - Groups delivery records by course title to analyze performance metrics per course - Calculates three key statistics: total delivery requests, count of late deliveries, and average delivery time in hours - Uses aggregation functions to efficiently compute multiple metrics in a single operation - Sorts results by late delivery count in descending order to identify courses with the most delayed deliveries - Returns a clean dataframe showing course performance metrics for quick analysis and decision making ### Task 9. Build a daily dashboard with rolling averages. ```python daily_dashboard = daily_summary.copy() daily_dashboard["events_7d_avg"] = daily_dashboard["events"].rolling(7, min_periods=1).mean() daily_dashboard["revenue_7d_avg"] = daily_dashboard["revenue"].rolling(7, min_periods=1).mean() daily_dashboard["active_learners_7d_avg"] = daily_dashboard["active_learners"].rolling(7, min_periods=1).mean() daily_dashboard.tail() ``` **Explanation** - Creates a copy of the daily summary dataframe to avoid modifying the original data while building the dashboard - Applies a 7-day rolling window calculation with minimum periods set to 1 to handle incomplete windows at the start of the dataset - Computes three separate 7-day moving averages for events, revenue, and active learners metrics to smooth out short-term fluctuations - Displays the last few rows of the updated dashboard to preview the calculated rolling averages alongside original data ### Task 10. Build a month and channel revenue report. ```python activity.pivot_table( index="event_month_period", columns="channel", values="amount", aggfunc="sum", fill_value=0, margins=True ) ``` **Explanation** - Generates a cross-tabulated view showing total amounts grouped by event month periods and marketing channels - Uses sum aggregation to combine transaction values across matching combinations of month and channel - Sets missing values to zero using fill_value parameter for clean data presentation - Includes marginal totals (row and column sums) through the margins=True parameter for comprehensive analysis - Organizes data with event months as rows, channels as columns, and summed amounts as cell values ## 35. Interview Questions From This Case Study ### 1. Why should you use `errors="coerce"` with `pd.to_datetime()`? It prevents invalid date strings from crashing the pipeline and turns them into `NaT`, which can be inspected and handled. ### 2. What is the difference between `asfreq()` and `resample()`? `asfreq()` changes the frequency view without aggregation. `resample()` groups time-indexed data into new time buckets and applies aggregations. ### 3. What is downsampling? Downsampling means moving from a more detailed frequency to a less detailed one, such as daily data to weekly or monthly summaries. ### 4. What is upsampling? Upsampling means moving from a less detailed frequency to a more detailed one, such as weekly data to daily data. ### 5. Why use a `DatetimeIndex`? It enables partial date slicing, time-aware resampling, frequency conversion, rolling windows, and other time series operations. ### 6. What does `shift()` help with? It helps compare each row with a previous or future row, such as previous-day revenue or previous login time. ### 7. Why use group-wise `shift()`? It calculates previous values within each group, such as the previous event for each learner. ### 8. What is the purpose of rolling averages? Rolling averages smooth noisy time series data by averaging values over a moving window. ### 9. How is EWM different from a rolling mean? EWM gives more weight to recent observations, while a simple rolling mean gives equal weight to all observations in the window. ### 10. Why audit joins with `indicator=True`? It reveals unmatched rows before or after a merge, helping catch missing dimension records or bad IDs. ### 11. Why use `validate` in `merge()`? It checks that the join relationship matches your expectation, such as many events to one learner. ### 12. What is a Timedelta? A Timedelta represents a duration between two moments, such as certificate delivery time or time between learner sessions. ### 13. Why use named aggregation? It creates readable summary tables with clear output column names. ### 14. Why use pivot tables? Pivot tables quickly summarize data across two or more dimensions, such as month by track or channel by event type. ### 15. What makes a case study different from isolated examples? A case study forces you to connect loading, cleaning, joining, time handling, analysis, validation, and reporting into one workflow. ## 36. Final Mental Model For a real Pandas analytics project, think in layers: | Layer | Pandas Tools | |---|---| | Ingest | `read_csv`, `concat` | | Inspect | `shape`, `info`, `describe`, `isna`, `duplicated` | | Clean | `.str`, `replace`, `astype`, `to_datetime` | | Model | `merge`, `validate`, `indicator` | | Feature Engineering | `.dt`, `Timedelta`, `transform`, `assign` | | Analysis | `groupby`, `agg`, `pivot_table`, `query` | | Time Series | `DatetimeIndex`, `asfreq`, `resample`, `rolling`, `ewm`, `shift` | | Reporting | sorted summaries, pivot tables, export-ready DataFrames | If you remember only one thing, remember this: Pandas is not a collection of random methods. It is a workflow: load, inspect, clean, combine, transform, analyze, validate, and report. ## Official References - Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html - Pandas windowing operations user guide: https://pandas.pydata.org/docs/user_guide/window.html - Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html - Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html - Pandas text data user guide: https://pandas.pydata.org/docs/user_guide/text.html - Pandas reshaping and pivot tables user guide: https://pandas.pydata.org/docs/user_guide/reshaping.html