Advanced8hPremium

Pandas AI Learning Platform Analytics Capstone

Build a portfolio-grade Pandas project covering cleaning, joins, groupby, pivots, time series, text analysis, and ML-ready features.

Overview

This capstone turns Pandas from a collection of methods into one complete analytics workflow. You will work as the data owner for a fictional AI learning platform that teaches Python, Pandas, analytics, and AI skills. The product team wants to know which learners are succeeding, which courses need improvement, how recommendation models perform, and what learner-level features could support future AI experiments.

The project is intentionally designed as a portfolio project. By the end, you should be able to explain the business problem, defend your cleaning choices, validate joins, build reports at the right grain, and export a model-ready learner feature table without leaking future information into training data.

The data is original synthetic data created for this project. It is not copied from a course notebook, public dataset, or third-party platform.

What You Will Build

A data quality audit that catches missing values, duplicate natural events, invalid scores, outlier study sessions, and unresolved tickets.
A cleaned relational Pandas workspace with learners, courses, lessons, enrollments, events, assessments, feedback, certificates, support tickets, recommendations, and content plans.
A course performance report with enrollment, revenue, engagement, ratings, certificates, and recommendation metrics.
A learner feature table for Data Science and AI use cases.
Weekly time-series reports with resampling, rolling averages, lag features, and percent change.
Text feedback summaries using vectorized string methods and token-level analysis.
Recommendation performance reports by model version, course, category, rank position, click-through rate, and enrollment conversion.
A final portfolio summary that explains practical business actions.

Dataset Guide

The project includes 12 CSV files. Each file is small enough to inspect manually but connected enough to practice real Pandas workflows.

pandas_capstone_ai_recommendations.csv: 420 rows, 11 columns
pandas_capstone_assessments.csv: 320 rows, 9 columns
pandas_capstone_certificates.csv: 90 rows, 7 columns
pandas_capstone_content_calendar.csv: 80 rows, 10 columns
pandas_capstone_courses.csv: 14 rows, 9 columns
pandas_capstone_data_dictionary.csv: 102 rows, 5 columns
pandas_capstone_enrollments.csv: 260 rows, 9 columns
pandas_capstone_feedback.csv: 260 rows, 9 columns
pandas_capstone_learners.csv: 120 rows, 11 columns
pandas_capstone_learning_events.csv: 900 rows, 10 columns
pandas_capstone_lessons.csv: 56 rows, 7 columns
pandas_capstone_support_tickets.csv: 150 rows, 10 columns

Important relationships:

learners.learner_id joins to enrollments, events, assessments, feedback, tickets, and recommendations.
courses.course_id joins to lessons, enrollments, events, assessments, feedback, certificates, and content calendar.
lessons.lesson_id joins to learning events.
enrollments.enrollment_id joins to events, assessments, feedback, certificates, and support tickets.
recommendations.recommended_course_id joins back to courses.course_id.

Intentional issues to practice:

Missing country, rating, coupon_code, feedback_text, and resolved_at values.
Whitespace and casing inconsistencies in country, plan, and category.
Outliers in minutes_spent, score_pct, price_paid_usd, and response_hours.
Duplicate natural learning events with unique technical event IDs.
Completed enrollments without certificates.
Recommendation clicks that do not convert into enrollments.

Portfolio Submission Checklist

Your repository has a clear README with the business problem, data model, and final outputs.
Your code can run from a clean checkout.
Your cleaning choices are documented.
Your joins include validation checks.
Your exported reports are stored in outputs/.
Your learner feature table has one row per learner.
Your final write-up includes three business recommendations.
Your leakage notes explain which columns are safe for modeling and which are not.

FAQ

Is this project enough to practice Pandas for Data Science and AI?

Yes. It covers the Pandas workflows that show up repeatedly in DS and AI projects: ingestion, inspection, cleaning, joins, aggregation, reshaping, datetime analysis, text preprocessing, feature engineering, validation, and export.

Why is this not a machine learning project?

This capstone prepares the data that a machine learning project would need. Model training is intentionally out of scope so the focus stays on Pandas depth and data quality.

Can I use notebooks instead of scripts?

Yes. A notebook is fine for exploration. For portfolio quality, keep a script version of the final pipeline so another person can rerun it.

Why are some values messy on purpose?

Real datasets are rarely clean. The missing values, casing issues, outliers, and duplicate natural events force you to practice judgment instead of only memorizing syntax.

What should premium users compare against?

Compare your structure, outputs, join validation, feature grain, and leakage notes against the reference solution. The exact metric values may differ if your outlier policy differs.

Requirements

Milestone 1: Project setup

Download all starter files.
Keep CSVs in a data/ folder.
Keep your working script in starter/pandas_capstone_starter.py.
Create an outputs/ folder for exported reports.

Milestone 2: Load and audit

Load every CSV with pd.read_csv.
Print shape, dtypes, missing cells, duplicate rows, and key uniqueness.
Create a reusable audit_tables() function.
Identify which columns need type conversion before analysis.

Milestone 3: Clean data safely

Normalize casing and whitespace for categorical columns.
Parse all date and datetime columns with errors="coerce".
Convert boolean-like text columns to booleans.
Convert numeric columns with pd.to_numeric.
Add flags for invalid scores and outlier values before clipping or capping.
Fill missing text fields only when a placeholder is analytically meaningful.

Milestone 4: Validate relationships

Validate primary keys with is_unique.
Use isin coverage checks for foreign keys.
Use merge(validate="many_to_one") when joining dimensions to facts.
Keep row counts before and after joins so accidental row multiplication is visible.

Milestone 5: Core Pandas transformations

Use loc, boolean masks, query, assign, pipe, sort_values, and method chaining.
Use named aggregation in groupby.
Use transform for learner/course-level features that must return to the original row grain.
Use rank, cut, and qcut to create score bands, engagement bands, and course rankings.

Milestone 6: Reporting tables

Build one course performance table.
Build one learner feature table.
Build one recommendation performance table.
Build one feedback term report.
Build one relationship validation table.

Milestone 7: Time series

Set event_time as a datetime index.
Resample learning events weekly.
Calculate rolling 4-week engagement.
Add lag and percent-change columns.
Explain why sorted datetime indexes matter before rolling analysis.

Milestone 8: Text features

Clean feedback with vectorized string operations.
Count words and detect important phrases such as project, portfolio, stuck, trouble, fast, or hard.
Tokenize feedback with str.findall and explode.
Build top terms per course.

Milestone 9: DS and AI feature engineering

Create learner-level features such as total minutes, active days, completed courses, average score, average rating, recommendation CTR, recommendation enrollment rate, support burden, and engagement labels.
Add leakage notes for every label-like column.
Separate descriptive analytics columns from model-ready columns.

Milestone 10: Export and communicate

Export reports to CSV.
Write three business recommendations.
Add a short README explaining your assumptions, cleaning policy, and final outputs.

Starter Files

README.mdmarkdown

starter/pandas_capstone_starter.pypython

data/pandas_capstone_ai_recommendations.csvcsv

data/pandas_capstone_assessments.csvcsv

data/pandas_capstone_certificates.csvcsv

data/pandas_capstone_content_calendar.csvcsv

data/pandas_capstone_courses.csvcsv

data/pandas_capstone_data_dictionary.csvcsv

data/pandas_capstone_enrollments.csvcsv

data/pandas_capstone_feedback.csvcsv

data/pandas_capstone_learners.csvcsv

data/pandas_capstone_learning_events.csvcsv

data/pandas_capstone_lessons.csvcsv

data/pandas_capstone_support_tickets.csvcsv

Workspace

Project Sandbox

1# Pandas AI Learning Platform Analytics Capstone2 3You are the analytics owner for a fictional AI learning platform. Your job is to turn raw learner, course, event, feedback, ticket, certificate, and recommendation data into clean reports and a learner-level feature table that could feed downstream dashboards or ML experiments.4 5## Suggested workflow6 71. Create a virtual environment.82. Install pandas: `pip install pandas`.93. Run `python starter/pandas_capstone_starter.py`.104. Implement the TODO functions one by one.115. Export reports into the `outputs/` folder.126. Use the tests as a final quality gate.13 14The data is synthetic and intentionally messy. Treat this like a real analytics handoff: audit first, clean carefully, validate joins, then build features.15

Click inside code viewer to edit

Edits are preserved locally. This file type does not support in-browser runtime execution.