Pandas AI Learning Platform Analytics Capstone
Build a portfolio-grade Pandas project covering cleaning, joins, groupby, pivots, time series, text analysis, and ML-ready features.
Overview
This capstone turns Pandas from a collection of methods into one complete analytics workflow. You will work as the data owner for a fictional AI learning platform that teaches Python, Pandas, analytics, and AI skills. The product team wants to know which learners are succeeding, which courses need improvement, how recommendation models perform, and what learner-level features could support future AI experiments.
The project is intentionally designed as a portfolio project. By the end, you should be able to explain the business problem, defend your cleaning choices, validate joins, build reports at the right grain, and export a model-ready learner feature table without leaking future information into training data.
The data is original synthetic data created for this project. It is not copied from a course notebook, public dataset, or third-party platform.
What You Will Build
- A data quality audit that catches missing values, duplicate natural events, invalid scores, outlier study sessions, and unresolved tickets.
- A cleaned relational Pandas workspace with learners, courses, lessons, enrollments, events, assessments, feedback, certificates, support tickets, recommendations, and content plans.
- A course performance report with enrollment, revenue, engagement, ratings, certificates, and recommendation metrics.
- A learner feature table for Data Science and AI use cases.
- Weekly time-series reports with resampling, rolling averages, lag features, and percent change.
- Text feedback summaries using vectorized string methods and token-level analysis.
- Recommendation performance reports by model version, course, category, rank position, click-through rate, and enrollment conversion.
- A final portfolio summary that explains practical business actions.
Dataset Guide
The project includes 12 CSV files. Each file is small enough to inspect manually but connected enough to practice real Pandas workflows.
pandas_capstone_ai_recommendations.csv: 420 rows, 11 columnspandas_capstone_assessments.csv: 320 rows, 9 columnspandas_capstone_certificates.csv: 90 rows, 7 columnspandas_capstone_content_calendar.csv: 80 rows, 10 columnspandas_capstone_courses.csv: 14 rows, 9 columnspandas_capstone_data_dictionary.csv: 102 rows, 5 columnspandas_capstone_enrollments.csv: 260 rows, 9 columnspandas_capstone_feedback.csv: 260 rows, 9 columnspandas_capstone_learners.csv: 120 rows, 11 columnspandas_capstone_learning_events.csv: 900 rows, 10 columnspandas_capstone_lessons.csv: 56 rows, 7 columnspandas_capstone_support_tickets.csv: 150 rows, 10 columns
Important relationships:
learners.learner_idjoins to enrollments, events, assessments, feedback, tickets, and recommendations.courses.course_idjoins to lessons, enrollments, events, assessments, feedback, certificates, and content calendar.lessons.lesson_idjoins to learning events.enrollments.enrollment_idjoins to events, assessments, feedback, certificates, and support tickets.recommendations.recommended_course_idjoins back tocourses.course_id.
Intentional issues to practice:
- Missing
country,rating,coupon_code,feedback_text, andresolved_atvalues. - Whitespace and casing inconsistencies in
country,plan, andcategory. - Outliers in
minutes_spent,score_pct,price_paid_usd, andresponse_hours. - Duplicate natural learning events with unique technical event IDs.
- Completed enrollments without certificates.
- Recommendation clicks that do not convert into enrollments.
Portfolio Submission Checklist
- Your repository has a clear README with the business problem, data model, and final outputs.
- Your code can run from a clean checkout.
- Your cleaning choices are documented.
- Your joins include validation checks.
- Your exported reports are stored in
outputs/. - Your learner feature table has one row per learner.
- Your final write-up includes three business recommendations.
- Your leakage notes explain which columns are safe for modeling and which are not.
FAQ
Is this project enough to practice Pandas for Data Science and AI?
Yes. It covers the Pandas workflows that show up repeatedly in DS and AI projects: ingestion, inspection, cleaning, joins, aggregation, reshaping, datetime analysis, text preprocessing, feature engineering, validation, and export.
Why is this not a machine learning project?
This capstone prepares the data that a machine learning project would need. Model training is intentionally out of scope so the focus stays on Pandas depth and data quality.
Can I use notebooks instead of scripts?
Yes. A notebook is fine for exploration. For portfolio quality, keep a script version of the final pipeline so another person can rerun it.
Why are some values messy on purpose?
Real datasets are rarely clean. The missing values, casing issues, outliers, and duplicate natural events force you to practice judgment instead of only memorizing syntax.
What should premium users compare against?
Compare your structure, outputs, join validation, feature grain, and leakage notes against the reference solution. The exact metric values may differ if your outlier policy differs.
Requirements
Milestone 1: Project setup
- Download all starter files.
- Keep CSVs in a
data/folder. - Keep your working script in
starter/pandas_capstone_starter.py. - Create an
outputs/folder for exported reports.
Milestone 2: Load and audit
- Load every CSV with
pd.read_csv. - Print shape, dtypes, missing cells, duplicate rows, and key uniqueness.
- Create a reusable
audit_tables()function. - Identify which columns need type conversion before analysis.
Milestone 3: Clean data safely
- Normalize casing and whitespace for categorical columns.
- Parse all date and datetime columns with
errors="coerce". - Convert boolean-like text columns to booleans.
- Convert numeric columns with
pd.to_numeric. - Add flags for invalid scores and outlier values before clipping or capping.
- Fill missing text fields only when a placeholder is analytically meaningful.
Milestone 4: Validate relationships
- Validate primary keys with
is_unique. - Use
isincoverage checks for foreign keys. - Use
merge(validate="many_to_one")when joining dimensions to facts. - Keep row counts before and after joins so accidental row multiplication is visible.
Milestone 5: Core Pandas transformations
- Use
loc, boolean masks,query,assign,pipe,sort_values, and method chaining. - Use named aggregation in
groupby. - Use
transformfor learner/course-level features that must return to the original row grain. - Use
rank,cut, andqcutto create score bands, engagement bands, and course rankings.
Milestone 6: Reporting tables
- Build one course performance table.
- Build one learner feature table.
- Build one recommendation performance table.
- Build one feedback term report.
- Build one relationship validation table.
Milestone 7: Time series
- Set
event_timeas a datetime index. - Resample learning events weekly.
- Calculate rolling 4-week engagement.
- Add lag and percent-change columns.
- Explain why sorted datetime indexes matter before rolling analysis.
Milestone 8: Text features
- Clean feedback with vectorized string operations.
- Count words and detect important phrases such as project, portfolio, stuck, trouble, fast, or hard.
- Tokenize feedback with
str.findallandexplode. - Build top terms per course.
Milestone 9: DS and AI feature engineering
- Create learner-level features such as total minutes, active days, completed courses, average score, average rating, recommendation CTR, recommendation enrollment rate, support burden, and engagement labels.
- Add leakage notes for every label-like column.
- Separate descriptive analytics columns from model-ready columns.
Milestone 10: Export and communicate
- Export reports to CSV.
- Write three business recommendations.
- Add a short README explaining your assumptions, cleaning policy, and final outputs.
Starter Files
Workspace
1# Pandas AI Learning Platform Analytics Capstone2 3You are the analytics owner for a fictional AI learning platform. Your job is to turn raw learner, course, event, feedback, ticket, certificate, and recommendation data into clean reports and a learner-level feature table that could feed downstream dashboards or ML experiments.4 5## Suggested workflow6 71. Create a virtual environment.82. Install pandas: `pip install pandas`.93. Run `python starter/pandas_capstone_starter.py`.104. Implement the TODO functions one by one.115. Export reports into the `outputs/` folder.126. Use the tests as a final quality gate.13 14The data is synthetic and intentionally messy. Treat this like a real analytics handoff: audit first, clean carefully, validate joins, then build features.15 