What is the purpose of using Pandas GroupBy?

Pandas GroupBy is used to split a DataFrame into groups, apply calculations to each group, and combine the results into a new object, making datasets useful by answering specific questions about the data.

What are some of the key functions you can use with GroupBy?

Some key functions you can use with GroupBy include sum, mean, median, min, max, std, var, size, count, nunique, first, last, nth, and more.

How do you load the datasets used in this guide?

You load the datasets by placing the CSV files 'learnergroupbyactivity.csv' and 'courserevenuegroupby.csv' in the same folder as the Markdown file and using pandas' read_csv function.

What does the GroupBy object do?

The GroupBy object is lazy, meaning it holds the grouping instructions until an aggregation, transformation, filter, or custom operation is requested.

How can you check the number of groups in a GroupBy object?

You can check the number of groups in a GroupBy object using the ngroups attribute.

Pandas GroupBy: Split, Aggregate, Transform &

Pandas GroupBy: Split, Aggregate, Transform, Filter, and Practice

Most real datasets are not useful because of individual rows alone.

They become useful when you ask questions like:

Which track has the highest average quiz score?
Which city brings the most paid learners?
Which plan has the strongest completion rate?
Which course generated the highest revenue?
Which learner is above or below the average of their own track?
Which groups should be kept or removed based on business rules?

This is where groupby() becomes one of the most important Pandas tools.

This guide uses original sample datasets created for this lesson. It does not use copied course, movie, sports, or public dataset examples.

Files used in this lesson:

learner_groupby_activity.csv
course_revenue_groupby.csv

Place both CSV files in the same folder as this Markdown file before running the examples.

What You Will Learn

By the end, you should be able to use:

groupby()
sum, mean, median, min, max, std, and var
size() versus count()
nunique()
first(), last(), and nth()
groups, ngroups, and get_group()
multiple grouping columns
.agg() with lists and dictionaries
named aggregation
as_index=False and reset_index()
transform()
filter()
apply() for custom group logic
top-N records inside each group
group-wise percentages
group-wise missing-value handling

1. Setup

python

import pandas as pd
import numpy as np

Load the learner activity dataset:

python

activity = pd.read_csv("learner_groupby_activity.csv")
activity.head()

Load the course revenue dataset:

python

revenue = pd.read_csv("course_revenue_groupby.csv", parse_dates=["ordered_at"])
revenue.head()

Quickly inspect both datasets:

python

activity.shape, revenue.shape

python

activity.info()

python

revenue.info()

2. What GroupBy Means

groupby() follows a common pattern:

Split the DataFrame into groups.
Apply a calculation to each group.
Combine the results into a new object.

For example:

python

activity.groupby("track")["minutes_watched"].sum()

This means:

split rows by track
take the minutes_watched column inside each track
calculate total minutes for each track
combine the totals into one Series

The result is not one row per learner. It is one row per group.

3. Create a GroupBy Object

python

track_groups = activity.groupby("track")
track_groups

The output may not look exciting because a GroupBy object is lazy.

Pandas has not calculated the final result yet. It is holding the grouping instructions until you ask for an aggregation, transformation, filter, or custom operation.

Check the number of groups:

python

track_groups.ngroups

Check the group labels:

python

list(track_groups.groups.keys())

The groups attribute gives the row indexes that belong to each group:

python

track_groups.groups

Get only one group:

python

track_groups.get_group("Pandas")

This is useful while debugging because you can inspect what Pandas considers one group.

4. Basic Aggregation

Find total watch time for each track:

python

activity.groupby("track")["minutes_watched"].sum()

Find average quiz score for each track:

python

activity.groupby("track")["quiz_score"].mean()

Find highest quiz score in each track:

python

activity.groupby("track")["quiz_score"].max()

Find lowest quiz score in each track:

python

activity.groupby("track")["quiz_score"].min()

Find multiple statistics:

python

activity.groupby("track")["quiz_score"].agg(["min", "mean", "median", "max"])

When you aggregate one column, the result is usually a Series.

When you aggregate multiple columns or multiple functions, the result becomes a DataFrame.

5. `size()` Versus `count()`

This difference matters in interviews and real projects.

size() counts rows in each group:

python

activity.groupby("track").size()

count() counts non-missing values in each column:

python

activity.groupby("track").count()

Example:

python

activity.groupby("track")["quiz_score"].count()

If a group has 5 rows but one missing quiz score, size() returns 5, while count() on quiz_score returns 4.

Use size() when you want row count.

Use count() when you want non-null value count.

6. Count Unique Values Per Group

Find how many unique learners are in each track:

python

activity.groupby("track")["learner_id"].nunique()

Find how many cities each track has learners from:

python

activity.groupby("track")["city"].nunique()

Find how many unique lessons each track contains:

python

activity.groupby("track")["lesson"].nunique()

nunique() is useful for metrics like:

unique learners
unique buyers
unique cities
unique products
unique sessions

7. First, Last, and Nth Row in Every Group

Sort the data first so "first" and "last" mean something:

python

activity_sorted = activity.sort_values(["track", "completed_at"])

First row per track:

python

activity_sorted.groupby("track").first()

Last row per track:

python

activity_sorted.groupby("track").last()

Second row per track:

python

activity_sorted.groupby("track").nth(1)

nth(1) uses zero-based position, so it returns the second row in each group.

8. Group By Multiple Columns

You can group by more than one column by passing a list.

Average score by track and plan:

python

activity.groupby(["track", "plan"])["quiz_score"].mean()

Total minutes by track and status:

python

activity.groupby(["track", "status"])["minutes_watched"].sum()

Revenue by course and city:

python

revenue.groupby(["course", "city"])["amount"].sum()

The result uses a MultiIndex by default.

If you want normal columns, call reset_index():

python

revenue.groupby(["course", "city"])["amount"].sum().reset_index()

Or use as_index=False:

python

revenue.groupby(["course", "city"], as_index=False)["amount"].sum()

For beginner-friendly dashboards, as_index=False is often easier.

9. Sort GroupBy Results

Find the top tracks by total watch time:

python

activity.groupby("track")["minutes_watched"].sum().sort_values(ascending=False)

Find the top courses by paid revenue:

python

paid_revenue = revenue[revenue["paid"] == True]

paid_revenue.groupby("course")["amount"].sum().sort_values(ascending=False)

Find the top cities by number of paid orders:

python

paid_revenue.groupby("city").size().sort_values(ascending=False)

GroupBy results are often more useful after sorting.

10. Use `.agg()` With Multiple Columns

Create a track summary:

python

activity.groupby("track").agg({
    "learner_id": "nunique",
    "minutes_watched": "sum",
    "quiz_score": "mean",
    "lesson": "count"
})

This works, but the column names are not always friendly.

For cleaner output, use named aggregation.

11. Named Aggregation

Named aggregation lets you choose the output column names.

python

track_summary = activity.groupby("track", as_index=False).agg(
    learners=("learner_id", "nunique"),
    lessons_attempted=("lesson", "count"),
    total_minutes=("minutes_watched", "sum"),
    average_score=("quiz_score", "mean"),
    completed_lessons=("status", lambda s: (s == "completed").sum())
)

track_summary

Round the average:

python

track_summary["average_score"] = track_summary["average_score"].round(2)
track_summary

Create a course revenue summary:

python

course_summary = revenue.groupby("course", as_index=False).agg(
    orders=("order_id", "nunique"),
    buyers=("learner_id", "nunique"),
    gross_revenue=("amount", "sum"),
    average_discount=("discount", "mean"),
    paid_orders=("paid", lambda s: s.sum())
)

course_summary.sort_values("gross_revenue", ascending=False)

Named aggregation is usually the cleanest style for project code.

12. Grouped Value Counts

Find lesson status distribution inside each track:

python

activity.groupby("track")["status"].value_counts()

Convert it into a table:

python

activity.groupby("track")["status"].value_counts().unstack(fill_value=0)

Find plan mix inside each course:

python

revenue.groupby("course")["plan"].value_counts().unstack(fill_value=0)

This is useful for quick cross-tab style analysis.

13. Percentages Within Each Group

Suppose you want each order's share of total course revenue.

First calculate total revenue per course and attach it back to every row:

python

revenue["course_total_amount"] = revenue.groupby("course")["amount"].transform("sum")

Now calculate the row's percentage contribution:

python

revenue["course_revenue_share"] = revenue["amount"] / revenue["course_total_amount"]

revenue[["order_id", "course", "amount", "course_total_amount", "course_revenue_share"]].head()

Format as percentage:

python

revenue["course_revenue_share_pct"] = (revenue["course_revenue_share"] * 100).round(2)

The key idea is that transform() returns a result with the same number of rows as the original DataFrame.

That makes it perfect for adding group-level metrics back into row-level data.

14. `transform()` For Group-Wise Features

Add average score of each learner's track:

python

activity["track_average_score"] = activity.groupby("track")["quiz_score"].transform("mean")

Compare each row to its track average:

python

activity["score_vs_track_average"] = activity["quiz_score"] - activity["track_average_score"]

Show the comparison:

python

activity[[
    "learner",
    "track",
    "lesson",
    "quiz_score",
    "track_average_score",
    "score_vs_track_average"
]].head()

Add each learner's total minutes:

python

activity["learner_total_minutes"] = activity.groupby("learner_id")["minutes_watched"].transform("sum")

Find learners with high total effort:

python

activity[activity["learner_total_minutes"] >= 120][
    ["learner_id", "learner", "track", "learner_total_minutes"]
].drop_duplicates()

Use transform() when you need group results while keeping the original rows.

15. Filter Groups

filter() keeps or removes entire groups based on a condition.

Keep only tracks with at least 4 rows:

python

activity.groupby("track").filter(lambda group: len(group) >= 4)

Keep only tracks with at least 3 unique learners:

python

activity.groupby("track").filter(
    lambda group: group["learner_id"].nunique() >= 3
)

Keep only courses with total revenue of at least 4000:

python

revenue.groupby("course").filter(
    lambda group: group["amount"].sum() >= 4000
)

Important: filter() returns original rows, not a summary table.

16. Apply Custom Group Logic

Use apply() when built-in aggregation or transformation is not enough.

Example: create a custom summary for each track.

python

def track_health(group):
    completion_rate = (group["status"] == "completed").mean()
    avg_score = group["quiz_score"].mean()
    total_minutes = group["minutes_watched"].sum()

    return pd.Series({
        "completion_rate": completion_rate,
        "average_score": avg_score,
        "total_minutes": total_minutes,
        "health_score": (completion_rate * 50) + (avg_score / 100 * 50)
    })

activity.groupby("track").apply(track_health)

apply() is flexible, but it can be slower and easier to misuse.

Prefer these in most cases:

.agg() for summary tables
.transform() for row-level features
.filter() for keeping or removing full groups

Use .apply() when the logic does not fit those patterns.

17. Top Records Inside Each Group

Find top 2 quiz attempts inside each track.

python

activity.sort_values(["track", "quiz_score"], ascending=[True, False]) \
    .groupby("track") \
    .head(2)

Find the top order from each course:

python

revenue.sort_values(["course", "amount"], ascending=[True, False]) \
    .groupby("course") \
    .head(1)

Another approach is ranking:

python

activity["score_rank_in_track"] = activity.groupby("track")["quiz_score"].rank(
    method="dense",
    ascending=False
)

activity[activity["score_rank_in_track"] <= 2].sort_values(["track", "score_rank_in_track"])

Ranking is better when you want to keep the rank column for later analysis.

18. Handle Missing Values In Groups

Find missing quiz scores by track:

python

activity["quiz_score"].isna().groupby(activity["track"]).sum()

Fill missing quiz score with the track average:

python

activity["quiz_score_filled"] = activity["quiz_score"].fillna(
    activity.groupby("track")["quiz_score"].transform("mean")
)

Check again:

python

activity[["learner", "track", "quiz_score", "quiz_score_filled"]]

This is more thoughtful than filling every missing score with the global mean, because each track may have a different difficulty level.

19. Group Missing Categories With `dropna=False`

By default, Pandas may exclude missing group labels.

If you want missing values to appear as a group, use dropna=False.

python

activity.groupby("mentor", dropna=False)["learner_id"].nunique()

This can be useful when missing values have business meaning, such as learners who have not been assigned a mentor yet.

20. Time-Based Grouping

The revenue dataset has an order date.

Create a month column:

python

revenue["month"] = revenue["ordered_at"].dt.to_period("M").astype(str)

Revenue by month:

python

revenue.groupby("month", as_index=False)["amount"].sum()

Revenue by month and category:

python

revenue.groupby(["month", "category"], as_index=False)["amount"].sum()

You can also use pd.Grouper:

python

revenue.groupby(pd.Grouper(key="ordered_at", freq="M"))["amount"].sum()

Use a date column when the question is about time.

Use normal category columns when the question is about segments.

21. Mini Project 1: Learner Engagement Dashboard

Build a summary table with:

track name
unique learners
total minutes watched
average quiz score
completion rate
number of mentors

Solution:

python

learner_dashboard = activity.groupby("track", as_index=False).agg(
    unique_learners=("learner_id", "nunique"),
    total_minutes=("minutes_watched", "sum"),
    average_quiz_score=("quiz_score", "mean"),
    completion_rate=("status", lambda s: (s == "completed").mean()),
    mentors=("mentor", "nunique")
)

learner_dashboard["average_quiz_score"] = learner_dashboard["average_quiz_score"].round(2)
learner_dashboard["completion_rate"] = (learner_dashboard["completion_rate"] * 100).round(2)

learner_dashboard.sort_values("total_minutes", ascending=False)

22. Mini Project 2: Course Revenue Dashboard

Build a summary table with:

course
category
total orders
paid orders
total revenue
average order amount
total discount
paid conversion rate

Solution:

python

course_revenue_dashboard = revenue.groupby(["course", "category"], as_index=False).agg(
    total_orders=("order_id", "nunique"),
    paid_orders=("paid", lambda s: s.sum()),
    total_revenue=("amount", "sum"),
    average_order_amount=("amount", "mean"),
    total_discount=("discount", "sum")
)

course_revenue_dashboard["paid_conversion_rate"] = (
    course_revenue_dashboard["paid_orders"] / course_revenue_dashboard["total_orders"] * 100
).round(2)

course_revenue_dashboard["average_order_amount"] = course_revenue_dashboard["average_order_amount"].round(2)

course_revenue_dashboard.sort_values("total_revenue", ascending=False)

23. Mini Project 3: Find Learners Who Need Help

A learner may need support if:

their filled quiz score is below the track average
they have at least one in-progress lesson
their total minutes are below 100

Solution:

python

support_data = activity.copy()

support_data["quiz_score_filled"] = support_data["quiz_score"].fillna(
    support_data.groupby("track")["quiz_score"].transform("mean")
)

support_data["track_avg_score"] = support_data.groupby("track")["quiz_score_filled"].transform("mean")
support_data["learner_total_minutes"] = support_data.groupby("learner_id")["minutes_watched"].transform("sum")
support_data["has_in_progress"] = support_data.groupby("learner_id")["status"].transform(
    lambda s: (s == "in_progress").any()
)

needs_help = support_data[
    (support_data["quiz_score_filled"] < support_data["track_avg_score"]) &
    (support_data["has_in_progress"]) &
    (support_data["learner_total_minutes"] < 100)
]

needs_help[["learner_id", "learner", "track", "learner_total_minutes"]].drop_duplicates()

24. Common Mistakes

Mistake 1: Grouping the whole DataFrame when you need one column

This works:

python

activity.groupby("track").sum(numeric_only=True)

Explanation

Groups the activity DataFrame by the "track" column to aggregate records with the same track value
Calculates the sum of all numeric columns for each track group using the sum() method
The numeric_only=True parameter ensures only numerical data types are included in the summation operation
Returns a new DataFrame where each row represents a unique track with summed values for numeric fields
This approach efficiently aggregates quantitative metrics across different track categories

But this is often clearer:

python

activity.groupby("track")["minutes_watched"].sum()

Explanation

Groups the activity DataFrame by the "track" column to organize data by musical tracks
Calculates the sum of "minutes_watched" for each track group to determine total viewing time
Returns a Series with track names as index and aggregated minute totals as values
Useful for identifying most/least popular tracks based on cumulative watching duration
Common pattern for data aggregation in pandas when analyzing user engagement metrics

Select the column you actually need.

Mistake 2: Confusing `size()` and `count()`

python

activity.groupby("track").size()

Explanation

Groups the activity dataset by the "track" column to aggregate related records
Uses the size() method to count how many entries exist for each unique track value
Returns a pandas Series with track identifiers as the index and their respective counts as values
Commonly used for understanding distribution patterns across different tracks or categories
Provides quick insight into which tracks have the most activity or are most frequently occurring

counts rows.

python

activity.groupby("track")["quiz_score"].count()

Explanation

Groups data by the "track" column to organize records by track category
Calculates the count of quiz scores for each track group using the count() method
Returns a Series with track names as index and corresponding quiz submission counts as values
Useful for analyzing participation rates across different tracks or categories
Demonstrates pandas groupby aggregation pattern for counting observations within groups

counts non-missing quiz scores.

Mistake 3: Using `apply()` for everything

This is usually better:

python

activity.groupby("track", as_index=False).agg(
    total_minutes=("minutes_watched", "sum"),
    average_score=("quiz_score", "mean")
)

Explanation

Groups user activity records by the "track" column to analyze performance across different learning paths
Calculates the total minutes watched for each track using the sum aggregation function on the "minutes_watched" column
Computes the mean quiz score for each track using the mean aggregation function on the "quiz_score" column
Returns a new DataFrame with track names as the grouping key and two aggregated metrics as columns
Uses as_index=False to keep the track column as a regular column rather than making it the index

Use apply() only when aggregation, transformation, and filtering are not enough.

Mistake 4: Forgetting `reset_index()`

This returns a Series with track as the index:

python

activity.groupby("track")["minutes_watched"].sum()

Explanation

Groups the activity DataFrame by the "track" column to organize data by musical tracks
Calculates the sum of "minutes_watched" for each track group to determine total viewing time
Returns a Series with track names as index and aggregated minute totals as values
Useful for identifying most/least popular tracks based on cumulative watching duration
Common pattern for data aggregation in pandas when analyzing user engagement metrics

This returns a DataFrame with track as a normal column:

python

activity.groupby("track")["minutes_watched"].sum().reset_index()

Explanation

Groups the activity DataFrame by the "track" column to organize records by track category
Calculates the total minutes watched for each track group using the sum aggregation function
Resets the index to convert the grouped result back into a regular DataFrame format
Returns a new DataFrame with two columns: "track" and "minutes_watched" showing aggregated totals per track
Useful for analyzing viewing patterns and identifying popular content categories

For dashboards, APIs, and CSV exports, normal columns are often easier.

25. Practice Questions

Try these without looking at the solutions first.

Practice Task

Q1. Which track has the highest total minutes watched?

python

activity.groupby("track")["minutes_watched"].sum().sort_values(ascending=False).head(1)

Explanation

Groups activity data by the "track" column to aggregate viewing statistics per track
Calculates the sum of "minutes_watched" for each track group to determine total viewing time
Sorts the aggregated results in descending order to rank tracks by total minutes watched
Returns only the top result using head(1) to identify the track with the highest viewing time
This approach efficiently identifies the most popular track based on cumulative watching duration

Practice Task

Q2. Which plan has the highest average quiz score?

python

activity.groupby("plan")["quiz_score"].mean().sort_values(ascending=False)

Explanation

Groups activity data by the "plan" column to organize records by different plan types
Calculates the mean quiz score for each plan group using the "quiz_score" column
Sorts the resulting averages in descending order to show highest performing plans first
Returns a pandas Series with plan names as index and their corresponding average quiz scores as values
Useful for identifying which plan type correlates with better quiz performance among users

Practice Task

Q3. How many unique learners are there in each city?

python

activity.groupby("city")["learner_id"].nunique().sort_values(ascending=False)

Explanation

Groups the activity dataset by the "city" column to aggregate data per location
Counts the number of unique learner IDs within each city group using nunique()
Sorts the resulting counts in descending order to show cities with the most learners first
Provides a quick overview of learner distribution across different geographic locations
Useful for identifying which cities have the highest engagement levels in the activity program

Practice Task

Q4. What is the completion count for every track and plan?

python

activity[activity["status"] == "completed"] \
    .groupby(["track", "plan"]) \
    .size() \
    .reset_index(name="completed_count")

Explanation

Filters the activity DataFrame to include only rows where status equals "completed"
Groups the filtered data by both "track" and "plan" columns to create aggregated categories
Counts the number of records in each group using the size() method
Resets the index to convert the grouped result back to a regular DataFrame with proper column names
Creates a new column called "completed_count" to store the aggregated counts for each track-plan combination

Practice Task

Q5. Which course has the highest average paid order amount?

python

revenue[revenue["paid"] == True] \
    .groupby("course")["amount"] \
    .mean() \
    .sort_values(ascending=False)

Explanation

Filters the revenue dataset to include only rows where the "paid" column equals True, selecting only completed transactions
Groups the filtered data by the "course" column to organize payments by course category
Calculates the mean (average) payment amount for each course group using the "amount" column
Sorts the resulting averages in descending order to show courses with highest average payments first

Practice Task

Q6. Which city generated the highest total revenue?

python

revenue.groupby("city")["amount"].sum().sort_values(ascending=False).head(1)

Explanation

Groups revenue data by city column and calculates the sum of amounts for each city
Sorts the resulting sums in descending order to rank cities by total revenue
Returns only the top result using head(1) to identify the city with maximum revenue
This approach efficiently aggregates and ranks categorical data to find peak performance metrics
Commonly used for business analytics to identify best performing locations or segments

Practice Task

Q7. What is each course's discount percentage?

python

discount_summary = revenue.groupby("course", as_index=False).agg(
    total_amount=("amount", "sum"),
    total_discount=("discount", "sum")
)

discount_summary["discount_pct"] = (
    discount_summary["total_discount"] /
    (discount_summary["total_amount"] + discount_summary["total_discount"]) *
    100
).round(2)

discount_summary

Explanation

Groups revenue data by course category and calculates total amount and total discount for each course
Computes the discount percentage by dividing total discount by the sum of total amount and total discount, then multiplies by 100
Rounds the calculated discount percentage to two decimal places for better readability
Returns the final summary dataframe containing courses, their total amounts, total discounts, and corresponding discount percentages

Practice Task

Q8. Find the top learner by total minutes watched in each track.

python

learner_track_minutes = activity.groupby(
    ["track", "learner_id", "learner"],
    as_index=False
)["minutes_watched"].sum()

learner_track_minutes.sort_values(
    ["track", "minutes_watched"],
    ascending=[True, False]
).groupby("track").head(1)

Explanation

Groups activity data by track, learner_id, and learner name to aggregate total minutes watched per learner per track
Sorts the grouped data by track name ascending and minutes watched descending to rank learners within each track
Uses groupby with head(1) to select only the top learner (highest minutes watched) from each track group
The final result shows the learner who watched the most minutes for each content track
This approach efficiently finds the most engaged learner per track without requiring explicit ranking functions

Practice Task

Q9. Add a column showing each learner's percentage of total minutes inside their track.

python

activity["track_total_minutes"] = activity.groupby("track")["minutes_watched"].transform("sum")
activity["minutes_share_in_track"] = (
    activity["minutes_watched"] / activity["track_total_minutes"] * 100
).round(2)

activity[["learner", "track", "lesson", "minutes_watched", "minutes_share_in_track"]].head()

Explanation

Groups the activity data by "track" column and calculates the total minutes watched for each track using transform with sum aggregation
Creates a new column that represents the percentage share of minutes watched relative to the total minutes in each track
Rounds the percentage values to 2 decimal places for better readability
Displays selected columns including learner, track, lesson, minutes watched, and the calculated percentage share
Shows the first few rows of the resulting dataframe with the new percentage calculation included

Practice Task

Q10. Keep only courses that have at least 3 paid orders.

python

revenue[revenue["paid"] == True] \
    .groupby("course") \
    .filter(lambda group: len(group) >= 3)

Explanation

Filters the revenue DataFrame to include only rows where the "paid" column equals True, selecting only paid transactions
Groups the filtered data by the "course" column to aggregate information per course
Applies a filter function that retains only groups (courses) containing 3 or more records (students/enrollments)
Returns a subset of the original DataFrame containing only paid course entries that meet the minimum enrollment requirement of 3 participants
This operation helps identify popular paid courses with sufficient student participation for analysis or reporting purposes

df.groupby(["column_1", "column_2"])

Explanation

Groups rows in a pandas DataFrame based on unique combinations of values in column_1 and column_2
Creates separate groups for each distinct pair of values from the specified columns
Enables subsequent aggregation operations like sum, mean, or count within each group
Returns a GroupBy object that can be used with aggregation methods to compute statistics per group
Commonly used for categorical data analysis and creating summary tables by dimension combinations

10. How do you get one specific group?

Use get_group():

python

activity.groupby("track").get_group("Pandas")

Explanation

The code uses pandas groupby() function to partition data by the "track" column into separate groups
It then applies get_group() method to extract only the subset of data where track equals "Pandas"
This approach efficiently filters large datasets to focus on specific categorical values
The operation returns a DataFrame containing all rows that belong to the "Pandas" track category
This pattern is commonly used for analyzing specific segments within grouped data structures

For multiple keys, pass a tuple:

python

activity.groupby(["track", "plan"]).get_group(("Pandas", "premium"))

Explanation

The code uses pandas groupby() method to partition data into groups based on two categorical columns: "track" and "plan"
It applies the get_group() method to specifically extract the subset of data where track equals "Pandas" and plan equals "premium"
This approach enables efficient filtering and analysis of specific combinations within grouped datasets
The resulting output contains all rows that match both criteria simultaneously from the original activity dataframe
This pattern is commonly used for segmenting business data by product tiers and user categories

27. Final Mental Model

Use this simple decision table:

Need	Use
One summary row per group	`agg()`
Group result attached back to original rows	`transform()`
Keep or remove full groups	`filter()`
Custom logic that returns flexible output	`apply()`
Count rows	`size()`
Count non-missing values	`count()`
Count distinct values	`nunique()`
Make grouped keys normal columns	`as_index=False` or `reset_index()`

If you remember only one thing, remember this:

GroupBy is not just for totals. It is a complete pattern for turning row-level data into segment-level insight.

Official References

Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
Pandas DataFrame.groupby API reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
Pandas GroupBy API reference: https://pandas.pydata.org/docs/reference/groupby.html

Pandas GroupBy: Split, Aggregate, Transform & Filter

AI Insights

Q1. Which track has the highest total minutes watched?

Q2. Which plan has the highest average quiz score?

Q3. How many unique learners are there in each city?

Q4. What is the completion count for every track and plan?

Q5. Which course has the highest average paid order amount?

Q6. Which city generated the highest total revenue?

Q7. What is each course's discount percentage?

Q8. Find the top learner by total minutes watched in each track.

Q9. Add a column showing each learner's percentage of total minutes inside their track.

Q10. Keep only courses that have at least 3 paid orders.