#Pandas#pythonIntermediate

Pandas GroupBy: Split, Aggregate, Transform & Filter

Jun 2, 2026
44 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: pandas-groupby-split-aggregate-transform-filter
Quick Answer

Learn Pandas GroupBy with original CSV datasets: split data into groups, aggregate metrics, compare size and count, group by multiple columns, use named aggregations, transform group metrics back to rows, filter groups, find top records per group, and solve practice dashboards.

Quick Summary

Learn how to effectively use Pandas GroupBy for data analysis. Master split, aggregate, transform, and filter techniques with practical examples.

Pandas GroupBy: Split, Aggregate, Transform, Filter, and Practice

Most real datasets are not useful because of individual rows alone.

They become useful when you ask questions like:

  • Which track has the highest average quiz score?
  • Which city brings the most paid learners?
  • Which plan has the strongest completion rate?
  • Which course generated the highest revenue?
  • Which learner is above or below the average of their own track?
  • Which groups should be kept or removed based on business rules?

This is where groupby() becomes one of the most important Pandas tools.

This guide uses original sample datasets created for this lesson. It does not use copied course, movie, sports, or public dataset examples.

Files used in this lesson:

  • learner_groupby_activity.csv
  • course_revenue_groupby.csv

Place both CSV files in the same folder as this Markdown file before running the examples.

What You Will Learn

By the end, you should be able to use:

  • groupby()
  • sum, mean, median, min, max, std, and var
  • size() versus count()
  • nunique()
  • first(), last(), and nth()
  • groups, ngroups, and get_group()
  • multiple grouping columns
  • .agg() with lists and dictionaries
  • named aggregation
  • as_index=False and reset_index()
  • transform()
  • filter()
  • apply() for custom group logic
  • top-N records inside each group
  • group-wise percentages
  • group-wise missing-value handling

1. Setup

python
import pandas as pd
import numpy as np

Load the learner activity dataset:

python
activity = pd.read_csv("learner_groupby_activity.csv")
activity.head()

Load the course revenue dataset:

python
revenue = pd.read_csv("course_revenue_groupby.csv", parse_dates=["ordered_at"])
revenue.head()

Quickly inspect both datasets:

python
activity.shape, revenue.shape
python
activity.info()
python
revenue.info()

2. What GroupBy Means

groupby() follows a common pattern:

  1. Split the DataFrame into groups.
  2. Apply a calculation to each group.
  3. Combine the results into a new object.

For example:

python
activity.groupby("track")["minutes_watched"].sum()

This means:

  • split rows by track
  • take the minutes_watched column inside each track
  • calculate total minutes for each track
  • combine the totals into one Series

The result is not one row per learner. It is one row per group.

3. Create a GroupBy Object

python
track_groups = activity.groupby("track")
track_groups

The output may not look exciting because a GroupBy object is lazy.

Pandas has not calculated the final result yet. It is holding the grouping instructions until you ask for an aggregation, transformation, filter, or custom operation.

Check the number of groups:

python
track_groups.ngroups

Check the group labels:

python
list(track_groups.groups.keys())

The groups attribute gives the row indexes that belong to each group:

python
track_groups.groups

Get only one group:

python
track_groups.get_group("Pandas")

This is useful while debugging because you can inspect what Pandas considers one group.

4. Basic Aggregation

Find total watch time for each track:

python
activity.groupby("track")["minutes_watched"].sum()

Find average quiz score for each track:

python
activity.groupby("track")["quiz_score"].mean()

Find highest quiz score in each track:

python
activity.groupby("track")["quiz_score"].max()

Find lowest quiz score in each track:

python
activity.groupby("track")["quiz_score"].min()

Find multiple statistics:

python
activity.groupby("track")["quiz_score"].agg(["min", "mean", "median", "max"])

When you aggregate one column, the result is usually a Series.

When you aggregate multiple columns or multiple functions, the result becomes a DataFrame.

5. size() Versus count()

This difference matters in interviews and real projects.

size() counts rows in each group:

python
activity.groupby("track").size()

count() counts non-missing values in each column:

python
activity.groupby("track").count()

Example:

python
activity.groupby("track")["quiz_score"].count()

If a group has 5 rows but one missing quiz score, size() returns 5, while count() on quiz_score returns 4.

Use size() when you want row count.

Use count() when you want non-null value count.

6. Count Unique Values Per Group

Find how many unique learners are in each track:

python
activity.groupby("track")["learner_id"].nunique()

Find how many cities each track has learners from:

python
activity.groupby("track")["city"].nunique()

Find how many unique lessons each track contains:

python
activity.groupby("track")["lesson"].nunique()

nunique() is useful for metrics like:

  • unique learners
  • unique buyers
  • unique cities
  • unique products
  • unique sessions

7. First, Last, and Nth Row in Every Group

Sort the data first so "first" and "last" mean something:

python
activity_sorted = activity.sort_values(["track", "completed_at"])

First row per track:

python
activity_sorted.groupby("track").first()

Last row per track:

python
activity_sorted.groupby("track").last()

Second row per track:

python
activity_sorted.groupby("track").nth(1)

nth(1) uses zero-based position, so it returns the second row in each group.

8. Group By Multiple Columns

You can group by more than one column by passing a list.

Average score by track and plan:

python
activity.groupby(["track", "plan"])["quiz_score"].mean()

Total minutes by track and status:

python
activity.groupby(["track", "status"])["minutes_watched"].sum()

Revenue by course and city:

python
revenue.groupby(["course", "city"])["amount"].sum()

The result uses a MultiIndex by default.

If you want normal columns, call reset_index():

python
revenue.groupby(["course", "city"])["amount"].sum().reset_index()

Or use as_index=False:

python
revenue.groupby(["course", "city"], as_index=False)["amount"].sum()

For beginner-friendly dashboards, as_index=False is often easier.

9. Sort GroupBy Results

Find the top tracks by total watch time:

python
activity.groupby("track")["minutes_watched"].sum().sort_values(ascending=False)

Find the top courses by paid revenue:

python
paid_revenue = revenue[revenue["paid"] == True]

paid_revenue.groupby("course")["amount"].sum().sort_values(ascending=False)

Find the top cities by number of paid orders:

python
paid_revenue.groupby("city").size().sort_values(ascending=False)

GroupBy results are often more useful after sorting.

10. Use .agg() With Multiple Columns

Create a track summary:

python
activity.groupby("track").agg({
    "learner_id": "nunique",
    "minutes_watched": "sum",
    "quiz_score": "mean",
    "lesson": "count"
})

This works, but the column names are not always friendly.

For cleaner output, use named aggregation.

11. Named Aggregation

Named aggregation lets you choose the output column names.

python
track_summary = activity.groupby("track", as_index=False).agg(
    learners=("learner_id", "nunique"),
    lessons_attempted=("lesson", "count"),
    total_minutes=("minutes_watched", "sum"),
    average_score=("quiz_score", "mean"),
    completed_lessons=("status", lambda s: (s == "completed").sum())
)

track_summary

Round the average:

python
track_summary["average_score"] = track_summary["average_score"].round(2)
track_summary

Create a course revenue summary:

python
course_summary = revenue.groupby("course", as_index=False).agg(
    orders=("order_id", "nunique"),
    buyers=("learner_id", "nunique"),
    gross_revenue=("amount", "sum"),
    average_discount=("discount", "mean"),
    paid_orders=("paid", lambda s: s.sum())
)

course_summary.sort_values("gross_revenue", ascending=False)

Named aggregation is usually the cleanest style for project code.

12. Grouped Value Counts

Find lesson status distribution inside each track:

python
activity.groupby("track")["status"].value_counts()

Convert it into a table:

python
activity.groupby("track")["status"].value_counts().unstack(fill_value=0)

Find plan mix inside each course:

python
revenue.groupby("course")["plan"].value_counts().unstack(fill_value=0)

This is useful for quick cross-tab style analysis.

13. Percentages Within Each Group

Suppose you want each order's share of total course revenue.

First calculate total revenue per course and attach it back to every row:

python
revenue["course_total_amount"] = revenue.groupby("course")["amount"].transform("sum")

Now calculate the row's percentage contribution:

python
revenue["course_revenue_share"] = revenue["amount"] / revenue["course_total_amount"]

revenue[["order_id", "course", "amount", "course_total_amount", "course_revenue_share"]].head()

Format as percentage:

python
revenue["course_revenue_share_pct"] = (revenue["course_revenue_share"] * 100).round(2)

The key idea is that transform() returns a result with the same number of rows as the original DataFrame.

That makes it perfect for adding group-level metrics back into row-level data.

14. transform() For Group-Wise Features

Add average score of each learner's track:

python
activity["track_average_score"] = activity.groupby("track")["quiz_score"].transform("mean")

Compare each row to its track average:

python
activity["score_vs_track_average"] = activity["quiz_score"] - activity["track_average_score"]

Show the comparison:

python
activity[[
    "learner",
    "track",
    "lesson",
    "quiz_score",
    "track_average_score",
    "score_vs_track_average"
]].head()

Add each learner's total minutes:

python
activity["learner_total_minutes"] = activity.groupby("learner_id")["minutes_watched"].transform("sum")

Find learners with high total effort:

python
activity[activity["learner_total_minutes"] >= 120][
    ["learner_id", "learner", "track", "learner_total_minutes"]
].drop_duplicates()

Use transform() when you need group results while keeping the original rows.

15. Filter Groups

filter() keeps or removes entire groups based on a condition.

Keep only tracks with at least 4 rows:

python
activity.groupby("track").filter(lambda group: len(group) >= 4)

Keep only tracks with at least 3 unique learners:

python
activity.groupby("track").filter(
    lambda group: group["learner_id"].nunique() >= 3
)

Keep only courses with total revenue of at least 4000:

python
revenue.groupby("course").filter(
    lambda group: group["amount"].sum() >= 4000
)

Important: filter() returns original rows, not a summary table.

16. Apply Custom Group Logic

Use apply() when built-in aggregation or transformation is not enough.

Example: create a custom summary for each track.

python
def track_health(group):
    completion_rate = (group["status"] == "completed").mean()
    avg_score = group["quiz_score"].mean()
    total_minutes = group["minutes_watched"].sum()

    return pd.Series({
        "completion_rate": completion_rate,
        "average_score": avg_score,
        "total_minutes": total_minutes,
        "health_score": (completion_rate * 50) + (avg_score / 100 * 50)
    })

activity.groupby("track").apply(track_health)

apply() is flexible, but it can be slower and easier to misuse.

Prefer these in most cases:

  • .agg() for summary tables
  • .transform() for row-level features
  • .filter() for keeping or removing full groups

Use .apply() when the logic does not fit those patterns.

17. Top Records Inside Each Group

Find top 2 quiz attempts inside each track.

python
activity.sort_values(["track", "quiz_score"], ascending=[True, False]) \
    .groupby("track") \
    .head(2)

Find the top order from each course:

python
revenue.sort_values(["course", "amount"], ascending=[True, False]) \
    .groupby("course") \
    .head(1)

Another approach is ranking:

python
activity["score_rank_in_track"] = activity.groupby("track")["quiz_score"].rank(
    method="dense",
    ascending=False
)

activity[activity["score_rank_in_track"] <= 2].sort_values(["track", "score_rank_in_track"])

Ranking is better when you want to keep the rank column for later analysis.

18. Handle Missing Values In Groups

Find missing quiz scores by track:

python
activity["quiz_score"].isna().groupby(activity["track"]).sum()

Fill missing quiz score with the track average:

python
activity["quiz_score_filled"] = activity["quiz_score"].fillna(
    activity.groupby("track")["quiz_score"].transform("mean")
)

Check again:

python
activity[["learner", "track", "quiz_score", "quiz_score_filled"]]

This is more thoughtful than filling every missing score with the global mean, because each track may have a different difficulty level.

19. Group Missing Categories With dropna=False

By default, Pandas may exclude missing group labels.

If you want missing values to appear as a group, use dropna=False.

python
activity.groupby("mentor", dropna=False)["learner_id"].nunique()

This can be useful when missing values have business meaning, such as learners who have not been assigned a mentor yet.

20. Time-Based Grouping

The revenue dataset has an order date.

Create a month column:

python
revenue["month"] = revenue["ordered_at"].dt.to_period("M").astype(str)

Revenue by month:

python
revenue.groupby("month", as_index=False)["amount"].sum()

Revenue by month and category:

python
revenue.groupby(["month", "category"], as_index=False)["amount"].sum()

You can also use pd.Grouper:

python
revenue.groupby(pd.Grouper(key="ordered_at", freq="M"))["amount"].sum()

Use a date column when the question is about time.

Use normal category columns when the question is about segments.

21. Mini Project 1: Learner Engagement Dashboard

Build a summary table with:

  • track name
  • unique learners
  • total minutes watched
  • average quiz score
  • completion rate
  • number of mentors

Solution:

python
learner_dashboard = activity.groupby("track", as_index=False).agg(
    unique_learners=("learner_id", "nunique"),
    total_minutes=("minutes_watched", "sum"),
    average_quiz_score=("quiz_score", "mean"),
    completion_rate=("status", lambda s: (s == "completed").mean()),
    mentors=("mentor", "nunique")
)

learner_dashboard["average_quiz_score"] = learner_dashboard["average_quiz_score"].round(2)
learner_dashboard["completion_rate"] = (learner_dashboard["completion_rate"] * 100).round(2)

learner_dashboard.sort_values("total_minutes", ascending=False)

22. Mini Project 2: Course Revenue Dashboard

Build a summary table with:

  • course
  • category
  • total orders
  • paid orders
  • total revenue
  • average order amount
  • total discount
  • paid conversion rate

Solution:

python
course_revenue_dashboard = revenue.groupby(["course", "category"], as_index=False).agg(
    total_orders=("order_id", "nunique"),
    paid_orders=("paid", lambda s: s.sum()),
    total_revenue=("amount", "sum"),
    average_order_amount=("amount", "mean"),
    total_discount=("discount", "sum")
)

course_revenue_dashboard["paid_conversion_rate"] = (
    course_revenue_dashboard["paid_orders"] / course_revenue_dashboard["total_orders"] * 100
).round(2)

course_revenue_dashboard["average_order_amount"] = course_revenue_dashboard["average_order_amount"].round(2)

course_revenue_dashboard.sort_values("total_revenue", ascending=False)

23. Mini Project 3: Find Learners Who Need Help

A learner may need support if:

  • their filled quiz score is below the track average
  • they have at least one in-progress lesson
  • their total minutes are below 100

Solution:

python
support_data = activity.copy()

support_data["quiz_score_filled"] = support_data["quiz_score"].fillna(
    support_data.groupby("track")["quiz_score"].transform("mean")
)

support_data["track_avg_score"] = support_data.groupby("track")["quiz_score_filled"].transform("mean")
support_data["learner_total_minutes"] = support_data.groupby("learner_id")["minutes_watched"].transform("sum")
support_data["has_in_progress"] = support_data.groupby("learner_id")["status"].transform(
    lambda s: (s == "in_progress").any()
)

needs_help = support_data[
    (support_data["quiz_score_filled"] < support_data["track_avg_score"]) &
    (support_data["has_in_progress"]) &
    (support_data["learner_total_minutes"] < 100)
]

needs_help[["learner_id", "learner", "track", "learner_total_minutes"]].drop_duplicates()

24. Common Mistakes

Mistake 1: Grouping the whole DataFrame when you need one column

This works:

python
activity.groupby("track").sum(numeric_only=True)

Explanation

  • Groups the activity DataFrame by the "track" column to aggregate records with the same track value
  • Calculates the sum of all numeric columns for each track group using the sum() method
  • The numeric_only=True parameter ensures only numerical data types are included in the summation operation
  • Returns a new DataFrame where each row represents a unique track with summed values for numeric fields
  • This approach efficiently aggregates quantitative metrics across different track categories

But this is often clearer:

python
activity.groupby("track")["minutes_watched"].sum()

Explanation

  • Groups the activity DataFrame by the "track" column to organize data by musical tracks
  • Calculates the sum of "minutes_watched" for each track group to determine total viewing time
  • Returns a Series with track names as index and aggregated minute totals as values
  • Useful for identifying most/least popular tracks based on cumulative watching duration
  • Common pattern for data aggregation in pandas when analyzing user engagement metrics

Select the column you actually need.

Mistake 2: Confusing size() and count()

python
activity.groupby("track").size()

Explanation

  • Groups the activity dataset by the "track" column to aggregate related records
  • Uses the size() method to count how many entries exist for each unique track value
  • Returns a pandas Series with track identifiers as the index and their respective counts as values
  • Commonly used for understanding distribution patterns across different tracks or categories
  • Provides quick insight into which tracks have the most activity or are most frequently occurring

counts rows.

python
activity.groupby("track")["quiz_score"].count()

Explanation

  • Groups data by the "track" column to organize records by track category
  • Calculates the count of quiz scores for each track group using the count() method
  • Returns a Series with track names as index and corresponding quiz submission counts as values
  • Useful for analyzing participation rates across different tracks or categories
  • Demonstrates pandas groupby aggregation pattern for counting observations within groups

counts non-missing quiz scores.

Mistake 3: Using apply() for everything

This is usually better:

python
activity.groupby("track", as_index=False).agg(
    total_minutes=("minutes_watched", "sum"),
    average_score=("quiz_score", "mean")
)

Explanation

  • Groups user activity records by the "track" column to analyze performance across different learning paths
  • Calculates the total minutes watched for each track using the sum aggregation function on the "minutes_watched" column
  • Computes the mean quiz score for each track using the mean aggregation function on the "quiz_score" column
  • Returns a new DataFrame with track names as the grouping key and two aggregated metrics as columns
  • Uses as_index=False to keep the track column as a regular column rather than making it the index

Use apply() only when aggregation, transformation, and filtering are not enough.

Mistake 4: Forgetting reset_index()

This returns a Series with track as the index:

python
activity.groupby("track")["minutes_watched"].sum()

Explanation

  • Groups the activity DataFrame by the "track" column to organize data by musical tracks
  • Calculates the sum of "minutes_watched" for each track group to determine total viewing time
  • Returns a Series with track names as index and aggregated minute totals as values
  • Useful for identifying most/least popular tracks based on cumulative watching duration
  • Common pattern for data aggregation in pandas when analyzing user engagement metrics

This returns a DataFrame with track as a normal column:

python
activity.groupby("track")["minutes_watched"].sum().reset_index()

Explanation

  • Groups the activity DataFrame by the "track" column to organize records by track category
  • Calculates the total minutes watched for each track group using the sum aggregation function
  • Resets the index to convert the grouped result back into a regular DataFrame format
  • Returns a new DataFrame with two columns: "track" and "minutes_watched" showing aggregated totals per track
  • Useful for analyzing viewing patterns and identifying popular content categories

For dashboards, APIs, and CSV exports, normal columns are often easier.

25. Practice Questions

Try these without looking at the solutions first.

Practice Task

Q1. Which track has the highest total minutes watched?

python
activity.groupby("track")["minutes_watched"].sum().sort_values(ascending=False).head(1)

Explanation

  • Groups activity data by the "track" column to aggregate viewing statistics per track
  • Calculates the sum of "minutes_watched" for each track group to determine total viewing time
  • Sorts the aggregated results in descending order to rank tracks by total minutes watched
  • Returns only the top result using head(1) to identify the track with the highest viewing time
  • This approach efficiently identifies the most popular track based on cumulative watching duration

Practice Task

Q2. Which plan has the highest average quiz score?

python
activity.groupby("plan")["quiz_score"].mean().sort_values(ascending=False)

Explanation

  • Groups activity data by the "plan" column to organize records by different plan types
  • Calculates the mean quiz score for each plan group using the "quiz_score" column
  • Sorts the resulting averages in descending order to show highest performing plans first
  • Returns a pandas Series with plan names as index and their corresponding average quiz scores as values
  • Useful for identifying which plan type correlates with better quiz performance among users

Practice Task

Q3. How many unique learners are there in each city?

python
activity.groupby("city")["learner_id"].nunique().sort_values(ascending=False)

Explanation

  • Groups the activity dataset by the "city" column to aggregate data per location
  • Counts the number of unique learner IDs within each city group using nunique()
  • Sorts the resulting counts in descending order to show cities with the most learners first
  • Provides a quick overview of learner distribution across different geographic locations
  • Useful for identifying which cities have the highest engagement levels in the activity program

Practice Task

Q4. What is the completion count for every track and plan?

python
activity[activity["status"] == "completed"] \
    .groupby(["track", "plan"]) \
    .size() \
    .reset_index(name="completed_count")

Explanation

  • Filters the activity DataFrame to include only rows where status equals "completed"
  • Groups the filtered data by both "track" and "plan" columns to create aggregated categories
  • Counts the number of records in each group using the size() method
  • Resets the index to convert the grouped result back to a regular DataFrame with proper column names
  • Creates a new column called "completed_count" to store the aggregated counts for each track-plan combination

Practice Task

Q5. Which course has the highest average paid order amount?

python
revenue[revenue["paid"] == True] \
    .groupby("course")["amount"] \
    .mean() \
    .sort_values(ascending=False)

Explanation

  • Filters the revenue dataset to include only rows where the "paid" column equals True, selecting only completed transactions
  • Groups the filtered data by the "course" column to organize payments by course category
  • Calculates the mean (average) payment amount for each course group using the "amount" column
  • Sorts the resulting averages in descending order to show courses with highest average payments first

Practice Task

Q6. Which city generated the highest total revenue?

python
revenue.groupby("city")["amount"].sum().sort_values(ascending=False).head(1)

Explanation

  • Groups revenue data by city column and calculates the sum of amounts for each city
  • Sorts the resulting sums in descending order to rank cities by total revenue
  • Returns only the top result using head(1) to identify the city with maximum revenue
  • This approach efficiently aggregates and ranks categorical data to find peak performance metrics
  • Commonly used for business analytics to identify best performing locations or segments

Practice Task

Q7. What is each course's discount percentage?

python
discount_summary = revenue.groupby("course", as_index=False).agg(
    total_amount=("amount", "sum"),
    total_discount=("discount", "sum")
)

discount_summary["discount_pct"] = (
    discount_summary["total_discount"] /
    (discount_summary["total_amount"] + discount_summary["total_discount"]) *
    100
).round(2)

discount_summary

Explanation

  • Groups revenue data by course category and calculates total amount and total discount for each course
  • Computes the discount percentage by dividing total discount by the sum of total amount and total discount, then multiplies by 100
  • Rounds the calculated discount percentage to two decimal places for better readability
  • Returns the final summary dataframe containing courses, their total amounts, total discounts, and corresponding discount percentages

Practice Task

Q8. Find the top learner by total minutes watched in each track.

python
learner_track_minutes = activity.groupby(
    ["track", "learner_id", "learner"],
    as_index=False
)["minutes_watched"].sum()

learner_track_minutes.sort_values(
    ["track", "minutes_watched"],
    ascending=[True, False]
).groupby("track").head(1)

Explanation

  • Groups activity data by track, learner_id, and learner name to aggregate total minutes watched per learner per track
  • Sorts the grouped data by track name ascending and minutes watched descending to rank learners within each track
  • Uses groupby with head(1) to select only the top learner (highest minutes watched) from each track group
  • The final result shows the learner who watched the most minutes for each content track
  • This approach efficiently finds the most engaged learner per track without requiring explicit ranking functions

Practice Task

Q9. Add a column showing each learner's percentage of total minutes inside their track.

python
activity["track_total_minutes"] = activity.groupby("track")["minutes_watched"].transform("sum")
activity["minutes_share_in_track"] = (
    activity["minutes_watched"] / activity["track_total_minutes"] * 100
).round(2)

activity[["learner", "track", "lesson", "minutes_watched", "minutes_share_in_track"]].head()

Explanation

  • Groups the activity data by "track" column and calculates the total minutes watched for each track using transform with sum aggregation
  • Creates a new column that represents the percentage share of minutes watched relative to the total minutes in each track
  • Rounds the percentage values to 2 decimal places for better readability
  • Displays selected columns including learner, track, lesson, minutes watched, and the calculated percentage share
  • Shows the first few rows of the resulting dataframe with the new percentage calculation included

Practice Task

Q10. Keep only courses that have at least 3 paid orders.

python
revenue[revenue["paid"] == True] \
    .groupby("course") \
    .filter(lambda group: len(group) >= 3)

Explanation

  • Filters the revenue DataFrame to include only rows where the "paid" column equals True, selecting only paid transactions
  • Groups the filtered data by the "course" column to aggregate information per course
  • Applies a filter function that retains only groups (courses) containing 3 or more records (students/enrollments)
  • Returns a subset of the original DataFrame containing only paid course entries that meet the minimum enrollment requirement of 3 participants
  • This operation helps identify popular paid courses with sufficient student participation for analysis or reporting purposes

26. Interview Questions

1. What does groupby() return?

It returns a GroupBy object. The object stores grouping rules and waits for an operation such as aggregation, transformation, filtering, or apply.

2. What is the split-apply-combine idea?

Split data into groups, apply a calculation to each group, and combine the results.

3. What is the difference between size() and count()?

size() counts rows in each group. count() counts non-null values in each column or selected Series.

4. When should you use transform()?

Use transform() when you need a group-level result repeated back to the original row shape, such as adding each row's group average.

5. When should you use filter()?

Use filter() when you want to keep or remove entire groups based on a condition.

6. When should you use apply()?

Use apply() for custom group logic that does not fit normal aggregation, transformation, or filtering.

7. Why is named aggregation useful?

It lets you create readable output column names while applying different calculations to different columns.

8. Why can as_index=False be helpful?

It keeps grouping keys as normal columns instead of making them the index. This is often easier for dashboards, APIs, and exports.

9. How do you group by multiple columns?

Pass a list of columns:

python
df.groupby(["column_1", "column_2"])

Explanation

  • Groups rows in a pandas DataFrame based on unique combinations of values in column_1 and column_2
  • Creates separate groups for each distinct pair of values from the specified columns
  • Enables subsequent aggregation operations like sum, mean, or count within each group
  • Returns a GroupBy object that can be used with aggregation methods to compute statistics per group
  • Commonly used for categorical data analysis and creating summary tables by dimension combinations

10. How do you get one specific group?

Use get_group():

python
activity.groupby("track").get_group("Pandas")

Explanation

  • The code uses pandas groupby() function to partition data by the "track" column into separate groups
  • It then applies get_group() method to extract only the subset of data where track equals "Pandas"
  • This approach efficiently filters large datasets to focus on specific categorical values
  • The operation returns a DataFrame containing all rows that belong to the "Pandas" track category
  • This pattern is commonly used for analyzing specific segments within grouped data structures

For multiple keys, pass a tuple:

python
activity.groupby(["track", "plan"]).get_group(("Pandas", "premium"))

Explanation

  • The code uses pandas groupby() method to partition data into groups based on two categorical columns: "track" and "plan"
  • It applies the get_group() method to specifically extract the subset of data where track equals "Pandas" and plan equals "premium"
  • This approach enables efficient filtering and analysis of specific combinations within grouped datasets
  • The resulting output contains all rows that match both criteria simultaneously from the original activity dataframe
  • This pattern is commonly used for segmenting business data by product tiers and user categories

27. Final Mental Model

Use this simple decision table:

NeedUse
One summary row per groupagg()
Group result attached back to original rowstransform()
Keep or remove full groupsfilter()
Custom logic that returns flexible outputapply()
Count rowssize()
Count non-missing valuescount()
Count distinct valuesnunique()
Make grouped keys normal columnsas_index=False or reset_index()

If you remember only one thing, remember this:

GroupBy is not just for totals. It is a complete pattern for turning row-level data into segment-level insight.

Official References