Pandas GroupBy: Split, Aggregate, Transform, Filter, and Practice
Most real datasets are not useful because of individual rows alone.
They become useful when you ask questions like:
- Which track has the highest average quiz score?
- Which city brings the most paid learners?
- Which plan has the strongest completion rate?
- Which course generated the highest revenue?
- Which learner is above or below the average of their own track?
- Which groups should be kept or removed based on business rules?
This is where groupby() becomes one of the most important Pandas tools.
This guide uses original sample datasets created for this lesson. It does not use copied course, movie, sports, or public dataset examples.
Files used in this lesson:
learner_groupby_activity.csvcourse_revenue_groupby.csv
Place both CSV files in the same folder as this Markdown file before running the examples.
What You Will Learn
By the end, you should be able to use:
groupby()sum,mean,median,min,max,std, andvarsize()versuscount()nunique()first(),last(), andnth()groups,ngroups, andget_group()- multiple grouping columns
.agg()with lists and dictionaries- named aggregation
as_index=Falseandreset_index()transform()filter()apply()for custom group logic- top-N records inside each group
- group-wise percentages
- group-wise missing-value handling
1. Setup
import pandas as pd
import numpy as npLoad the learner activity dataset:
activity = pd.read_csv("learner_groupby_activity.csv")
activity.head()Load the course revenue dataset:
revenue = pd.read_csv("course_revenue_groupby.csv", parse_dates=["ordered_at"])
revenue.head()Quickly inspect both datasets:
activity.shape, revenue.shapeactivity.info()revenue.info()2. What GroupBy Means
groupby() follows a common pattern:
- Split the DataFrame into groups.
- Apply a calculation to each group.
- Combine the results into a new object.
For example:
activity.groupby("track")["minutes_watched"].sum()This means:
- split rows by
track - take the
minutes_watchedcolumn inside each track - calculate total minutes for each track
- combine the totals into one Series
The result is not one row per learner. It is one row per group.
3. Create a GroupBy Object
track_groups = activity.groupby("track")
track_groupsThe output may not look exciting because a GroupBy object is lazy.
Pandas has not calculated the final result yet. It is holding the grouping instructions until you ask for an aggregation, transformation, filter, or custom operation.
Check the number of groups:
track_groups.ngroupsCheck the group labels:
list(track_groups.groups.keys())The groups attribute gives the row indexes that belong to each group:
track_groups.groupsGet only one group:
track_groups.get_group("Pandas")This is useful while debugging because you can inspect what Pandas considers one group.
4. Basic Aggregation
Find total watch time for each track:
activity.groupby("track")["minutes_watched"].sum()Find average quiz score for each track:
activity.groupby("track")["quiz_score"].mean()Find highest quiz score in each track:
activity.groupby("track")["quiz_score"].max()Find lowest quiz score in each track:
activity.groupby("track")["quiz_score"].min()Find multiple statistics:
activity.groupby("track")["quiz_score"].agg(["min", "mean", "median", "max"])When you aggregate one column, the result is usually a Series.
When you aggregate multiple columns or multiple functions, the result becomes a DataFrame.
5. size() Versus count()
This difference matters in interviews and real projects.
size() counts rows in each group:
activity.groupby("track").size()count() counts non-missing values in each column:
activity.groupby("track").count()Example:
activity.groupby("track")["quiz_score"].count()If a group has 5 rows but one missing quiz score, size() returns 5, while count() on quiz_score returns 4.
Use size() when you want row count.
Use count() when you want non-null value count.
6. Count Unique Values Per Group
Find how many unique learners are in each track:
activity.groupby("track")["learner_id"].nunique()Find how many cities each track has learners from:
activity.groupby("track")["city"].nunique()Find how many unique lessons each track contains:
activity.groupby("track")["lesson"].nunique()nunique() is useful for metrics like:
- unique learners
- unique buyers
- unique cities
- unique products
- unique sessions
7. First, Last, and Nth Row in Every Group
Sort the data first so "first" and "last" mean something:
activity_sorted = activity.sort_values(["track", "completed_at"])First row per track:
activity_sorted.groupby("track").first()Last row per track:
activity_sorted.groupby("track").last()Second row per track:
activity_sorted.groupby("track").nth(1)nth(1) uses zero-based position, so it returns the second row in each group.
8. Group By Multiple Columns
You can group by more than one column by passing a list.
Average score by track and plan:
activity.groupby(["track", "plan"])["quiz_score"].mean()Total minutes by track and status:
activity.groupby(["track", "status"])["minutes_watched"].sum()Revenue by course and city:
revenue.groupby(["course", "city"])["amount"].sum()The result uses a MultiIndex by default.
If you want normal columns, call reset_index():
revenue.groupby(["course", "city"])["amount"].sum().reset_index()Or use as_index=False:
revenue.groupby(["course", "city"], as_index=False)["amount"].sum()For beginner-friendly dashboards, as_index=False is often easier.
9. Sort GroupBy Results
Find the top tracks by total watch time:
activity.groupby("track")["minutes_watched"].sum().sort_values(ascending=False)Find the top courses by paid revenue:
paid_revenue = revenue[revenue["paid"] == True]
paid_revenue.groupby("course")["amount"].sum().sort_values(ascending=False)Find the top cities by number of paid orders:
paid_revenue.groupby("city").size().sort_values(ascending=False)GroupBy results are often more useful after sorting.
10. Use .agg() With Multiple Columns
Create a track summary:
activity.groupby("track").agg({
"learner_id": "nunique",
"minutes_watched": "sum",
"quiz_score": "mean",
"lesson": "count"
})This works, but the column names are not always friendly.
For cleaner output, use named aggregation.
11. Named Aggregation
Named aggregation lets you choose the output column names.
track_summary = activity.groupby("track", as_index=False).agg(
learners=("learner_id", "nunique"),
lessons_attempted=("lesson", "count"),
total_minutes=("minutes_watched", "sum"),
average_score=("quiz_score", "mean"),
completed_lessons=("status", lambda s: (s == "completed").sum())
)
track_summaryRound the average:
track_summary["average_score"] = track_summary["average_score"].round(2)
track_summaryCreate a course revenue summary:
course_summary = revenue.groupby("course", as_index=False).agg(
orders=("order_id", "nunique"),
buyers=("learner_id", "nunique"),
gross_revenue=("amount", "sum"),
average_discount=("discount", "mean"),
paid_orders=("paid", lambda s: s.sum())
)
course_summary.sort_values("gross_revenue", ascending=False)Named aggregation is usually the cleanest style for project code.
12. Grouped Value Counts
Find lesson status distribution inside each track:
activity.groupby("track")["status"].value_counts()Convert it into a table:
activity.groupby("track")["status"].value_counts().unstack(fill_value=0)Find plan mix inside each course:
revenue.groupby("course")["plan"].value_counts().unstack(fill_value=0)This is useful for quick cross-tab style analysis.
13. Percentages Within Each Group
Suppose you want each order's share of total course revenue.
First calculate total revenue per course and attach it back to every row:
revenue["course_total_amount"] = revenue.groupby("course")["amount"].transform("sum")Now calculate the row's percentage contribution:
revenue["course_revenue_share"] = revenue["amount"] / revenue["course_total_amount"]
revenue[["order_id", "course", "amount", "course_total_amount", "course_revenue_share"]].head()Format as percentage:
revenue["course_revenue_share_pct"] = (revenue["course_revenue_share"] * 100).round(2)The key idea is that transform() returns a result with the same number of rows as the original DataFrame.
That makes it perfect for adding group-level metrics back into row-level data.
14. transform() For Group-Wise Features
Add average score of each learner's track:
activity["track_average_score"] = activity.groupby("track")["quiz_score"].transform("mean")Compare each row to its track average:
activity["score_vs_track_average"] = activity["quiz_score"] - activity["track_average_score"]Show the comparison:
activity[[
"learner",
"track",
"lesson",
"quiz_score",
"track_average_score",
"score_vs_track_average"
]].head()Add each learner's total minutes:
activity["learner_total_minutes"] = activity.groupby("learner_id")["minutes_watched"].transform("sum")Find learners with high total effort:
activity[activity["learner_total_minutes"] >= 120][
["learner_id", "learner", "track", "learner_total_minutes"]
].drop_duplicates()Use transform() when you need group results while keeping the original rows.
15. Filter Groups
filter() keeps or removes entire groups based on a condition.
Keep only tracks with at least 4 rows:
activity.groupby("track").filter(lambda group: len(group) >= 4)Keep only tracks with at least 3 unique learners:
activity.groupby("track").filter(
lambda group: group["learner_id"].nunique() >= 3
)Keep only courses with total revenue of at least 4000:
revenue.groupby("course").filter(
lambda group: group["amount"].sum() >= 4000
)Important: filter() returns original rows, not a summary table.
16. Apply Custom Group Logic
Use apply() when built-in aggregation or transformation is not enough.
Example: create a custom summary for each track.
def track_health(group):
completion_rate = (group["status"] == "completed").mean()
avg_score = group["quiz_score"].mean()
total_minutes = group["minutes_watched"].sum()
return pd.Series({
"completion_rate": completion_rate,
"average_score": avg_score,
"total_minutes": total_minutes,
"health_score": (completion_rate * 50) + (avg_score / 100 * 50)
})
activity.groupby("track").apply(track_health)apply() is flexible, but it can be slower and easier to misuse.
Prefer these in most cases:
.agg()for summary tables.transform()for row-level features.filter()for keeping or removing full groups
Use .apply() when the logic does not fit those patterns.
17. Top Records Inside Each Group
Find top 2 quiz attempts inside each track.
activity.sort_values(["track", "quiz_score"], ascending=[True, False]) \
.groupby("track") \
.head(2)Find the top order from each course:
revenue.sort_values(["course", "amount"], ascending=[True, False]) \
.groupby("course") \
.head(1)Another approach is ranking:
activity["score_rank_in_track"] = activity.groupby("track")["quiz_score"].rank(
method="dense",
ascending=False
)
activity[activity["score_rank_in_track"] <= 2].sort_values(["track", "score_rank_in_track"])Ranking is better when you want to keep the rank column for later analysis.
18. Handle Missing Values In Groups
Find missing quiz scores by track:
activity["quiz_score"].isna().groupby(activity["track"]).sum()Fill missing quiz score with the track average:
activity["quiz_score_filled"] = activity["quiz_score"].fillna(
activity.groupby("track")["quiz_score"].transform("mean")
)Check again:
activity[["learner", "track", "quiz_score", "quiz_score_filled"]]This is more thoughtful than filling every missing score with the global mean, because each track may have a different difficulty level.
19. Group Missing Categories With dropna=False
By default, Pandas may exclude missing group labels.
If you want missing values to appear as a group, use dropna=False.
activity.groupby("mentor", dropna=False)["learner_id"].nunique()This can be useful when missing values have business meaning, such as learners who have not been assigned a mentor yet.
20. Time-Based Grouping
The revenue dataset has an order date.
Create a month column:
revenue["month"] = revenue["ordered_at"].dt.to_period("M").astype(str)Revenue by month:
revenue.groupby("month", as_index=False)["amount"].sum()Revenue by month and category:
revenue.groupby(["month", "category"], as_index=False)["amount"].sum()You can also use pd.Grouper:
revenue.groupby(pd.Grouper(key="ordered_at", freq="M"))["amount"].sum()Use a date column when the question is about time.
Use normal category columns when the question is about segments.
21. Mini Project 1: Learner Engagement Dashboard
Build a summary table with:
- track name
- unique learners
- total minutes watched
- average quiz score
- completion rate
- number of mentors
Solution:
learner_dashboard = activity.groupby("track", as_index=False).agg(
unique_learners=("learner_id", "nunique"),
total_minutes=("minutes_watched", "sum"),
average_quiz_score=("quiz_score", "mean"),
completion_rate=("status", lambda s: (s == "completed").mean()),
mentors=("mentor", "nunique")
)
learner_dashboard["average_quiz_score"] = learner_dashboard["average_quiz_score"].round(2)
learner_dashboard["completion_rate"] = (learner_dashboard["completion_rate"] * 100).round(2)
learner_dashboard.sort_values("total_minutes", ascending=False)22. Mini Project 2: Course Revenue Dashboard
Build a summary table with:
- course
- category
- total orders
- paid orders
- total revenue
- average order amount
- total discount
- paid conversion rate
Solution:
course_revenue_dashboard = revenue.groupby(["course", "category"], as_index=False).agg(
total_orders=("order_id", "nunique"),
paid_orders=("paid", lambda s: s.sum()),
total_revenue=("amount", "sum"),
average_order_amount=("amount", "mean"),
total_discount=("discount", "sum")
)
course_revenue_dashboard["paid_conversion_rate"] = (
course_revenue_dashboard["paid_orders"] / course_revenue_dashboard["total_orders"] * 100
).round(2)
course_revenue_dashboard["average_order_amount"] = course_revenue_dashboard["average_order_amount"].round(2)
course_revenue_dashboard.sort_values("total_revenue", ascending=False)23. Mini Project 3: Find Learners Who Need Help
A learner may need support if:
- their filled quiz score is below the track average
- they have at least one in-progress lesson
- their total minutes are below 100
Solution:
support_data = activity.copy()
support_data["quiz_score_filled"] = support_data["quiz_score"].fillna(
support_data.groupby("track")["quiz_score"].transform("mean")
)
support_data["track_avg_score"] = support_data.groupby("track")["quiz_score_filled"].transform("mean")
support_data["learner_total_minutes"] = support_data.groupby("learner_id")["minutes_watched"].transform("sum")
support_data["has_in_progress"] = support_data.groupby("learner_id")["status"].transform(
lambda s: (s == "in_progress").any()
)
needs_help = support_data[
(support_data["quiz_score_filled"] < support_data["track_avg_score"]) &
(support_data["has_in_progress"]) &
(support_data["learner_total_minutes"] < 100)
]
needs_help[["learner_id", "learner", "track", "learner_total_minutes"]].drop_duplicates()24. Common Mistakes
Mistake 1: Grouping the whole DataFrame when you need one column
This works:
activity.groupby("track").sum(numeric_only=True)Explanation
- Groups the activity DataFrame by the "track" column to aggregate records with the same track value
- Calculates the sum of all numeric columns for each track group using the sum() method
- The numeric_only=True parameter ensures only numerical data types are included in the summation operation
- Returns a new DataFrame where each row represents a unique track with summed values for numeric fields
- This approach efficiently aggregates quantitative metrics across different track categories
But this is often clearer:
activity.groupby("track")["minutes_watched"].sum()Explanation
- Groups the activity DataFrame by the "track" column to organize data by musical tracks
- Calculates the sum of "minutes_watched" for each track group to determine total viewing time
- Returns a Series with track names as index and aggregated minute totals as values
- Useful for identifying most/least popular tracks based on cumulative watching duration
- Common pattern for data aggregation in pandas when analyzing user engagement metrics
Select the column you actually need.
Mistake 2: Confusing size() and count()
activity.groupby("track").size()Explanation
- Groups the activity dataset by the "track" column to aggregate related records
- Uses the size() method to count how many entries exist for each unique track value
- Returns a pandas Series with track identifiers as the index and their respective counts as values
- Commonly used for understanding distribution patterns across different tracks or categories
- Provides quick insight into which tracks have the most activity or are most frequently occurring
counts rows.
activity.groupby("track")["quiz_score"].count()Explanation
- Groups data by the "track" column to organize records by track category
- Calculates the count of quiz scores for each track group using the count() method
- Returns a Series with track names as index and corresponding quiz submission counts as values
- Useful for analyzing participation rates across different tracks or categories
- Demonstrates pandas groupby aggregation pattern for counting observations within groups
counts non-missing quiz scores.
Mistake 3: Using apply() for everything
This is usually better:
activity.groupby("track", as_index=False).agg(
total_minutes=("minutes_watched", "sum"),
average_score=("quiz_score", "mean")
)Explanation
- Groups user activity records by the "track" column to analyze performance across different learning paths
- Calculates the total minutes watched for each track using the sum aggregation function on the "minutes_watched" column
- Computes the mean quiz score for each track using the mean aggregation function on the "quiz_score" column
- Returns a new DataFrame with track names as the grouping key and two aggregated metrics as columns
- Uses as_index=False to keep the track column as a regular column rather than making it the index
Use apply() only when aggregation, transformation, and filtering are not enough.
Mistake 4: Forgetting reset_index()
This returns a Series with track as the index:
activity.groupby("track")["minutes_watched"].sum()Explanation
- Groups the activity DataFrame by the "track" column to organize data by musical tracks
- Calculates the sum of "minutes_watched" for each track group to determine total viewing time
- Returns a Series with track names as index and aggregated minute totals as values
- Useful for identifying most/least popular tracks based on cumulative watching duration
- Common pattern for data aggregation in pandas when analyzing user engagement metrics
This returns a DataFrame with track as a normal column:
activity.groupby("track")["minutes_watched"].sum().reset_index()Explanation
- Groups the activity DataFrame by the "track" column to organize records by track category
- Calculates the total minutes watched for each track group using the sum aggregation function
- Resets the index to convert the grouped result back into a regular DataFrame format
- Returns a new DataFrame with two columns: "track" and "minutes_watched" showing aggregated totals per track
- Useful for analyzing viewing patterns and identifying popular content categories
For dashboards, APIs, and CSV exports, normal columns are often easier.
25. Practice Questions
Try these without looking at the solutions first.
Practice Task
Q1. Which track has the highest total minutes watched?
activity.groupby("track")["minutes_watched"].sum().sort_values(ascending=False).head(1)Explanation
- Groups activity data by the "track" column to aggregate viewing statistics per track
- Calculates the sum of "minutes_watched" for each track group to determine total viewing time
- Sorts the aggregated results in descending order to rank tracks by total minutes watched
- Returns only the top result using head(1) to identify the track with the highest viewing time
- This approach efficiently identifies the most popular track based on cumulative watching duration
Practice Task
Q2. Which plan has the highest average quiz score?
activity.groupby("plan")["quiz_score"].mean().sort_values(ascending=False)Explanation
- Groups activity data by the "plan" column to organize records by different plan types
- Calculates the mean quiz score for each plan group using the "quiz_score" column
- Sorts the resulting averages in descending order to show highest performing plans first
- Returns a pandas Series with plan names as index and their corresponding average quiz scores as values
- Useful for identifying which plan type correlates with better quiz performance among users
Practice Task
Q3. How many unique learners are there in each city?
activity.groupby("city")["learner_id"].nunique().sort_values(ascending=False)Explanation
- Groups the activity dataset by the "city" column to aggregate data per location
- Counts the number of unique learner IDs within each city group using nunique()
- Sorts the resulting counts in descending order to show cities with the most learners first
- Provides a quick overview of learner distribution across different geographic locations
- Useful for identifying which cities have the highest engagement levels in the activity program
Practice Task
Q4. What is the completion count for every track and plan?
activity[activity["status"] == "completed"] \
.groupby(["track", "plan"]) \
.size() \
.reset_index(name="completed_count")Explanation
- Filters the activity DataFrame to include only rows where status equals "completed"
- Groups the filtered data by both "track" and "plan" columns to create aggregated categories
- Counts the number of records in each group using the size() method
- Resets the index to convert the grouped result back to a regular DataFrame with proper column names
- Creates a new column called "completed_count" to store the aggregated counts for each track-plan combination
Practice Task
Q5. Which course has the highest average paid order amount?
revenue[revenue["paid"] == True] \
.groupby("course")["amount"] \
.mean() \
.sort_values(ascending=False)Explanation
- Filters the revenue dataset to include only rows where the "paid" column equals True, selecting only completed transactions
- Groups the filtered data by the "course" column to organize payments by course category
- Calculates the mean (average) payment amount for each course group using the "amount" column
- Sorts the resulting averages in descending order to show courses with highest average payments first
Practice Task
Q6. Which city generated the highest total revenue?
revenue.groupby("city")["amount"].sum().sort_values(ascending=False).head(1)Explanation
- Groups revenue data by city column and calculates the sum of amounts for each city
- Sorts the resulting sums in descending order to rank cities by total revenue
- Returns only the top result using head(1) to identify the city with maximum revenue
- This approach efficiently aggregates and ranks categorical data to find peak performance metrics
- Commonly used for business analytics to identify best performing locations or segments
Practice Task
Q7. What is each course's discount percentage?
discount_summary = revenue.groupby("course", as_index=False).agg(
total_amount=("amount", "sum"),
total_discount=("discount", "sum")
)
discount_summary["discount_pct"] = (
discount_summary["total_discount"] /
(discount_summary["total_amount"] + discount_summary["total_discount"]) *
100
).round(2)
discount_summaryExplanation
- Groups revenue data by course category and calculates total amount and total discount for each course
- Computes the discount percentage by dividing total discount by the sum of total amount and total discount, then multiplies by 100
- Rounds the calculated discount percentage to two decimal places for better readability
- Returns the final summary dataframe containing courses, their total amounts, total discounts, and corresponding discount percentages
Practice Task
Q8. Find the top learner by total minutes watched in each track.
learner_track_minutes = activity.groupby(
["track", "learner_id", "learner"],
as_index=False
)["minutes_watched"].sum()
learner_track_minutes.sort_values(
["track", "minutes_watched"],
ascending=[True, False]
).groupby("track").head(1)Explanation
- Groups activity data by track, learner_id, and learner name to aggregate total minutes watched per learner per track
- Sorts the grouped data by track name ascending and minutes watched descending to rank learners within each track
- Uses groupby with head(1) to select only the top learner (highest minutes watched) from each track group
- The final result shows the learner who watched the most minutes for each content track
- This approach efficiently finds the most engaged learner per track without requiring explicit ranking functions
Practice Task
Q9. Add a column showing each learner's percentage of total minutes inside their track.
activity["track_total_minutes"] = activity.groupby("track")["minutes_watched"].transform("sum")
activity["minutes_share_in_track"] = (
activity["minutes_watched"] / activity["track_total_minutes"] * 100
).round(2)
activity[["learner", "track", "lesson", "minutes_watched", "minutes_share_in_track"]].head()Explanation
- Groups the activity data by "track" column and calculates the total minutes watched for each track using transform with sum aggregation
- Creates a new column that represents the percentage share of minutes watched relative to the total minutes in each track
- Rounds the percentage values to 2 decimal places for better readability
- Displays selected columns including learner, track, lesson, minutes watched, and the calculated percentage share
- Shows the first few rows of the resulting dataframe with the new percentage calculation included
Practice Task
Q10. Keep only courses that have at least 3 paid orders.
revenue[revenue["paid"] == True] \
.groupby("course") \
.filter(lambda group: len(group) >= 3)Explanation
- Filters the revenue DataFrame to include only rows where the "paid" column equals True, selecting only paid transactions
- Groups the filtered data by the "course" column to aggregate information per course
- Applies a filter function that retains only groups (courses) containing 3 or more records (students/enrollments)
- Returns a subset of the original DataFrame containing only paid course entries that meet the minimum enrollment requirement of 3 participants
- This operation helps identify popular paid courses with sufficient student participation for analysis or reporting purposes
26. Interview Questions
1. What does groupby() return?
It returns a GroupBy object. The object stores grouping rules and waits for an operation such as aggregation, transformation, filtering, or apply.
2. What is the split-apply-combine idea?
Split data into groups, apply a calculation to each group, and combine the results.
3. What is the difference between size() and count()?
size() counts rows in each group. count() counts non-null values in each column or selected Series.
4. When should you use transform()?
Use transform() when you need a group-level result repeated back to the original row shape, such as adding each row's group average.
5. When should you use filter()?
Use filter() when you want to keep or remove entire groups based on a condition.
6. When should you use apply()?
Use apply() for custom group logic that does not fit normal aggregation, transformation, or filtering.
7. Why is named aggregation useful?
It lets you create readable output column names while applying different calculations to different columns.
8. Why can as_index=False be helpful?
It keeps grouping keys as normal columns instead of making them the index. This is often easier for dashboards, APIs, and exports.
9. How do you group by multiple columns?
Pass a list of columns:
df.groupby(["column_1", "column_2"])Explanation
- Groups rows in a pandas DataFrame based on unique combinations of values in column_1 and column_2
- Creates separate groups for each distinct pair of values from the specified columns
- Enables subsequent aggregation operations like sum, mean, or count within each group
- Returns a GroupBy object that can be used with aggregation methods to compute statistics per group
- Commonly used for categorical data analysis and creating summary tables by dimension combinations
10. How do you get one specific group?
Use get_group():
activity.groupby("track").get_group("Pandas")Explanation
- The code uses pandas groupby() function to partition data by the "track" column into separate groups
- It then applies get_group() method to extract only the subset of data where track equals "Pandas"
- This approach efficiently filters large datasets to focus on specific categorical values
- The operation returns a DataFrame containing all rows that belong to the "Pandas" track category
- This pattern is commonly used for analyzing specific segments within grouped data structures
For multiple keys, pass a tuple:
activity.groupby(["track", "plan"]).get_group(("Pandas", "premium"))Explanation
- The code uses pandas groupby() method to partition data into groups based on two categorical columns: "track" and "plan"
- It applies the get_group() method to specifically extract the subset of data where track equals "Pandas" and plan equals "premium"
- This approach enables efficient filtering and analysis of specific combinations within grouped datasets
- The resulting output contains all rows that match both criteria simultaneously from the original activity dataframe
- This pattern is commonly used for segmenting business data by product tiers and user categories
27. Final Mental Model
Use this simple decision table:
| Need | Use |
|---|---|
| One summary row per group | agg() |
| Group result attached back to original rows | transform() |
| Keep or remove full groups | filter() |
| Custom logic that returns flexible output | apply() |
| Count rows | size() |
| Count non-missing values | count() |
| Count distinct values | nunique() |
| Make grouped keys normal columns | as_index=False or reset_index() |
If you remember only one thing, remember this:
GroupBy is not just for totals. It is a complete pattern for turning row-level data into segment-level insight.
Official References
- Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
- Pandas
DataFrame.groupbyAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html - Pandas GroupBy API reference: https://pandas.pydata.org/docs/reference/groupby.html
