How do you sort a DataFrame by more than one column?

You can use the sort_values method to sort a DataFrame by more than one column.

What method is used to count how often each value appears in a column?

The value_counts method is used to count how often each value appears in a column.

How can you check for missing values in a DataFrame?

You can use the isna or isnull methods to check for missing values in a DataFrame.

What is the purpose of the drop_duplicates method?

The drop_duplicates method is used to remove duplicate records from a DataFrame.

How can you apply custom logic to each row in a DataFrame?

You can use the apply method to apply custom logic to each row in a DataFrame.

Pandas DataFrame Methods: Sort, Rank, Clean,

Pandas Important DataFrame Methods: Sort, Rank, Index, Clean, Apply, and Practice

Once you understand how to create, inspect, select, and filter a DataFrame, the next step is learning the methods you will use again and again.

These methods help you answer questions like:

Which category appears most often?
Which learner has the highest score?
Which rows have missing values?
Which records are duplicates?
How do I sort by more than one column?
How do I make one column the index?
How do I reset the index after sorting?
How do I rename columns?
How do I apply custom logic row by row?

This guide uses original sample datasets, not copied third-party datasets.

Files used in this lesson:

learning_activity.csv
mini_league_matches_methods.csv

Place both files in the same folder as this Markdown file before running the examples.

What You Will Learn

By the end, you should be able to use:

value_counts
sort_values
rank
sort_index
set_index
reset_index
rename
unique
nunique
isna, isnull, notna, and notnull
hasnans
dropna
fillna, ffill, and bfill
duplicated
drop_duplicates
drop
apply

You will also solve realistic practice tasks using CSV files.

1. Setup

python

import pandas as pd
import numpy as np

Load the learner activity dataset:

python

activity = pd.read_csv("learning_activity.csv")
activity.head()

Load the match dataset:

python

matches = pd.read_csv("mini_league_matches_methods.csv")
matches.head()

2. Quick Dataset Check

Before using any advanced method, inspect the data.

python

activity.shape
activity.info()
activity.head()
activity.tail()

Check column names:

python

activity.columns

Check missing values:

python

activity.isna().sum()

Check duplicates:

python

activity.duplicated().sum()

This small routine catches many problems early.

3. `value_counts` On A Series

Use value_counts to count how often each value appears in one column.

python

activity["track"].value_counts()

Example questions:

Which course track has the most activity?
Which city has the most learners?
Which plan is most common?
Which completion status appears most often?

python

activity["city"].value_counts()
activity["plan"].value_counts()
activity["status"].value_counts()

By default, missing values are ignored.

To include missing values:

python

activity["mentor"].value_counts(dropna=False)

To get percentages:

python

activity["track"].value_counts(normalize=True) * 100

4. `value_counts` On A DataFrame

DataFrame.value_counts counts repeated combinations of columns.

python

activity[["track", "status"]].value_counts()

This answers:

text

How many rows exist for each track-status pair?

You can include missing combinations:

python

activity[["track", "mentor"]].value_counts(dropna=False)

You can normalize:

python

activity[["track", "status"]].value_counts(normalize=True) * 100

This is useful when a single column is not enough.

5. `sort_values` On A Series

Create a simple Series:

python

scores = pd.Series([88, 72, 95, 64, 91], index=["Aarav", "Meera", "Nila", "Kabir", "Tara"])

Sort ascending:

python

scores.sort_values()

Sort descending:

python

scores.sort_values(ascending=False)

This returns a new sorted Series. It does not change the original unless you assign it back.

python

scores = scores.sort_values(ascending=False)

6. `sort_values` On A DataFrame

Sort learners by quiz score:

python

activity.sort_values("quiz_score", ascending=False)

Sort by minutes watched:

python

activity.sort_values("minutes_watched", ascending=False)

Sort by multiple columns:

python

activity.sort_values(
    ["track", "quiz_score"],
    ascending=[True, False]
)

This means:

track in ascending order
quiz score in descending order inside each track

7. Sorting With Missing Values

If a column has missing values, choose where they should go.

python

activity.sort_values("quiz_score", na_position="first")

Or:

python

activity.sort_values("quiz_score", na_position="last")

For reports, na_position="last" is usually easier to read.

8. `rank`

Ranking converts values into positions.

python

activity["score_rank"] = activity["quiz_score"].rank(ascending=False)

Now the highest quiz score gets rank 1.

python

activity.sort_values("score_rank").head()

If two learners have the same score, Pandas uses average ranking by default.

For dense ranking:

python

activity["dense_rank"] = activity["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Dense ranking does not leave gaps after ties.

Example:

text

Score: 95, 91, 91, 88
Dense rank: 1, 2, 2, 3

9. Ranking Within Groups

Rank learners inside each track.

python

activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Show top learners per track:

python

activity.sort_values(["track", "track_rank", "learner"])

This pattern is powerful for leaderboards.

10. `sort_index`

Sorting by values sorts the data inside columns.

Sorting by index sorts row labels.

python

activity_by_id = activity.set_index("learner_id")
activity_by_id.sort_index()

Descending:

python

activity_by_id.sort_index(ascending=False)

You use sort_index when the index itself has meaning, such as:

IDs
dates
alphabetical labels
ordered categories

11. `set_index`

Use set_index when one column should become the row label.

python

activity_by_id = activity.set_index("learner_id")

Now learner_id is the index.

You can select by ID:

python

activity_by_id.loc["L003"]

By default, set_index returns a new DataFrame.

If you want to modify the same variable:

python

activity = activity.set_index("learner_id")

In notebooks, returning a new DataFrame is often safer than using inplace=True.

12. `reset_index`

Use reset_index to convert the index back into a column.

python

activity_by_id.reset_index()

If you do not want the old index as a column:

python

activity_by_id.reset_index(drop=True)

Common pattern after sorting:

python

top_learners = (
    activity
    .sort_values("quiz_score", ascending=False)
    .reset_index(drop=True)
)

This gives clean row numbers after sorting.

13. Series To DataFrame With `reset_index`

value_counts returns a Series.

python

track_counts = activity["track"].value_counts()
type(track_counts)

Convert it into a DataFrame:

python

track_counts_df = track_counts.reset_index()
track_counts_df.columns = ["track", "row_count"]
track_counts_df

This is useful before exporting a summary.

14. `rename`

Rename columns:

python

activity = activity.rename(columns={
    "minutes_watched": "minutes",
    "quiz_score": "score"
})

Rename index labels:

python

activity_by_id = activity.set_index("learner_id")

activity_by_id.rename(index={
    "L001": "learner-001",
    "L002": "learner-002"
})

Clean column names after reading messy CSV files:

python

activity.columns = (
    activity.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

Clean column names reduce bugs.

15. `unique`

Use unique to see distinct values in a Series.

python

activity["track"].unique()

This returns an array of unique values.

For city names:

python

activity["city"].unique()

If you see values like "Pune" and "pune", you have a cleaning issue.

Normalize before analysis:

python

activity["city"] = activity["city"].str.title()

16. `nunique`

Use nunique to count unique values.

python

activity["learner_id"].nunique()

Count unique learners by track:

python

activity.groupby("track")["learner_id"].nunique()

By default, missing values are ignored.

Include missing values:

python

activity["mentor"].nunique(dropna=False)

17. `isna`, `isnull`, `notna`, And `notnull`

Check missing values:

python

activity.isna()

Count missing values per column:

python

activity.isna().sum()

isnull is an alias many people still use:

python

activity.isnull().sum()

Check non-missing values:

python

activity.notna()
activity.notnull()

Select rows with missing mentor:

python

activity[activity["mentor"].isna()]

Select rows with known mentor:

python

activity[activity["mentor"].notna()]

18. `hasnans`

hasnans is available on a Series.

python

activity["mentor"].hasnans

It returns True if the Series contains missing values.

It is a quick yes/no check.

19. `dropna`

dropna removes rows or columns with missing values.

Drop rows with any missing value:

python

activity.dropna()

Drop rows only if all values are missing:

python

activity.dropna(how="all")

Drop rows where a specific column is missing:

python

activity.dropna(subset=["mentor"])

Drop rows where either mentor or completed_at is missing:

python

activity.dropna(subset=["mentor", "completed_at"])

Drop rows only when both mentor and completed_at are missing:

python

activity.dropna(subset=["mentor", "completed_at"], how="all")

Do not use dropna blindly. In learning data, missing completed_at may simply mean the learner has not completed the lesson yet.

20. `fillna`

Use fillna to replace missing values.

Fill one column:

python

activity["mentor"] = activity["mentor"].fillna("Unassigned")

Fill numeric missing values with median:

python

activity["quiz_score"] = activity["quiz_score"].fillna(activity["quiz_score"].median())

Fill multiple columns with different values:

python

activity = activity.fillna({
    "mentor": "Unassigned",
    "completed_at": "Not completed",
    "status": "unknown"
})

Column-specific filling is usually better than filling the whole DataFrame with one value.

21. Forward Fill And Backward Fill

Forward fill uses the previous known value.

python

activity["mentor"].ffill()

Backward fill uses the next known value.

python

activity["mentor"].bfill()

These are useful for time series or ordered logs, but they can be misleading if the row order has no meaning.

22. `duplicated`

Check full duplicate rows:

python

activity.duplicated()

Count duplicates:

python

activity.duplicated().sum()

Check duplicates by specific columns:

python

activity.duplicated(subset=["learner_id", "lesson"])

Show duplicate lesson attempts:

python

activity[activity.duplicated(subset=["learner_id", "lesson"], keep=False)]

keep=False marks all duplicate rows in each duplicate group.

23. `drop_duplicates`

Remove exact duplicate rows:

python

activity.drop_duplicates()

Keep the last attempt for each learner and lesson:

python

latest_attempts = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep="last"
)

Drop all duplicated learner-lesson rows:

python

unique_only = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep=False
)

For event data, keep="last" is common when later rows represent updated attempts.

24. `drop`

Drop columns:

python

activity.drop(columns=["mentor"])

Drop rows by index:

python

activity.drop(index=[0, 1])

After setting an index, drop by label:

python

activity_by_id = activity.set_index("learner_id")
activity_by_id.drop(index=["L001", "L002"])

Drop with axis also works:

python

activity.drop(["mentor"], axis=1)

But columns=[...] is usually clearer.

25. `apply` On A Series

apply runs a function on each value in a Series.

python

def score_label(score):
    if pd.isna(score):
        return "missing"
    if score >= 85:
        return "strong"
    if score >= 70:
        return "steady"
    return "needs_practice"

activity["score_label"] = activity["quiz_score"].apply(score_label)

For many simple conditions, vectorized methods like np.where or pd.cut can be faster.

But apply is easy to understand and useful for custom logic.

26. `apply` Row By Row

Use axis=1 when your function needs more than one column from the same row.

python

def learner_priority(row):
    if row["status"] != "completed" and row["minutes_watched"] >= 250:
        return "follow_up"
    if row["quiz_score"] >= 90:
        return "celebrate"
    return "normal"

activity["priority"] = activity.apply(learner_priority, axis=1)

Row-wise apply is flexible, but it can be slower on very large datasets.

Use it when clarity matters or when the logic is too custom for simple vectorized expressions.

27. Mini Practice: Learner Activity

Reload a clean copy:

python

activity = pd.read_csv("learning_activity.csv")

Practice Task

Q1. Which track has the most rows?

python

activity["track"].value_counts()

Explanation

The code examines the distribution of different activity types in a dataset by counting how many times each unique activity appears
It uses the pandas value_counts() method to generate a frequency table of activity track values
The result shows the count of each distinct activity type, helping identify which activities occur most frequently
This analysis is useful for understanding user engagement patterns and popular activity categories
The output provides insights into data distribution that can inform business decisions and resource allocation

Practice Task

Q2. Which city-plan pair appears most often?

python

activity[["city", "plan"]].value_counts()

Explanation

This code performs a frequency count of unique combinations between city and plan columns in the activity DataFrame
The double bracket notation [["city", "plan"]] selects multiple columns simultaneously for analysis
value_counts() method returns a Series with MultiIndex where each index represents a unique city-plan pair
The output shows how many times each city-plan combination appears in the dataset, useful for identifying popular activity patterns
This technique is commonly used in data exploration to understand distribution patterns across categorical variables

Practice Task

Q3. Sort learners by score, highest first.

python

activity.sort_values("quiz_score", ascending=False)

Explanation

The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in descending order
Values with higher quiz scores appear first in the sorted result
The ascending=False parameter ensures the highest scores are displayed at the top
This operation modifies the DataFrame in place without creating a new object
Commonly used for ranking students or participants by performance metrics

Practice Task

Q4. Sort by track ascending and score descending.

python

activity.sort_values(["track", "quiz_score"], ascending=[True, False])

Explanation

The code sorts a pandas DataFrame called 'activity' by two columns: "track" in ascending order and "quiz_score" in descending order
The sort_values() method rearranges rows based on the specified column priorities, with tracks grouped together and quiz scores ordered from highest to lowest within each track
This creates a structured dataset where activities are organized first by track category, then by performance scores within each category
The ascending parameter controls sorting direction, with True for ascending (A-Z) and False for descending (Z-A)
This type of sorting is commonly used for creating performance reports or organizing educational content by difficulty level

Practice Task

Q5. Add a global score rank.

python

activity["global_rank"] = activity["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Explanation

Assigns a global ranking to each activity entry based on quiz scores in descending order
Uses the dense ranking method which assigns the same rank to tied values without skipping subsequent ranks
Higher quiz scores receive lower rank numbers (1st place, 2nd place, etc.)
Modifies the activity DataFrame by adding a new "global_rank" column with calculated rankings
The ranking process ensures consistent ordering where ties share identical rank values

Practice Task

Q6. Add a rank inside each track.

python

activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Explanation

Groups the activity dataframe by the "track" column to organize data by track categories
Within each track group, ranks the "quiz_score" values in descending order (highest scores first)
Uses the "dense" ranking method which assigns the same rank to tied scores without skipping subsequent rank numbers
Creates a new "track_rank" column that contains the rank position for each quiz score within its respective track
The resulting ranking helps identify top performers within each track category efficiently

Practice Task

Q7. Set learner_id as the index and select one learner.

python

activity_by_id = activity.set_index("learner_id")
activity_by_id.loc["L003"]

Explanation

Creates a new DataFrame view with "learner_id" column as the index using set_index() method
Uses .loc accessor to retrieve all rows where learner_id equals "L003" from the indexed DataFrame
This approach enables efficient lookup of specific learner records by their unique identifier
The indexed structure provides faster access times when querying multiple records for the same learner ID
Returns all activity records associated with the specified learner ID in a structured format

Practice Task

Q8. Convert track counts into a DataFrame.

python

track_counts = activity["track"].value_counts().reset_index()
track_counts.columns = ["track", "rows"]
track_counts

Explanation

The code extracts unique track values from the activity DataFrame's "track" column and counts how many times each track appears
It uses value_counts() to generate a Series with track names as index and their occurrence counts as values
The reset_index() method converts the Series into a DataFrame format with tracks as a regular column instead of an index
Column names are renamed to "track" and "rows" for better readability and clarity
The final result displays a tabular view showing each track and its corresponding count of occurrences

Practice Task

Q9. Rename quiz_score to score.

python

activity.rename(columns={"quiz_score": "score"})

Explanation

The code uses the pandas rename() method to change the column name "quiz_score" to "score" in a DataFrame called activity
This operation creates a new DataFrame with the updated column names while preserving all original data
The rename() method accepts a dictionary where keys are current column names and values are new column names
This technique is commonly used to standardize column naming across datasets or make column names more readable
The operation is non-destructive when using the inplace parameter or assigning the result to a new variable

Practice Task

Q10. Count unique learners by track.

python

activity.groupby("track")["learner_id"].nunique()

Explanation

Groups data by the "track" column to organize records by track category
For each track group, counts the number of distinct learner IDs using nunique()
Returns a pandas Series with track names as index and unique learner counts as values
This operation efficiently summarizes learner diversity across different tracks
Commonly used for analytics to measure engagement breadth within educational pathways

Practice Task

Q11. Show rows where mentor is missing.

python

activity[activity["mentor"].isna()]

Explanation

Uses boolean indexing to filter rows from the activity DataFrame where the "mentor" column has null/NaN values
The isna() method returns a boolean Series identifying which entries are null in the mentor column
This creates a subset of the original DataFrame containing only records without assigned mentors
Commonly used for data cleaning to identify unassigned or missing mentorship records
Returns a new DataFrame with the same structure but filtered to show only null mentor entries

Practice Task

Q12. Fill missing mentors with Unassigned.

python

activity["mentor"] = activity["mentor"].fillna("Unassigned")

Explanation

Replaces all null/missing values in the "mentor" column with the string "Unassigned"
Uses pandas fillna() method to handle missing data gracefully in the activity dataset
Ensures consistent data representation by providing a default value for undefined mentor assignments
Maintains data integrity by preventing null values from propagating through subsequent operations
Common pattern for cleaning categorical data where missing values need explicit handling

Practice Task

Q13. Show duplicate learner-lesson attempts.

python

activity[
    activity.duplicated(subset=["learner_id", "lesson"], keep=False)
]

Explanation

This code identifies all rows in the activity DataFrame that have duplicate combinations of learner_id and lesson values
The duplicated() method returns a boolean mask indicating which rows are duplicates based on the specified subset columns
Setting keep=False ensures all duplicate instances are marked as True, not just the subsequent ones
The result filters the original DataFrame to show only rows where the same learner has completed the same lesson multiple times
This is useful for identifying repeated learning activities or potential data entry issues in educational analytics

Practice Task

Q14. Keep only the last attempt for every learner and lesson.

python

activity = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep="last"
)

Explanation

The code eliminates duplicate rows from the activity DataFrame based on combinations of learner_id and lesson columns
It keeps only the last occurrence of each duplicate combination, ensuring the most recent record is retained
The subset parameter specifies which columns to consider when identifying duplicates
This operation helps clean data by removing redundant learner progress entries for the same lesson

Practice Task

Q15. Create a priority label using row-wise apply.

python

def priority(row):
    if row["status"] != "completed" and row["minutes_watched"] >= 250:
        return "follow_up"
    if row["quiz_score"] >= 90:
        return "celebrate"
    return "normal"

activity["priority"] = activity.apply(priority, axis=1)

Explanation

The priority function evaluates student data rows to determine appropriate follow-up actions based on multiple criteria
Students who haven't completed their status but have watched at least 250 minutes receive "follow_up" priority status
High-performing students with quiz scores of 90 or above are flagged with "celebrate" priority level
All other students automatically receive the default "normal" priority classification
The apply method processes each row of the activity DataFrame through the priority function to create a new priority column

28. Mini Practice: Fictional League Matches

Reload the matches dataset:

python

matches = pd.read_csv("mini_league_matches_methods.csv")

Explanation

Reads match data from a CSV file named "mini_league_matches_methods.csv" into a pandas DataFrame called matches
The CSV file likely contains information about football matches including teams, scores, dates, and other statistical data
This operation creates a structured dataset that can be manipulated and analyzed using pandas methods
The resulting DataFrame provides a foundation for performing data analysis, filtering, grouping, and visualization tasks
Common use cases include generating match statistics, calculating team performance metrics, or creating visualizations of league standings

This dataset is fictional and created for Pandas practice.

Practice Task

Q1. Count toss decisions.

python

matches["toss_decision"].value_counts()

Explanation

The code examines the frequency distribution of toss decisions (bat or field) made by teams in cricket matches
It uses the value_counts() method to count how many times each toss decision appears in the toss_decision column
This provides insights into whether teams prefer batting or fielding after winning the toss
The output shows the absolute counts for each decision type, helping identify patterns in team strategy
This analysis is commonly used in sports analytics to understand tactical preferences and decision-making patterns

Practice Task

Q2. Count match stages.

python

matches["stage"].value_counts()

Explanation

This code performs a frequency count of unique values in the "stage" column of a pandas DataFrame called matches
The value_counts() method returns a Series with each unique stage value as the index and its corresponding count as the value
This is useful for understanding how many records exist for each stage category in your dataset
The output helps identify which stages are most common or rare in your data
This operation is commonly used for exploratory data analysis and data validation purposes

Practice Task

Q3. Find how many matches each team played.

Because a team can appear in either team_a or team_b, combine both columns.

python

teams_played = pd.concat([matches["team_a"], matches["team_b"]])
teams_played.value_counts()

Explanation

Concatenates two columns from a matches DataFrame containing team names (team_a and team_b) into a single Series
Creates a unified list of all teams that participated in matches by combining both home and away team columns
Uses value_counts() method to tally how many times each team appears in the concatenated data
Returns a sorted Series showing team names as index and their total match participation counts as values
Useful for identifying most active teams or analyzing team involvement patterns in sports analytics

Practice Task

Q4. Rank teams by number of wins.

python

team_wins = matches["winner"].value_counts().reset_index()
team_wins.columns = ["team", "wins"]
team_wins["rank"] = team_wins["wins"].rank(
    ascending=False,
    method="dense"
)
team_wins.sort_values(["rank", "team"])

Explanation

The code retrieves the count of wins for each team from the "winner" column in the matches DataFrame.
It resets the index to convert the Series into a DataFrame and renames the columns to "team" and "wins".
The rank method is applied to the "wins" column to assign a ranking based on the number of wins, with ties receiving the same rank.
Finally, the DataFrame is sorted by rank in ascending order and by team name to ensure a clear presentation of the results.

Practice Task

Q5. Sort matches by season, stage, and margin.

python

matches.sort_values(
    ["season", "stage", "margin"],
    ascending=[True, True, False]
)

Explanation

The sort_values method is used to sort the DataFrame named matches.
The sorting is performed based on three columns: season, stage, and margin.
The ascending parameter specifies the sort order: season and stage are sorted in ascending order, while margin is sorted in descending order.
This allows for a structured organization of match data, prioritizing seasons and stages while highlighting larger margins.

Practice Task

Q6. Set match_id as the index.

python

matches_by_id = matches.set_index("match_id")
matches_by_id.sort_index()

Explanation

The set_index("match_id") method transforms the DataFrame by using the "match_id" column as the new index, allowing for quicker lookups.
The sort_index() method sorts the DataFrame based on the new index, ensuring that the data is organized in ascending order of match IDs.
This approach enhances data manipulation and retrieval, particularly when working with large datasets.
The original DataFrame remains unchanged unless reassigned, so the sorted version needs to be stored if required.

Practice Task

Q7. Show unique stages.

python

matches["stage"].unique()

Explanation

Retrieves all distinct values present in the "stage" column of the matches DataFrame
Returns a numpy array containing only unique stage identifiers without duplicates
Useful for understanding the variety of stages available in the dataset
Commonly used for data exploration and validation before further analysis
Helps identify potential data inconsistencies or missing stage categories

Practice Task

Q8. Count unique cities.

python

matches["city"].nunique()

Explanation

The code uses the nunique() method to count distinct values in the "city" column of a pandas DataFrame called matches
This method efficiently handles missing values by excluding them from the count automatically
The result returns a single integer representing how many unique cities exist in that dataset column
This operation is useful for understanding data diversity and identifying potential data quality issues
The syntax follows pandas convention where nunique() is applied directly to a Series object

Practice Task

Q9. Drop rows where winner is missing.

python

matches.dropna(subset=["winner"])

Explanation

The dropna method removes any rows where the "winner" column contains null or NaN values
The subset parameter specifically targets only the "winner" column for null value detection
This operation returns a new DataFrame without the incomplete match records
Useful for data cleaning when you need to ensure all matches have a defined winner before analysis
The original DataFrame remains unchanged since dropna creates a copy by default

In this sample dataset, one row has no winner because the match had no result.

Practice Task

Q10. Fill missing winner with No Result.

python

matches["winner"] = matches["winner"].fillna("No Result")

Explanation

The code fills NaN (missing) values in the "winner" column of a DataFrame called matches
Missing values are replaced with the string "No Result" to indicate incomplete match outcomes
This approach ensures consistent data representation where every match has a defined winner value
The fillna() method operates in-place on the DataFrame column without requiring assignment to a new variable
Useful for data analysis workflows where missing results need to be handled explicitly rather than left as null values

Practice Task

Q11. Add a close_match column.

python

matches["close_match"] = matches["margin"].apply(
    lambda margin: "yes" if margin <= 8 else "no"
)

Explanation

The code creates a new column called "close_match" in the matches DataFrame
It applies a lambda function to each value in the "margin" column to evaluate whether the margin is less than or equal to 8
Values meeting the condition (margin ≤ 8) are labeled as "yes", otherwise "no"
This operation uses pandas apply method for vectorized conditional assignment across all rows
The resulting column helps categorize matches as either "close" or "not close" based on the defined threshold

Practice Task

Q12. Add a match summary using row-wise apply.

python

def match_summary(row):
    if pd.isna(row["winner"]):
        return f'{row["team_a"]} vs {row["team_b"]}: no result'
    return f'{row["winner"]} won by {row["margin"]}'

matches["summary"] = matches.apply(match_summary, axis=1)

Explanation

The function match_summary takes a row of match data and checks if the winner column contains missing values using pd.isna()
When there's no winner (missing value), it returns a string indicating the match ended without a result
When there is a winner, it formats a summary showing which team won and by what margin
The apply() method processes each row of the matches DataFrame through this function
The resulting summary strings are stored in a new column called "summary" in the matches DataFrame

29. Method Selection Guide

Use this as a quick decision table.

Task	Method
Count values in one column	`value_counts`
Count combinations of columns	`DataFrame.value_counts`
Sort rows by column	`sort_values`
Sort rows by index	`sort_index`
Create ranks	`rank`
Make a column the index	`set_index`
Convert index back to column	`reset_index`
Rename columns or index labels	`rename`
See unique values	`unique`
Count unique values	`nunique`
Detect missing values	`isna` or `isnull`
Detect non-missing values	`notna` or `notnull`
Remove missing rows	`dropna`
Fill missing values	`fillna`, `ffill`, `bfill`
Detect duplicates	`duplicated`
Remove duplicates	`drop_duplicates`
Remove rows or columns	`drop`
Run custom logic	`apply`

30. Common Mistakes

Mistake 1: Sorting Without Assigning

This returns a sorted DataFrame but does not change activity.

python

activity.sort_values("quiz_score")

Explanation

The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in ascending order
By default, sort_values() arranges values from lowest to highest
This operation returns a new sorted DataFrame without modifying the original data
Useful for analyzing student performance or ranking participants by their quiz results
The method can be modified with additional parameters like ascending=False for descending order

Assign it:

python

activity = activity.sort_values("quiz_score")

Explanation

The code sorts a pandas DataFrame called 'activity' based on the "quiz_score" column in ascending order
This operation rearranges all rows in the dataset so that records with lower quiz scores appear first
The sort_values() method modifies the DataFrame in-place by default, reordering the entire dataset
This sorting approach is commonly used for analyzing student performance or ranking participants by score
The resulting dataset maintains all original columns while organizing rows according to the specified sorting criteria

Mistake 2: Dropping Rows Too Early

Missing values are not always bad.

For example, a missing completed_at value can mean the learner has not completed the lesson yet.

Understand the meaning before using:

python

activity.dropna()

Explanation

The dropna() method eliminates rows or columns containing null values (NaN, None) from a pandas DataFrame
By default, it removes any row that contains at least one missing value, preserving only complete records
This operation returns a new DataFrame without modifying the original data structure
Commonly used in data cleaning processes to ensure analysis accuracy when missing values could skew results
Can be customized with parameters like subset= to target specific columns or how='all' to remove rows where all values are missing

Mistake 3: Using Row-Wise `apply` For Everything

This works:

python

activity.apply(priority, axis=1)

Explanation

The code applies a priority function to each row of the activity DataFrame using the pandas apply method
The axis=1 parameter specifies that the function should be applied horizontally across rows rather than vertically down columns
This approach enables vectorized processing of each row's data through the custom priority function
The result represents transformed activity data where each row has been processed according to the priority logic
This pattern is commonly used for data transformation tasks like scoring, categorization, or conditional calculations across dataset rows

But for simple comparisons, vectorized code is often cleaner:

python

activity["is_high_score"] = activity["quiz_score"] >= 85

Explanation

This code creates a new column called "is_high_score" in the activity DataFrame
It evaluates each row's quiz_score against the value 85 using a greater than or equal comparison
The result is a boolean True/False value indicating whether each student achieved a score of 85 or higher
This boolean flag can be used for filtering, grouping, or conditional operations on high-performing students
The operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously

Use apply when you really need custom logic.

Mistake 4: Forgetting `subset` In Duplicate Checks

This checks full-row duplicates:

python

activity.duplicated()

Explanation

The duplicated() method returns a boolean Series indicating which rows in a DataFrame are duplicates of previously occurring rows
By default, it marks all duplicates as True except for the first occurrence, which is marked as False
This method is useful for data cleaning and quality assessment to identify redundant records
The result can be used to filter or remove duplicate entries from datasets
The method supports various parameters like keep='first', keep='last', or keep=False to customize duplicate detection behavior

This checks duplicate attempts by learner and lesson:

python

activity.duplicated(subset=["learner_id", "lesson"])

Explanation

The code checks for duplicate rows in an activity dataset based on specific columns
It examines combinations of learner_id and lesson to find repeated entries
Returns a boolean Series indicating which rows are duplicates of previous rows
Useful for data cleaning to identify redundant activity records in educational analytics
The subset parameter specifies exactly which columns to consider when determining duplication criteria

They answer different questions.

Mistake 5: Resetting Index And Keeping Old Index Accidentally

This keeps the old index as a column:

python

df.reset_index()

Explanation

Resets the DataFrame's index to the default integer index starting from zero
Removes any existing index labels and replaces them with sequential integers
Useful when index values are no longer meaningful or need to be standardized
Commonly used after filtering or sorting operations that may have left irregular index values

This drops it:

python

df.reset_index(drop=True)

Explanation

Removes the current index labels and replaces them with a new sequential integer index starting from zero
The drop=True parameter ensures the old index is completely discarded rather than being added as a new column
Useful when you need to reorganize data after filtering, sorting, or concatenating DataFrames
Helps maintain consistent indexing when performing operations that might disrupt the original index structure

Choose intentionally.

31. Compact Cheat Sheet

python

# Counts
df["col"].value_counts()
df[["a", "b"]].value_counts()

# Sorting
df.sort_values("score", ascending=False)
df.sort_values(["track", "score"], ascending=[True, False])
df.sort_index()

# Ranking
df["rank"] = df["score"].rank(ascending=False, method="dense")
df["group_rank"] = df.groupby("track")["score"].rank(ascending=False)

# Index
df = df.set_index("id")
df = df.reset_index()
df = df.reset_index(drop=True)

# Rename
df = df.rename(columns={"old": "new"})

# Unique
df["city"].unique()
df["city"].nunique()

# Missing
df.isna().sum()
df[df["mentor"].notna()]
df.dropna(subset=["mentor"])
df["mentor"] = df["mentor"].fillna("Unassigned")
df["mentor"].ffill()
df["mentor"].bfill()

# Duplicates
df.duplicated().sum()
df.duplicated(subset=["id", "lesson"], keep=False)
df.drop_duplicates(subset=["id", "lesson"], keep="last")

# Drop
df.drop(columns=["unused_col"])
df.drop(index=[0, 1])

# Apply
df["label"] = df["score"].apply(score_label)
df["summary"] = df.apply(make_summary, axis=1)

Explanation

Value counting and sorting operations including single column counts, multi-column combinations, and various sorting configurations with ascending/descending options
Ranking functionality using dense method for both global and group-based score rankings within DataFrame columns
Index management through setting, resetting, and dropping index columns while maintaining data integrity
Column renaming and unique value extraction with count statistics for identifying distinct data entries
Missing data handling including detection, filtering, filling strategies, and duplicate removal techniques for data cleaning and preprocessing

32. Interview-Style Questions

1. What is the difference between `sort_values` and `sort_index`?

sort_values sorts by data inside columns. sort_index sorts by row labels or column labels.

2. What does `value_counts` do?

It counts how often values appear. On a Series it counts one column. On a DataFrame it counts combinations of columns.

3. What does `rank(method="dense")` mean?

Dense ranking gives the same rank to ties and does not skip the next rank.

4. Why use `set_index`?

Use set_index when one column is a better row label, such as an ID, date, username, or product code.

5. What is the purpose of `reset_index(drop=True)`?

It resets the index to simple row numbers and discards the old index instead of keeping it as a column.

6. What is the difference between `unique` and `nunique`?

unique returns the distinct values. nunique returns the number of distinct values.

7. When should you use `dropna`?

Use it when rows with missing values are not useful for the specific analysis. Do not use it blindly.

8. What does `drop_duplicates(subset=[...], keep="last")` do?

It removes duplicate rows based on selected columns and keeps the last row from each duplicate group.

9. What does `apply(axis=1)` do?

It runs a function row by row. The function receives each row as a Series.

10. Why should row-wise `apply` be used carefully?

It is flexible but can be slower than vectorized operations on large DataFrames.

Sources And Further Reading

This guide uses original examples and original CSV files. The notebook was used only as a topic outline.

Official Pandas references:

Pandas API reference: https://pandas.pydata.org/docs/reference/
DataFrame basics: https://pandas.pydata.org/docs/user_guide/basics.html
Indexing and selecting data: https://pandas.pydata.org/docs/user_guide/indexing.html
Working with missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html
Group by user guide: https://pandas.pydata.org/docs/user_guide/groupby.html

Next in this series: Mastering Pandas GroupBy: A Comprehensive Guide to Data Analysis →

Pandas DataFrame Methods: Sort, Rank, Clean, and Apply

AI Insights

Q1. Which track has the most rows?

Q2. Which city-plan pair appears most often?

Q3. Sort learners by score, highest first.

Q4. Sort by track ascending and score descending.

Q5. Add a global score rank.

Q6. Add a rank inside each track.

Q7. Set learner_id as the index and select one learner.

Q8. Convert track counts into a DataFrame.

Q9. Rename quiz_score to score.

Q10. Count unique learners by track.

Q11. Show rows where mentor is missing.

Q12. Fill missing mentors with Unassigned.

Q13. Show duplicate learner-lesson attempts.

Q14. Keep only the last attempt for every learner and lesson.

Q15. Create a priority label using row-wise apply.

Q1. Count toss decisions.

Q2. Count match stages.

Q3. Find how many matches each team played.

Q4. Rank teams by number of wins.

Q5. Sort matches by season, stage, and margin.

Q6. Set match_id as the index.

Q7. Show unique stages.

Q8. Count unique cities.

Q9. Drop rows where winner is missing.

Q10. Fill missing winner with No Result.

Q11. Add a close_match column.

Q12. Add a match summary using row-wise apply.