Pandas DataFrame Methods: Sort, Rank, Clean, and Apply

Jun 2, 2026
38 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: pandas-dataframe-methods-sort-rank-clean-and-apply
Quick Answer

Learn the most useful Pandas DataFrame methods with original CSV datasets: value_counts, sort_values, rank, sort_index, set_index, reset_index, rename, unique, nunique, isna, notna, dropna, fillna, duplicated, drop_duplicates, drop, and apply.

Quick Summary

Master essential Pandas DataFrame methods like sort, rank, and clean to analyze and manipulate your data effectively.

Pandas Important DataFrame Methods: Sort, Rank, Index, Clean, Apply, and Practice

Once you understand how to create, inspect, select, and filter a DataFrame, the next step is learning the methods you will use again and again.

These methods help you answer questions like:

  • Which category appears most often?
  • Which learner has the highest score?
  • Which rows have missing values?
  • Which records are duplicates?
  • How do I sort by more than one column?
  • How do I make one column the index?
  • How do I reset the index after sorting?
  • How do I rename columns?
  • How do I apply custom logic row by row?

This guide uses original sample datasets, not copied third-party datasets.

Files used in this lesson:

  • learning_activity.csv
  • mini_league_matches_methods.csv

Place both files in the same folder as this Markdown file before running the examples.

What You Will Learn

By the end, you should be able to use:

  • value_counts
  • sort_values
  • rank
  • sort_index
  • set_index
  • reset_index
  • rename
  • unique
  • nunique
  • isna, isnull, notna, and notnull
  • hasnans
  • dropna
  • fillna, ffill, and bfill
  • duplicated
  • drop_duplicates
  • drop
  • apply

You will also solve realistic practice tasks using CSV files.

1. Setup

python
import pandas as pd
import numpy as np

Load the learner activity dataset:

python
activity = pd.read_csv("learning_activity.csv")
activity.head()

Load the match dataset:

python
matches = pd.read_csv("mini_league_matches_methods.csv")
matches.head()

2. Quick Dataset Check

Before using any advanced method, inspect the data.

python
activity.shape
activity.info()
activity.head()
activity.tail()

Check column names:

python
activity.columns

Check missing values:

python
activity.isna().sum()

Check duplicates:

python
activity.duplicated().sum()

This small routine catches many problems early.

3. value_counts On A Series

Use value_counts to count how often each value appears in one column.

python
activity["track"].value_counts()

Example questions:

  • Which course track has the most activity?
  • Which city has the most learners?
  • Which plan is most common?
  • Which completion status appears most often?
python
activity["city"].value_counts()
activity["plan"].value_counts()
activity["status"].value_counts()

By default, missing values are ignored.

To include missing values:

python
activity["mentor"].value_counts(dropna=False)

To get percentages:

python
activity["track"].value_counts(normalize=True) * 100

4. value_counts On A DataFrame

DataFrame.value_counts counts repeated combinations of columns.

python
activity[["track", "status"]].value_counts()

This answers:

text
How many rows exist for each track-status pair?

You can include missing combinations:

python
activity[["track", "mentor"]].value_counts(dropna=False)

You can normalize:

python
activity[["track", "status"]].value_counts(normalize=True) * 100

This is useful when a single column is not enough.

5. sort_values On A Series

Create a simple Series:

python
scores = pd.Series([88, 72, 95, 64, 91], index=["Aarav", "Meera", "Nila", "Kabir", "Tara"])

Sort ascending:

python
scores.sort_values()

Sort descending:

python
scores.sort_values(ascending=False)

This returns a new sorted Series. It does not change the original unless you assign it back.

python
scores = scores.sort_values(ascending=False)

6. sort_values On A DataFrame

Sort learners by quiz score:

python
activity.sort_values("quiz_score", ascending=False)

Sort by minutes watched:

python
activity.sort_values("minutes_watched", ascending=False)

Sort by multiple columns:

python
activity.sort_values(
    ["track", "quiz_score"],
    ascending=[True, False]
)

This means:

  • track in ascending order
  • quiz score in descending order inside each track

7. Sorting With Missing Values

If a column has missing values, choose where they should go.

python
activity.sort_values("quiz_score", na_position="first")

Or:

python
activity.sort_values("quiz_score", na_position="last")

For reports, na_position="last" is usually easier to read.

8. rank

Ranking converts values into positions.

python
activity["score_rank"] = activity["quiz_score"].rank(ascending=False)

Now the highest quiz score gets rank 1.

python
activity.sort_values("score_rank").head()

If two learners have the same score, Pandas uses average ranking by default.

For dense ranking:

python
activity["dense_rank"] = activity["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Dense ranking does not leave gaps after ties.

Example:

text
Score: 95, 91, 91, 88
Dense rank: 1, 2, 2, 3

9. Ranking Within Groups

Rank learners inside each track.

python
activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Show top learners per track:

python
activity.sort_values(["track", "track_rank", "learner"])

This pattern is powerful for leaderboards.

10. sort_index

Sorting by values sorts the data inside columns.

Sorting by index sorts row labels.

python
activity_by_id = activity.set_index("learner_id")
activity_by_id.sort_index()

Descending:

python
activity_by_id.sort_index(ascending=False)

You use sort_index when the index itself has meaning, such as:

  • IDs
  • dates
  • alphabetical labels
  • ordered categories

11. set_index

Use set_index when one column should become the row label.

python
activity_by_id = activity.set_index("learner_id")

Now learner_id is the index.

You can select by ID:

python
activity_by_id.loc["L003"]

By default, set_index returns a new DataFrame.

If you want to modify the same variable:

python
activity = activity.set_index("learner_id")

In notebooks, returning a new DataFrame is often safer than using inplace=True.

12. reset_index

Use reset_index to convert the index back into a column.

python
activity_by_id.reset_index()

If you do not want the old index as a column:

python
activity_by_id.reset_index(drop=True)

Common pattern after sorting:

python
top_learners = (
    activity
    .sort_values("quiz_score", ascending=False)
    .reset_index(drop=True)
)

This gives clean row numbers after sorting.

13. Series To DataFrame With reset_index

value_counts returns a Series.

python
track_counts = activity["track"].value_counts()
type(track_counts)

Convert it into a DataFrame:

python
track_counts_df = track_counts.reset_index()
track_counts_df.columns = ["track", "row_count"]
track_counts_df

This is useful before exporting a summary.

14. rename

Rename columns:

python
activity = activity.rename(columns={
    "minutes_watched": "minutes",
    "quiz_score": "score"
})

Rename index labels:

python
activity_by_id = activity.set_index("learner_id")

activity_by_id.rename(index={
    "L001": "learner-001",
    "L002": "learner-002"
})

Clean column names after reading messy CSV files:

python
activity.columns = (
    activity.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

Clean column names reduce bugs.

15. unique

Use unique to see distinct values in a Series.

python
activity["track"].unique()

This returns an array of unique values.

For city names:

python
activity["city"].unique()

If you see values like "Pune" and "pune", you have a cleaning issue.

Normalize before analysis:

python
activity["city"] = activity["city"].str.title()

16. nunique

Use nunique to count unique values.

python
activity["learner_id"].nunique()

Count unique learners by track:

python
activity.groupby("track")["learner_id"].nunique()

By default, missing values are ignored.

Include missing values:

python
activity["mentor"].nunique(dropna=False)

17. isna, isnull, notna, And notnull

Check missing values:

python
activity.isna()

Count missing values per column:

python
activity.isna().sum()

isnull is an alias many people still use:

python
activity.isnull().sum()

Check non-missing values:

python
activity.notna()
activity.notnull()

Select rows with missing mentor:

python
activity[activity["mentor"].isna()]

Select rows with known mentor:

python
activity[activity["mentor"].notna()]

18. hasnans

hasnans is available on a Series.

python
activity["mentor"].hasnans

It returns True if the Series contains missing values.

It is a quick yes/no check.

19. dropna

dropna removes rows or columns with missing values.

Drop rows with any missing value:

python
activity.dropna()

Drop rows only if all values are missing:

python
activity.dropna(how="all")

Drop rows where a specific column is missing:

python
activity.dropna(subset=["mentor"])

Drop rows where either mentor or completed_at is missing:

python
activity.dropna(subset=["mentor", "completed_at"])

Drop rows only when both mentor and completed_at are missing:

python
activity.dropna(subset=["mentor", "completed_at"], how="all")

Do not use dropna blindly. In learning data, missing completed_at may simply mean the learner has not completed the lesson yet.

20. fillna

Use fillna to replace missing values.

Fill one column:

python
activity["mentor"] = activity["mentor"].fillna("Unassigned")

Fill numeric missing values with median:

python
activity["quiz_score"] = activity["quiz_score"].fillna(activity["quiz_score"].median())

Fill multiple columns with different values:

python
activity = activity.fillna({
    "mentor": "Unassigned",
    "completed_at": "Not completed",
    "status": "unknown"
})

Column-specific filling is usually better than filling the whole DataFrame with one value.

21. Forward Fill And Backward Fill

Forward fill uses the previous known value.

python
activity["mentor"].ffill()

Backward fill uses the next known value.

python
activity["mentor"].bfill()

These are useful for time series or ordered logs, but they can be misleading if the row order has no meaning.

22. duplicated

Check full duplicate rows:

python
activity.duplicated()

Count duplicates:

python
activity.duplicated().sum()

Check duplicates by specific columns:

python
activity.duplicated(subset=["learner_id", "lesson"])

Show duplicate lesson attempts:

python
activity[activity.duplicated(subset=["learner_id", "lesson"], keep=False)]

keep=False marks all duplicate rows in each duplicate group.

23. drop_duplicates

Remove exact duplicate rows:

python
activity.drop_duplicates()

Keep the last attempt for each learner and lesson:

python
latest_attempts = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep="last"
)

Drop all duplicated learner-lesson rows:

python
unique_only = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep=False
)

For event data, keep="last" is common when later rows represent updated attempts.

24. drop

Drop columns:

python
activity.drop(columns=["mentor"])

Drop rows by index:

python
activity.drop(index=[0, 1])

After setting an index, drop by label:

python
activity_by_id = activity.set_index("learner_id")
activity_by_id.drop(index=["L001", "L002"])

Drop with axis also works:

python
activity.drop(["mentor"], axis=1)

But columns=[...] is usually clearer.

25. apply On A Series

apply runs a function on each value in a Series.

python
def score_label(score):
    if pd.isna(score):
        return "missing"
    if score >= 85:
        return "strong"
    if score >= 70:
        return "steady"
    return "needs_practice"

activity["score_label"] = activity["quiz_score"].apply(score_label)

For many simple conditions, vectorized methods like np.where or pd.cut can be faster.

But apply is easy to understand and useful for custom logic.

26. apply Row By Row

Use axis=1 when your function needs more than one column from the same row.

python
def learner_priority(row):
    if row["status"] != "completed" and row["minutes_watched"] >= 250:
        return "follow_up"
    if row["quiz_score"] >= 90:
        return "celebrate"
    return "normal"

activity["priority"] = activity.apply(learner_priority, axis=1)

Row-wise apply is flexible, but it can be slower on very large datasets.

Use it when clarity matters or when the logic is too custom for simple vectorized expressions.

27. Mini Practice: Learner Activity

Reload a clean copy:

python
activity = pd.read_csv("learning_activity.csv")

Practice Task

Q1. Which track has the most rows?

python
activity["track"].value_counts()

Explanation

  • The code examines the distribution of different activity types in a dataset by counting how many times each unique activity appears
  • It uses the pandas value_counts() method to generate a frequency table of activity track values
  • The result shows the count of each distinct activity type, helping identify which activities occur most frequently
  • This analysis is useful for understanding user engagement patterns and popular activity categories
  • The output provides insights into data distribution that can inform business decisions and resource allocation

Practice Task

Q2. Which city-plan pair appears most often?

python
activity[["city", "plan"]].value_counts()

Explanation

  • This code performs a frequency count of unique combinations between city and plan columns in the activity DataFrame
  • The double bracket notation [["city", "plan"]] selects multiple columns simultaneously for analysis
  • value_counts() method returns a Series with MultiIndex where each index represents a unique city-plan pair
  • The output shows how many times each city-plan combination appears in the dataset, useful for identifying popular activity patterns
  • This technique is commonly used in data exploration to understand distribution patterns across categorical variables

Practice Task

Q3. Sort learners by score, highest first.

python
activity.sort_values("quiz_score", ascending=False)

Explanation

  • The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in descending order
  • Values with higher quiz scores appear first in the sorted result
  • The ascending=False parameter ensures the highest scores are displayed at the top
  • This operation modifies the DataFrame in place without creating a new object
  • Commonly used for ranking students or participants by performance metrics

Practice Task

Q4. Sort by track ascending and score descending.

python
activity.sort_values(["track", "quiz_score"], ascending=[True, False])

Explanation

  • The code sorts a pandas DataFrame called 'activity' by two columns: "track" in ascending order and "quiz_score" in descending order
  • The sort_values() method rearranges rows based on the specified column priorities, with tracks grouped together and quiz scores ordered from highest to lowest within each track
  • This creates a structured dataset where activities are organized first by track category, then by performance scores within each category
  • The ascending parameter controls sorting direction, with True for ascending (A-Z) and False for descending (Z-A)
  • This type of sorting is commonly used for creating performance reports or organizing educational content by difficulty level

Practice Task

Q5. Add a global score rank.

python
activity["global_rank"] = activity["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Explanation

  • Assigns a global ranking to each activity entry based on quiz scores in descending order
  • Uses the dense ranking method which assigns the same rank to tied values without skipping subsequent ranks
  • Higher quiz scores receive lower rank numbers (1st place, 2nd place, etc.)
  • Modifies the activity DataFrame by adding a new "global_rank" column with calculated rankings
  • The ranking process ensures consistent ordering where ties share identical rank values

Practice Task

Q6. Add a rank inside each track.

python
activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
    ascending=False,
    method="dense"
)

Explanation

  • Groups the activity dataframe by the "track" column to organize data by track categories
  • Within each track group, ranks the "quiz_score" values in descending order (highest scores first)
  • Uses the "dense" ranking method which assigns the same rank to tied scores without skipping subsequent rank numbers
  • Creates a new "track_rank" column that contains the rank position for each quiz score within its respective track
  • The resulting ranking helps identify top performers within each track category efficiently

Practice Task

Q7. Set learner_id as the index and select one learner.

python
activity_by_id = activity.set_index("learner_id")
activity_by_id.loc["L003"]

Explanation

  • Creates a new DataFrame view with "learner_id" column as the index using set_index() method
  • Uses .loc accessor to retrieve all rows where learner_id equals "L003" from the indexed DataFrame
  • This approach enables efficient lookup of specific learner records by their unique identifier
  • The indexed structure provides faster access times when querying multiple records for the same learner ID
  • Returns all activity records associated with the specified learner ID in a structured format

Practice Task

Q8. Convert track counts into a DataFrame.

python
track_counts = activity["track"].value_counts().reset_index()
track_counts.columns = ["track", "rows"]
track_counts

Explanation

  • The code extracts unique track values from the activity DataFrame's "track" column and counts how many times each track appears
  • It uses value_counts() to generate a Series with track names as index and their occurrence counts as values
  • The reset_index() method converts the Series into a DataFrame format with tracks as a regular column instead of an index
  • Column names are renamed to "track" and "rows" for better readability and clarity
  • The final result displays a tabular view showing each track and its corresponding count of occurrences

Practice Task

Q9. Rename quiz_score to score.

python
activity.rename(columns={"quiz_score": "score"})

Explanation

  • The code uses the pandas rename() method to change the column name "quiz_score" to "score" in a DataFrame called activity
  • This operation creates a new DataFrame with the updated column names while preserving all original data
  • The rename() method accepts a dictionary where keys are current column names and values are new column names
  • This technique is commonly used to standardize column naming across datasets or make column names more readable
  • The operation is non-destructive when using the inplace parameter or assigning the result to a new variable

Practice Task

Q10. Count unique learners by track.

python
activity.groupby("track")["learner_id"].nunique()

Explanation

  • Groups data by the "track" column to organize records by track category
  • For each track group, counts the number of distinct learner IDs using nunique()
  • Returns a pandas Series with track names as index and unique learner counts as values
  • This operation efficiently summarizes learner diversity across different tracks
  • Commonly used for analytics to measure engagement breadth within educational pathways

Practice Task

Q11. Show rows where mentor is missing.

python
activity[activity["mentor"].isna()]

Explanation

  • Uses boolean indexing to filter rows from the activity DataFrame where the "mentor" column has null/NaN values
  • The isna() method returns a boolean Series identifying which entries are null in the mentor column
  • This creates a subset of the original DataFrame containing only records without assigned mentors
  • Commonly used for data cleaning to identify unassigned or missing mentorship records
  • Returns a new DataFrame with the same structure but filtered to show only null mentor entries

Practice Task

Q12. Fill missing mentors with Unassigned.

python
activity["mentor"] = activity["mentor"].fillna("Unassigned")

Explanation

  • Replaces all null/missing values in the "mentor" column with the string "Unassigned"
  • Uses pandas fillna() method to handle missing data gracefully in the activity dataset
  • Ensures consistent data representation by providing a default value for undefined mentor assignments
  • Maintains data integrity by preventing null values from propagating through subsequent operations
  • Common pattern for cleaning categorical data where missing values need explicit handling

Practice Task

Q13. Show duplicate learner-lesson attempts.

python
activity[
    activity.duplicated(subset=["learner_id", "lesson"], keep=False)
]

Explanation

  • This code identifies all rows in the activity DataFrame that have duplicate combinations of learner_id and lesson values
  • The duplicated() method returns a boolean mask indicating which rows are duplicates based on the specified subset columns
  • Setting keep=False ensures all duplicate instances are marked as True, not just the subsequent ones
  • The result filters the original DataFrame to show only rows where the same learner has completed the same lesson multiple times
  • This is useful for identifying repeated learning activities or potential data entry issues in educational analytics

Practice Task

Q14. Keep only the last attempt for every learner and lesson.

python
activity = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep="last"
)

Explanation

  • The code eliminates duplicate rows from the activity DataFrame based on combinations of learner_id and lesson columns
  • It keeps only the last occurrence of each duplicate combination, ensuring the most recent record is retained
  • The subset parameter specifies which columns to consider when identifying duplicates
  • This operation helps clean data by removing redundant learner progress entries for the same lesson

Practice Task

Q15. Create a priority label using row-wise apply.

python
def priority(row):
    if row["status"] != "completed" and row["minutes_watched"] >= 250:
        return "follow_up"
    if row["quiz_score"] >= 90:
        return "celebrate"
    return "normal"

activity["priority"] = activity.apply(priority, axis=1)

Explanation

  • The priority function evaluates student data rows to determine appropriate follow-up actions based on multiple criteria
  • Students who haven't completed their status but have watched at least 250 minutes receive "follow_up" priority status
  • High-performing students with quiz scores of 90 or above are flagged with "celebrate" priority level
  • All other students automatically receive the default "normal" priority classification
  • The apply method processes each row of the activity DataFrame through the priority function to create a new priority column

28. Mini Practice: Fictional League Matches

Reload the matches dataset:

python
matches = pd.read_csv("mini_league_matches_methods.csv")

Explanation

  • Reads match data from a CSV file named "mini_league_matches_methods.csv" into a pandas DataFrame called matches
  • The CSV file likely contains information about football matches including teams, scores, dates, and other statistical data
  • This operation creates a structured dataset that can be manipulated and analyzed using pandas methods
  • The resulting DataFrame provides a foundation for performing data analysis, filtering, grouping, and visualization tasks
  • Common use cases include generating match statistics, calculating team performance metrics, or creating visualizations of league standings

This dataset is fictional and created for Pandas practice.

Practice Task

Q1. Count toss decisions.

python
matches["toss_decision"].value_counts()

Explanation

  • The code examines the frequency distribution of toss decisions (bat or field) made by teams in cricket matches
  • It uses the value_counts() method to count how many times each toss decision appears in the toss_decision column
  • This provides insights into whether teams prefer batting or fielding after winning the toss
  • The output shows the absolute counts for each decision type, helping identify patterns in team strategy
  • This analysis is commonly used in sports analytics to understand tactical preferences and decision-making patterns

Practice Task

Q2. Count match stages.

python
matches["stage"].value_counts()

Explanation

  • This code performs a frequency count of unique values in the "stage" column of a pandas DataFrame called matches
  • The value_counts() method returns a Series with each unique stage value as the index and its corresponding count as the value
  • This is useful for understanding how many records exist for each stage category in your dataset
  • The output helps identify which stages are most common or rare in your data
  • This operation is commonly used for exploratory data analysis and data validation purposes

Practice Task

Q3. Find how many matches each team played.

Because a team can appear in either team_a or team_b, combine both columns.

python
teams_played = pd.concat([matches["team_a"], matches["team_b"]])
teams_played.value_counts()

Explanation

  • Concatenates two columns from a matches DataFrame containing team names (team_a and team_b) into a single Series
  • Creates a unified list of all teams that participated in matches by combining both home and away team columns
  • Uses value_counts() method to tally how many times each team appears in the concatenated data
  • Returns a sorted Series showing team names as index and their total match participation counts as values
  • Useful for identifying most active teams or analyzing team involvement patterns in sports analytics

Practice Task

Q4. Rank teams by number of wins.

python
team_wins = matches["winner"].value_counts().reset_index()
team_wins.columns = ["team", "wins"]
team_wins["rank"] = team_wins["wins"].rank(
    ascending=False,
    method="dense"
)
team_wins.sort_values(["rank", "team"])

Explanation

  • The code retrieves the count of wins for each team from the "winner" column in the matches DataFrame.
  • It resets the index to convert the Series into a DataFrame and renames the columns to "team" and "wins".
  • The rank method is applied to the "wins" column to assign a ranking based on the number of wins, with ties receiving the same rank.
  • Finally, the DataFrame is sorted by rank in ascending order and by team name to ensure a clear presentation of the results.

Practice Task

Q5. Sort matches by season, stage, and margin.

python
matches.sort_values(
    ["season", "stage", "margin"],
    ascending=[True, True, False]
)

Explanation

  • The sort_values method is used to sort the DataFrame named matches.
  • The sorting is performed based on three columns: season, stage, and margin.
  • The ascending parameter specifies the sort order: season and stage are sorted in ascending order, while margin is sorted in descending order.
  • This allows for a structured organization of match data, prioritizing seasons and stages while highlighting larger margins.

Practice Task

Q6. Set match_id as the index.

python
matches_by_id = matches.set_index("match_id")
matches_by_id.sort_index()

Explanation

  • The set_index("match_id") method transforms the DataFrame by using the "match_id" column as the new index, allowing for quicker lookups.
  • The sort_index() method sorts the DataFrame based on the new index, ensuring that the data is organized in ascending order of match IDs.
  • This approach enhances data manipulation and retrieval, particularly when working with large datasets.
  • The original DataFrame remains unchanged unless reassigned, so the sorted version needs to be stored if required.

Practice Task

Q7. Show unique stages.

python
matches["stage"].unique()

Explanation

  • Retrieves all distinct values present in the "stage" column of the matches DataFrame
  • Returns a numpy array containing only unique stage identifiers without duplicates
  • Useful for understanding the variety of stages available in the dataset
  • Commonly used for data exploration and validation before further analysis
  • Helps identify potential data inconsistencies or missing stage categories

Practice Task

Q8. Count unique cities.

python
matches["city"].nunique()

Explanation

  • The code uses the nunique() method to count distinct values in the "city" column of a pandas DataFrame called matches
  • This method efficiently handles missing values by excluding them from the count automatically
  • The result returns a single integer representing how many unique cities exist in that dataset column
  • This operation is useful for understanding data diversity and identifying potential data quality issues
  • The syntax follows pandas convention where nunique() is applied directly to a Series object

Practice Task

Q9. Drop rows where winner is missing.

python
matches.dropna(subset=["winner"])

Explanation

  • The dropna method removes any rows where the "winner" column contains null or NaN values
  • The subset parameter specifically targets only the "winner" column for null value detection
  • This operation returns a new DataFrame without the incomplete match records
  • Useful for data cleaning when you need to ensure all matches have a defined winner before analysis
  • The original DataFrame remains unchanged since dropna creates a copy by default

In this sample dataset, one row has no winner because the match had no result.

Practice Task

Q10. Fill missing winner with No Result.

python
matches["winner"] = matches["winner"].fillna("No Result")

Explanation

  • The code fills NaN (missing) values in the "winner" column of a DataFrame called matches
  • Missing values are replaced with the string "No Result" to indicate incomplete match outcomes
  • This approach ensures consistent data representation where every match has a defined winner value
  • The fillna() method operates in-place on the DataFrame column without requiring assignment to a new variable
  • Useful for data analysis workflows where missing results need to be handled explicitly rather than left as null values

Practice Task

Q11. Add a close_match column.

python
matches["close_match"] = matches["margin"].apply(
    lambda margin: "yes" if margin <= 8 else "no"
)

Explanation

  • The code creates a new column called "close_match" in the matches DataFrame
  • It applies a lambda function to each value in the "margin" column to evaluate whether the margin is less than or equal to 8
  • Values meeting the condition (margin ≤ 8) are labeled as "yes", otherwise "no"
  • This operation uses pandas apply method for vectorized conditional assignment across all rows
  • The resulting column helps categorize matches as either "close" or "not close" based on the defined threshold

Practice Task

Q12. Add a match summary using row-wise apply.

python
def match_summary(row):
    if pd.isna(row["winner"]):
        return f'{row["team_a"]} vs {row["team_b"]}: no result'
    return f'{row["winner"]} won by {row["margin"]}'

matches["summary"] = matches.apply(match_summary, axis=1)

Explanation

  • The function match_summary takes a row of match data and checks if the winner column contains missing values using pd.isna()
  • When there's no winner (missing value), it returns a string indicating the match ended without a result
  • When there is a winner, it formats a summary showing which team won and by what margin
  • The apply() method processes each row of the matches DataFrame through this function
  • The resulting summary strings are stored in a new column called "summary" in the matches DataFrame

29. Method Selection Guide

Use this as a quick decision table.

TaskMethod
Count values in one columnvalue_counts
Count combinations of columnsDataFrame.value_counts
Sort rows by columnsort_values
Sort rows by indexsort_index
Create ranksrank
Make a column the indexset_index
Convert index back to columnreset_index
Rename columns or index labelsrename
See unique valuesunique
Count unique valuesnunique
Detect missing valuesisna or isnull
Detect non-missing valuesnotna or notnull
Remove missing rowsdropna
Fill missing valuesfillna, ffill, bfill
Detect duplicatesduplicated
Remove duplicatesdrop_duplicates
Remove rows or columnsdrop
Run custom logicapply

30. Common Mistakes

Mistake 1: Sorting Without Assigning

This returns a sorted DataFrame but does not change activity.

python
activity.sort_values("quiz_score")

Explanation

  • The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in ascending order
  • By default, sort_values() arranges values from lowest to highest
  • This operation returns a new sorted DataFrame without modifying the original data
  • Useful for analyzing student performance or ranking participants by their quiz results
  • The method can be modified with additional parameters like ascending=False for descending order

Assign it:

python
activity = activity.sort_values("quiz_score")

Explanation

  • The code sorts a pandas DataFrame called 'activity' based on the "quiz_score" column in ascending order
  • This operation rearranges all rows in the dataset so that records with lower quiz scores appear first
  • The sort_values() method modifies the DataFrame in-place by default, reordering the entire dataset
  • This sorting approach is commonly used for analyzing student performance or ranking participants by score
  • The resulting dataset maintains all original columns while organizing rows according to the specified sorting criteria

Mistake 2: Dropping Rows Too Early

Missing values are not always bad.

For example, a missing completed_at value can mean the learner has not completed the lesson yet.

Understand the meaning before using:

python
activity.dropna()

Explanation

  • The dropna() method eliminates rows or columns containing null values (NaN, None) from a pandas DataFrame
  • By default, it removes any row that contains at least one missing value, preserving only complete records
  • This operation returns a new DataFrame without modifying the original data structure
  • Commonly used in data cleaning processes to ensure analysis accuracy when missing values could skew results
  • Can be customized with parameters like subset= to target specific columns or how='all' to remove rows where all values are missing

Mistake 3: Using Row-Wise apply For Everything

This works:

python
activity.apply(priority, axis=1)

Explanation

  • The code applies a priority function to each row of the activity DataFrame using the pandas apply method
  • The axis=1 parameter specifies that the function should be applied horizontally across rows rather than vertically down columns
  • This approach enables vectorized processing of each row's data through the custom priority function
  • The result represents transformed activity data where each row has been processed according to the priority logic
  • This pattern is commonly used for data transformation tasks like scoring, categorization, or conditional calculations across dataset rows

But for simple comparisons, vectorized code is often cleaner:

python
activity["is_high_score"] = activity["quiz_score"] >= 85

Explanation

  • This code creates a new column called "is_high_score" in the activity DataFrame
  • It evaluates each row's quiz_score against the value 85 using a greater than or equal comparison
  • The result is a boolean True/False value indicating whether each student achieved a score of 85 or higher
  • This boolean flag can be used for filtering, grouping, or conditional operations on high-performing students
  • The operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously

Use apply when you really need custom logic.

Mistake 4: Forgetting subset In Duplicate Checks

This checks full-row duplicates:

python
activity.duplicated()

Explanation

  • The duplicated() method returns a boolean Series indicating which rows in a DataFrame are duplicates of previously occurring rows
  • By default, it marks all duplicates as True except for the first occurrence, which is marked as False
  • This method is useful for data cleaning and quality assessment to identify redundant records
  • The result can be used to filter or remove duplicate entries from datasets
  • The method supports various parameters like keep='first', keep='last', or keep=False to customize duplicate detection behavior

This checks duplicate attempts by learner and lesson:

python
activity.duplicated(subset=["learner_id", "lesson"])

Explanation

  • The code checks for duplicate rows in an activity dataset based on specific columns
  • It examines combinations of learner_id and lesson to find repeated entries
  • Returns a boolean Series indicating which rows are duplicates of previous rows
  • Useful for data cleaning to identify redundant activity records in educational analytics
  • The subset parameter specifies exactly which columns to consider when determining duplication criteria

They answer different questions.

Mistake 5: Resetting Index And Keeping Old Index Accidentally

This keeps the old index as a column:

python
df.reset_index()

Explanation

  • Resets the DataFrame's index to the default integer index starting from zero
  • Removes any existing index labels and replaces them with sequential integers
  • Useful when index values are no longer meaningful or need to be standardized
  • Commonly used after filtering or sorting operations that may have left irregular index values

This drops it:

python
df.reset_index(drop=True)

Explanation

  • Removes the current index labels and replaces them with a new sequential integer index starting from zero
  • The drop=True parameter ensures the old index is completely discarded rather than being added as a new column
  • Useful when you need to reorganize data after filtering, sorting, or concatenating DataFrames
  • Helps maintain consistent indexing when performing operations that might disrupt the original index structure

Choose intentionally.

31. Compact Cheat Sheet

python
# Counts
df["col"].value_counts()
df[["a", "b"]].value_counts()

# Sorting
df.sort_values("score", ascending=False)
df.sort_values(["track", "score"], ascending=[True, False])
df.sort_index()

# Ranking
df["rank"] = df["score"].rank(ascending=False, method="dense")
df["group_rank"] = df.groupby("track")["score"].rank(ascending=False)

# Index
df = df.set_index("id")
df = df.reset_index()
df = df.reset_index(drop=True)

# Rename
df = df.rename(columns={"old": "new"})

# Unique
df["city"].unique()
df["city"].nunique()

# Missing
df.isna().sum()
df[df["mentor"].notna()]
df.dropna(subset=["mentor"])
df["mentor"] = df["mentor"].fillna("Unassigned")
df["mentor"].ffill()
df["mentor"].bfill()

# Duplicates
df.duplicated().sum()
df.duplicated(subset=["id", "lesson"], keep=False)
df.drop_duplicates(subset=["id", "lesson"], keep="last")

# Drop
df.drop(columns=["unused_col"])
df.drop(index=[0, 1])

# Apply
df["label"] = df["score"].apply(score_label)
df["summary"] = df.apply(make_summary, axis=1)

Explanation

  • Value counting and sorting operations including single column counts, multi-column combinations, and various sorting configurations with ascending/descending options
  • Ranking functionality using dense method for both global and group-based score rankings within DataFrame columns
  • Index management through setting, resetting, and dropping index columns while maintaining data integrity
  • Column renaming and unique value extraction with count statistics for identifying distinct data entries
  • Missing data handling including detection, filtering, filling strategies, and duplicate removal techniques for data cleaning and preprocessing

32. Interview-Style Questions

1. What is the difference between sort_values and sort_index?

sort_values sorts by data inside columns. sort_index sorts by row labels or column labels.

2. What does value_counts do?

It counts how often values appear. On a Series it counts one column. On a DataFrame it counts combinations of columns.

3. What does rank(method="dense") mean?

Dense ranking gives the same rank to ties and does not skip the next rank.

4. Why use set_index?

Use set_index when one column is a better row label, such as an ID, date, username, or product code.

5. What is the purpose of reset_index(drop=True)?

It resets the index to simple row numbers and discards the old index instead of keeping it as a column.

6. What is the difference between unique and nunique?

unique returns the distinct values. nunique returns the number of distinct values.

7. When should you use dropna?

Use it when rows with missing values are not useful for the specific analysis. Do not use it blindly.

8. What does drop_duplicates(subset=[...], keep="last") do?

It removes duplicate rows based on selected columns and keeps the last row from each duplicate group.

9. What does apply(axis=1) do?

It runs a function row by row. The function receives each row as a Series.

10. Why should row-wise apply be used carefully?

It is flexible but can be slower than vectorized operations on large DataFrames.

Sources And Further Reading

This guide uses original examples and original CSV files. The notebook was used only as a topic outline.

Official Pandas references:


Next in this series: Mastering Pandas GroupBy: A Comprehensive Guide to Data Analysis →