Pandas Important DataFrame Methods: Sort, Rank, Index, Clean, Apply, and Practice
Once you understand how to create, inspect, select, and filter a DataFrame, the next step is learning the methods you will use again and again.
These methods help you answer questions like:
- Which category appears most often?
- Which learner has the highest score?
- Which rows have missing values?
- Which records are duplicates?
- How do I sort by more than one column?
- How do I make one column the index?
- How do I reset the index after sorting?
- How do I rename columns?
- How do I apply custom logic row by row?
This guide uses original sample datasets, not copied third-party datasets.
Files used in this lesson:
learning_activity.csvmini_league_matches_methods.csv
Place both files in the same folder as this Markdown file before running the examples.
What You Will Learn
By the end, you should be able to use:
value_countssort_valuesranksort_indexset_indexreset_indexrenameuniquenuniqueisna,isnull,notna, andnotnullhasnansdropnafillna,ffill, andbfillduplicateddrop_duplicatesdropapply
You will also solve realistic practice tasks using CSV files.
1. Setup
import pandas as pd
import numpy as npLoad the learner activity dataset:
activity = pd.read_csv("learning_activity.csv")
activity.head()Load the match dataset:
matches = pd.read_csv("mini_league_matches_methods.csv")
matches.head()2. Quick Dataset Check
Before using any advanced method, inspect the data.
activity.shape
activity.info()
activity.head()
activity.tail()Check column names:
activity.columnsCheck missing values:
activity.isna().sum()Check duplicates:
activity.duplicated().sum()This small routine catches many problems early.
3. value_counts On A Series
Use value_counts to count how often each value appears in one column.
activity["track"].value_counts()Example questions:
- Which course track has the most activity?
- Which city has the most learners?
- Which plan is most common?
- Which completion status appears most often?
activity["city"].value_counts()
activity["plan"].value_counts()
activity["status"].value_counts()By default, missing values are ignored.
To include missing values:
activity["mentor"].value_counts(dropna=False)To get percentages:
activity["track"].value_counts(normalize=True) * 1004. value_counts On A DataFrame
DataFrame.value_counts counts repeated combinations of columns.
activity[["track", "status"]].value_counts()This answers:
How many rows exist for each track-status pair?You can include missing combinations:
activity[["track", "mentor"]].value_counts(dropna=False)You can normalize:
activity[["track", "status"]].value_counts(normalize=True) * 100This is useful when a single column is not enough.
5. sort_values On A Series
Create a simple Series:
scores = pd.Series([88, 72, 95, 64, 91], index=["Aarav", "Meera", "Nila", "Kabir", "Tara"])Sort ascending:
scores.sort_values()Sort descending:
scores.sort_values(ascending=False)This returns a new sorted Series. It does not change the original unless you assign it back.
scores = scores.sort_values(ascending=False)6. sort_values On A DataFrame
Sort learners by quiz score:
activity.sort_values("quiz_score", ascending=False)Sort by minutes watched:
activity.sort_values("minutes_watched", ascending=False)Sort by multiple columns:
activity.sort_values(
["track", "quiz_score"],
ascending=[True, False]
)This means:
- track in ascending order
- quiz score in descending order inside each track
7. Sorting With Missing Values
If a column has missing values, choose where they should go.
activity.sort_values("quiz_score", na_position="first")Or:
activity.sort_values("quiz_score", na_position="last")For reports, na_position="last" is usually easier to read.
8. rank
Ranking converts values into positions.
activity["score_rank"] = activity["quiz_score"].rank(ascending=False)Now the highest quiz score gets rank 1.
activity.sort_values("score_rank").head()If two learners have the same score, Pandas uses average ranking by default.
For dense ranking:
activity["dense_rank"] = activity["quiz_score"].rank(
ascending=False,
method="dense"
)Dense ranking does not leave gaps after ties.
Example:
Score: 95, 91, 91, 88
Dense rank: 1, 2, 2, 39. Ranking Within Groups
Rank learners inside each track.
activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
ascending=False,
method="dense"
)Show top learners per track:
activity.sort_values(["track", "track_rank", "learner"])This pattern is powerful for leaderboards.
10. sort_index
Sorting by values sorts the data inside columns.
Sorting by index sorts row labels.
activity_by_id = activity.set_index("learner_id")
activity_by_id.sort_index()Descending:
activity_by_id.sort_index(ascending=False)You use sort_index when the index itself has meaning, such as:
- IDs
- dates
- alphabetical labels
- ordered categories
11. set_index
Use set_index when one column should become the row label.
activity_by_id = activity.set_index("learner_id")Now learner_id is the index.
You can select by ID:
activity_by_id.loc["L003"]By default, set_index returns a new DataFrame.
If you want to modify the same variable:
activity = activity.set_index("learner_id")In notebooks, returning a new DataFrame is often safer than using inplace=True.
12. reset_index
Use reset_index to convert the index back into a column.
activity_by_id.reset_index()If you do not want the old index as a column:
activity_by_id.reset_index(drop=True)Common pattern after sorting:
top_learners = (
activity
.sort_values("quiz_score", ascending=False)
.reset_index(drop=True)
)This gives clean row numbers after sorting.
13. Series To DataFrame With reset_index
value_counts returns a Series.
track_counts = activity["track"].value_counts()
type(track_counts)Convert it into a DataFrame:
track_counts_df = track_counts.reset_index()
track_counts_df.columns = ["track", "row_count"]
track_counts_dfThis is useful before exporting a summary.
14. rename
Rename columns:
activity = activity.rename(columns={
"minutes_watched": "minutes",
"quiz_score": "score"
})Rename index labels:
activity_by_id = activity.set_index("learner_id")
activity_by_id.rename(index={
"L001": "learner-001",
"L002": "learner-002"
})Clean column names after reading messy CSV files:
activity.columns = (
activity.columns
.str.strip()
.str.lower()
.str.replace(" ", "_")
)Clean column names reduce bugs.
15. unique
Use unique to see distinct values in a Series.
activity["track"].unique()This returns an array of unique values.
For city names:
activity["city"].unique()If you see values like "Pune" and "pune", you have a cleaning issue.
Normalize before analysis:
activity["city"] = activity["city"].str.title()16. nunique
Use nunique to count unique values.
activity["learner_id"].nunique()Count unique learners by track:
activity.groupby("track")["learner_id"].nunique()By default, missing values are ignored.
Include missing values:
activity["mentor"].nunique(dropna=False)17. isna, isnull, notna, And notnull
Check missing values:
activity.isna()Count missing values per column:
activity.isna().sum()isnull is an alias many people still use:
activity.isnull().sum()Check non-missing values:
activity.notna()
activity.notnull()Select rows with missing mentor:
activity[activity["mentor"].isna()]Select rows with known mentor:
activity[activity["mentor"].notna()]18. hasnans
hasnans is available on a Series.
activity["mentor"].hasnansIt returns True if the Series contains missing values.
It is a quick yes/no check.
19. dropna
dropna removes rows or columns with missing values.
Drop rows with any missing value:
activity.dropna()Drop rows only if all values are missing:
activity.dropna(how="all")Drop rows where a specific column is missing:
activity.dropna(subset=["mentor"])Drop rows where either mentor or completed_at is missing:
activity.dropna(subset=["mentor", "completed_at"])Drop rows only when both mentor and completed_at are missing:
activity.dropna(subset=["mentor", "completed_at"], how="all")Do not use dropna blindly. In learning data, missing completed_at may simply mean the learner has not completed the lesson yet.
20. fillna
Use fillna to replace missing values.
Fill one column:
activity["mentor"] = activity["mentor"].fillna("Unassigned")Fill numeric missing values with median:
activity["quiz_score"] = activity["quiz_score"].fillna(activity["quiz_score"].median())Fill multiple columns with different values:
activity = activity.fillna({
"mentor": "Unassigned",
"completed_at": "Not completed",
"status": "unknown"
})Column-specific filling is usually better than filling the whole DataFrame with one value.
21. Forward Fill And Backward Fill
Forward fill uses the previous known value.
activity["mentor"].ffill()Backward fill uses the next known value.
activity["mentor"].bfill()These are useful for time series or ordered logs, but they can be misleading if the row order has no meaning.
22. duplicated
Check full duplicate rows:
activity.duplicated()Count duplicates:
activity.duplicated().sum()Check duplicates by specific columns:
activity.duplicated(subset=["learner_id", "lesson"])Show duplicate lesson attempts:
activity[activity.duplicated(subset=["learner_id", "lesson"], keep=False)]keep=False marks all duplicate rows in each duplicate group.
23. drop_duplicates
Remove exact duplicate rows:
activity.drop_duplicates()Keep the last attempt for each learner and lesson:
latest_attempts = activity.drop_duplicates(
subset=["learner_id", "lesson"],
keep="last"
)Drop all duplicated learner-lesson rows:
unique_only = activity.drop_duplicates(
subset=["learner_id", "lesson"],
keep=False
)For event data, keep="last" is common when later rows represent updated attempts.
24. drop
Drop columns:
activity.drop(columns=["mentor"])Drop rows by index:
activity.drop(index=[0, 1])After setting an index, drop by label:
activity_by_id = activity.set_index("learner_id")
activity_by_id.drop(index=["L001", "L002"])Drop with axis also works:
activity.drop(["mentor"], axis=1)But columns=[...] is usually clearer.
25. apply On A Series
apply runs a function on each value in a Series.
def score_label(score):
if pd.isna(score):
return "missing"
if score >= 85:
return "strong"
if score >= 70:
return "steady"
return "needs_practice"
activity["score_label"] = activity["quiz_score"].apply(score_label)For many simple conditions, vectorized methods like np.where or pd.cut can be faster.
But apply is easy to understand and useful for custom logic.
26. apply Row By Row
Use axis=1 when your function needs more than one column from the same row.
def learner_priority(row):
if row["status"] != "completed" and row["minutes_watched"] >= 250:
return "follow_up"
if row["quiz_score"] >= 90:
return "celebrate"
return "normal"
activity["priority"] = activity.apply(learner_priority, axis=1)Row-wise apply is flexible, but it can be slower on very large datasets.
Use it when clarity matters or when the logic is too custom for simple vectorized expressions.
27. Mini Practice: Learner Activity
Reload a clean copy:
activity = pd.read_csv("learning_activity.csv")Practice Task
Q1. Which track has the most rows?
activity["track"].value_counts()Explanation
- The code examines the distribution of different activity types in a dataset by counting how many times each unique activity appears
- It uses the pandas value_counts() method to generate a frequency table of activity track values
- The result shows the count of each distinct activity type, helping identify which activities occur most frequently
- This analysis is useful for understanding user engagement patterns and popular activity categories
- The output provides insights into data distribution that can inform business decisions and resource allocation
Practice Task
Q2. Which city-plan pair appears most often?
activity[["city", "plan"]].value_counts()Explanation
- This code performs a frequency count of unique combinations between city and plan columns in the activity DataFrame
- The double bracket notation [["city", "plan"]] selects multiple columns simultaneously for analysis
- value_counts() method returns a Series with MultiIndex where each index represents a unique city-plan pair
- The output shows how many times each city-plan combination appears in the dataset, useful for identifying popular activity patterns
- This technique is commonly used in data exploration to understand distribution patterns across categorical variables
Practice Task
Q3. Sort learners by score, highest first.
activity.sort_values("quiz_score", ascending=False)Explanation
- The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in descending order
- Values with higher quiz scores appear first in the sorted result
- The ascending=False parameter ensures the highest scores are displayed at the top
- This operation modifies the DataFrame in place without creating a new object
- Commonly used for ranking students or participants by performance metrics
Practice Task
Q4. Sort by track ascending and score descending.
activity.sort_values(["track", "quiz_score"], ascending=[True, False])Explanation
- The code sorts a pandas DataFrame called 'activity' by two columns: "track" in ascending order and "quiz_score" in descending order
- The sort_values() method rearranges rows based on the specified column priorities, with tracks grouped together and quiz scores ordered from highest to lowest within each track
- This creates a structured dataset where activities are organized first by track category, then by performance scores within each category
- The ascending parameter controls sorting direction, with True for ascending (A-Z) and False for descending (Z-A)
- This type of sorting is commonly used for creating performance reports or organizing educational content by difficulty level
Practice Task
Q5. Add a global score rank.
activity["global_rank"] = activity["quiz_score"].rank(
ascending=False,
method="dense"
)Explanation
- Assigns a global ranking to each activity entry based on quiz scores in descending order
- Uses the dense ranking method which assigns the same rank to tied values without skipping subsequent ranks
- Higher quiz scores receive lower rank numbers (1st place, 2nd place, etc.)
- Modifies the activity DataFrame by adding a new "global_rank" column with calculated rankings
- The ranking process ensures consistent ordering where ties share identical rank values
Practice Task
Q6. Add a rank inside each track.
activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
ascending=False,
method="dense"
)Explanation
- Groups the activity dataframe by the "track" column to organize data by track categories
- Within each track group, ranks the "quiz_score" values in descending order (highest scores first)
- Uses the "dense" ranking method which assigns the same rank to tied scores without skipping subsequent rank numbers
- Creates a new "track_rank" column that contains the rank position for each quiz score within its respective track
- The resulting ranking helps identify top performers within each track category efficiently
Practice Task
Q7. Set learner_id as the index and select one learner.
activity_by_id = activity.set_index("learner_id")
activity_by_id.loc["L003"]Explanation
- Creates a new DataFrame view with "learner_id" column as the index using set_index() method
- Uses .loc accessor to retrieve all rows where learner_id equals "L003" from the indexed DataFrame
- This approach enables efficient lookup of specific learner records by their unique identifier
- The indexed structure provides faster access times when querying multiple records for the same learner ID
- Returns all activity records associated with the specified learner ID in a structured format
Practice Task
Q8. Convert track counts into a DataFrame.
track_counts = activity["track"].value_counts().reset_index()
track_counts.columns = ["track", "rows"]
track_countsExplanation
- The code extracts unique track values from the activity DataFrame's "track" column and counts how many times each track appears
- It uses value_counts() to generate a Series with track names as index and their occurrence counts as values
- The reset_index() method converts the Series into a DataFrame format with tracks as a regular column instead of an index
- Column names are renamed to "track" and "rows" for better readability and clarity
- The final result displays a tabular view showing each track and its corresponding count of occurrences
Practice Task
Q9. Rename quiz_score to score.
activity.rename(columns={"quiz_score": "score"})Explanation
- The code uses the pandas rename() method to change the column name "quiz_score" to "score" in a DataFrame called activity
- This operation creates a new DataFrame with the updated column names while preserving all original data
- The rename() method accepts a dictionary where keys are current column names and values are new column names
- This technique is commonly used to standardize column naming across datasets or make column names more readable
- The operation is non-destructive when using the inplace parameter or assigning the result to a new variable
Practice Task
Q10. Count unique learners by track.
activity.groupby("track")["learner_id"].nunique()Explanation
- Groups data by the "track" column to organize records by track category
- For each track group, counts the number of distinct learner IDs using nunique()
- Returns a pandas Series with track names as index and unique learner counts as values
- This operation efficiently summarizes learner diversity across different tracks
- Commonly used for analytics to measure engagement breadth within educational pathways
Practice Task
Q11. Show rows where mentor is missing.
activity[activity["mentor"].isna()]Explanation
- Uses boolean indexing to filter rows from the activity DataFrame where the "mentor" column has null/NaN values
- The isna() method returns a boolean Series identifying which entries are null in the mentor column
- This creates a subset of the original DataFrame containing only records without assigned mentors
- Commonly used for data cleaning to identify unassigned or missing mentorship records
- Returns a new DataFrame with the same structure but filtered to show only null mentor entries
Practice Task
Q12. Fill missing mentors with Unassigned.
activity["mentor"] = activity["mentor"].fillna("Unassigned")Explanation
- Replaces all null/missing values in the "mentor" column with the string "Unassigned"
- Uses pandas fillna() method to handle missing data gracefully in the activity dataset
- Ensures consistent data representation by providing a default value for undefined mentor assignments
- Maintains data integrity by preventing null values from propagating through subsequent operations
- Common pattern for cleaning categorical data where missing values need explicit handling
Practice Task
Q13. Show duplicate learner-lesson attempts.
activity[
activity.duplicated(subset=["learner_id", "lesson"], keep=False)
]Explanation
- This code identifies all rows in the activity DataFrame that have duplicate combinations of learner_id and lesson values
- The duplicated() method returns a boolean mask indicating which rows are duplicates based on the specified subset columns
- Setting keep=False ensures all duplicate instances are marked as True, not just the subsequent ones
- The result filters the original DataFrame to show only rows where the same learner has completed the same lesson multiple times
- This is useful for identifying repeated learning activities or potential data entry issues in educational analytics
Practice Task
Q14. Keep only the last attempt for every learner and lesson.
activity = activity.drop_duplicates(
subset=["learner_id", "lesson"],
keep="last"
)Explanation
- The code eliminates duplicate rows from the activity DataFrame based on combinations of learner_id and lesson columns
- It keeps only the last occurrence of each duplicate combination, ensuring the most recent record is retained
- The subset parameter specifies which columns to consider when identifying duplicates
- This operation helps clean data by removing redundant learner progress entries for the same lesson
Practice Task
Q15. Create a priority label using row-wise apply.
def priority(row):
if row["status"] != "completed" and row["minutes_watched"] >= 250:
return "follow_up"
if row["quiz_score"] >= 90:
return "celebrate"
return "normal"
activity["priority"] = activity.apply(priority, axis=1)Explanation
- The priority function evaluates student data rows to determine appropriate follow-up actions based on multiple criteria
- Students who haven't completed their status but have watched at least 250 minutes receive "follow_up" priority status
- High-performing students with quiz scores of 90 or above are flagged with "celebrate" priority level
- All other students automatically receive the default "normal" priority classification
- The apply method processes each row of the activity DataFrame through the priority function to create a new priority column
28. Mini Practice: Fictional League Matches
Reload the matches dataset:
matches = pd.read_csv("mini_league_matches_methods.csv")Explanation
- Reads match data from a CSV file named "mini_league_matches_methods.csv" into a pandas DataFrame called matches
- The CSV file likely contains information about football matches including teams, scores, dates, and other statistical data
- This operation creates a structured dataset that can be manipulated and analyzed using pandas methods
- The resulting DataFrame provides a foundation for performing data analysis, filtering, grouping, and visualization tasks
- Common use cases include generating match statistics, calculating team performance metrics, or creating visualizations of league standings
This dataset is fictional and created for Pandas practice.
Practice Task
Q1. Count toss decisions.
matches["toss_decision"].value_counts()Explanation
- The code examines the frequency distribution of toss decisions (bat or field) made by teams in cricket matches
- It uses the value_counts() method to count how many times each toss decision appears in the toss_decision column
- This provides insights into whether teams prefer batting or fielding after winning the toss
- The output shows the absolute counts for each decision type, helping identify patterns in team strategy
- This analysis is commonly used in sports analytics to understand tactical preferences and decision-making patterns
Practice Task
Q2. Count match stages.
matches["stage"].value_counts()Explanation
- This code performs a frequency count of unique values in the "stage" column of a pandas DataFrame called matches
- The value_counts() method returns a Series with each unique stage value as the index and its corresponding count as the value
- This is useful for understanding how many records exist for each stage category in your dataset
- The output helps identify which stages are most common or rare in your data
- This operation is commonly used for exploratory data analysis and data validation purposes
Practice Task
Q3. Find how many matches each team played.
Because a team can appear in either team_a or team_b, combine both columns.
teams_played = pd.concat([matches["team_a"], matches["team_b"]])
teams_played.value_counts()Explanation
- Concatenates two columns from a matches DataFrame containing team names (team_a and team_b) into a single Series
- Creates a unified list of all teams that participated in matches by combining both home and away team columns
- Uses value_counts() method to tally how many times each team appears in the concatenated data
- Returns a sorted Series showing team names as index and their total match participation counts as values
- Useful for identifying most active teams or analyzing team involvement patterns in sports analytics
Practice Task
Q4. Rank teams by number of wins.
team_wins = matches["winner"].value_counts().reset_index()
team_wins.columns = ["team", "wins"]
team_wins["rank"] = team_wins["wins"].rank(
ascending=False,
method="dense"
)
team_wins.sort_values(["rank", "team"])Explanation
- The code retrieves the count of wins for each team from the "winner" column in the
matchesDataFrame. - It resets the index to convert the Series into a DataFrame and renames the columns to "team" and "wins".
- The
rankmethod is applied to the "wins" column to assign a ranking based on the number of wins, with ties receiving the same rank. - Finally, the DataFrame is sorted by rank in ascending order and by team name to ensure a clear presentation of the results.
Practice Task
Q5. Sort matches by season, stage, and margin.
matches.sort_values(
["season", "stage", "margin"],
ascending=[True, True, False]
)Explanation
- The
sort_valuesmethod is used to sort the DataFrame namedmatches. - The sorting is performed based on three columns:
season,stage, andmargin. - The
ascendingparameter specifies the sort order:seasonandstageare sorted in ascending order, whilemarginis sorted in descending order. - This allows for a structured organization of match data, prioritizing seasons and stages while highlighting larger margins.
Practice Task
Q6. Set match_id as the index.
matches_by_id = matches.set_index("match_id")
matches_by_id.sort_index()Explanation
- The
set_index("match_id")method transforms the DataFrame by using the "match_id" column as the new index, allowing for quicker lookups. - The
sort_index()method sorts the DataFrame based on the new index, ensuring that the data is organized in ascending order of match IDs. - This approach enhances data manipulation and retrieval, particularly when working with large datasets.
- The original DataFrame remains unchanged unless reassigned, so the sorted version needs to be stored if required.
Practice Task
Q7. Show unique stages.
matches["stage"].unique()Explanation
- Retrieves all distinct values present in the "stage" column of the matches DataFrame
- Returns a numpy array containing only unique stage identifiers without duplicates
- Useful for understanding the variety of stages available in the dataset
- Commonly used for data exploration and validation before further analysis
- Helps identify potential data inconsistencies or missing stage categories
Practice Task
Q8. Count unique cities.
matches["city"].nunique()Explanation
- The code uses the nunique() method to count distinct values in the "city" column of a pandas DataFrame called matches
- This method efficiently handles missing values by excluding them from the count automatically
- The result returns a single integer representing how many unique cities exist in that dataset column
- This operation is useful for understanding data diversity and identifying potential data quality issues
- The syntax follows pandas convention where nunique() is applied directly to a Series object
Practice Task
Q9. Drop rows where winner is missing.
matches.dropna(subset=["winner"])Explanation
- The dropna method removes any rows where the "winner" column contains null or NaN values
- The subset parameter specifically targets only the "winner" column for null value detection
- This operation returns a new DataFrame without the incomplete match records
- Useful for data cleaning when you need to ensure all matches have a defined winner before analysis
- The original DataFrame remains unchanged since dropna creates a copy by default
In this sample dataset, one row has no winner because the match had no result.
Practice Task
Q10. Fill missing winner with No Result.
matches["winner"] = matches["winner"].fillna("No Result")Explanation
- The code fills NaN (missing) values in the "winner" column of a DataFrame called matches
- Missing values are replaced with the string "No Result" to indicate incomplete match outcomes
- This approach ensures consistent data representation where every match has a defined winner value
- The fillna() method operates in-place on the DataFrame column without requiring assignment to a new variable
- Useful for data analysis workflows where missing results need to be handled explicitly rather than left as null values
Practice Task
Q11. Add a close_match column.
matches["close_match"] = matches["margin"].apply(
lambda margin: "yes" if margin <= 8 else "no"
)Explanation
- The code creates a new column called "close_match" in the matches DataFrame
- It applies a lambda function to each value in the "margin" column to evaluate whether the margin is less than or equal to 8
- Values meeting the condition (margin ≤ 8) are labeled as "yes", otherwise "no"
- This operation uses pandas apply method for vectorized conditional assignment across all rows
- The resulting column helps categorize matches as either "close" or "not close" based on the defined threshold
Practice Task
Q12. Add a match summary using row-wise apply.
def match_summary(row):
if pd.isna(row["winner"]):
return f'{row["team_a"]} vs {row["team_b"]}: no result'
return f'{row["winner"]} won by {row["margin"]}'
matches["summary"] = matches.apply(match_summary, axis=1)Explanation
- The function
match_summarytakes a row of match data and checks if the winner column contains missing values usingpd.isna() - When there's no winner (missing value), it returns a string indicating the match ended without a result
- When there is a winner, it formats a summary showing which team won and by what margin
- The
apply()method processes each row of the matches DataFrame through this function - The resulting summary strings are stored in a new column called "summary" in the matches DataFrame
29. Method Selection Guide
Use this as a quick decision table.
| Task | Method |
|---|---|
| Count values in one column | value_counts |
| Count combinations of columns | DataFrame.value_counts |
| Sort rows by column | sort_values |
| Sort rows by index | sort_index |
| Create ranks | rank |
| Make a column the index | set_index |
| Convert index back to column | reset_index |
| Rename columns or index labels | rename |
| See unique values | unique |
| Count unique values | nunique |
| Detect missing values | isna or isnull |
| Detect non-missing values | notna or notnull |
| Remove missing rows | dropna |
| Fill missing values | fillna, ffill, bfill |
| Detect duplicates | duplicated |
| Remove duplicates | drop_duplicates |
| Remove rows or columns | drop |
| Run custom logic | apply |
30. Common Mistakes
Mistake 1: Sorting Without Assigning
This returns a sorted DataFrame but does not change activity.
activity.sort_values("quiz_score")Explanation
- The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in ascending order
- By default, sort_values() arranges values from lowest to highest
- This operation returns a new sorted DataFrame without modifying the original data
- Useful for analyzing student performance or ranking participants by their quiz results
- The method can be modified with additional parameters like ascending=False for descending order
Assign it:
activity = activity.sort_values("quiz_score")Explanation
- The code sorts a pandas DataFrame called 'activity' based on the "quiz_score" column in ascending order
- This operation rearranges all rows in the dataset so that records with lower quiz scores appear first
- The sort_values() method modifies the DataFrame in-place by default, reordering the entire dataset
- This sorting approach is commonly used for analyzing student performance or ranking participants by score
- The resulting dataset maintains all original columns while organizing rows according to the specified sorting criteria
Mistake 2: Dropping Rows Too Early
Missing values are not always bad.
For example, a missing completed_at value can mean the learner has not completed the lesson yet.
Understand the meaning before using:
activity.dropna()Explanation
- The dropna() method eliminates rows or columns containing null values (NaN, None) from a pandas DataFrame
- By default, it removes any row that contains at least one missing value, preserving only complete records
- This operation returns a new DataFrame without modifying the original data structure
- Commonly used in data cleaning processes to ensure analysis accuracy when missing values could skew results
- Can be customized with parameters like subset= to target specific columns or how='all' to remove rows where all values are missing
Mistake 3: Using Row-Wise apply For Everything
This works:
activity.apply(priority, axis=1)Explanation
- The code applies a priority function to each row of the activity DataFrame using the pandas apply method
- The axis=1 parameter specifies that the function should be applied horizontally across rows rather than vertically down columns
- This approach enables vectorized processing of each row's data through the custom priority function
- The result represents transformed activity data where each row has been processed according to the priority logic
- This pattern is commonly used for data transformation tasks like scoring, categorization, or conditional calculations across dataset rows
But for simple comparisons, vectorized code is often cleaner:
activity["is_high_score"] = activity["quiz_score"] >= 85Explanation
- This code creates a new column called "is_high_score" in the activity DataFrame
- It evaluates each row's quiz_score against the value 85 using a greater than or equal comparison
- The result is a boolean True/False value indicating whether each student achieved a score of 85 or higher
- This boolean flag can be used for filtering, grouping, or conditional operations on high-performing students
- The operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously
Use apply when you really need custom logic.
Mistake 4: Forgetting subset In Duplicate Checks
This checks full-row duplicates:
activity.duplicated()Explanation
- The duplicated() method returns a boolean Series indicating which rows in a DataFrame are duplicates of previously occurring rows
- By default, it marks all duplicates as True except for the first occurrence, which is marked as False
- This method is useful for data cleaning and quality assessment to identify redundant records
- The result can be used to filter or remove duplicate entries from datasets
- The method supports various parameters like keep='first', keep='last', or keep=False to customize duplicate detection behavior
This checks duplicate attempts by learner and lesson:
activity.duplicated(subset=["learner_id", "lesson"])Explanation
- The code checks for duplicate rows in an activity dataset based on specific columns
- It examines combinations of learner_id and lesson to find repeated entries
- Returns a boolean Series indicating which rows are duplicates of previous rows
- Useful for data cleaning to identify redundant activity records in educational analytics
- The subset parameter specifies exactly which columns to consider when determining duplication criteria
They answer different questions.
Mistake 5: Resetting Index And Keeping Old Index Accidentally
This keeps the old index as a column:
df.reset_index()Explanation
- Resets the DataFrame's index to the default integer index starting from zero
- Removes any existing index labels and replaces them with sequential integers
- Useful when index values are no longer meaningful or need to be standardized
- Commonly used after filtering or sorting operations that may have left irregular index values
This drops it:
df.reset_index(drop=True)Explanation
- Removes the current index labels and replaces them with a new sequential integer index starting from zero
- The drop=True parameter ensures the old index is completely discarded rather than being added as a new column
- Useful when you need to reorganize data after filtering, sorting, or concatenating DataFrames
- Helps maintain consistent indexing when performing operations that might disrupt the original index structure
Choose intentionally.
31. Compact Cheat Sheet
# Counts
df["col"].value_counts()
df[["a", "b"]].value_counts()
# Sorting
df.sort_values("score", ascending=False)
df.sort_values(["track", "score"], ascending=[True, False])
df.sort_index()
# Ranking
df["rank"] = df["score"].rank(ascending=False, method="dense")
df["group_rank"] = df.groupby("track")["score"].rank(ascending=False)
# Index
df = df.set_index("id")
df = df.reset_index()
df = df.reset_index(drop=True)
# Rename
df = df.rename(columns={"old": "new"})
# Unique
df["city"].unique()
df["city"].nunique()
# Missing
df.isna().sum()
df[df["mentor"].notna()]
df.dropna(subset=["mentor"])
df["mentor"] = df["mentor"].fillna("Unassigned")
df["mentor"].ffill()
df["mentor"].bfill()
# Duplicates
df.duplicated().sum()
df.duplicated(subset=["id", "lesson"], keep=False)
df.drop_duplicates(subset=["id", "lesson"], keep="last")
# Drop
df.drop(columns=["unused_col"])
df.drop(index=[0, 1])
# Apply
df["label"] = df["score"].apply(score_label)
df["summary"] = df.apply(make_summary, axis=1)Explanation
- Value counting and sorting operations including single column counts, multi-column combinations, and various sorting configurations with ascending/descending options
- Ranking functionality using dense method for both global and group-based score rankings within DataFrame columns
- Index management through setting, resetting, and dropping index columns while maintaining data integrity
- Column renaming and unique value extraction with count statistics for identifying distinct data entries
- Missing data handling including detection, filtering, filling strategies, and duplicate removal techniques for data cleaning and preprocessing
32. Interview-Style Questions
1. What is the difference between sort_values and sort_index?
sort_values sorts by data inside columns. sort_index sorts by row labels or column labels.
2. What does value_counts do?
It counts how often values appear. On a Series it counts one column. On a DataFrame it counts combinations of columns.
3. What does rank(method="dense") mean?
Dense ranking gives the same rank to ties and does not skip the next rank.
4. Why use set_index?
Use set_index when one column is a better row label, such as an ID, date, username, or product code.
5. What is the purpose of reset_index(drop=True)?
It resets the index to simple row numbers and discards the old index instead of keeping it as a column.
6. What is the difference between unique and nunique?
unique returns the distinct values. nunique returns the number of distinct values.
7. When should you use dropna?
Use it when rows with missing values are not useful for the specific analysis. Do not use it blindly.
8. What does drop_duplicates(subset=[...], keep="last") do?
It removes duplicate rows based on selected columns and keeps the last row from each duplicate group.
9. What does apply(axis=1) do?
It runs a function row by row. The function receives each row as a Series.
10. Why should row-wise apply be used carefully?
It is flexible but can be slower than vectorized operations on large DataFrames.
Sources And Further Reading
This guide uses original examples and original CSV files. The notebook was used only as a topic outline.
Official Pandas references:
- Pandas API reference: https://pandas.pydata.org/docs/reference/
- DataFrame basics: https://pandas.pydata.org/docs/user_guide/basics.html
- Indexing and selecting data: https://pandas.pydata.org/docs/user_guide/indexing.html
- Working with missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html
- Group by user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
Next in this series: Mastering Pandas GroupBy: A Comprehensive Guide to Data Analysis →
