# Essential Pandas DataFrame Methods: Sort, Rank, and Clean Data
URL: https://madhudadi.in/blog/posts/pandas-dataframe-methods-sort-rank-clean-and-apply
Published: 2026-06-02
Tags: Pandas, python
Read time: 38 min
Difficulty: beginner
> Learn the most useful Pandas DataFrame methods with original CSV datasets: value_counts, sort_values, rank, sort_index, set_index, reset_index, rename, unique, nunique, isna, notna, dropna, fillna, duplicated, drop_duplicates, drop, and apply.# Pandas Important DataFrame Methods: Sort, Rank, Index, Clean, Apply, and Practice

Once you understand how to create, inspect, select, and filter a DataFrame, the next step is learning the methods you will use again and again.

These methods help you answer questions like:

- Which category appears most often?
- Which learner has the highest score?
- Which rows have missing values?
- Which records are duplicates?
- How do I sort by more than one column?
- How do I make one column the index?
- How do I reset the index after sorting?
- How do I rename columns?
- How do I apply custom logic row by row?

This guide uses original sample datasets, not copied third-party datasets.

Files used in this lesson:

- `learning_activity.csv`
- `mini_league_matches_methods.csv`

Place both files in the same folder as this Markdown file before running the examples.

## What You Will Learn

By the end, you should be able to use:

- `value_counts`
- `sort_values`
- `rank`
- `sort_index`
- `set_index`
- `reset_index`
- `rename`
- `unique`
- `nunique`
- `isna`, `isnull`, `notna`, and `notnull`
- `hasnans`
- `dropna`
- `fillna`, `ffill`, and `bfill`
- `duplicated`
- `drop_duplicates`
- `drop`
- `apply`

You will also solve realistic practice tasks using CSV files.

## 1. Setup

```python
import pandas as pd
import numpy as np
```

Load the learner activity dataset:

```python
activity = pd.read_csv("learning_activity.csv")
activity.head()
```

Load the match dataset:

```python
matches = pd.read_csv("mini_league_matches_methods.csv")
matches.head()
```

## 2. Quick Dataset Check

Before using any advanced method, inspect the data.

```python
activity.shape
activity.info()
activity.head()
activity.tail()
```

Check column names:

```python
activity.columns
```

Check missing values:

```python
activity.isna().sum()
```

Check duplicates:

```python
activity.duplicated().sum()
```

This small routine catches many problems early.

## 3. `value_counts` On A Series

Use `value_counts` to count how often each value appears in one column.

```python
activity["track"].value_counts()
```

Example questions:

- Which course track has the most activity?
- Which city has the most learners?
- Which plan is most common?
- Which completion status appears most often?

```python
activity["city"].value_counts()
activity["plan"].value_counts()
activity["status"].value_counts()
```

By default, missing values are ignored.

To include missing values:

```python
activity["mentor"].value_counts(dropna=False)
```

To get percentages:

```python
activity["track"].value_counts(normalize=True) * 100
```

## 4. `value_counts` On A DataFrame

`DataFrame.value_counts` counts repeated combinations of columns.

```python
activity[["track", "status"]].value_counts()
```

This answers:

```text
How many rows exist for each track-status pair?
```

You can include missing combinations:

```python
activity[["track", "mentor"]].value_counts(dropna=False)
```

You can normalize:

```python
activity[["track", "status"]].value_counts(normalize=True) * 100
```

This is useful when a single column is not enough.

## 5. `sort_values` On A Series

Create a simple Series:

```python
scores = pd.Series([88, 72, 95, 64, 91], index=["Aarav", "Meera", "Nila", "Kabir", "Tara"])
```

Sort ascending:

```python
scores.sort_values()
```

Sort descending:

```python
scores.sort_values(ascending=False)
```

This returns a new sorted Series. It does not change the original unless you assign it back.

```python
scores = scores.sort_values(ascending=False)
```

## 6. `sort_values` On A DataFrame

Sort learners by quiz score:

```python
activity.sort_values("quiz_score", ascending=False)
```

Sort by minutes watched:

```python
activity.sort_values("minutes_watched", ascending=False)
```

Sort by multiple columns:

```python
activity.sort_values(
    ["track", "quiz_score"],
    ascending=[True, False]
)
```

This means:

- track in ascending order
- quiz score in descending order inside each track

## 7. Sorting With Missing Values

If a column has missing values, choose where they should go.

```python
activity.sort_values("quiz_score", na_position="first")
```

Or:

```python
activity.sort_values("quiz_score", na_position="last")
```

For reports, `na_position="last"` is usually easier to read.

## 8. `rank`

Ranking converts values into positions.

```python
activity["score_rank"] = activity["quiz_score"].rank(ascending=False)
```

Now the highest quiz score gets rank 1.

```python
activity.sort_values("score_rank").head()
```

If two learners have the same score, Pandas uses average ranking by default.

For dense ranking:

```python
activity["dense_rank"] = activity["quiz_score"].rank(
    ascending=False,
    method="dense"
)
```

Dense ranking does not leave gaps after ties.

Example:

```text
Score: 95, 91, 91, 88
Dense rank: 1, 2, 2, 3
```

## 9. Ranking Within Groups

Rank learners inside each track.

```python
activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
    ascending=False,
    method="dense"
)
```

Show top learners per track:

```python
activity.sort_values(["track", "track_rank", "learner"])
```

This pattern is powerful for leaderboards.

## 10. `sort_index`

Sorting by values sorts the data inside columns.

Sorting by index sorts row labels.

```python
activity_by_id = activity.set_index("learner_id")
activity_by_id.sort_index()
```

Descending:

```python
activity_by_id.sort_index(ascending=False)
```

You use `sort_index` when the index itself has meaning, such as:

- IDs
- dates
- alphabetical labels
- ordered categories

## 11. `set_index`

Use `set_index` when one column should become the row label.

```python
activity_by_id = activity.set_index("learner_id")
```

Now `learner_id` is the index.

You can select by ID:

```python
activity_by_id.loc["L003"]
```

By default, `set_index` returns a new DataFrame.

If you want to modify the same variable:

```python
activity = activity.set_index("learner_id")
```

In notebooks, returning a new DataFrame is often safer than using `inplace=True`.

## 12. `reset_index`

Use `reset_index` to convert the index back into a column.

```python
activity_by_id.reset_index()
```

If you do not want the old index as a column:

```python
activity_by_id.reset_index(drop=True)
```

Common pattern after sorting:

```python
top_learners = (
    activity
    .sort_values("quiz_score", ascending=False)
    .reset_index(drop=True)
)
```

This gives clean row numbers after sorting.

## 13. Series To DataFrame With `reset_index`

`value_counts` returns a Series.

```python
track_counts = activity["track"].value_counts()
type(track_counts)
```

Convert it into a DataFrame:

```python
track_counts_df = track_counts.reset_index()
track_counts_df.columns = ["track", "row_count"]
track_counts_df
```

This is useful before exporting a summary.

## 14. `rename`

Rename columns:

```python
activity = activity.rename(columns={
    "minutes_watched": "minutes",
    "quiz_score": "score"
})
```

Rename index labels:

```python
activity_by_id = activity.set_index("learner_id")

activity_by_id.rename(index={
    "L001": "learner-001",
    "L002": "learner-002"
})
```

Clean column names after reading messy CSV files:

```python
activity.columns = (
    activity.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)
```

Clean column names reduce bugs.

## 15. `unique`

Use `unique` to see distinct values in a Series.

```python
activity["track"].unique()
```

This returns an array of unique values.

For city names:

```python
activity["city"].unique()
```

If you see values like `"Pune"` and `"pune"`, you have a cleaning issue.

Normalize before analysis:

```python
activity["city"] = activity["city"].str.title()
```

## 16. `nunique`

Use `nunique` to count unique values.

```python
activity["learner_id"].nunique()
```

Count unique learners by track:

```python
activity.groupby("track")["learner_id"].nunique()
```

By default, missing values are ignored.

Include missing values:

```python
activity["mentor"].nunique(dropna=False)
```

## 17. `isna`, `isnull`, `notna`, And `notnull`

Check missing values:

```python
activity.isna()
```

Count missing values per column:

```python
activity.isna().sum()
```

`isnull` is an alias many people still use:

```python
activity.isnull().sum()
```

Check non-missing values:

```python
activity.notna()
activity.notnull()
```

Select rows with missing mentor:

```python
activity[activity["mentor"].isna()]
```

Select rows with known mentor:

```python
activity[activity["mentor"].notna()]
```

## 18. `hasnans`

`hasnans` is available on a Series.

```python
activity["mentor"].hasnans
```

It returns `True` if the Series contains missing values.

It is a quick yes/no check.

## 19. `dropna`

`dropna` removes rows or columns with missing values.

Drop rows with any missing value:

```python
activity.dropna()
```

Drop rows only if all values are missing:

```python
activity.dropna(how="all")
```

Drop rows where a specific column is missing:

```python
activity.dropna(subset=["mentor"])
```

Drop rows where either mentor or completed_at is missing:

```python
activity.dropna(subset=["mentor", "completed_at"])
```

Drop rows only when both mentor and completed_at are missing:

```python
activity.dropna(subset=["mentor", "completed_at"], how="all")
```

Do not use `dropna` blindly. In learning data, missing `completed_at` may simply mean the learner has not completed the lesson yet.

## 20. `fillna`

Use `fillna` to replace missing values.

Fill one column:

```python
activity["mentor"] = activity["mentor"].fillna("Unassigned")
```

Fill numeric missing values with median:

```python
activity["quiz_score"] = activity["quiz_score"].fillna(activity["quiz_score"].median())
```

Fill multiple columns with different values:

```python
activity = activity.fillna({
    "mentor": "Unassigned",
    "completed_at": "Not completed",
    "status": "unknown"
})
```

Column-specific filling is usually better than filling the whole DataFrame with one value.

## 21. Forward Fill And Backward Fill

Forward fill uses the previous known value.

```python
activity["mentor"].ffill()
```

Backward fill uses the next known value.

```python
activity["mentor"].bfill()
```

These are useful for time series or ordered logs, but they can be misleading if the row order has no meaning.

## 22. `duplicated`

Check full duplicate rows:

```python
activity.duplicated()
```

Count duplicates:

```python
activity.duplicated().sum()
```

Check duplicates by specific columns:

```python
activity.duplicated(subset=["learner_id", "lesson"])
```

Show duplicate lesson attempts:

```python
activity[activity.duplicated(subset=["learner_id", "lesson"], keep=False)]
```

`keep=False` marks all duplicate rows in each duplicate group.

## 23. `drop_duplicates`

Remove exact duplicate rows:

```python
activity.drop_duplicates()
```

Keep the last attempt for each learner and lesson:

```python
latest_attempts = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep="last"
)
```

Drop all duplicated learner-lesson rows:

```python
unique_only = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep=False
)
```

For event data, `keep="last"` is common when later rows represent updated attempts.

## 24. `drop`

Drop columns:

```python
activity.drop(columns=["mentor"])
```

Drop rows by index:

```python
activity.drop(index=[0, 1])
```

After setting an index, drop by label:

```python
activity_by_id = activity.set_index("learner_id")
activity_by_id.drop(index=["L001", "L002"])
```

Drop with `axis` also works:

```python
activity.drop(["mentor"], axis=1)
```

But `columns=[...]` is usually clearer.

## 25. `apply` On A Series

`apply` runs a function on each value in a Series.

```python
def score_label(score):
    if pd.isna(score):
        return "missing"
    if score >= 85:
        return "strong"
    if score >= 70:
        return "steady"
    return "needs_practice"

activity["score_label"] = activity["quiz_score"].apply(score_label)
```

For many simple conditions, vectorized methods like `np.where` or `pd.cut` can be faster.

But `apply` is easy to understand and useful for custom logic.

## 26. `apply` Row By Row

Use `axis=1` when your function needs more than one column from the same row.

```python
def learner_priority(row):
    if row["status"] != "completed" and row["minutes_watched"] >= 250:
        return "follow_up"
    if row["quiz_score"] >= 90:
        return "celebrate"
    return "normal"

activity["priority"] = activity.apply(learner_priority, axis=1)
```

Row-wise `apply` is flexible, but it can be slower on very large datasets.

Use it when clarity matters or when the logic is too custom for simple vectorized expressions.

## 27. Mini Practice: Learner Activity

Reload a clean copy:

```python
activity = pd.read_csv("learning_activity.csv")
```

### Q1. Which track has the most rows?

```python
activity["track"].value_counts()
```
**Explanation**

- The code examines the distribution of different activity types in a dataset by counting how many times each unique activity appears
- It uses the pandas value_counts() method to generate a frequency table of activity track values
- The result shows the count of each distinct activity type, helping identify which activities occur most frequently
- This analysis is useful for understanding user engagement patterns and popular activity categories
- The output provides insights into data distribution that can inform business decisions and resource allocation


### Q2. Which city-plan pair appears most often?

```python
activity[["city", "plan"]].value_counts()
```
**Explanation**

- This code performs a frequency count of unique combinations between city and plan columns in the activity DataFrame
- The double bracket notation [["city", "plan"]] selects multiple columns simultaneously for analysis
- value_counts() method returns a Series with MultiIndex where each index represents a unique city-plan pair
- The output shows how many times each city-plan combination appears in the dataset, useful for identifying popular activity patterns
- This technique is commonly used in data exploration to understand distribution patterns across categorical variables


### Q3. Sort learners by score, highest first.

```python
activity.sort_values("quiz_score", ascending=False)
```
**Explanation**

- The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in descending order
- Values with higher quiz scores appear first in the sorted result
- The ascending=False parameter ensures the highest scores are displayed at the top
- This operation modifies the DataFrame in place without creating a new object
- Commonly used for ranking students or participants by performance metrics


### Q4. Sort by track ascending and score descending.

```python
activity.sort_values(["track", "quiz_score"], ascending=[True, False])
```
**Explanation**

- The code sorts a pandas DataFrame called 'activity' by two columns: "track" in ascending order and "quiz_score" in descending order
- The sort_values() method rearranges rows based on the specified column priorities, with tracks grouped together and quiz scores ordered from highest to lowest within each track
- This creates a structured dataset where activities are organized first by track category, then by performance scores within each category
- The ascending parameter controls sorting direction, with True for ascending (A-Z) and False for descending (Z-A)
- This type of sorting is commonly used for creating performance reports or organizing educational content by difficulty level


### Q5. Add a global score rank.

```python
activity["global_rank"] = activity["quiz_score"].rank(
    ascending=False,
    method="dense"
)
```
**Explanation**

- Assigns a global ranking to each activity entry based on quiz scores in descending order
- Uses the dense ranking method which assigns the same rank to tied values without skipping subsequent ranks
- Higher quiz scores receive lower rank numbers (1st place, 2nd place, etc.)
- Modifies the activity DataFrame by adding a new "global_rank" column with calculated rankings
- The ranking process ensures consistent ordering where ties share identical rank values


### Q6. Add a rank inside each track.

```python
activity["track_rank"] = activity.groupby("track")["quiz_score"].rank(
    ascending=False,
    method="dense"
)
```
**Explanation**

- Groups the activity dataframe by the "track" column to organize data by track categories
- Within each track group, ranks the "quiz_score" values in descending order (highest scores first)
- Uses the "dense" ranking method which assigns the same rank to tied scores without skipping subsequent rank numbers
- Creates a new "track_rank" column that contains the rank position for each quiz score within its respective track
- The resulting ranking helps identify top performers within each track category efficiently


### Q7. Set `learner_id` as the index and select one learner.

```python
activity_by_id = activity.set_index("learner_id")
activity_by_id.loc["L003"]
```
**Explanation**

- Creates a new DataFrame view with "learner_id" column as the index using set_index() method
- Uses .loc accessor to retrieve all rows where learner_id equals "L003" from the indexed DataFrame
- This approach enables efficient lookup of specific learner records by their unique identifier
- The indexed structure provides faster access times when querying multiple records for the same learner ID
- Returns all activity records associated with the specified learner ID in a structured format


### Q8. Convert track counts into a DataFrame.

```python
track_counts = activity["track"].value_counts().reset_index()
track_counts.columns = ["track", "rows"]
track_counts
```
**Explanation**

- The code extracts unique track values from the activity DataFrame's "track" column and counts how many times each track appears
- It uses value_counts() to generate a Series with track names as index and their occurrence counts as values
- The reset_index() method converts the Series into a DataFrame format with tracks as a regular column instead of an index
- Column names are renamed to "track" and "rows" for better readability and clarity
- The final result displays a tabular view showing each track and its corresponding count of occurrences


### Q9. Rename `quiz_score` to `score`.

```python
activity.rename(columns={"quiz_score": "score"})
```
**Explanation**

- The code uses the pandas rename() method to change the column name "quiz_score" to "score" in a DataFrame called activity
- This operation creates a new DataFrame with the updated column names while preserving all original data
- The rename() method accepts a dictionary where keys are current column names and values are new column names
- This technique is commonly used to standardize column naming across datasets or make column names more readable
- The operation is non-destructive when using the inplace parameter or assigning the result to a new variable


### Q10. Count unique learners by track.

```python
activity.groupby("track")["learner_id"].nunique()
```
**Explanation**

- Groups data by the "track" column to organize records by track category
- For each track group, counts the number of distinct learner IDs using nunique()
- Returns a pandas Series with track names as index and unique learner counts as values
- This operation efficiently summarizes learner diversity across different tracks
- Commonly used for analytics to measure engagement breadth within educational pathways


### Q11. Show rows where mentor is missing.

```python
activity[activity["mentor"].isna()]
```
**Explanation**

- Uses boolean indexing to filter rows from the activity DataFrame where the "mentor" column has null/NaN values
- The isna() method returns a boolean Series identifying which entries are null in the mentor column
- This creates a subset of the original DataFrame containing only records without assigned mentors
- Commonly used for data cleaning to identify unassigned or missing mentorship records
- Returns a new DataFrame with the same structure but filtered to show only null mentor entries


### Q12. Fill missing mentors with `Unassigned`.

```python
activity["mentor"] = activity["mentor"].fillna("Unassigned")
```
**Explanation**

- Replaces all null/missing values in the "mentor" column with the string "Unassigned"
- Uses pandas fillna() method to handle missing data gracefully in the activity dataset
- Ensures consistent data representation by providing a default value for undefined mentor assignments
- Maintains data integrity by preventing null values from propagating through subsequent operations
- Common pattern for cleaning categorical data where missing values need explicit handling


### Q13. Show duplicate learner-lesson attempts.

```python
activity[
    activity.duplicated(subset=["learner_id", "lesson"], keep=False)
]
```
**Explanation**

- This code identifies all rows in the activity DataFrame that have duplicate combinations of learner_id and lesson values
- The duplicated() method returns a boolean mask indicating which rows are duplicates based on the specified subset columns
- Setting keep=False ensures all duplicate instances are marked as True, not just the subsequent ones
- The result filters the original DataFrame to show only rows where the same learner has completed the same lesson multiple times
- This is useful for identifying repeated learning activities or potential data entry issues in educational analytics


### Q14. Keep only the last attempt for every learner and lesson.

```python
activity = activity.drop_duplicates(
    subset=["learner_id", "lesson"],
    keep="last"
)
```
**Explanation**

- The code eliminates duplicate rows from the activity DataFrame based on combinations of learner_id and lesson columns
- It keeps only the last occurrence of each duplicate combination, ensuring the most recent record is retained
- The subset parameter specifies which columns to consider when identifying duplicates
- This operation helps clean data by removing redundant learner progress entries for the same lesson


### Q15. Create a priority label using row-wise apply.

```python
def priority(row):
    if row["status"] != "completed" and row["minutes_watched"] >= 250:
        return "follow_up"
    if row["quiz_score"] >= 90:
        return "celebrate"
    return "normal"

activity["priority"] = activity.apply(priority, axis=1)
```
**Explanation**

- The priority function evaluates student data rows to determine appropriate follow-up actions based on multiple criteria
- Students who haven't completed their status but have watched at least 250 minutes receive "follow_up" priority status
- High-performing students with quiz scores of 90 or above are flagged with "celebrate" priority level
- All other students automatically receive the default "normal" priority classification
- The apply method processes each row of the activity DataFrame through the priority function to create a new priority column


## 28. Mini Practice: Fictional League Matches

Reload the matches dataset:

```python
matches = pd.read_csv("mini_league_matches_methods.csv")
```
**Explanation**

- Reads match data from a CSV file named "mini_league_matches_methods.csv" into a pandas DataFrame called matches
- The CSV file likely contains information about football matches including teams, scores, dates, and other statistical data
- This operation creates a structured dataset that can be manipulated and analyzed using pandas methods
- The resulting DataFrame provides a foundation for performing data analysis, filtering, grouping, and visualization tasks
- Common use cases include generating match statistics, calculating team performance metrics, or creating visualizations of league standings


This dataset is fictional and created for Pandas practice.

### Q1. Count toss decisions.

```python
matches["toss_decision"].value_counts()
```
**Explanation**

- The code examines the frequency distribution of toss decisions (bat or field) made by teams in cricket matches
- It uses the value_counts() method to count how many times each toss decision appears in the toss_decision column
- This provides insights into whether teams prefer batting or fielding after winning the toss
- The output shows the absolute counts for each decision type, helping identify patterns in team strategy
- This analysis is commonly used in sports analytics to understand tactical preferences and decision-making patterns


### Q2. Count match stages.

```python
matches["stage"].value_counts()
```
**Explanation**

- This code performs a frequency count of unique values in the "stage" column of a pandas DataFrame called matches
- The value_counts() method returns a Series with each unique stage value as the index and its corresponding count as the value
- This is useful for understanding how many records exist for each stage category in your dataset
- The output helps identify which stages are most common or rare in your data
- This operation is commonly used for exploratory data analysis and data validation purposes


### Q3. Find how many matches each team played.

Because a team can appear in either `team_a` or `team_b`, combine both columns.

```python
teams_played = pd.concat([matches["team_a"], matches["team_b"]])
teams_played.value_counts()
```
**Explanation**

- Concatenates two columns from a matches DataFrame containing team names (team_a and team_b) into a single Series
- Creates a unified list of all teams that participated in matches by combining both home and away team columns
- Uses value_counts() method to tally how many times each team appears in the concatenated data
- Returns a sorted Series showing team names as index and their total match participation counts as values
- Useful for identifying most active teams or analyzing team involvement patterns in sports analytics


### Q4. Rank teams by number of wins.

```python
team_wins = matches["winner"].value_counts().reset_index()
team_wins.columns = ["team", "wins"]
team_wins["rank"] = team_wins["wins"].rank(
    ascending=False,
    method="dense"
)
team_wins.sort_values(["rank", "team"])
```
**Explanation**

- The code retrieves the count of wins for each team from the "winner" column in the `matches` DataFrame.  
- It resets the index to convert the Series into a DataFrame and renames the columns to "team" and "wins".  
- The `rank` method is applied to the "wins" column to assign a ranking based on the number of wins, with ties receiving the same rank.  
- Finally, the DataFrame is sorted by rank in ascending order and by team name to ensure a clear presentation of the results.


### Q5. Sort matches by season, stage, and margin.

```python
matches.sort_values(
    ["season", "stage", "margin"],
    ascending=[True, True, False]
)
```
**Explanation**

- The `sort_values` method is used to sort the DataFrame named `matches`.  
- The sorting is performed based on three columns: `season`, `stage`, and `margin`.  
- The `ascending` parameter specifies the sort order: `season` and `stage` are sorted in ascending order, while `margin` is sorted in descending order.  
- This allows for a structured organization of match data, prioritizing seasons and stages while highlighting larger margins.


### Q6. Set `match_id` as the index.

```python
matches_by_id = matches.set_index("match_id")
matches_by_id.sort_index()
```
**Explanation**

- The `set_index("match_id")` method transforms the DataFrame by using the "match_id" column as the new index, allowing for quicker lookups.  
- The `sort_index()` method sorts the DataFrame based on the new index, ensuring that the data is organized in ascending order of match IDs.  
- This approach enhances data manipulation and retrieval, particularly when working with large datasets.  
- The original DataFrame remains unchanged unless reassigned, so the sorted version needs to be stored if required.


### Q7. Show unique stages.

```python
matches["stage"].unique()
```
**Explanation**

- Retrieves all distinct values present in the "stage" column of the matches DataFrame
- Returns a numpy array containing only unique stage identifiers without duplicates
- Useful for understanding the variety of stages available in the dataset
- Commonly used for data exploration and validation before further analysis
- Helps identify potential data inconsistencies or missing stage categories


### Q8. Count unique cities.

```python
matches["city"].nunique()
```
**Explanation**

- The code uses the nunique() method to count distinct values in the "city" column of a pandas DataFrame called matches
- This method efficiently handles missing values by excluding them from the count automatically
- The result returns a single integer representing how many unique cities exist in that dataset column
- This operation is useful for understanding data diversity and identifying potential data quality issues
- The syntax follows pandas convention where nunique() is applied directly to a Series object


### Q9. Drop rows where winner is missing.

```python
matches.dropna(subset=["winner"])
```
**Explanation**

- The dropna method removes any rows where the "winner" column contains null or NaN values
- The subset parameter specifically targets only the "winner" column for null value detection
- This operation returns a new DataFrame without the incomplete match records
- Useful for data cleaning when you need to ensure all matches have a defined winner before analysis
- The original DataFrame remains unchanged since dropna creates a copy by default


In this sample dataset, one row has no winner because the match had no result.

### Q10. Fill missing winner with `No Result`.

```python
matches["winner"] = matches["winner"].fillna("No Result")
```
**Explanation**

- The code fills NaN (missing) values in the "winner" column of a DataFrame called matches
- Missing values are replaced with the string "No Result" to indicate incomplete match outcomes
- This approach ensures consistent data representation where every match has a defined winner value
- The fillna() method operates in-place on the DataFrame column without requiring assignment to a new variable
- Useful for data analysis workflows where missing results need to be handled explicitly rather than left as null values


### Q11. Add a close_match column.

```python
matches["close_match"] = matches["margin"].apply(
    lambda margin: "yes" if margin <= 8 else "no"
)
```
**Explanation**

- The code creates a new column called "close_match" in the matches DataFrame
- It applies a lambda function to each value in the "margin" column to evaluate whether the margin is less than or equal to 8
- Values meeting the condition (margin ≤ 8) are labeled as "yes", otherwise "no"
- This operation uses pandas apply method for vectorized conditional assignment across all rows
- The resulting column helps categorize matches as either "close" or "not close" based on the defined threshold


### Q12. Add a match summary using row-wise apply.

```python
def match_summary(row):
    if pd.isna(row["winner"]):
        return f'{row["team_a"]} vs {row["team_b"]}: no result'
    return f'{row["winner"]} won by {row["margin"]}'

matches["summary"] = matches.apply(match_summary, axis=1)
```
**Explanation**

- The function `match_summary` takes a row of match data and checks if the winner column contains missing values using `pd.isna()`
- When there's no winner (missing value), it returns a string indicating the match ended without a result
- When there is a winner, it formats a summary showing which team won and by what margin
- The `apply()` method processes each row of the matches DataFrame through this function
- The resulting summary strings are stored in a new column called "summary" in the matches DataFrame


## 29. Method Selection Guide

Use this as a quick decision table.

| Task | Method |
|---|---|
| Count values in one column | `value_counts` |
| Count combinations of columns | `DataFrame.value_counts` |
| Sort rows by column | `sort_values` |
| Sort rows by index | `sort_index` |
| Create ranks | `rank` |
| Make a column the index | `set_index` |
| Convert index back to column | `reset_index` |
| Rename columns or index labels | `rename` |
| See unique values | `unique` |
| Count unique values | `nunique` |
| Detect missing values | `isna` or `isnull` |
| Detect non-missing values | `notna` or `notnull` |
| Remove missing rows | `dropna` |
| Fill missing values | `fillna`, `ffill`, `bfill` |
| Detect duplicates | `duplicated` |
| Remove duplicates | `drop_duplicates` |
| Remove rows or columns | `drop` |
| Run custom logic | `apply` |

## 30. Common Mistakes

### Mistake 1: Sorting Without Assigning

This returns a sorted DataFrame but does not change `activity`.

```python
activity.sort_values("quiz_score")
```
**Explanation**

- The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in ascending order
- By default, sort_values() arranges values from lowest to highest
- This operation returns a new sorted DataFrame without modifying the original data
- Useful for analyzing student performance or ranking participants by their quiz results
- The method can be modified with additional parameters like ascending=False for descending order


Assign it:

```python
activity = activity.sort_values("quiz_score")
```
**Explanation**

- The code sorts a pandas DataFrame called 'activity' based on the "quiz_score" column in ascending order
- This operation rearranges all rows in the dataset so that records with lower quiz scores appear first
- The sort_values() method modifies the DataFrame in-place by default, reordering the entire dataset
- This sorting approach is commonly used for analyzing student performance or ranking participants by score
- The resulting dataset maintains all original columns while organizing rows according to the specified sorting criteria


### Mistake 2: Dropping Rows Too Early

Missing values are not always bad.

For example, a missing `completed_at` value can mean the learner has not completed the lesson yet.

Understand the meaning before using:

```python
activity.dropna()
```
**Explanation**

- The dropna() method eliminates rows or columns containing null values (NaN, None) from a pandas DataFrame
- By default, it removes any row that contains at least one missing value, preserving only complete records
- This operation returns a new DataFrame without modifying the original data structure
- Commonly used in data cleaning processes to ensure analysis accuracy when missing values could skew results
- Can be customized with parameters like subset= to target specific columns or how='all' to remove rows where all values are missing


### Mistake 3: Using Row-Wise `apply` For Everything

This works:

```python
activity.apply(priority, axis=1)
```
**Explanation**

- The code applies a priority function to each row of the activity DataFrame using the pandas apply method
- The axis=1 parameter specifies that the function should be applied horizontally across rows rather than vertically down columns
- This approach enables vectorized processing of each row's data through the custom priority function
- The result represents transformed activity data where each row has been processed according to the priority logic
- This pattern is commonly used for data transformation tasks like scoring, categorization, or conditional calculations across dataset rows


But for simple comparisons, vectorized code is often cleaner:

```python
activity["is_high_score"] = activity["quiz_score"] >= 85
```
**Explanation**

- This code creates a new column called "is_high_score" in the activity DataFrame
- It evaluates each row's quiz_score against the value 85 using a greater than or equal comparison
- The result is a boolean True/False value indicating whether each student achieved a score of 85 or higher
- This boolean flag can be used for filtering, grouping, or conditional operations on high-performing students
- The operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously


Use `apply` when you really need custom logic.

### Mistake 4: Forgetting `subset` In Duplicate Checks

This checks full-row duplicates:

```python
activity.duplicated()
```
**Explanation**

- The duplicated() method returns a boolean Series indicating which rows in a DataFrame are duplicates of previously occurring rows
- By default, it marks all duplicates as True except for the first occurrence, which is marked as False
- This method is useful for data cleaning and quality assessment to identify redundant records
- The result can be used to filter or remove duplicate entries from datasets
- The method supports various parameters like keep='first', keep='last', or keep=False to customize duplicate detection behavior


This checks duplicate attempts by learner and lesson:

```python
activity.duplicated(subset=["learner_id", "lesson"])
```
**Explanation**

- The code checks for duplicate rows in an activity dataset based on specific columns
- It examines combinations of learner_id and lesson to find repeated entries
- Returns a boolean Series indicating which rows are duplicates of previous rows
- Useful for data cleaning to identify redundant activity records in educational analytics
- The subset parameter specifies exactly which columns to consider when determining duplication criteria


They answer different questions.

### Mistake 5: Resetting Index And Keeping Old Index Accidentally

This keeps the old index as a column:

```python
df.reset_index()
```
**Explanation**

- Resets the DataFrame's index to the default integer index starting from zero
- Removes any existing index labels and replaces them with sequential integers
- Useful when index values are no longer meaningful or need to be standardized
- Commonly used after filtering or sorting operations that may have left irregular index values


This drops it:

```python
df.reset_index(drop=True)
```
**Explanation**

- Removes the current index labels and replaces them with a new sequential integer index starting from zero
- The drop=True parameter ensures the old index is completely discarded rather than being added as a new column
- Useful when you need to reorganize data after filtering, sorting, or concatenating DataFrames
- Helps maintain consistent indexing when performing operations that might disrupt the original index structure


Choose intentionally.

## 31. Compact Cheat Sheet

```python
# Counts
df["col"].value_counts()
df[["a", "b"]].value_counts()

# Sorting
df.sort_values("score", ascending=False)
df.sort_values(["track", "score"], ascending=[True, False])
df.sort_index()

# Ranking
df["rank"] = df["score"].rank(ascending=False, method="dense")
df["group_rank"] = df.groupby("track")["score"].rank(ascending=False)

# Index
df = df.set_index("id")
df = df.reset_index()
df = df.reset_index(drop=True)

# Rename
df = df.rename(columns={"old": "new"})

# Unique
df["city"].unique()
df["city"].nunique()

# Missing
df.isna().sum()
df[df["mentor"].notna()]
df.dropna(subset=["mentor"])
df["mentor"] = df["mentor"].fillna("Unassigned")
df["mentor"].ffill()
df["mentor"].bfill()

# Duplicates
df.duplicated().sum()
df.duplicated(subset=["id", "lesson"], keep=False)
df.drop_duplicates(subset=["id", "lesson"], keep="last")

# Drop
df.drop(columns=["unused_col"])
df.drop(index=[0, 1])

# Apply
df["label"] = df["score"].apply(score_label)
df["summary"] = df.apply(make_summary, axis=1)
```
**Explanation**

- Value counting and sorting operations including single column counts, multi-column combinations, and various sorting configurations with ascending/descending options
- Ranking functionality using dense method for both global and group-based score rankings within DataFrame columns
- Index management through setting, resetting, and dropping index columns while maintaining data integrity
- Column renaming and unique value extraction with count statistics for identifying distinct data entries
- Missing data handling including detection, filtering, filling strategies, and duplicate removal techniques for data cleaning and preprocessing


## 32. Interview-Style Questions

### 1. What is the difference between `sort_values` and `sort_index`?

`sort_values` sorts by data inside columns. `sort_index` sorts by row labels or column labels.

### 2. What does `value_counts` do?

It counts how often values appear. On a Series it counts one column. On a DataFrame it counts combinations of columns.

### 3. What does `rank(method="dense")` mean?

Dense ranking gives the same rank to ties and does not skip the next rank.

### 4. Why use `set_index`?

Use `set_index` when one column is a better row label, such as an ID, date, username, or product code.

### 5. What is the purpose of `reset_index(drop=True)`?

It resets the index to simple row numbers and discards the old index instead of keeping it as a column.

### 6. What is the difference between `unique` and `nunique`?

`unique` returns the distinct values. `nunique` returns the number of distinct values.

### 7. When should you use `dropna`?

Use it when rows with missing values are not useful for the specific analysis. Do not use it blindly.

### 8. What does `drop_duplicates(subset=[...], keep="last")` do?

It removes duplicate rows based on selected columns and keeps the last row from each duplicate group.

### 9. What does `apply(axis=1)` do?

It runs a function row by row. The function receives each row as a Series.

### 10. Why should row-wise `apply` be used carefully?

It is flexible but can be slower than vectorized operations on large DataFrames.

## Sources And Further Reading

This guide uses original examples and original CSV files. The notebook was used only as a topic outline.

Official Pandas references:

- Pandas API reference: https://pandas.pydata.org/docs/reference/
- DataFrame basics: https://pandas.pydata.org/docs/user_guide/basics.html
- Indexing and selecting data: https://pandas.pydata.org/docs/user_guide/indexing.html
- Working with missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html
- Group by user guide: https://pandas.pydata.org/docs/user_guide/groupby.html