# Essential Pandas DataFrame Methods: Sort, Rank, and Clean Data URL: https://madhudadi.in/blog/posts/pandas-dataframe-methods-sort-rank-clean-and-apply Published: 2026-06-02 Tags: Pandas, python Read time: 38 min Difficulty: beginner > Learn the most useful Pandas DataFrame methods with original CSV datasets: value_counts, sort_values, rank, sort_index, set_index, reset_index, rename, unique, nunique, isna, notna, dropna, fillna, duplicated, drop_duplicates, drop, and apply.# Pandas Important DataFrame Methods: Sort, Rank, Index, Clean, Apply, and Practice Once you understand how to create, inspect, select, and filter a DataFrame, the next step is learning the methods you will use again and again. These methods help you answer questions like: - Which category appears most often? - Which learner has the highest score? - Which rows have missing values? - Which records are duplicates? - How do I sort by more than one column? - How do I make one column the index? - How do I reset the index after sorting? - How do I rename columns? - How do I apply custom logic row by row? This guide uses original sample datasets, not copied third-party datasets. Files used in this lesson: - `learning_activity.csv` - `mini_league_matches_methods.csv` Place both files in the same folder as this Markdown file before running the examples. ## What You Will Learn By the end, you should be able to use: - `value_counts` - `sort_values` - `rank` - `sort_index` - `set_index` - `reset_index` - `rename` - `unique` - `nunique` - `isna`, `isnull`, `notna`, and `notnull` - `hasnans` - `dropna` - `fillna`, `ffill`, and `bfill` - `duplicated` - `drop_duplicates` - `drop` - `apply` You will also solve realistic practice tasks using CSV files. ## 1. Setup ```python import pandas as pd import numpy as np ``` Load the learner activity dataset: ```python activity = pd.read_csv("learning_activity.csv") activity.head() ``` Load the match dataset: ```python matches = pd.read_csv("mini_league_matches_methods.csv") matches.head() ``` ## 2. Quick Dataset Check Before using any advanced method, inspect the data. ```python activity.shape activity.info() activity.head() activity.tail() ``` Check column names: ```python activity.columns ``` Check missing values: ```python activity.isna().sum() ``` Check duplicates: ```python activity.duplicated().sum() ``` This small routine catches many problems early. ## 3. `value_counts` On A Series Use `value_counts` to count how often each value appears in one column. ```python activity["track"].value_counts() ``` Example questions: - Which course track has the most activity? - Which city has the most learners? - Which plan is most common? - Which completion status appears most often? ```python activity["city"].value_counts() activity["plan"].value_counts() activity["status"].value_counts() ``` By default, missing values are ignored. To include missing values: ```python activity["mentor"].value_counts(dropna=False) ``` To get percentages: ```python activity["track"].value_counts(normalize=True) * 100 ``` ## 4. `value_counts` On A DataFrame `DataFrame.value_counts` counts repeated combinations of columns. ```python activity[["track", "status"]].value_counts() ``` This answers: ```text How many rows exist for each track-status pair? ``` You can include missing combinations: ```python activity[["track", "mentor"]].value_counts(dropna=False) ``` You can normalize: ```python activity[["track", "status"]].value_counts(normalize=True) * 100 ``` This is useful when a single column is not enough. ## 5. `sort_values` On A Series Create a simple Series: ```python scores = pd.Series([88, 72, 95, 64, 91], index=["Aarav", "Meera", "Nila", "Kabir", "Tara"]) ``` Sort ascending: ```python scores.sort_values() ``` Sort descending: ```python scores.sort_values(ascending=False) ``` This returns a new sorted Series. It does not change the original unless you assign it back. ```python scores = scores.sort_values(ascending=False) ``` ## 6. `sort_values` On A DataFrame Sort learners by quiz score: ```python activity.sort_values("quiz_score", ascending=False) ``` Sort by minutes watched: ```python activity.sort_values("minutes_watched", ascending=False) ``` Sort by multiple columns: ```python activity.sort_values( ["track", "quiz_score"], ascending=[True, False] ) ``` This means: - track in ascending order - quiz score in descending order inside each track ## 7. Sorting With Missing Values If a column has missing values, choose where they should go. ```python activity.sort_values("quiz_score", na_position="first") ``` Or: ```python activity.sort_values("quiz_score", na_position="last") ``` For reports, `na_position="last"` is usually easier to read. ## 8. `rank` Ranking converts values into positions. ```python activity["score_rank"] = activity["quiz_score"].rank(ascending=False) ``` Now the highest quiz score gets rank 1. ```python activity.sort_values("score_rank").head() ``` If two learners have the same score, Pandas uses average ranking by default. For dense ranking: ```python activity["dense_rank"] = activity["quiz_score"].rank( ascending=False, method="dense" ) ``` Dense ranking does not leave gaps after ties. Example: ```text Score: 95, 91, 91, 88 Dense rank: 1, 2, 2, 3 ``` ## 9. Ranking Within Groups Rank learners inside each track. ```python activity["track_rank"] = activity.groupby("track")["quiz_score"].rank( ascending=False, method="dense" ) ``` Show top learners per track: ```python activity.sort_values(["track", "track_rank", "learner"]) ``` This pattern is powerful for leaderboards. ## 10. `sort_index` Sorting by values sorts the data inside columns. Sorting by index sorts row labels. ```python activity_by_id = activity.set_index("learner_id") activity_by_id.sort_index() ``` Descending: ```python activity_by_id.sort_index(ascending=False) ``` You use `sort_index` when the index itself has meaning, such as: - IDs - dates - alphabetical labels - ordered categories ## 11. `set_index` Use `set_index` when one column should become the row label. ```python activity_by_id = activity.set_index("learner_id") ``` Now `learner_id` is the index. You can select by ID: ```python activity_by_id.loc["L003"] ``` By default, `set_index` returns a new DataFrame. If you want to modify the same variable: ```python activity = activity.set_index("learner_id") ``` In notebooks, returning a new DataFrame is often safer than using `inplace=True`. ## 12. `reset_index` Use `reset_index` to convert the index back into a column. ```python activity_by_id.reset_index() ``` If you do not want the old index as a column: ```python activity_by_id.reset_index(drop=True) ``` Common pattern after sorting: ```python top_learners = ( activity .sort_values("quiz_score", ascending=False) .reset_index(drop=True) ) ``` This gives clean row numbers after sorting. ## 13. Series To DataFrame With `reset_index` `value_counts` returns a Series. ```python track_counts = activity["track"].value_counts() type(track_counts) ``` Convert it into a DataFrame: ```python track_counts_df = track_counts.reset_index() track_counts_df.columns = ["track", "row_count"] track_counts_df ``` This is useful before exporting a summary. ## 14. `rename` Rename columns: ```python activity = activity.rename(columns={ "minutes_watched": "minutes", "quiz_score": "score" }) ``` Rename index labels: ```python activity_by_id = activity.set_index("learner_id") activity_by_id.rename(index={ "L001": "learner-001", "L002": "learner-002" }) ``` Clean column names after reading messy CSV files: ```python activity.columns = ( activity.columns .str.strip() .str.lower() .str.replace(" ", "_") ) ``` Clean column names reduce bugs. ## 15. `unique` Use `unique` to see distinct values in a Series. ```python activity["track"].unique() ``` This returns an array of unique values. For city names: ```python activity["city"].unique() ``` If you see values like `"Pune"` and `"pune"`, you have a cleaning issue. Normalize before analysis: ```python activity["city"] = activity["city"].str.title() ``` ## 16. `nunique` Use `nunique` to count unique values. ```python activity["learner_id"].nunique() ``` Count unique learners by track: ```python activity.groupby("track")["learner_id"].nunique() ``` By default, missing values are ignored. Include missing values: ```python activity["mentor"].nunique(dropna=False) ``` ## 17. `isna`, `isnull`, `notna`, And `notnull` Check missing values: ```python activity.isna() ``` Count missing values per column: ```python activity.isna().sum() ``` `isnull` is an alias many people still use: ```python activity.isnull().sum() ``` Check non-missing values: ```python activity.notna() activity.notnull() ``` Select rows with missing mentor: ```python activity[activity["mentor"].isna()] ``` Select rows with known mentor: ```python activity[activity["mentor"].notna()] ``` ## 18. `hasnans` `hasnans` is available on a Series. ```python activity["mentor"].hasnans ``` It returns `True` if the Series contains missing values. It is a quick yes/no check. ## 19. `dropna` `dropna` removes rows or columns with missing values. Drop rows with any missing value: ```python activity.dropna() ``` Drop rows only if all values are missing: ```python activity.dropna(how="all") ``` Drop rows where a specific column is missing: ```python activity.dropna(subset=["mentor"]) ``` Drop rows where either mentor or completed_at is missing: ```python activity.dropna(subset=["mentor", "completed_at"]) ``` Drop rows only when both mentor and completed_at are missing: ```python activity.dropna(subset=["mentor", "completed_at"], how="all") ``` Do not use `dropna` blindly. In learning data, missing `completed_at` may simply mean the learner has not completed the lesson yet. ## 20. `fillna` Use `fillna` to replace missing values. Fill one column: ```python activity["mentor"] = activity["mentor"].fillna("Unassigned") ``` Fill numeric missing values with median: ```python activity["quiz_score"] = activity["quiz_score"].fillna(activity["quiz_score"].median()) ``` Fill multiple columns with different values: ```python activity = activity.fillna({ "mentor": "Unassigned", "completed_at": "Not completed", "status": "unknown" }) ``` Column-specific filling is usually better than filling the whole DataFrame with one value. ## 21. Forward Fill And Backward Fill Forward fill uses the previous known value. ```python activity["mentor"].ffill() ``` Backward fill uses the next known value. ```python activity["mentor"].bfill() ``` These are useful for time series or ordered logs, but they can be misleading if the row order has no meaning. ## 22. `duplicated` Check full duplicate rows: ```python activity.duplicated() ``` Count duplicates: ```python activity.duplicated().sum() ``` Check duplicates by specific columns: ```python activity.duplicated(subset=["learner_id", "lesson"]) ``` Show duplicate lesson attempts: ```python activity[activity.duplicated(subset=["learner_id", "lesson"], keep=False)] ``` `keep=False` marks all duplicate rows in each duplicate group. ## 23. `drop_duplicates` Remove exact duplicate rows: ```python activity.drop_duplicates() ``` Keep the last attempt for each learner and lesson: ```python latest_attempts = activity.drop_duplicates( subset=["learner_id", "lesson"], keep="last" ) ``` Drop all duplicated learner-lesson rows: ```python unique_only = activity.drop_duplicates( subset=["learner_id", "lesson"], keep=False ) ``` For event data, `keep="last"` is common when later rows represent updated attempts. ## 24. `drop` Drop columns: ```python activity.drop(columns=["mentor"]) ``` Drop rows by index: ```python activity.drop(index=[0, 1]) ``` After setting an index, drop by label: ```python activity_by_id = activity.set_index("learner_id") activity_by_id.drop(index=["L001", "L002"]) ``` Drop with `axis` also works: ```python activity.drop(["mentor"], axis=1) ``` But `columns=[...]` is usually clearer. ## 25. `apply` On A Series `apply` runs a function on each value in a Series. ```python def score_label(score): if pd.isna(score): return "missing" if score >= 85: return "strong" if score >= 70: return "steady" return "needs_practice" activity["score_label"] = activity["quiz_score"].apply(score_label) ``` For many simple conditions, vectorized methods like `np.where` or `pd.cut` can be faster. But `apply` is easy to understand and useful for custom logic. ## 26. `apply` Row By Row Use `axis=1` when your function needs more than one column from the same row. ```python def learner_priority(row): if row["status"] != "completed" and row["minutes_watched"] >= 250: return "follow_up" if row["quiz_score"] >= 90: return "celebrate" return "normal" activity["priority"] = activity.apply(learner_priority, axis=1) ``` Row-wise `apply` is flexible, but it can be slower on very large datasets. Use it when clarity matters or when the logic is too custom for simple vectorized expressions. ## 27. Mini Practice: Learner Activity Reload a clean copy: ```python activity = pd.read_csv("learning_activity.csv") ``` ### Q1. Which track has the most rows? ```python activity["track"].value_counts() ``` **Explanation** - The code examines the distribution of different activity types in a dataset by counting how many times each unique activity appears - It uses the pandas value_counts() method to generate a frequency table of activity track values - The result shows the count of each distinct activity type, helping identify which activities occur most frequently - This analysis is useful for understanding user engagement patterns and popular activity categories - The output provides insights into data distribution that can inform business decisions and resource allocation ### Q2. Which city-plan pair appears most often? ```python activity[["city", "plan"]].value_counts() ``` **Explanation** - This code performs a frequency count of unique combinations between city and plan columns in the activity DataFrame - The double bracket notation [["city", "plan"]] selects multiple columns simultaneously for analysis - value_counts() method returns a Series with MultiIndex where each index represents a unique city-plan pair - The output shows how many times each city-plan combination appears in the dataset, useful for identifying popular activity patterns - This technique is commonly used in data exploration to understand distribution patterns across categorical variables ### Q3. Sort learners by score, highest first. ```python activity.sort_values("quiz_score", ascending=False) ``` **Explanation** - The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in descending order - Values with higher quiz scores appear first in the sorted result - The ascending=False parameter ensures the highest scores are displayed at the top - This operation modifies the DataFrame in place without creating a new object - Commonly used for ranking students or participants by performance metrics ### Q4. Sort by track ascending and score descending. ```python activity.sort_values(["track", "quiz_score"], ascending=[True, False]) ``` **Explanation** - The code sorts a pandas DataFrame called 'activity' by two columns: "track" in ascending order and "quiz_score" in descending order - The sort_values() method rearranges rows based on the specified column priorities, with tracks grouped together and quiz scores ordered from highest to lowest within each track - This creates a structured dataset where activities are organized first by track category, then by performance scores within each category - The ascending parameter controls sorting direction, with True for ascending (A-Z) and False for descending (Z-A) - This type of sorting is commonly used for creating performance reports or organizing educational content by difficulty level ### Q5. Add a global score rank. ```python activity["global_rank"] = activity["quiz_score"].rank( ascending=False, method="dense" ) ``` **Explanation** - Assigns a global ranking to each activity entry based on quiz scores in descending order - Uses the dense ranking method which assigns the same rank to tied values without skipping subsequent ranks - Higher quiz scores receive lower rank numbers (1st place, 2nd place, etc.) - Modifies the activity DataFrame by adding a new "global_rank" column with calculated rankings - The ranking process ensures consistent ordering where ties share identical rank values ### Q6. Add a rank inside each track. ```python activity["track_rank"] = activity.groupby("track")["quiz_score"].rank( ascending=False, method="dense" ) ``` **Explanation** - Groups the activity dataframe by the "track" column to organize data by track categories - Within each track group, ranks the "quiz_score" values in descending order (highest scores first) - Uses the "dense" ranking method which assigns the same rank to tied scores without skipping subsequent rank numbers - Creates a new "track_rank" column that contains the rank position for each quiz score within its respective track - The resulting ranking helps identify top performers within each track category efficiently ### Q7. Set `learner_id` as the index and select one learner. ```python activity_by_id = activity.set_index("learner_id") activity_by_id.loc["L003"] ``` **Explanation** - Creates a new DataFrame view with "learner_id" column as the index using set_index() method - Uses .loc accessor to retrieve all rows where learner_id equals "L003" from the indexed DataFrame - This approach enables efficient lookup of specific learner records by their unique identifier - The indexed structure provides faster access times when querying multiple records for the same learner ID - Returns all activity records associated with the specified learner ID in a structured format ### Q8. Convert track counts into a DataFrame. ```python track_counts = activity["track"].value_counts().reset_index() track_counts.columns = ["track", "rows"] track_counts ``` **Explanation** - The code extracts unique track values from the activity DataFrame's "track" column and counts how many times each track appears - It uses value_counts() to generate a Series with track names as index and their occurrence counts as values - The reset_index() method converts the Series into a DataFrame format with tracks as a regular column instead of an index - Column names are renamed to "track" and "rows" for better readability and clarity - The final result displays a tabular view showing each track and its corresponding count of occurrences ### Q9. Rename `quiz_score` to `score`. ```python activity.rename(columns={"quiz_score": "score"}) ``` **Explanation** - The code uses the pandas rename() method to change the column name "quiz_score" to "score" in a DataFrame called activity - This operation creates a new DataFrame with the updated column names while preserving all original data - The rename() method accepts a dictionary where keys are current column names and values are new column names - This technique is commonly used to standardize column naming across datasets or make column names more readable - The operation is non-destructive when using the inplace parameter or assigning the result to a new variable ### Q10. Count unique learners by track. ```python activity.groupby("track")["learner_id"].nunique() ``` **Explanation** - Groups data by the "track" column to organize records by track category - For each track group, counts the number of distinct learner IDs using nunique() - Returns a pandas Series with track names as index and unique learner counts as values - This operation efficiently summarizes learner diversity across different tracks - Commonly used for analytics to measure engagement breadth within educational pathways ### Q11. Show rows where mentor is missing. ```python activity[activity["mentor"].isna()] ``` **Explanation** - Uses boolean indexing to filter rows from the activity DataFrame where the "mentor" column has null/NaN values - The isna() method returns a boolean Series identifying which entries are null in the mentor column - This creates a subset of the original DataFrame containing only records without assigned mentors - Commonly used for data cleaning to identify unassigned or missing mentorship records - Returns a new DataFrame with the same structure but filtered to show only null mentor entries ### Q12. Fill missing mentors with `Unassigned`. ```python activity["mentor"] = activity["mentor"].fillna("Unassigned") ``` **Explanation** - Replaces all null/missing values in the "mentor" column with the string "Unassigned" - Uses pandas fillna() method to handle missing data gracefully in the activity dataset - Ensures consistent data representation by providing a default value for undefined mentor assignments - Maintains data integrity by preventing null values from propagating through subsequent operations - Common pattern for cleaning categorical data where missing values need explicit handling ### Q13. Show duplicate learner-lesson attempts. ```python activity[ activity.duplicated(subset=["learner_id", "lesson"], keep=False) ] ``` **Explanation** - This code identifies all rows in the activity DataFrame that have duplicate combinations of learner_id and lesson values - The duplicated() method returns a boolean mask indicating which rows are duplicates based on the specified subset columns - Setting keep=False ensures all duplicate instances are marked as True, not just the subsequent ones - The result filters the original DataFrame to show only rows where the same learner has completed the same lesson multiple times - This is useful for identifying repeated learning activities or potential data entry issues in educational analytics ### Q14. Keep only the last attempt for every learner and lesson. ```python activity = activity.drop_duplicates( subset=["learner_id", "lesson"], keep="last" ) ``` **Explanation** - The code eliminates duplicate rows from the activity DataFrame based on combinations of learner_id and lesson columns - It keeps only the last occurrence of each duplicate combination, ensuring the most recent record is retained - The subset parameter specifies which columns to consider when identifying duplicates - This operation helps clean data by removing redundant learner progress entries for the same lesson ### Q15. Create a priority label using row-wise apply. ```python def priority(row): if row["status"] != "completed" and row["minutes_watched"] >= 250: return "follow_up" if row["quiz_score"] >= 90: return "celebrate" return "normal" activity["priority"] = activity.apply(priority, axis=1) ``` **Explanation** - The priority function evaluates student data rows to determine appropriate follow-up actions based on multiple criteria - Students who haven't completed their status but have watched at least 250 minutes receive "follow_up" priority status - High-performing students with quiz scores of 90 or above are flagged with "celebrate" priority level - All other students automatically receive the default "normal" priority classification - The apply method processes each row of the activity DataFrame through the priority function to create a new priority column ## 28. Mini Practice: Fictional League Matches Reload the matches dataset: ```python matches = pd.read_csv("mini_league_matches_methods.csv") ``` **Explanation** - Reads match data from a CSV file named "mini_league_matches_methods.csv" into a pandas DataFrame called matches - The CSV file likely contains information about football matches including teams, scores, dates, and other statistical data - This operation creates a structured dataset that can be manipulated and analyzed using pandas methods - The resulting DataFrame provides a foundation for performing data analysis, filtering, grouping, and visualization tasks - Common use cases include generating match statistics, calculating team performance metrics, or creating visualizations of league standings This dataset is fictional and created for Pandas practice. ### Q1. Count toss decisions. ```python matches["toss_decision"].value_counts() ``` **Explanation** - The code examines the frequency distribution of toss decisions (bat or field) made by teams in cricket matches - It uses the value_counts() method to count how many times each toss decision appears in the toss_decision column - This provides insights into whether teams prefer batting or fielding after winning the toss - The output shows the absolute counts for each decision type, helping identify patterns in team strategy - This analysis is commonly used in sports analytics to understand tactical preferences and decision-making patterns ### Q2. Count match stages. ```python matches["stage"].value_counts() ``` **Explanation** - This code performs a frequency count of unique values in the "stage" column of a pandas DataFrame called matches - The value_counts() method returns a Series with each unique stage value as the index and its corresponding count as the value - This is useful for understanding how many records exist for each stage category in your dataset - The output helps identify which stages are most common or rare in your data - This operation is commonly used for exploratory data analysis and data validation purposes ### Q3. Find how many matches each team played. Because a team can appear in either `team_a` or `team_b`, combine both columns. ```python teams_played = pd.concat([matches["team_a"], matches["team_b"]]) teams_played.value_counts() ``` **Explanation** - Concatenates two columns from a matches DataFrame containing team names (team_a and team_b) into a single Series - Creates a unified list of all teams that participated in matches by combining both home and away team columns - Uses value_counts() method to tally how many times each team appears in the concatenated data - Returns a sorted Series showing team names as index and their total match participation counts as values - Useful for identifying most active teams or analyzing team involvement patterns in sports analytics ### Q4. Rank teams by number of wins. ```python team_wins = matches["winner"].value_counts().reset_index() team_wins.columns = ["team", "wins"] team_wins["rank"] = team_wins["wins"].rank( ascending=False, method="dense" ) team_wins.sort_values(["rank", "team"]) ``` **Explanation** - The code retrieves the count of wins for each team from the "winner" column in the `matches` DataFrame. - It resets the index to convert the Series into a DataFrame and renames the columns to "team" and "wins". - The `rank` method is applied to the "wins" column to assign a ranking based on the number of wins, with ties receiving the same rank. - Finally, the DataFrame is sorted by rank in ascending order and by team name to ensure a clear presentation of the results. ### Q5. Sort matches by season, stage, and margin. ```python matches.sort_values( ["season", "stage", "margin"], ascending=[True, True, False] ) ``` **Explanation** - The `sort_values` method is used to sort the DataFrame named `matches`. - The sorting is performed based on three columns: `season`, `stage`, and `margin`. - The `ascending` parameter specifies the sort order: `season` and `stage` are sorted in ascending order, while `margin` is sorted in descending order. - This allows for a structured organization of match data, prioritizing seasons and stages while highlighting larger margins. ### Q6. Set `match_id` as the index. ```python matches_by_id = matches.set_index("match_id") matches_by_id.sort_index() ``` **Explanation** - The `set_index("match_id")` method transforms the DataFrame by using the "match_id" column as the new index, allowing for quicker lookups. - The `sort_index()` method sorts the DataFrame based on the new index, ensuring that the data is organized in ascending order of match IDs. - This approach enhances data manipulation and retrieval, particularly when working with large datasets. - The original DataFrame remains unchanged unless reassigned, so the sorted version needs to be stored if required. ### Q7. Show unique stages. ```python matches["stage"].unique() ``` **Explanation** - Retrieves all distinct values present in the "stage" column of the matches DataFrame - Returns a numpy array containing only unique stage identifiers without duplicates - Useful for understanding the variety of stages available in the dataset - Commonly used for data exploration and validation before further analysis - Helps identify potential data inconsistencies or missing stage categories ### Q8. Count unique cities. ```python matches["city"].nunique() ``` **Explanation** - The code uses the nunique() method to count distinct values in the "city" column of a pandas DataFrame called matches - This method efficiently handles missing values by excluding them from the count automatically - The result returns a single integer representing how many unique cities exist in that dataset column - This operation is useful for understanding data diversity and identifying potential data quality issues - The syntax follows pandas convention where nunique() is applied directly to a Series object ### Q9. Drop rows where winner is missing. ```python matches.dropna(subset=["winner"]) ``` **Explanation** - The dropna method removes any rows where the "winner" column contains null or NaN values - The subset parameter specifically targets only the "winner" column for null value detection - This operation returns a new DataFrame without the incomplete match records - Useful for data cleaning when you need to ensure all matches have a defined winner before analysis - The original DataFrame remains unchanged since dropna creates a copy by default In this sample dataset, one row has no winner because the match had no result. ### Q10. Fill missing winner with `No Result`. ```python matches["winner"] = matches["winner"].fillna("No Result") ``` **Explanation** - The code fills NaN (missing) values in the "winner" column of a DataFrame called matches - Missing values are replaced with the string "No Result" to indicate incomplete match outcomes - This approach ensures consistent data representation where every match has a defined winner value - The fillna() method operates in-place on the DataFrame column without requiring assignment to a new variable - Useful for data analysis workflows where missing results need to be handled explicitly rather than left as null values ### Q11. Add a close_match column. ```python matches["close_match"] = matches["margin"].apply( lambda margin: "yes" if margin <= 8 else "no" ) ``` **Explanation** - The code creates a new column called "close_match" in the matches DataFrame - It applies a lambda function to each value in the "margin" column to evaluate whether the margin is less than or equal to 8 - Values meeting the condition (margin ≤ 8) are labeled as "yes", otherwise "no" - This operation uses pandas apply method for vectorized conditional assignment across all rows - The resulting column helps categorize matches as either "close" or "not close" based on the defined threshold ### Q12. Add a match summary using row-wise apply. ```python def match_summary(row): if pd.isna(row["winner"]): return f'{row["team_a"]} vs {row["team_b"]}: no result' return f'{row["winner"]} won by {row["margin"]}' matches["summary"] = matches.apply(match_summary, axis=1) ``` **Explanation** - The function `match_summary` takes a row of match data and checks if the winner column contains missing values using `pd.isna()` - When there's no winner (missing value), it returns a string indicating the match ended without a result - When there is a winner, it formats a summary showing which team won and by what margin - The `apply()` method processes each row of the matches DataFrame through this function - The resulting summary strings are stored in a new column called "summary" in the matches DataFrame ## 29. Method Selection Guide Use this as a quick decision table. | Task | Method | |---|---| | Count values in one column | `value_counts` | | Count combinations of columns | `DataFrame.value_counts` | | Sort rows by column | `sort_values` | | Sort rows by index | `sort_index` | | Create ranks | `rank` | | Make a column the index | `set_index` | | Convert index back to column | `reset_index` | | Rename columns or index labels | `rename` | | See unique values | `unique` | | Count unique values | `nunique` | | Detect missing values | `isna` or `isnull` | | Detect non-missing values | `notna` or `notnull` | | Remove missing rows | `dropna` | | Fill missing values | `fillna`, `ffill`, `bfill` | | Detect duplicates | `duplicated` | | Remove duplicates | `drop_duplicates` | | Remove rows or columns | `drop` | | Run custom logic | `apply` | ## 30. Common Mistakes ### Mistake 1: Sorting Without Assigning This returns a sorted DataFrame but does not change `activity`. ```python activity.sort_values("quiz_score") ``` **Explanation** - The code sorts a pandas DataFrame called "activity" based on the "quiz_score" column in ascending order - By default, sort_values() arranges values from lowest to highest - This operation returns a new sorted DataFrame without modifying the original data - Useful for analyzing student performance or ranking participants by their quiz results - The method can be modified with additional parameters like ascending=False for descending order Assign it: ```python activity = activity.sort_values("quiz_score") ``` **Explanation** - The code sorts a pandas DataFrame called 'activity' based on the "quiz_score" column in ascending order - This operation rearranges all rows in the dataset so that records with lower quiz scores appear first - The sort_values() method modifies the DataFrame in-place by default, reordering the entire dataset - This sorting approach is commonly used for analyzing student performance or ranking participants by score - The resulting dataset maintains all original columns while organizing rows according to the specified sorting criteria ### Mistake 2: Dropping Rows Too Early Missing values are not always bad. For example, a missing `completed_at` value can mean the learner has not completed the lesson yet. Understand the meaning before using: ```python activity.dropna() ``` **Explanation** - The dropna() method eliminates rows or columns containing null values (NaN, None) from a pandas DataFrame - By default, it removes any row that contains at least one missing value, preserving only complete records - This operation returns a new DataFrame without modifying the original data structure - Commonly used in data cleaning processes to ensure analysis accuracy when missing values could skew results - Can be customized with parameters like subset= to target specific columns or how='all' to remove rows where all values are missing ### Mistake 3: Using Row-Wise `apply` For Everything This works: ```python activity.apply(priority, axis=1) ``` **Explanation** - The code applies a priority function to each row of the activity DataFrame using the pandas apply method - The axis=1 parameter specifies that the function should be applied horizontally across rows rather than vertically down columns - This approach enables vectorized processing of each row's data through the custom priority function - The result represents transformed activity data where each row has been processed according to the priority logic - This pattern is commonly used for data transformation tasks like scoring, categorization, or conditional calculations across dataset rows But for simple comparisons, vectorized code is often cleaner: ```python activity["is_high_score"] = activity["quiz_score"] >= 85 ``` **Explanation** - This code creates a new column called "is_high_score" in the activity DataFrame - It evaluates each row's quiz_score against the value 85 using a greater than or equal comparison - The result is a boolean True/False value indicating whether each student achieved a score of 85 or higher - This boolean flag can be used for filtering, grouping, or conditional operations on high-performing students - The operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously Use `apply` when you really need custom logic. ### Mistake 4: Forgetting `subset` In Duplicate Checks This checks full-row duplicates: ```python activity.duplicated() ``` **Explanation** - The duplicated() method returns a boolean Series indicating which rows in a DataFrame are duplicates of previously occurring rows - By default, it marks all duplicates as True except for the first occurrence, which is marked as False - This method is useful for data cleaning and quality assessment to identify redundant records - The result can be used to filter or remove duplicate entries from datasets - The method supports various parameters like keep='first', keep='last', or keep=False to customize duplicate detection behavior This checks duplicate attempts by learner and lesson: ```python activity.duplicated(subset=["learner_id", "lesson"]) ``` **Explanation** - The code checks for duplicate rows in an activity dataset based on specific columns - It examines combinations of learner_id and lesson to find repeated entries - Returns a boolean Series indicating which rows are duplicates of previous rows - Useful for data cleaning to identify redundant activity records in educational analytics - The subset parameter specifies exactly which columns to consider when determining duplication criteria They answer different questions. ### Mistake 5: Resetting Index And Keeping Old Index Accidentally This keeps the old index as a column: ```python df.reset_index() ``` **Explanation** - Resets the DataFrame's index to the default integer index starting from zero - Removes any existing index labels and replaces them with sequential integers - Useful when index values are no longer meaningful or need to be standardized - Commonly used after filtering or sorting operations that may have left irregular index values This drops it: ```python df.reset_index(drop=True) ``` **Explanation** - Removes the current index labels and replaces them with a new sequential integer index starting from zero - The drop=True parameter ensures the old index is completely discarded rather than being added as a new column - Useful when you need to reorganize data after filtering, sorting, or concatenating DataFrames - Helps maintain consistent indexing when performing operations that might disrupt the original index structure Choose intentionally. ## 31. Compact Cheat Sheet ```python # Counts df["col"].value_counts() df[["a", "b"]].value_counts() # Sorting df.sort_values("score", ascending=False) df.sort_values(["track", "score"], ascending=[True, False]) df.sort_index() # Ranking df["rank"] = df["score"].rank(ascending=False, method="dense") df["group_rank"] = df.groupby("track")["score"].rank(ascending=False) # Index df = df.set_index("id") df = df.reset_index() df = df.reset_index(drop=True) # Rename df = df.rename(columns={"old": "new"}) # Unique df["city"].unique() df["city"].nunique() # Missing df.isna().sum() df[df["mentor"].notna()] df.dropna(subset=["mentor"]) df["mentor"] = df["mentor"].fillna("Unassigned") df["mentor"].ffill() df["mentor"].bfill() # Duplicates df.duplicated().sum() df.duplicated(subset=["id", "lesson"], keep=False) df.drop_duplicates(subset=["id", "lesson"], keep="last") # Drop df.drop(columns=["unused_col"]) df.drop(index=[0, 1]) # Apply df["label"] = df["score"].apply(score_label) df["summary"] = df.apply(make_summary, axis=1) ``` **Explanation** - Value counting and sorting operations including single column counts, multi-column combinations, and various sorting configurations with ascending/descending options - Ranking functionality using dense method for both global and group-based score rankings within DataFrame columns - Index management through setting, resetting, and dropping index columns while maintaining data integrity - Column renaming and unique value extraction with count statistics for identifying distinct data entries - Missing data handling including detection, filtering, filling strategies, and duplicate removal techniques for data cleaning and preprocessing ## 32. Interview-Style Questions ### 1. What is the difference between `sort_values` and `sort_index`? `sort_values` sorts by data inside columns. `sort_index` sorts by row labels or column labels. ### 2. What does `value_counts` do? It counts how often values appear. On a Series it counts one column. On a DataFrame it counts combinations of columns. ### 3. What does `rank(method="dense")` mean? Dense ranking gives the same rank to ties and does not skip the next rank. ### 4. Why use `set_index`? Use `set_index` when one column is a better row label, such as an ID, date, username, or product code. ### 5. What is the purpose of `reset_index(drop=True)`? It resets the index to simple row numbers and discards the old index instead of keeping it as a column. ### 6. What is the difference between `unique` and `nunique`? `unique` returns the distinct values. `nunique` returns the number of distinct values. ### 7. When should you use `dropna`? Use it when rows with missing values are not useful for the specific analysis. Do not use it blindly. ### 8. What does `drop_duplicates(subset=[...], keep="last")` do? It removes duplicate rows based on selected columns and keeps the last row from each duplicate group. ### 9. What does `apply(axis=1)` do? It runs a function row by row. The function receives each row as a Series. ### 10. Why should row-wise `apply` be used carefully? It is flexible but can be slower than vectorized operations on large DataFrames. ## Sources And Further Reading This guide uses original examples and original CSV files. The notebook was used only as a topic outline. Official Pandas references: - Pandas API reference: https://pandas.pydata.org/docs/reference/ - DataFrame basics: https://pandas.pydata.org/docs/user_guide/basics.html - Indexing and selecting data: https://pandas.pydata.org/docs/user_guide/indexing.html - Working with missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html - Group by user guide: https://pandas.pydata.org/docs/user_guide/groupby.html