# Mastering Pandas DataFrame: Create, Inspect, and Analyze Data URL: https://madhudadi.in/blog/posts/pandas-dataframe-create-inspect-and-analyze-data Published: 2026-05-31 Tags: Pandas, python Read time: 35 min Difficulty: beginner > Learn Pandas DataFrames from scratch with original examples: create tables, read CSV files, inspect data, use loc and iloc, filter rows, add columns, clean missing values, remove duplicates, convert data types, and solve practical exercises.# Pandas DataFrame: Create, Inspect, Select, Filter, Clean, and Analyze Tables A Pandas DataFrame is the main object you use when working with table-shaped data in Python. If a `Series` is one labeled column, a `DataFrame` is a full table made from many labeled columns. You can use a DataFrame to represent: - students and marks - orders and payments - website traffic logs - course enrollments - app events - survey responses - sports match records - product inventory - interview practice datasets In this guide, you will learn DataFrames through original examples built around students, learning courses, small store orders, and simple sports records. The goal is not just to memorize commands. The goal is to become comfortable asking questions from data. ## What You Will Learn By the end, you should be able to: - explain what a DataFrame is - create DataFrames from lists, dictionaries, and records - read data from a CSV file - inspect shape, columns, index, dtypes, and values - preview rows with `head`, `tail`, and `sample` - understand `info`, `describe`, `isna`, and `duplicated` - rename columns safely - select one column, many columns, one row, and many rows - use `loc` for labels and `iloc` for positions - select rows and columns together - filter rows using one condition and multiple conditions - use `isin`, `between`, and string filters - add new columns from constants and formulas - convert data types with `astype` and `pd.to_numeric` - clean missing values with `fillna` and `dropna` - remove duplicate rows - sort rows - count categories with `value_counts` - group data with `groupby` - solve beginner-to-practical DataFrame exercises ## 1. Importing Pandas Most Pandas code uses the alias `pd`. ```python import pandas as pd import numpy as np ``` You do not always need NumPy, but it is useful for missing values and numeric examples. ## 2. What Is A DataFrame? A DataFrame is a two-dimensional labeled data structure. Simple meaning: - rows represent records - columns represent fields - row labels live in the index - column labels live in `columns` Example: ```python students = pd.DataFrame({ "name": ["Aarav", "Meera", "Ishan", "Tara"], "python_score": [86, 92, 74, 88], "sql_score": [78, 85, 81, 90], "city": ["Pune", "Kochi", "Delhi", "Pune"] }) print(students) ``` Output: ```text name python_score sql_score city 0 Aarav 86 78 Pune 1 Meera 92 85 Kochi 2 Ishan 74 81 Delhi 3 Tara 88 90 Pune ``` This is a DataFrame with 4 rows and 4 columns. ## 3. DataFrame Vs Series Use this mental model: ```text Series = one labeled column DataFrame = many Series side by side ``` Selecting a single column from a DataFrame usually gives a Series: ```python students["python_score"] ``` Selecting multiple columns gives another DataFrame: ```python students[["name", "python_score"]] ``` This difference matters because a Series and a DataFrame support many of the same operations, but their shapes are different. ## 4. Creating A DataFrame From A List Of Lists You can create a DataFrame from rows. ```python rows = [ ["Aarav", 86, "Pune"], ["Meera", 92, "Kochi"], ["Ishan", 74, "Delhi"] ] df = pd.DataFrame(rows, columns=["name", "score", "city"]) print(df) ``` Output: ```text name score city 0 Aarav 86 Pune 1 Meera 92 Kochi 2 Ishan 74 Delhi ``` This style is useful when your data naturally arrives row by row. ## 5. Creating A DataFrame From A Dictionary Of Lists This is one of the most common beginner-friendly ways. ```python course_data = { "student": ["Aarav", "Meera", "Ishan", "Tara"], "course": ["Python", "Pandas", "Python", "SQL"], "hours_spent": [14, 18, 9, 12], "completed": [True, True, False, True] } courses = pd.DataFrame(course_data) print(courses) ``` Each dictionary key becomes a column name. Each list becomes a column. Important rule: all lists must have the same length. ## 6. Creating A DataFrame From A List Of Dictionaries This style is excellent for API-like records. ```python records = [ {"order_id": 101, "customer": "Anika", "amount": 499, "paid": True}, {"order_id": 102, "customer": "Rohan", "amount": 299, "paid": False}, {"order_id": 103, "customer": "Zoya", "amount": 799, "paid": True} ] orders = pd.DataFrame(records) print(orders) ``` Output: ```text order_id customer amount paid 0 101 Anika 499 True 1 102 Rohan 299 False 2 103 Zoya 799 True ``` This is often easier to read than a list of lists because every value carries a field name. ## 7. Creating A DataFrame With A Custom Index The index identifies rows. ```python students = pd.DataFrame( { "python_score": [86, 92, 74, 88], "sql_score": [78, 85, 81, 90], "city": ["Pune", "Kochi", "Delhi", "Pune"] }, index=["aarav", "meera", "ishan", "tara"] ) print(students) ``` Output: ```text python_score sql_score city aarav 86 78 Pune meera 92 85 Kochi ishan 74 81 Delhi tara 88 90 Pune ``` Custom indexes are useful when your rows already have meaningful labels. ## 8. Reading A CSV File In real work, you often read data from files. ```python orders = pd.read_csv("orders.csv") ``` If the CSV has a column that should become the index: ```python orders = pd.read_csv("orders.csv", index_col="order_id") ``` If you only want selected columns: ```python orders = pd.read_csv("orders.csv", usecols=["order_id", "customer", "amount"]) ``` If date columns should be parsed: ```python orders = pd.read_csv("orders.csv", parse_dates=["ordered_at"]) ``` For beginners, start with plain `pd.read_csv("file.csv")`, inspect the result, and then add options only when needed. ## 9. Inspecting A DataFrame Let us build one reusable DataFrame for the next few sections. ```python students = pd.DataFrame({ "name": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila"], "city": ["Pune", "Kochi", "Delhi", "Pune", "Jaipur", "Kochi"], "python_score": [86, 92, 74, 88, 67, 95], "sql_score": [78, 85, 81, 90, 72, 89], "practice_hours": [14, 18, 9, 16, 7, 20], "completed": [True, True, False, True, False, True] }) ``` ### Shape ```python students.shape ``` **Explanation** - The `shape` attribute of a DataFrame returns a tuple representing its dimensions. - The first element of the tuple indicates the number of rows (students), while the second element indicates the number of columns (attributes). - This is useful for quickly assessing the size of the dataset before performing further analysis or operations. Output: ```text (6, 6) ``` This means 6 rows and 6 columns. ### Data Types ```python students.dtypes ``` **Explanation** - The code accesses the `dtypes` attribute of a Pandas DataFrame named `students`. - It returns a Series object that contains the data type of each column in the DataFrame. - This information is useful for understanding the structure of the data and for ensuring that data types are appropriate for analysis. - Common data types include `int64`, `float64`, `object`, and `datetime64`, among others. - Analyzing data types can help identify potential issues such as incorrect data formats or the need for type conversions. This tells you the type of each column. Common Pandas dtypes: - `int64` - `float64` - `bool` - `object` - `datetime64` - `category` Many text columns appear as `object`. ### Index ```python students.index ``` **Explanation** - The `students.index` attribute retrieves the index (row labels) of the DataFrame named `students`. - This index can be used to identify and access specific rows in the DataFrame. - It is particularly useful for operations that require referencing or manipulating data based on row positions. - The index can be customized or set to meaningful labels to enhance data readability and management. The index is the row label system. ### Columns ```python students.columns ``` **Explanation** - Accesses the `columns` attribute of a pandas DataFrame named `students`. - Returns an Index object containing the names of the columns in the DataFrame. - Useful for understanding the structure of the DataFrame and for data manipulation tasks. - Can be used in data analysis to quickly identify available features for further processing. This returns the column names. ### Values ```python students.values ``` **Explanation** - The code snippet retrieves all the values from the `students` dictionary. - `students.values` returns a view object that displays a list of all the values in the dictionary. - This can be useful for iterating over student data or performing operations on the values without needing the keys. - To convert the view object into a list, you can wrap it with `list(students.values())`. This gives the raw underlying values as an array-like object. In beginner Pandas work, you usually interact with columns and rows through Pandas methods instead of relying on `.values`. ## 10. Previewing Rows ### First Rows With `head` ```python students.head() ``` **Explanation** - The `head()` method is called on a DataFrame named `students`. - It returns the first five rows by default, allowing for a quick overview of the data structure and contents. - This is useful for verifying data loading and understanding the format of the dataset. - Users can specify a different number of rows to display by passing an integer argument to `head()`. By default, it returns the first 5 rows. ```python students.head(3) ``` **Explanation** - The `head()` function is a method used on a pandas DataFrame to display the first few rows of the dataset. - In this case, `students.head(3)` returns the first three rows of the `students` DataFrame. - This is useful for quickly inspecting the structure and contents of the DataFrame without displaying the entire dataset. - The number passed to `head()` can be adjusted to retrieve a different number of rows as needed. Returns the first 3 rows. ### Last Rows With `tail` ```python students.tail() students.tail(2) ``` **Explanation** - The `students.tail()` function returns the last five rows of the DataFrame named `students`. - By calling `students.tail(2)`, the code retrieves the last two rows specifically, allowing for a focused view of the most recent entries. - This method is useful for quickly inspecting the end of a dataset without displaying the entire DataFrame. - The `tail()` function is commonly used in data analysis to check for data integrity and trends at the end of the dataset. ### Random Rows With `sample` ```python students.sample(2, random_state=42) ``` **Explanation** - The `sample` method is used to randomly select a specified number of rows from a DataFrame. - In this case, it selects 2 random students from the `students` DataFrame. - The `random_state` parameter is set to 42, ensuring that the random selection is reproducible across different runs. - This is useful for testing or demonstration purposes, as it allows the same subset of data to be retrieved consistently. - If `random_state` were omitted, the selection would vary each time the code is executed. Use `random_state` when you want reproducible examples. ## 11. Using `info` `info` gives a compact technical summary. ```python students.info() ``` **Explanation** - The `info()` method is called on a DataFrame named `students`. - It provides a concise summary of the DataFrame, including the number of entries, column names, data types, and memory usage. - This method is useful for quickly assessing the structure and data types of the dataset before performing further analysis. - It helps identify any missing values or inconsistencies in the data. It shows: - row count - column names - non-null counts - data types - memory usage This is one of the first commands you should run after loading a new dataset. ## 12. Using `describe` `describe` summarizes numeric columns by default. ```python students.describe() ``` **Explanation** - The `describe()` method is called on the `students` DataFrame, which is typically used in pandas. - It generates summary statistics such as count, mean, standard deviation, minimum, and maximum values for numerical columns. - For categorical columns, it provides counts of unique values and the most frequent value. - This method is useful for quickly understanding the distribution and central tendencies of the data. - The output helps in identifying potential outliers and the overall shape of the dataset. It gives: - count - mean - standard deviation - minimum - quartiles - maximum For non-numeric columns: ```python students.describe(include="object") ``` **Explanation** - The `describe()` method is called on a DataFrame named `students`. - The parameter `include="object"` specifies that only columns with object data types (typically strings) should be included in the output. - This method provides insights such as count, unique values, top values, and frequency of the top values for the specified columns. - It is useful for understanding the distribution and characteristics of categorical data in the DataFrame. For all columns: ```python students.describe(include="all") ``` **Explanation** - The `describe()` method is called on a DataFrame named `students` to generate descriptive statistics. - The parameter `include="all"` ensures that the summary includes statistics for all columns, regardless of their data types. - For numeric columns, it provides metrics like count, mean, standard deviation, min, max, and quartiles. - For categorical columns, it includes counts of unique values, the most frequent value, and its frequency. - This method is useful for quickly understanding the distribution and characteristics of the dataset. ## 13. Checking Missing Values Missing values are common in real data. ```python students.isna().sum() ``` **Explanation** - The `isna()` function is called on the `students` DataFrame to identify missing values, returning a DataFrame of the same shape with boolean values. - The `sum()` function is then applied to this boolean DataFrame, which counts the number of `True` values (missing entries) for each column. - The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries in each column. - This is useful for data cleaning and preprocessing, allowing users to quickly assess the completeness of their dataset. You may also see: ```python students.isnull().sum() ``` **Explanation** - The `isnull()` method is called on the `students` DataFrame to identify missing values, returning a DataFrame of the same shape with boolean values. - The `sum()` method is then applied to this boolean DataFrame, which counts the `True` values (indicating missing data) for each column. - The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries in each column. - This is useful for data cleaning and preprocessing, allowing users to quickly assess the completeness of their dataset. In Pandas, `isna` and `isnull` are commonly used for the same purpose. ## 14. Checking Duplicate Rows ```python students.duplicated().sum() ``` **Explanation** - The `duplicated()` method is called on the `students` DataFrame to identify duplicate rows. - It returns a boolean Series where `True` indicates a duplicate row. - The `sum()` function then counts the number of `True` values in the Series, effectively giving the total number of duplicate entries. - This code is useful for data cleaning and ensuring the integrity of the dataset by identifying redundancy. This counts complete duplicate rows. To see duplicated rows: ```python students[students.duplicated()] ``` **Explanation** - The code uses the `duplicated()` method from the pandas library to check for duplicate rows in the `students` DataFrame. - It returns a boolean Series indicating whether each row is a duplicate of a previous row. - By indexing the `students` DataFrame with this boolean Series, it filters and displays only the duplicate entries. - This is useful for data cleaning and ensuring the integrity of student records. To remove duplicate rows: ```python students = students.drop_duplicates() ``` **Explanation** - The `drop_duplicates()` method is called on the `students` DataFrame. - It identifies and removes any rows that are exact duplicates of each other. - The operation helps ensure that each student is represented only once in the DataFrame. - The result is a cleaner dataset, which is essential for accurate data analysis and reporting. ## 15. Renaming Columns Readable column names make your code easier to understand. ```python students = students.rename(columns={ "python_score": "python", "sql_score": "sql", "practice_hours": "hours" }) ``` **Explanation** - The `rename` method is used to change the names of specific columns in the `students` DataFrame. - The `columns` parameter takes a dictionary where the keys are the current column names and the values are the new names. - In this case, "python_score" is renamed to "python", "sql_score" to "sql", and "practice_hours" to "hours". - This operation helps in making the DataFrame more user-friendly and easier to work with in subsequent analyses. You can also rename in place: ```python students.rename(columns={"hours": "study_hours"}, inplace=True) ``` **Explanation** - The `rename` method is used to change the name of a specific column in a pandas DataFrame. - The `columns` parameter takes a dictionary where the key is the current column name and the value is the new name. - In this case, the column "hours" is renamed to "study_hours". - The `inplace=True` argument modifies the original DataFrame directly without needing to create a copy. Many teams avoid `inplace=True` because returning a new DataFrame is easier to reason about in longer pipelines. ## 16. Math Across Columns And Rows Create a small score DataFrame: ```python scores = pd.DataFrame({ "python": [86, 92, 74], "sql": [78, 85, 81], "excel": [90, 88, 76] }, index=["Aarav", "Meera", "Ishan"]) ``` **Explanation** - Initializes a pandas DataFrame named `scores` to organize student scores across different subjects. - Contains three subjects: "python", "sql", and "excel", each with a list of scores. - Uses a custom index with student names: "Aarav", "Meera", and "Ishan" to label the rows. - Facilitates easy data manipulation and analysis for educational performance tracking. Column-wise sum: ```python scores.sum(axis=0) ``` **Explanation** - The `scores` variable is expected to be a NumPy array or a similar data structure. - The `sum` method computes the total of the array elements. - The `axis=0` argument specifies that the summation should be performed column-wise, aggregating values from each row. - The result will be a new array containing the sum of each column in the original array. Row-wise sum: ```python scores.sum(axis=1) ``` **Explanation** - The `scores` variable is expected to be a 2D array or matrix, where each row represents a different set of scores. - The `sum` method is called with the parameter `axis=1`, indicating that the summation should be performed across rows. - The result will be a 1D array containing the total score for each row, effectively aggregating the scores horizontally. - This operation is commonly used in data analysis to compute totals or aggregates for each observation or sample. Row-wise average: ```python scores.mean(axis=1) ``` **Explanation** - The `mean` function computes the average of the elements along the specified axis. - By setting `axis=1`, the function calculates the mean for each row, treating each row as a separate dataset. - This is useful for obtaining a single average score per entity represented by the rows in the array. - The output will be a 1D array containing the mean values for each row. Other useful methods: ```python scores.median(axis=1) scores.std(axis=1) scores.var(axis=1) scores.min(axis=1) scores.max(axis=1) ``` **Explanation** - Computes the median of scores across each row using `median(axis=1)`. - Calculates the standard deviation of scores for each row with `std(axis=1)`. - Determines the variance of scores in each row using `var(axis=1)`. - Finds the minimum score in each row with `min(axis=1)`. - Identifies the maximum score in each row using `max(axis=1)`. Remember: ```text axis=0 means down the rows, producing one result per column axis=1 means across the columns, producing one result per row ``` ## 17. Selecting One Column ```python students["name"] ``` **Explanation** - This code snippet retrieves the value associated with the key "name" from the `students` dictionary. - It assumes that `students` is a dictionary where each key corresponds to a student's identifier and the value is another dictionary containing student details. - If "name" is not a valid key in the dictionary, it will raise a KeyError. - This operation is commonly used to extract specific information about a student for further processing or display. This returns a Series. You can verify: ```python type(students["name"]) ``` **Explanation** - The code accesses the 'name' key from the 'students' dictionary. - It uses the `type()` function to retrieve the data type of the value associated with the 'name' key. - This is useful for validating the expected data structure and ensuring data integrity within the dictionary. Use bracket syntax for column names because it works with spaces and special characters. ```python students["practice_hours"] ``` **Explanation** - This code snippet retrieves the "practice_hours" data from a collection named "students". - It assumes that "students" is a dictionary or a similar data structure where "practice_hours" is a key. - The output will be the value associated with "practice_hours", which could be a list, integer, or any other data type. - This operation is useful for analyzing or displaying the amount of practice time logged by each student. Dot syntax can be convenient: ```python students.city ``` **Explanation** - This code snippet retrieves the 'city' attribute from the 'students' object. - It assumes that 'students' is an instance of a class or a data structure that has a defined 'city' property. - The value returned will depend on the specific implementation of the 'students' object. - This is a common way to access object attributes in Python, allowing for easy retrieval of stored data. But dot syntax can break when column names have spaces, conflict with methods, or are created dynamically. Prefer bracket syntax in teaching and production code. ## 18. Selecting Multiple Columns Use a list of column names inside brackets. ```python students[["name", "city", "python_score"]] ``` **Explanation** - This code snippet accesses a DataFrame named `students`. - It retrieves only the columns "name", "city", and "python_score". - The result is a new DataFrame containing only the specified columns for further analysis or display. - This operation is useful for focusing on relevant data without altering the original DataFrame. This returns a DataFrame. Column order follows the list you pass: ```python students[["python_score", "name"]] ``` **Explanation** - This code snippet accesses a DataFrame named `students`. - It selects and displays only the "python_score" and "name" columns. - The result is a new DataFrame containing only the specified data for further analysis or display. - This operation is useful for focusing on relevant information without altering the original DataFrame. ## 19. Selecting Rows With `iloc` `iloc` selects by integer position. ```python students.iloc[0] ``` **Explanation** - The code retrieves the first row of the DataFrame named `students` using the `iloc` indexer. - `iloc[0]` specifically targets the first row, as indexing starts at 0 in Python. - This operation returns a Series object containing all the column values for the first student in the DataFrame. - It is useful for quickly inspecting the data of the first entry without needing to display the entire DataFrame. This selects the first row. Multiple rows: ```python students.iloc[0:3] ``` **Explanation** - `students` is a pandas DataFrame that likely contains information about students. - The `iloc` method is used for integer-location based indexing to select rows. - The slice `0:3` specifies that the code should return rows at index 0, 1, and 2, effectively giving the first three entries. - This operation is useful for quickly viewing a subset of data for analysis or debugging. Rows at specific positions: ```python students.iloc[[0, 2, 5]] ``` **Explanation** - The code utilizes the `iloc` method from the pandas library to access specific rows of a DataFrame named `students`. - It retrieves the rows at index positions 0, 2, and 5, allowing for targeted data extraction. - This method is particularly useful for accessing rows by their integer location rather than by label. - The resulting output will be a new DataFrame containing only the selected rows. - Ensure that the indices specified exist within the DataFrame to avoid an IndexError. Last row: ```python students.iloc[-1] ``` **Explanation** - The code uses the `iloc` method from the pandas library to access data by integer-location based indexing. - `-1` indicates that the last row of the DataFrame `students` will be selected. - This is useful for quickly viewing the most recent entry or record in the dataset. - The returned output will be a Series object representing the last student's data. In `iloc`, slices behave like normal Python slices. The stop position is excluded. ```python students.iloc[1:4] ``` **Explanation** - The `iloc` method is used for integer-location based indexing in pandas DataFrames. - The slice `1:4` indicates that it will return rows with index 1, 2, and 3 (the end index is exclusive). - This allows for easy selection of a subset of data for analysis or manipulation. - The resulting output will be a new DataFrame containing only the specified rows. This selects positions 1, 2, and 3. ## 20. Selecting Rows With `loc` `loc` selects by index label. Create a labeled DataFrame: ```python students_by_id = students.set_index("name") ``` **Explanation** - The `set_index` method is used to change the index of the DataFrame `students`. - The "name" column is specified as the new index, allowing for more intuitive data retrieval based on student names. - After executing this line, the DataFrame will be indexed by student names instead of the default integer index. - This can improve performance and readability when performing operations that involve looking up students by their names. Select a row by label: ```python students_by_id.loc["Meera"] ``` **Explanation** - The code snippet uses the `loc` method from the pandas library to retrieve data. - It targets the row corresponding to the index label "Meera" in the `students_by_id` DataFrame. - This operation returns all columns for the student identified by the index "Meera". - It is useful for quickly accessing information about a specific student based on their unique identifier. Select multiple labels: ```python students_by_id.loc[["Aarav", "Tara", "Nila"]] ``` **Explanation** - The code accesses a DataFrame named `students_by_id` to retrieve data. - It uses the `.loc[]` method, which allows for label-based indexing to select rows. - The list `["Aarav", "Tara", "Nila"]` specifies the IDs of the students whose records are to be fetched. - This operation returns the rows corresponding to the specified student IDs, enabling focused data analysis. Label slicing: ```python students_by_id.loc["Meera":"Kabir"] ``` **Explanation** - Utilizes the `loc` method to access rows in a pandas DataFrame based on index labels. - The range specified, from "Meera" to "Kabir", includes all rows with index labels within this range. - This operation is useful for filtering data based on specific criteria or for viewing a subset of records. - Assumes that the DataFrame `students_by_id` is indexed by student names or IDs. - The output will include all records between the specified start and end labels, inclusive. Important difference: ```text loc label slices include the stop label when it exists. iloc position slices exclude the stop position. ``` This is a common beginner trap. ## 21. Selecting Rows And Columns Together Use `iloc` when both rows and columns are positions. ```python students.iloc[0:3, 0:2] ``` **Explanation** - Utilizes the `iloc` method from the pandas library to access specific rows and columns by integer-location based indexing. - The slice `0:3` indicates that it selects rows 0, 1, and 2, effectively retrieving the first three rows. - The slice `0:2` specifies that it selects columns 0 and 1, thus retrieving the first two columns of the DataFrame. - This operation is useful for quickly examining a subset of data within a larger DataFrame. - The result is a new DataFrame containing only the specified rows and columns. This means: - rows from position 0 to 2 - columns from position 0 to 1 Use `loc` when both rows and columns are labels. ```python students_by_id.loc["Aarav":"Tara", ["city", "python_score"]] ``` **Explanation** - Uses the `loc` method to access rows and columns in a pandas DataFrame. - Selects rows from "Aarav" to "Tara" based on the index labels. - Retrieves only the "city" and "python_score" columns for the specified range of students. - Assumes that the DataFrame `students_by_id` is indexed by student names or IDs. - Useful for filtering data to analyze specific attributes of a subset of students. You can also filter rows and select columns in one clean expression: ```python students.loc[students["city"] == "Pune", ["name", "python_score"]] ``` **Explanation** - Uses the `loc` method from the pandas library to filter rows in the `students` DataFrame. - The condition checks for students whose "city" column matches "Pune". - Only the "name" and "python_score" columns are selected for the output. - This operation is useful for analyzing performance based on geographical location. - The result is a subset of the original DataFrame containing relevant information for further analysis. This is one of the most useful DataFrame patterns. ## 22. Filtering Rows With One Condition Find students with Python score above 85: ```python high_python = students[students["python_score"] > 85] ``` **Explanation** - The code filters a DataFrame named `students` to identify those with a Python score greater than 85. - It uses boolean indexing to create a new DataFrame called `high_python`. - This operation allows for easy analysis of students who excel in Python, facilitating targeted educational strategies. - The resulting DataFrame `high_python` contains only the rows that meet the specified condition. Find students from Pune: ```python pune_students = students[students["city"] == "Pune"] ``` **Explanation** - The code snippet filters a DataFrame named `students` to create a new DataFrame `pune_students`. - It uses a boolean condition to check if the "city" column equals "Pune". - Only the rows where the condition is true are included in the `pune_students` DataFrame. - This operation is useful for analyzing or processing data specific to students located in Pune. Find students who completed the course: ```python completed = students[students["completed"] == True] ``` **Explanation** - The code filters a DataFrame named `students` to create a new DataFrame called `completed`. - It selects only the rows where the "completed" column has a value of `True`. - This is useful for analyzing or processing only those students who have finished their assignments or tasks. - The resulting `completed` DataFrame contains all relevant information about the students who have completed their tasks. Cleaner version: ```python completed = students[students["completed"]] ``` **Explanation** - This code snippet filters a DataFrame named `students` to create a new variable `completed`. - It selects only the rows where the "completed" column is truthy, indicating that those students have completed their tasks. - The result is a subset of the original DataFrame containing only the relevant entries. - This operation is useful for analyzing or processing data related to students who have finished their assignments or courses. ## 23. Filtering With Multiple Conditions Use `&` for AND. ```python strong_and_consistent = students[ (students["python_score"] >= 85) & (students["sql_score"] >= 85) ] ``` **Explanation** - The code filters a DataFrame named `students` to identify those with high scores in both Python and SQL. - It uses boolean indexing to select rows where the `python_score` is greater than or equal to 85 and the `sql_score` is also greater than or equal to 85. - The result is stored in the variable `strong_and_consistent`, which contains only the students meeting these criteria. - This approach is useful for analyzing student performance and identifying top performers in specific subjects. Use `|` for OR. ```python needs_attention = students[ (students["python_score"] < 75) | (students["practice_hours"] < 10) ] ``` **Explanation** - The code creates a new DataFrame called `needs_attention` from an existing DataFrame `students`. - It selects students whose `python_score` is less than 75 or who have logged less than 10 practice hours. - The logical operator `|` is used to combine the two conditions, meaning either condition being true will include the student in the result. - This filtering helps identify students who may need extra help in their studies. Important rule: ```text When combining conditions with & or |, wrap each condition in parentheses. ``` Do not write: ```python # Avoid this students[students["python_score"] >= 85 & students["sql_score"] >= 85] ``` **Explanation** - The code attempts to filter a DataFrame of students based on their scores in Python and SQL. - It uses the bitwise AND operator `&` incorrectly with the comparison operators, which can lead to unexpected results. - The correct approach should use parentheses around each condition to ensure proper evaluation order. - The intended functionality is to select students with scores of 85 or higher in both subjects. - Proper syntax would be: `students[(students["python_score"] >= 85) & (students["sql_score"] >= 85)]`. The operator precedence will not behave the way you expect. ## 24. Filtering With `isin` Use `isin` when a column should match any value from a list. ```python south_or_west = students[students["city"].isin(["Pune", "Kochi"])] ``` **Explanation** - The code filters a DataFrame named `students` to select only those entries where the "city" column matches either "Pune" or "Kochi". - The `isin()` method is used to check if the values in the "city" column are present in the specified list of cities. - The result is stored in the variable `south_or_west`, which will contain a subset of the original DataFrame. - This operation is useful for analyzing or processing data related to specific geographic locations. This is cleaner than writing many OR conditions. ## 25. Filtering With `between` Use `between` for ranges. ```python middle_scores = students[students["python_score"].between(75, 90)] ``` **Explanation** - The code filters a DataFrame named `students` to select only those rows where the `python_score` is between 75 and 90. - The `between` method is used to check if the `python_score` falls within the inclusive range of 75 to 90. - The result is stored in the variable `middle_scores`, which will contain only the students who meet this score criterion. - This operation is useful for analyzing a subset of students who are performing at a moderate level in Python. By default, the endpoints are included. ## 26. Filtering Text With String Methods Pandas gives string methods through `.str`. ```python students[students["name"].str.startswith("M")] ``` **Explanation** - Utilizes the pandas library to filter a DataFrame named `students`. - The `str.startswith("M")` method checks if the 'name' column values begin with the letter 'M'. - The result is a subset of the original DataFrame containing only the rows that meet this condition. - This operation is useful for quickly identifying specific groups of data based on string patterns. Case-insensitive contains: ```python students[students["city"].str.contains("pu", case=False, na=False)] ``` **Explanation** - The code accesses a DataFrame named `students` which contains student data. - It uses the `str.contains()` method to check if the "city" column contains the substring "pu". - The `case=False` argument ensures that the search is case insensitive, allowing matches for "Pu", "pu", "PU", etc. - The `na=False` argument prevents errors by treating NaN values as False, meaning they won't be included in the results. - The result is a filtered DataFrame containing only the rows where the "city" column matches the specified condition. Use `na=False` when the column may contain missing values. ## 27. Adding A New Column Add a constant column: ```python students["country"] = "India" ``` **Explanation** - The code modifies the `students` dictionary by adding or updating the key "country". - All entries in the `students` dictionary will have their "country" value set to "India". - This operation is useful for bulk updating information for a group of students. - If "country" already exists, its value will be overwritten with "India". Add a calculated column: ```python students["total_score"] = students["python_score"] + students["sql_score"] ``` **Explanation** - The code accesses a DataFrame named `students`, which contains individual student scores. - It creates a new column called `total_score` in the DataFrame. - The `total_score` is computed by adding the values from the `python_score` and `sql_score` columns for each student. - This operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously. - The resulting `total_score` column provides a quick reference for the overall performance of each student in both subjects. Add an average: ```python students["average_score"] = students[["python_score", "sql_score"]].mean(axis=1) ``` **Explanation** - The code snippet adds a new column named "average_score" to the `students` DataFrame. - It computes the mean of the scores from the "python_score" and "sql_score" columns. - The `mean(axis=1)` function calculates the average across the specified columns for each row. - This operation allows for easy assessment of student performance in both subjects. Add a category from a condition: ```python students["performance"] = np.where( students["average_score"] >= 85, "strong", "practice_more" ) ``` **Explanation** - The code uses NumPy's `where` function to create a new column in the `students` DataFrame called "performance". - It assigns the value "strong" to students whose "average_score" is 85 or higher. - For students with an "average_score" below 85, it assigns the value "practice_more". - This operation allows for efficient conditional logic application across the DataFrame. ## 28. Updating Existing Columns You can update text values: ```python students["city"] = students["city"].str.title() ``` **Explanation** - The code accesses the "city" column of the `students` DataFrame. - It applies the `str.title()` method, which converts each city name to title case (capitalizing the first letter of each word). - This transformation helps standardize city names, improving data quality and readability. - The modified city names are then reassigned back to the "city" column in the DataFrame. You can normalize inconsistent labels: ```python students["city"] = students["city"].replace({ "Bengaluru": "Bangalore", "Bombay": "Mumbai" }) ``` **Explanation** - The code modifies the "city" column in the `students` DataFrame. - It uses the `replace` method to change specific city names to their standardized versions. - "Bengaluru" is replaced with "Bangalore" and "Bombay" with "Mumbai". - This ensures uniformity in city names, which is important for data analysis and reporting. - The operation is performed in-place, updating the original DataFrame directly. You can update numeric values: ```python students["practice_hours"] = students["practice_hours"].clip(lower=0) ``` **Explanation** - The code modifies the "practice_hours" column in the "students" DataFrame. - It uses the `clip` method to set a lower limit of 0 for the values in this column. - Any negative values in "practice_hours" will be replaced with 0, ensuring all entries are non-negative. - This is useful for data cleaning, preventing unrealistic negative practice hours from affecting analysis. ## 29. Converting Data Types Use `astype` when the conversion is direct. ```python students["completed"] = students["completed"].astype("bool") ``` **Explanation** - The code modifies the "completed" column in the `students` DataFrame. - It uses the `astype` method to change the data type of the "completed" column to boolean. - This conversion allows for more efficient storage and logical operations on the column's values. - The operation is performed in-place, meaning the original DataFrame is updated directly. Convert a repeated text column to category: ```python students["city"] = students["city"].astype("category") ``` **Explanation** - The code modifies the "city" column of the `students` DataFrame. - It uses the `astype` method to change the data type of the column to "category". - Categorical data types are more memory-efficient and can improve performance for certain operations. - This conversion is particularly useful when the column contains a limited number of unique values. This can reduce memory for columns with repeated values. For messy numeric text, use `pd.to_numeric`. ```python raw = pd.DataFrame({ "amount": ["499", "299", "missing", "799"] }) raw["amount"] = pd.to_numeric(raw["amount"], errors="coerce") ``` **Explanation** - A pandas DataFrame named `raw` is created with a column "amount" containing string representations of numbers and a missing value. - The `pd.to_numeric()` function is applied to the "amount" column to convert its values to numeric types. - The `errors="coerce"` parameter ensures that any non-convertible values (like "missing") are replaced with NaN (Not a Number) instead of raising an error. - This approach is useful for data cleaning, allowing for easier numerical analysis and operations on the DataFrame. Invalid values become missing values. ## 30. Handling Missing Values Create an example: ```python feedback = pd.DataFrame({ "student": ["Aarav", "Meera", "Ishan", "Tara", "Kabir"], "rating": [5, np.nan, 4, np.nan, 3], "comment": ["Clear", "Helpful", None, "Practical", None] }) ``` **Explanation** - A pandas DataFrame named `feedback` is created to organize student feedback data. - The DataFrame contains three columns: `student`, `rating`, and `comment`. - The `student` column lists the names of five students. - The `rating` column includes numerical ratings, with some entries as NaN (not a number) indicating missing values. - The `comment` column contains textual feedback, with some entries as None, representing absent comments. Check missing values: ```python feedback.isna().sum() ``` **Explanation** - The `isna()` function checks for missing values (NaNs) in the DataFrame `feedback`. - The `sum()` function aggregates the boolean results from `isna()`, counting the total number of missing entries for each column. - This code snippet is useful for data cleaning and preprocessing, allowing users to quickly identify columns that require attention due to missing data. Fill missing numeric values with median: ```python feedback["rating"] = feedback["rating"].fillna(feedback["rating"].median()) ``` **Explanation** - The code accesses the 'rating' column of the `feedback` DataFrame. - It uses the `fillna()` method to fill any NaN (missing) values in the 'rating' column. - The missing values are replaced with the median of the existing ratings, calculated using `feedback["rating"].median()`. - This approach helps maintain the integrity of the dataset by ensuring that all entries have a valid rating. - Using the median is beneficial as it is less affected by outliers compared to the mean. Fill missing text: ```python feedback["comment"] = feedback["comment"].fillna("No comment") ``` **Explanation** - The code accesses the "comment" column of the `feedback` DataFrame. - It uses the `fillna()` method to identify and replace any `NaN` (missing) values. - The default value used for replacement is the string "No comment". - This ensures that all entries in the "comment" column have a valid string, improving data consistency. Drop rows with missing values: ```python feedback.dropna() ``` **Explanation** - The `dropna()` method is called on the DataFrame `feedback`. - It eliminates any rows that contain at least one NaN (Not a Number) value. - The operation modifies the DataFrame in place unless specified otherwise. - This is useful for cleaning data before analysis to ensure completeness. - No parameters are provided, so it defaults to removing rows with any missing values. Drop rows only when a specific column is missing: ```python feedback.dropna(subset=["rating"]) ``` **Explanation** - The `dropna` method is used to eliminate rows that contain NaN (Not a Number) values. - The `subset` parameter specifies that only the 'rating' column should be checked for missing values. - Rows where 'rating' is NaN will be dropped from the DataFrame named `feedback`. - This operation helps in cleaning the dataset for analysis by ensuring that only complete entries are retained. Use dropping carefully. Sometimes missing data is itself meaningful. ## 31. Sorting Rows Sort by one column: ```python students.sort_values("python_score", ascending=False) ``` **Explanation** - The code snippet utilizes the `sort_values` method from a DataFrame to organize student records. - It specifically targets the "python_score" column for sorting. - The `ascending=False` parameter ensures that the highest scores appear first in the sorted list. - This operation modifies the order of the DataFrame in place, allowing for easy access to top-performing students. Sort by multiple columns: ```python students.sort_values( ["city", "python_score"], ascending=[True, False] ) ``` **Explanation** - The `sort_values` method is used to sort the DataFrame named `students`. - The sorting is performed first by the `city` column in ascending order. - Within each city, the `python_score` column is sorted in descending order. - This allows for a structured view of students, grouped by their city and ranked by their Python scores. - The result is a DataFrame that prioritizes city organization while highlighting top performers in Python within each city. ## 32. Counting Categories Use `value_counts` on a Series. ```python students["city"].value_counts() ``` **Explanation** - The code accesses the "city" column from the `students` DataFrame. - It uses the `value_counts()` method to count the number of occurrences of each unique city. - The result is a Series object that displays cities as the index and their respective counts as values. - This is useful for quickly analyzing the distribution of students across different cities. - The output can help in understanding demographic information or planning resources based on student locations. Normalize to percentages: ```python students["city"].value_counts(normalize=True) * 100 ``` **Explanation** - The code accesses the "city" column from a DataFrame named `students`. - It uses the `value_counts()` method to count the occurrences of each unique city. - The `normalize=True` parameter converts these counts into proportions of the total. - Multiplying by 100 converts the proportions into percentages, providing a clearer view of city distribution. - The result is a Series showing the percentage of students from each city. Include missing values: ```python students["city"].value_counts(dropna=False) ``` **Explanation** - The code accesses the "city" column from a DataFrame named `students`. - It uses the `value_counts()` method to count the unique occurrences of each city. - The parameter `dropna=False` ensures that missing values (NaN) are included in the count. - This is useful for understanding the distribution of students across different cities, including those without specified locations. ## 33. Grouping Data With `groupby` `groupby` helps you answer questions by category. Average score by city: ```python students.groupby("city")["python_score"].mean() ``` **Explanation** - The code uses the `groupby` method from the Pandas library to group student data by the "city" column. - It selects the "python_score" column to focus on the scores of students in each city. - The `mean()` function computes the average Python score for each city group. - This operation results in a Series with cities as the index and their corresponding average Python scores as values. - It is useful for analyzing performance trends based on geographic location. Multiple aggregations: ```python students.groupby("city").agg( learners=("name", "count"), avg_python=("python_score", "mean"), avg_sql=("sql_score", "mean"), total_hours=("practice_hours", "sum") ) ``` **Explanation** - The code uses the `groupby` method to organize student data based on the "city" column. - It aggregates the data to compute the number of learners, average Python scores, average SQL scores, and total practice hours for each city. - The `agg` function specifies multiple aggregation operations, including counting names, calculating means for scores, and summing practice hours. - The results are returned as a DataFrame with new column names for clarity, such as "learners", "avg_python", "avg_sql", and "total_hours". Sort the result: ```python students.groupby("city")["practice_hours"].sum().sort_values(ascending=False) ``` **Explanation** - The code uses the `groupby` method to categorize student data based on the "city" column. - It then aggregates the "practice_hours" for each city using the `sum()` function. - The results are sorted in descending order with `sort_values(ascending=False)`, allowing for easy identification of cities with the highest total practice hours. - This operation is useful for analyzing regional engagement in practice activities among students. ## 34. Mini Project 1: Course Enrollment Table Create a small course enrollment dataset. ```python enrollments = pd.DataFrame({ "learner": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila", "Rohan"], "track": ["Python", "Pandas", "Python", "SQL", "Pandas", "Python", "SQL"], "level": ["beginner", "beginner", "beginner", "intermediate", "beginner", "intermediate", "beginner"], "minutes_watched": [320, 410, 180, 260, 390, 520, 150], "quiz_score": [82, 91, 67, 74, 88, 95, 61], "certificate": [True, True, False, False, True, True, False] }) ``` **Explanation** - This code snippet initializes a pandas DataFrame named `enrollments` to manage data related to learners in various tracks. - It includes columns for `learner` names, the `track` they are enrolled in, their `level` of expertise, `minutes_watched` for course content, `quiz_score` achieved, and whether they received a `certificate`. - The data is structured as a dictionary, where each key corresponds to a column name and the associated list contains the respective values for each learner. - This DataFrame can be used for further analysis, such as calculating average scores or filtering learners based on their progress. ### Question 1: Which learners watched more than 300 minutes? ```python enrollments[enrollments["minutes_watched"] > 300] ``` **Explanation** - This code snippet filters a DataFrame named `enrollments`. - It selects only the rows where the value in the "minutes_watched" column is greater than 300. - The result is a new DataFrame containing only the enrollments with significant viewing time. - This operation is useful for analyzing user engagement or identifying highly active users. ### Question 2: Which beginner learners scored at least 80? ```python enrollments[ (enrollments["level"] == "beginner") & (enrollments["quiz_score"] >= 80) ] ``` **Explanation** - The code filters a DataFrame named `enrollments` to select rows that meet specific criteria. - It checks for entries where the "level" column is equal to "beginner". - Additionally, it ensures that the "quiz_score" column has values greater than or equal to 80. - The result is a subset of the original DataFrame containing only the qualifying records. ### Question 3: What is the average score per track? ```python enrollments.groupby("track")["quiz_score"].mean() ``` **Explanation** - The code uses the `groupby` method to organize data based on the "track" column. - It selects the "quiz_score" column to perform calculations on. - The `mean()` function computes the average quiz score for each unique track. - This operation is useful for analyzing performance trends across different tracks. - The result is a new Series with track names as the index and their corresponding average quiz scores. ### Question 4: Which track has the most certificates? ```python enrollments.groupby("track")["certificate"].sum().sort_values(ascending=False) ``` **Explanation** - Groups the `enrollments` DataFrame by the `track` column. - Sums the `certificate` values for each group, providing the total certificates earned per track. - Sorts the resulting sums in descending order to highlight the tracks with the highest certificate totals. ### Question 5: Add an engagement column ```python enrollments["engagement"] = np.where( enrollments["minutes_watched"] >= 300, "high", "low" ) ``` **Explanation** - Utilizes the `np.where` function from the NumPy library to create a new column called "engagement" in the `enrollments` DataFrame. - Checks if the value in the "minutes_watched" column is greater than or equal to 300. - Assigns the label "high" for users who meet or exceed 300 minutes watched, and "low" for those who do not. - This categorization helps in analyzing user engagement levels effectively. This mini project covers filtering, grouping, sorting, and column creation. ## 35. Mini Project 2: Cleaning Store Orders Create a messy order table. ```python orders = pd.DataFrame({ "order_id": [1, 2, 3, 3, 4, 5, 6], "customer": ["Anika", "Rohan", "Zoya", "Zoya", "Kabir", "Meera", "Ishan"], "city": ["pune", "Delhi", "PUNE", "PUNE", None, "Kochi", "delhi"], "amount": ["499", "299", "799", "799", "bad", "999", "450"], "status": ["paid", "pending", "paid", "paid", "paid", None, "pending"] }) ``` **Explanation** - Initializes a pandas DataFrame named `orders` containing order details such as order ID, customer name, city, amount, and payment status. - The `order_id` column includes duplicate entries, indicating multiple orders from the same customer. - The `city` column contains inconsistent casing (e.g., "PUNE" vs "pune") and a missing value (None). - The `amount` column has a non-numeric entry ("bad"), which may lead to data type issues during analysis. - The `status` column includes both valid entries and a missing value (None), which may require handling for accurate reporting. ### Step 1: Normalize city names ```python orders["city"] = orders["city"].str.title() ``` **Explanation** - The code accesses the "city" column of the `orders` DataFrame. - It applies the `str.title()` method, which converts each city name to title case (capitalizing the first letter of each word). - The formatted city names are then reassigned back to the "city" column, updating the DataFrame in place. - This transformation ensures consistency in the representation of city names for better readability and data analysis. ### Step 2: Convert amount to numeric ```python orders["amount"] = pd.to_numeric(orders["amount"], errors="coerce") ``` **Explanation** - The code uses the `pd.to_numeric()` function from the Pandas library to convert the "amount" column in the `orders` DataFrame to numeric data types. - The `errors="coerce"` parameter ensures that any values that cannot be converted to numbers are replaced with `NaN` (Not a Number) instead of raising an error. - This approach is useful for cleaning data, as it allows for the identification and handling of invalid entries in the "amount" column. - After execution, the "amount" column will contain only numeric values or `NaN`, making it suitable for further analysis or calculations. ### Step 3: Fill missing status ```python orders["status"] = orders["status"].fillna("unknown") ``` **Explanation** - The code accesses the 'status' column of the DataFrame named `orders`. - It uses the `fillna()` method to replace any `NaN` (missing) values in that column. - The replacement value for missing entries is set to the string "unknown". - This operation helps in maintaining data integrity by ensuring there are no missing values in the 'status' column. - The updated DataFrame retains all other data while only modifying the specified column. ### Step 4: Remove duplicate orders ```python orders = orders.drop_duplicates(subset=["order_id"]) ``` **Explanation** - The `drop_duplicates` method is called on the `orders` DataFrame. - The `subset` parameter specifies that duplicates should be identified based on the `order_id` column. - Only the first occurrence of each unique `order_id` is retained in the DataFrame. - This operation helps in cleaning the data by ensuring that each order is represented only once. - The result is a DataFrame with unique order entries, which is crucial for accurate data analysis. ### Step 5: Find paid orders above 400 ```python orders[ (orders["status"] == "paid") & (orders["amount"] > 400) ] ``` **Explanation** - The code filters a DataFrame named `orders` to retrieve specific rows based on conditions. - It checks for rows where the `status` column is equal to "paid". - Additionally, it ensures that the `amount` column has values greater than 400. - The result is a subset of the `orders` DataFrame that meets both criteria, useful for analyzing high-value transactions. This is closer to real work because data is rarely clean when it arrives. ## 36. Practice Dataset: Workshop Attendance Use this dataset for the exercises. ```python attendance = pd.DataFrame({ "learner": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila", "Rohan", "Zoya", "Aarav"], "topic": ["Python", "Pandas", "Python", "SQL", "Pandas", "Python", "SQL", "Pandas", "Python"], "age": [21, 24, np.nan, 23, 26, 22, np.nan, 25, 21], "visits": [3, 5, 2, np.nan, 4, 6, 1, 5, 3], "score": [78, 91, 65, 72, 88, 95, 59, 84, 78], "city": ["Pune", "Kochi", "Delhi", "Pune", "Jaipur", "Kochi", "Delhi", "Pune", "Pune"] }) ``` **Explanation** - Initializes a pandas DataFrame named `attendance` to store information about learners. - Contains columns for `learner` names, `topic` of study, `age`, `visits` to sessions, `score` achieved, and `city` of residence. - Includes missing values represented by `np.nan` for the `age` and `visits` columns, indicating incomplete data for some learners. - The DataFrame is structured to facilitate analysis of attendance patterns and performance metrics across different topics and demographics. ## 37. Practice Questions With Solutions ### Q1. Display basic information about the DataFrame. ```python attendance.info() attendance.describe() ``` **Explanation** - The `attendance.info()` method provides a concise summary of the DataFrame, including the number of entries, column data types, and memory usage. - The `attendance.describe()` method generates descriptive statistics for numerical columns, such as count, mean, standard deviation, min, and max values. - Both methods are useful for understanding the structure and distribution of data within the DataFrame before performing further analysis. - These functions help identify potential issues like missing values or outliers in the dataset. ### Q2. Show alternate rows. ```python attendance.iloc[::2] ``` **Explanation** - The code uses the `iloc` method from the pandas library to access rows by their integer index. - The slicing notation `::2` indicates that every second row will be selected, starting from the first row (index 0). - This is useful for reducing the size of a DataFrame or for sampling data at regular intervals. - The resulting output is a new DataFrame containing only the selected rows, which can be further processed or analyzed. ### Q3. Show rows at positions 0, 2, and 6 with only `learner` and `age`. ```python attendance.loc[attendance.index[[0, 2, 6]], ["learner", "age"]] ``` **Explanation** - The code uses the `loc` method from the Pandas library to access specific data in a DataFrame named `attendance`. - It retrieves rows at index positions 0, 2, and 6, which correspond to the first, third, and seventh entries in the DataFrame. - The columns selected for output are "learner" and "age", meaning only these two attributes will be displayed for the specified rows. - This method is useful for extracting a subset of data for analysis or reporting based on specific criteria. Or: ```python attendance[["learner", "age"]].iloc[[0, 2, 6]] ``` **Explanation** - The code snippet utilizes the pandas library to manipulate a DataFrame named `attendance`. - It selects the columns "learner" and "age" from the DataFrame. - The `iloc` method is used to retrieve rows at indices 0, 2, and 6. - This operation results in a new DataFrame containing only the specified rows and columns. ### Q4. Select rows where visits are less than 4. ```python attendance[attendance["visits"] < 4] ``` **Explanation** - The code accesses a DataFrame named `attendance`. - It filters the rows where the "visits" column has values less than 4. - The result is a new DataFrame containing only those records that meet the specified condition. - This operation is useful for identifying individuals who may need additional engagement or support. ### Q5. Select rows where `age` or `visits` is missing. ```python attendance[ attendance["age"].isna() | attendance["visits"].isna() ] ``` **Explanation** - The code filters the `attendance` DataFrame to identify rows where either the `age` or `visits` columns contain missing values (NaN). - The `isna()` function is used to check for NaN values in the specified columns. - The logical OR operator (`|`) combines the two conditions, ensuring that any row with a NaN in either column is included in the result. - The resulting DataFrame will only contain entries that have incomplete data regarding age or visit counts. ### Q6. Fill missing ages with the median age. ```python attendance["age"] = attendance["age"].fillna(attendance["age"].median()) ``` **Explanation** - The code accesses the "age" column of the `attendance` DataFrame. - It uses the `fillna()` method to replace any `NaN` (missing) values in the "age" column. - The replacement value is calculated using the `median()` function, which finds the median age from the existing values. - This approach helps maintain the integrity of the dataset by avoiding the removal of rows with missing values. - Filling missing values with the median is a common practice to reduce bias in statistical analysis. ### Q7. Fill missing visits with the most common visits value. ```python attendance["visits"] = attendance["visits"].fillna(attendance["visits"].mode()[0]) ``` **Explanation** - The code accesses the 'visits' column of the 'attendance' DataFrame. - It uses the `fillna()` method to replace any NaN (missing) values. - The replacement value is determined by calculating the mode (most frequent value) of the 'visits' column. - The mode is accessed using `attendance["visits"].mode()[0]`, which retrieves the first mode value. - This approach ensures that the 'visits' column has no missing values, improving data integrity for analysis. ### Q8. Find total visits for learners attending Python workshops. ```python attendance[attendance["topic"] == "Python"]["visits"].sum() ``` **Explanation** - Filters the `attendance` DataFrame to include only rows where the `topic` column equals "Python". - Accesses the `visits` column from the filtered DataFrame. - Uses the `sum()` function to compute the total number of visits for the Python topic. - Returns a single numeric value representing the aggregated visits. - Assumes that the `attendance` DataFrame is already defined and contains the relevant columns. ### Q9. Count learners per topic. ```python attendance["topic"].value_counts() ``` **Explanation** - The code accesses the "topic" column of the `attendance` DataFrame. - It utilizes the `value_counts()` method to compute the frequency of each unique value in that column. - The result is a Series object that displays topics as the index and their corresponding counts as values. - This is useful for quickly understanding the distribution of topics in the attendance data. ### Q10. Count duplicate rows. ```python attendance.duplicated().sum() ``` **Explanation** - The `attendance.duplicated()` method identifies duplicate rows in the `attendance` DataFrame, returning a boolean Series. - The `sum()` function then counts the number of `True` values in this Series, effectively giving the total number of duplicate entries. - This code is useful for data cleaning and validation, ensuring that each entry in the attendance record is unique. - It helps in maintaining data integrity before performing further analysis or operations on the dataset. ### Q11. Drop duplicate rows. ```python attendance = attendance.drop_duplicates() ``` **Explanation** - The `drop_duplicates()` method is called on the `attendance` DataFrame. - It eliminates any rows that are exact duplicates of others, ensuring each entry is unique. - The operation modifies the DataFrame in place, meaning the original `attendance` variable will now only contain unique records. - This is useful for cleaning data before analysis or reporting, as duplicates can skew results. ### Q12. Find average score by topic. ```python attendance.groupby("topic")["score"].mean() ``` **Explanation** - The code uses the `groupby` method to organize the data in the `attendance` DataFrame by the "topic" column. - It then selects the "score" column to perform calculations on. - The `mean()` function computes the average score for each unique topic. - This operation is useful for summarizing performance metrics across different topics. - The result is a new Series with topics as the index and their corresponding average scores as values. ### Q13. Find top cities by total visits. ```python attendance.groupby("city")["visits"].sum().sort_values(ascending=False) ``` **Explanation** - The code uses the `groupby` method to aggregate data in the `attendance` DataFrame based on the unique values in the "city" column. - It selects the "visits" column to calculate the total number of visits for each city. - The `sum()` function computes the total visits for each group (city). - Finally, `sort_values(ascending=False)` arranges the resulting totals in descending order, allowing easy identification of cities with the highest visits. ### Q14. Create a `passed` column where score is at least 70. ```python attendance["passed"] = attendance["score"] >= 70 ``` **Explanation** - The code adds a new column named "passed" to the `attendance` DataFrame. - It evaluates whether each student's score is greater than or equal to 70. - The result is a boolean value (`True` or `False`) indicating if the student has passed. - This operation utilizes vectorized operations for efficient computation across the DataFrame. ### Q15. Create a `score_band` column. ```python attendance["score_band"] = np.where( attendance["score"] >= 85, "strong", np.where(attendance["score"] >= 70, "good", "needs_practice") ) ``` **Explanation** - Utilizes NumPy's `where` function to create a new column `score_band` in the `attendance` DataFrame. - Assigns the value "strong" if the `score` is 85 or higher. - Assigns the value "good" if the `score` is between 70 and 84. - Assigns the value "needs_practice" for scores below 70. - This approach allows for efficient conditional assignment in a DataFrame. ## 38. Practice Dataset: League Matches This dataset is fictional and made only for practice. ```python matches = pd.DataFrame({ "season": [2023, 2023, 2023, 2024, 2024, 2024, 2024], "city": ["Pune", "Delhi", "Kochi", "Pune", "Jaipur", "Delhi", "Kochi"], "team_a": ["Falcons", "Tigers", "Falcons", "Rockets", "Tigers", "Falcons", "Rockets"], "team_b": ["Rockets", "Falcons", "Tigers", "Falcons", "Rockets", "Tigers", "Tigers"], "winner": ["Falcons", "Tigers", "Falcons", "Falcons", "Rockets", "Falcons", "Tigers"], "margin": [12, 8, 20, 5, 15, 6, 9], "final": [False, False, True, False, False, True, False] }) ``` **Explanation** - A pandas DataFrame named `matches` is created to organize data related to sports match results. - The DataFrame contains columns for the season year, city of the match, teams involved (team_a and team_b), the winner of each match, the margin of victory, and whether the match was a final. - The data includes matches from two different seasons (2023 and 2024) with various cities and teams participating. - The `final` column indicates if a match is a final, with boolean values (True or False) for each match entry. - This structure allows for easy analysis and manipulation of match data using pandas functionalities. ## 39. Match Practice Questions With Solutions ### Q1. Find all final winners. ```python matches.loc[matches["final"], ["season", "winner"]] ``` **Explanation** - Utilizes the `loc` method to filter rows in the `matches` DataFrame where the "final" column is `True`. - Selects only the "season" and "winner" columns from the filtered results. - This operation is useful for analyzing the outcomes of championship matches across different seasons. ### Q2. Count matches won by Falcons in Delhi. ```python matches[ (matches["winner"] == "Falcons") & (matches["city"] == "Delhi") ].shape[0] ``` **Explanation** - Filters the `matches` DataFrame to include only rows where the winner is "Falcons" and the city is "Delhi". - Uses the bitwise AND operator `&` to combine the two conditions for filtering. - The `shape[0]` attribute returns the number of rows that meet the specified criteria, effectively counting the matches. - This code is useful for analyzing specific match outcomes based on team performance in a particular location. ### Q3. Find top teams by number of matches played. Because a team can appear in `team_a` or `team_b`, combine both columns. ```python played = pd.concat([matches["team_a"], matches["team_b"]]) played.value_counts() ``` **Explanation** - The code uses the `pd.concat()` function from the pandas library to combine two columns, `team_a` and `team_b`, from the `matches` DataFrame into a single Series named `played`. - The `value_counts()` method is then called on the `played` Series to count how many times each team appears in the combined list of matches. - This allows for easy analysis of team participation across all matches in the dataset. - The resulting output will show the teams along with the number of matches they played, sorted in descending order by count. ### Q4. Build a team-vs-team summary function. ```python def team_summary(data, team_one, team_two): mask_one = (data["team_a"] == team_one) | (data["team_b"] == team_one) mask_two = (data["team_a"] == team_two) | (data["team_b"] == team_two) head_to_head = data[mask_one & mask_two] return { "matches": head_to_head.shape[0], f"{team_one}_wins": (head_to_head["winner"] == team_one).sum(), f"{team_two}_wins": (head_to_head["winner"] == team_two).sum(), "average_margin": head_to_head["margin"].mean() } team_summary(matches, "Falcons", "Tigers") ``` **Explanation** - The function `team_summary` takes three parameters: a DataFrame `data` containing match details, and two team names `team_one` and `team_two`. - It creates boolean masks to filter matches where either team participated using logical OR operations. - The filtered DataFrame `head_to_head` contains only the matches involving both specified teams. - The function returns a dictionary with the total number of matches, the number of wins for each team, and the average margin of victory across those matches. - The function is called with a DataFrame `matches` and two team names, "Falcons" and "Tigers", to retrieve their match statistics. ### Q5. Find the top cities where Falcons played. ```python falcons_matches = matches[ (matches["team_a"] == "Falcons") | (matches["team_b"] == "Falcons") ] falcons_matches["city"].value_counts() ``` **Explanation** - The code filters a DataFrame named `matches` to include only those rows where either `team_a` or `team_b` is "Falcons". - The filtered DataFrame is stored in the variable `falcons_matches`. - It then counts the occurrences of each unique city in the `city` column of the `falcons_matches` DataFrame. - The result is a Series that shows how many matches involving the Falcons took place in each city. ### Q6. Find average winning margin for Falcons in 2024. ```python matches[ (matches["season"] == 2024) & ( (matches["team_a"] == "Falcons") | (matches["team_b"] == "Falcons") ) ]["margin"].mean() ``` **Explanation** - Filters the `matches` DataFrame to include only rows where the season is 2024. - Further narrows down the results to matches where either `team_a` or `team_b` is "Falcons". - Selects the `margin` column from the filtered DataFrame. - Computes the mean of the `margin` values to determine the average margin for the specified conditions. ## 40. Common Beginner Mistakes ### Mistake 1: Forgetting Double Brackets For Multiple Columns Wrong: ```python students["name", "city"] ``` **Explanation** - This code snippet retrieves the "name" and "city" columns from a DataFrame named `students`. - The syntax `students["name", "city"]` indicates that multiple columns can be accessed simultaneously. - This operation is useful for filtering data to focus on specific attributes of the students. - Ensure that the `students` DataFrame is defined and contains the specified columns to avoid errors. Right: ```python students[["name", "city"]] ``` **Explanation** - This code snippet accesses a pandas DataFrame named `students`. - It retrieves only the "name" and "city" columns from the DataFrame. - The double square brackets indicate that a subset of columns is being selected, returning a new DataFrame. - This operation is useful for focusing on specific data attributes for analysis or reporting. ### Mistake 2: Confusing `loc` And `iloc` Use: - `loc` for labels - `iloc` for integer positions ```python students.iloc[0] students_by_id.loc["Meera"] ``` **Explanation** - The first line retrieves the first row of the `students` DataFrame using integer-location based indexing with `iloc`. - The second line accesses the record of a student named "Meera" from the `students_by_id` DataFrame using label-based indexing with `loc`. - Both methods are part of the pandas library, which is commonly used for data manipulation and analysis in Python. - These indexing techniques allow for efficient retrieval of data based on either position or label, enhancing data handling capabilities. ### Mistake 3: Missing Parentheses In Multiple Conditions Wrong: ```python students[students["python_score"] > 80 & students["sql_score"] > 80] ``` **Explanation** - The code filters a DataFrame named `students` to select rows where the `python_score` is greater than 80 and the `sql_score` is also greater than 80. - The logical condition uses the bitwise AND operator `&` to combine the two conditions for filtering. - It is important to use parentheses around each condition to ensure proper evaluation order in the expression. - The result is a subset of the original DataFrame containing only students who excel in both subjects. Right: ```python students[ (students["python_score"] > 80) & (students["sql_score"] > 80) ] ``` **Explanation** - The code filters a DataFrame named `students` to find rows where both the `python_score` and `sql_score` are greater than 80. - It uses boolean indexing with the `&` operator to combine conditions for both scores. - The resulting DataFrame will only include students who excel in both subjects, useful for identifying top performers. - This operation is efficient for data analysis in educational contexts, allowing for targeted insights. ### Mistake 4: Using `=` Instead Of `==` In Filters Wrong: ```python # This is assignment, not comparison students["city"] = "Pune" ``` **Explanation** - The code modifies the dictionary `students` by assigning the string value `"Pune"` to the key `"city"`. - This operation updates the existing value associated with the key if it already exists or creates a new key-value pair if it does not. - The comment clarifies that the single equals sign (`=`) is used for assignment, not for comparison, which uses double equals (`==`). Right: ```python students[students["city"] == "Pune"] ``` **Explanation** - The code snippet accesses a DataFrame named `students`. - It filters the DataFrame to include only those rows where the "city" column has the value "Pune". - The result is a new DataFrame containing only the students from Pune. - This operation is useful for analyzing or processing data specific to a particular location. ### Mistake 5: Dropping Missing Values Too Early Before dropping rows, ask: - Why is the value missing? - Is the row still useful? - Can the value be filled safely? - Will dropping rows bias the analysis? ### Mistake 6: Changing Original Data Accidentally If you want to experiment, create a copy. ```python practice = students.copy() ``` **Explanation** - The `copy()` method is used to create a shallow copy of the `students` list. - The new list, `practice`, is independent of the original `students` list, meaning changes to one will not affect the other. - This is useful for scenarios where you want to manipulate data without altering the original dataset. - The operation is efficient for lists, ensuring that the original data remains intact while allowing for experimentation or practice. ## 41. DataFrame Cheat Sheet ```python # Create pd.DataFrame(data) pd.read_csv("file.csv") # Inspect df.shape df.dtypes df.index df.columns df.head() df.tail() df.sample(5) df.info() df.describe() # Missing and duplicates df.isna().sum() df.duplicated().sum() df.drop_duplicates() # Select columns df["col"] df[["col1", "col2"]] # Select rows df.iloc[0] df.iloc[0:5] df.loc["label"] df.loc["start":"stop"] # Select rows and columns df.iloc[0:5, 0:3] df.loc[df["score"] > 80, ["name", "score"]] # Filter df[df["city"] == "Pune"] df[(df["score"] > 80) & (df["city"] == "Pune")] df[df["city"].isin(["Pune", "Kochi"])] df[df["score"].between(70, 90)] # Add or update df["new_col"] = value df["total"] = df["a"] + df["b"] df.rename(columns={"old": "new"}) # Convert df["col"].astype("category") pd.to_numeric(df["amount"], errors="coerce") # Summarize df["city"].value_counts() df.groupby("city")["score"].mean() df.sort_values("score", ascending=False) ``` **Explanation** - The code demonstrates various operations on a Pandas DataFrame, including creation, inspection, and manipulation of data. - It shows how to read data from a CSV file and inspect the DataFrame's shape, data types, and summary statistics. - The snippet includes methods for handling missing values and duplicates, allowing for data cleaning. - It illustrates techniques for selecting specific rows and columns, as well as filtering data based on conditions. - The code also covers adding new columns, updating existing ones, converting data types, and summarizing data through grouping and counting. ## 42. Interview-Style Questions ### 1. What is a Pandas DataFrame? A DataFrame is a two-dimensional labeled data structure with rows and columns. It is commonly used for tabular data. ### 2. What is the difference between a Series and a DataFrame? A Series is one-dimensional. A DataFrame is two-dimensional and can be thought of as multiple Series sharing the same row index. ### 3. What does `df.shape` return? It returns a tuple: `(number_of_rows, number_of_columns)`. ### 4. What is the difference between `loc` and `iloc`? `loc` selects by labels. `iloc` selects by integer positions. ### 5. Why do we use parentheses with `&` and `|` in filters? Because Pandas comparisons must be grouped before combining them. Without parentheses, Python operator precedence can produce errors or incorrect logic. ### 6. How do you select multiple columns? Use a list of column names: ```python df[["name", "score"]] ``` **Explanation** - This code snippet retrieves two columns, "name" and "score", from a pandas DataFrame named `df`. - The double square brackets indicate that a subset of the DataFrame is being created, returning only the specified columns. - The result is a new DataFrame containing only the data from the "name" and "score" columns, which can be useful for analysis or visualization. - This operation does not modify the original DataFrame `df`; it creates a new view of the data. ### 7. How do you count missing values per column? ```python df.isna().sum() ``` **Explanation** - The `isna()` function is called on a DataFrame `df`, which returns a DataFrame of the same shape with boolean values indicating the presence of missing values (True for NaN). - The `sum()` function is then applied to this boolean DataFrame, which counts the number of True values (missing values) in each column. - The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries for each column. - This is useful for data cleaning and preprocessing, allowing users to quickly identify columns that may require attention due to missing data. ### 8. How do you remove duplicate rows? ```python df.drop_duplicates() ``` **Explanation** - The `drop_duplicates()` method is called on a pandas DataFrame `df`. - It identifies and removes any rows that are exact duplicates of each other. - By default, it keeps the first occurrence of each duplicate and drops subsequent ones. - This operation helps in cleaning the dataset for accurate analysis and processing. - The method can be customized with parameters to specify which columns to consider for identifying duplicates. ### 9. How do you convert a column to numeric? ```python df["amount"] = pd.to_numeric(df["amount"], errors="coerce") ``` **Explanation** - The code uses the `pd.to_numeric()` function from the pandas library to convert the values in the 'amount' column of a DataFrame (`df`) to numeric types. - The `errors="coerce"` parameter ensures that any values that cannot be converted to numeric will be replaced with `NaN` (Not a Number), preventing the code from raising an error. - This is useful for cleaning data, especially when dealing with mixed types or invalid entries in the 'amount' column. - The resulting 'amount' column will contain only numeric values or `NaN`, making it easier to perform numerical operations or analyses later on. ### 10. How do you find the average score by city? ```python df.groupby("city")["score"].mean() ``` **Explanation** - The code uses the `groupby` method on a DataFrame `df` to organize data based on unique values in the "city" column. - It then selects the "score" column to compute the mean for each group created by the cities. - The result is a new Series containing the average score for each city, which can be useful for data analysis and reporting. ## 43. A Good Learning Path After This Once you understand DataFrames, continue with: - deeper missing-value handling - `merge` and `join` - `groupby` in detail - pivot tables - time series data - plotting with Pandas - performance-friendly Pandas patterns - exploratory data analysis projects DataFrame basics are the foundation. Most Pandas work is just these ideas repeated on larger and messier datasets. ## Sources And Further Reading This guide uses original examples and practice datasets. The notebook was used only as a topic outline, not as copied lesson text or copied datasets. Official Pandas references: - Pandas DataFrame introduction: https://pandas.pydata.org/docs/user_guide/dsintro.html - Pandas getting started tutorials: https://pandas.pydata.org/docs/getting_started/intro_tutorials/ - Indexing and selecting data: https://pandas.pydata.org/docs/user_guide/indexing.html - Group by user guide: https://pandas.pydata.org/docs/user_guide/groupby.html - Pandas API reference: https://pandas.pydata.org/docs/reference/