Pandas DataFrame: Create, Inspect, and Analyze Data

May 31, 2026
35 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: pandas-dataframe-create-inspect-and-analyze-data
Quick Answer

Learn Pandas DataFrames from scratch with original examples: create tables, read CSV files, inspect data, use loc and iloc, filter rows, add columns, clean missing values, remove duplicates, convert data types, and solve practical exercises.

Quick Summary

Learn how to create, inspect, and analyze Pandas DataFrames in Python with practical examples and tips for effective data handling.

Pandas DataFrame: Create, Inspect, Select, Filter, Clean, and Analyze Tables

A Pandas DataFrame is the main object you use when working with table-shaped data in Python.

If a Series is one labeled column, a DataFrame is a full table made from many labeled columns.

You can use a DataFrame to represent:

  • students and marks
  • orders and payments
  • website traffic logs
  • course enrollments
  • app events
  • survey responses
  • sports match records
  • product inventory
  • interview practice datasets

In this guide, you will learn DataFrames through original examples built around students, learning courses, small store orders, and simple sports records.

The goal is not just to memorize commands. The goal is to become comfortable asking questions from data.

What You Will Learn

By the end, you should be able to:

  • explain what a DataFrame is
  • create DataFrames from lists, dictionaries, and records
  • read data from a CSV file
  • inspect shape, columns, index, dtypes, and values
  • preview rows with head, tail, and sample
  • understand info, describe, isna, and duplicated
  • rename columns safely
  • select one column, many columns, one row, and many rows
  • use loc for labels and iloc for positions
  • select rows and columns together
  • filter rows using one condition and multiple conditions
  • use isin, between, and string filters
  • add new columns from constants and formulas
  • convert data types with astype and pd.to_numeric
  • clean missing values with fillna and dropna
  • remove duplicate rows
  • sort rows
  • count categories with value_counts
  • group data with groupby
  • solve beginner-to-practical DataFrame exercises

1. Importing Pandas

Most Pandas code uses the alias pd.

python
import pandas as pd
import numpy as np

You do not always need NumPy, but it is useful for missing values and numeric examples.

2. What Is A DataFrame?

A DataFrame is a two-dimensional labeled data structure.

Simple meaning:

  • rows represent records
  • columns represent fields
  • row labels live in the index
  • column labels live in columns

Example:

python
students = pd.DataFrame({
    "name": ["Aarav", "Meera", "Ishan", "Tara"],
    "python_score": [86, 92, 74, 88],
    "sql_score": [78, 85, 81, 90],
    "city": ["Pune", "Kochi", "Delhi", "Pune"]
})

print(students)

Output:

text
    name  python_score  sql_score   city
0  Aarav            86         78   Pune
1  Meera            92         85  Kochi
2  Ishan            74         81  Delhi
3   Tara            88         90   Pune

This is a DataFrame with 4 rows and 4 columns.

3. DataFrame Vs Series

Use this mental model:

text
Series     = one labeled column
DataFrame  = many Series side by side

Selecting a single column from a DataFrame usually gives a Series:

python
students["python_score"]

Selecting multiple columns gives another DataFrame:

python
students[["name", "python_score"]]

This difference matters because a Series and a DataFrame support many of the same operations, but their shapes are different.

4. Creating A DataFrame From A List Of Lists

You can create a DataFrame from rows.

python
rows = [
    ["Aarav", 86, "Pune"],
    ["Meera", 92, "Kochi"],
    ["Ishan", 74, "Delhi"]
]

df = pd.DataFrame(rows, columns=["name", "score", "city"])
print(df)

Output:

text
    name  score   city
0  Aarav     86   Pune
1  Meera     92  Kochi
2  Ishan     74  Delhi

This style is useful when your data naturally arrives row by row.

5. Creating A DataFrame From A Dictionary Of Lists

This is one of the most common beginner-friendly ways.

python
course_data = {
    "student": ["Aarav", "Meera", "Ishan", "Tara"],
    "course": ["Python", "Pandas", "Python", "SQL"],
    "hours_spent": [14, 18, 9, 12],
    "completed": [True, True, False, True]
}

courses = pd.DataFrame(course_data)
print(courses)

Each dictionary key becomes a column name.

Each list becomes a column.

Important rule: all lists must have the same length.

6. Creating A DataFrame From A List Of Dictionaries

This style is excellent for API-like records.

python
records = [
    {"order_id": 101, "customer": "Anika", "amount": 499, "paid": True},
    {"order_id": 102, "customer": "Rohan", "amount": 299, "paid": False},
    {"order_id": 103, "customer": "Zoya", "amount": 799, "paid": True}
]

orders = pd.DataFrame(records)
print(orders)

Output:

text
   order_id customer  amount   paid
0       101    Anika     499   True
1       102    Rohan     299  False
2       103     Zoya     799   True

This is often easier to read than a list of lists because every value carries a field name.

7. Creating A DataFrame With A Custom Index

The index identifies rows.

python
students = pd.DataFrame(
    {
        "python_score": [86, 92, 74, 88],
        "sql_score": [78, 85, 81, 90],
        "city": ["Pune", "Kochi", "Delhi", "Pune"]
    },
    index=["aarav", "meera", "ishan", "tara"]
)

print(students)

Output:

text
       python_score  sql_score   city
aarav            86         78   Pune
meera            92         85  Kochi
ishan            74         81  Delhi
tara             88         90   Pune

Custom indexes are useful when your rows already have meaningful labels.

8. Reading A CSV File

In real work, you often read data from files.

python
orders = pd.read_csv("orders.csv")

If the CSV has a column that should become the index:

python
orders = pd.read_csv("orders.csv", index_col="order_id")

If you only want selected columns:

python
orders = pd.read_csv("orders.csv", usecols=["order_id", "customer", "amount"])

If date columns should be parsed:

python
orders = pd.read_csv("orders.csv", parse_dates=["ordered_at"])

For beginners, start with plain pd.read_csv("file.csv"), inspect the result, and then add options only when needed.

9. Inspecting A DataFrame

Let us build one reusable DataFrame for the next few sections.

python
students = pd.DataFrame({
    "name": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila"],
    "city": ["Pune", "Kochi", "Delhi", "Pune", "Jaipur", "Kochi"],
    "python_score": [86, 92, 74, 88, 67, 95],
    "sql_score": [78, 85, 81, 90, 72, 89],
    "practice_hours": [14, 18, 9, 16, 7, 20],
    "completed": [True, True, False, True, False, True]
})

Shape

python
students.shape

Explanation

  • The shape attribute of a DataFrame returns a tuple representing its dimensions.
  • The first element of the tuple indicates the number of rows (students), while the second element indicates the number of columns (attributes).
  • This is useful for quickly assessing the size of the dataset before performing further analysis or operations.

Output:

text
(6, 6)

This means 6 rows and 6 columns.

Data Types

python
students.dtypes

Explanation

  • The code accesses the dtypes attribute of a Pandas DataFrame named students.
  • It returns a Series object that contains the data type of each column in the DataFrame.
  • This information is useful for understanding the structure of the data and for ensuring that data types are appropriate for analysis.
  • Common data types include int64, float64, object, and datetime64, among others.
  • Analyzing data types can help identify potential issues such as incorrect data formats or the need for type conversions.

This tells you the type of each column.

Common Pandas dtypes:

  • int64
  • float64
  • bool
  • object
  • datetime64
  • category

Many text columns appear as object.

Index

python
students.index

Explanation

  • The students.index attribute retrieves the index (row labels) of the DataFrame named students.
  • This index can be used to identify and access specific rows in the DataFrame.
  • It is particularly useful for operations that require referencing or manipulating data based on row positions.
  • The index can be customized or set to meaningful labels to enhance data readability and management.

The index is the row label system.

Columns

python
students.columns

Explanation

  • Accesses the columns attribute of a pandas DataFrame named students.
  • Returns an Index object containing the names of the columns in the DataFrame.
  • Useful for understanding the structure of the DataFrame and for data manipulation tasks.
  • Can be used in data analysis to quickly identify available features for further processing.

This returns the column names.

Values

python
students.values

Explanation

  • The code snippet retrieves all the values from the students dictionary.
  • students.values returns a view object that displays a list of all the values in the dictionary.
  • This can be useful for iterating over student data or performing operations on the values without needing the keys.
  • To convert the view object into a list, you can wrap it with list(students.values()).

This gives the raw underlying values as an array-like object.

In beginner Pandas work, you usually interact with columns and rows through Pandas methods instead of relying on .values.

10. Previewing Rows

First Rows With head

python
students.head()

Explanation

  • The head() method is called on a DataFrame named students.
  • It returns the first five rows by default, allowing for a quick overview of the data structure and contents.
  • This is useful for verifying data loading and understanding the format of the dataset.
  • Users can specify a different number of rows to display by passing an integer argument to head().

By default, it returns the first 5 rows.

python
students.head(3)

Explanation

  • The head() function is a method used on a pandas DataFrame to display the first few rows of the dataset.
  • In this case, students.head(3) returns the first three rows of the students DataFrame.
  • This is useful for quickly inspecting the structure and contents of the DataFrame without displaying the entire dataset.
  • The number passed to head() can be adjusted to retrieve a different number of rows as needed.

Returns the first 3 rows.

Last Rows With tail

python
students.tail()
students.tail(2)

Explanation

  • The students.tail() function returns the last five rows of the DataFrame named students.
  • By calling students.tail(2), the code retrieves the last two rows specifically, allowing for a focused view of the most recent entries.
  • This method is useful for quickly inspecting the end of a dataset without displaying the entire DataFrame.
  • The tail() function is commonly used in data analysis to check for data integrity and trends at the end of the dataset.

Random Rows With sample

python
students.sample(2, random_state=42)

Explanation

  • The sample method is used to randomly select a specified number of rows from a DataFrame.
  • In this case, it selects 2 random students from the students DataFrame.
  • The random_state parameter is set to 42, ensuring that the random selection is reproducible across different runs.
  • This is useful for testing or demonstration purposes, as it allows the same subset of data to be retrieved consistently.
  • If random_state were omitted, the selection would vary each time the code is executed.

Use random_state when you want reproducible examples.

11. Using info

info gives a compact technical summary.

python
students.info()

Explanation

  • The info() method is called on a DataFrame named students.
  • It provides a concise summary of the DataFrame, including the number of entries, column names, data types, and memory usage.
  • This method is useful for quickly assessing the structure and data types of the dataset before performing further analysis.
  • It helps identify any missing values or inconsistencies in the data.

It shows:

  • row count
  • column names
  • non-null counts
  • data types
  • memory usage

This is one of the first commands you should run after loading a new dataset.

12. Using describe

describe summarizes numeric columns by default.

python
students.describe()

Explanation

  • The describe() method is called on the students DataFrame, which is typically used in pandas.
  • It generates summary statistics such as count, mean, standard deviation, minimum, and maximum values for numerical columns.
  • For categorical columns, it provides counts of unique values and the most frequent value.
  • This method is useful for quickly understanding the distribution and central tendencies of the data.
  • The output helps in identifying potential outliers and the overall shape of the dataset.

It gives:

  • count
  • mean
  • standard deviation
  • minimum
  • quartiles
  • maximum

For non-numeric columns:

python
students.describe(include="object")

Explanation

  • The describe() method is called on a DataFrame named students.
  • The parameter include="object" specifies that only columns with object data types (typically strings) should be included in the output.
  • This method provides insights such as count, unique values, top values, and frequency of the top values for the specified columns.
  • It is useful for understanding the distribution and characteristics of categorical data in the DataFrame.

For all columns:

python
students.describe(include="all")

Explanation

  • The describe() method is called on a DataFrame named students to generate descriptive statistics.
  • The parameter include="all" ensures that the summary includes statistics for all columns, regardless of their data types.
  • For numeric columns, it provides metrics like count, mean, standard deviation, min, max, and quartiles.
  • For categorical columns, it includes counts of unique values, the most frequent value, and its frequency.
  • This method is useful for quickly understanding the distribution and characteristics of the dataset.

13. Checking Missing Values

Missing values are common in real data.

python
students.isna().sum()

Explanation

  • The isna() function is called on the students DataFrame to identify missing values, returning a DataFrame of the same shape with boolean values.
  • The sum() function is then applied to this boolean DataFrame, which counts the number of True values (missing entries) for each column.
  • The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries in each column.
  • This is useful for data cleaning and preprocessing, allowing users to quickly assess the completeness of their dataset.

You may also see:

python
students.isnull().sum()

Explanation

  • The isnull() method is called on the students DataFrame to identify missing values, returning a DataFrame of the same shape with boolean values.
  • The sum() method is then applied to this boolean DataFrame, which counts the True values (indicating missing data) for each column.
  • The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries in each column.
  • This is useful for data cleaning and preprocessing, allowing users to quickly assess the completeness of their dataset.

In Pandas, isna and isnull are commonly used for the same purpose.

14. Checking Duplicate Rows

python
students.duplicated().sum()

Explanation

  • The duplicated() method is called on the students DataFrame to identify duplicate rows.
  • It returns a boolean Series where True indicates a duplicate row.
  • The sum() function then counts the number of True values in the Series, effectively giving the total number of duplicate entries.
  • This code is useful for data cleaning and ensuring the integrity of the dataset by identifying redundancy.

This counts complete duplicate rows.

To see duplicated rows:

python
students[students.duplicated()]

Explanation

  • The code uses the duplicated() method from the pandas library to check for duplicate rows in the students DataFrame.
  • It returns a boolean Series indicating whether each row is a duplicate of a previous row.
  • By indexing the students DataFrame with this boolean Series, it filters and displays only the duplicate entries.
  • This is useful for data cleaning and ensuring the integrity of student records.

To remove duplicate rows:

python
students = students.drop_duplicates()

Explanation

  • The drop_duplicates() method is called on the students DataFrame.
  • It identifies and removes any rows that are exact duplicates of each other.
  • The operation helps ensure that each student is represented only once in the DataFrame.
  • The result is a cleaner dataset, which is essential for accurate data analysis and reporting.

15. Renaming Columns

Readable column names make your code easier to understand.

python
students = students.rename(columns={
    "python_score": "python",
    "sql_score": "sql",
    "practice_hours": "hours"
})

Explanation

  • The rename method is used to change the names of specific columns in the students DataFrame.
  • The columns parameter takes a dictionary where the keys are the current column names and the values are the new names.
  • In this case, "python_score" is renamed to "python", "sql_score" to "sql", and "practice_hours" to "hours".
  • This operation helps in making the DataFrame more user-friendly and easier to work with in subsequent analyses.

You can also rename in place:

python
students.rename(columns={"hours": "study_hours"}, inplace=True)

Explanation

  • The rename method is used to change the name of a specific column in a pandas DataFrame.
  • The columns parameter takes a dictionary where the key is the current column name and the value is the new name.
  • In this case, the column "hours" is renamed to "study_hours".
  • The inplace=True argument modifies the original DataFrame directly without needing to create a copy.

Many teams avoid inplace=True because returning a new DataFrame is easier to reason about in longer pipelines.

16. Math Across Columns And Rows

Create a small score DataFrame:

python
scores = pd.DataFrame({
    "python": [86, 92, 74],
    "sql": [78, 85, 81],
    "excel": [90, 88, 76]
}, index=["Aarav", "Meera", "Ishan"])

Explanation

  • Initializes a pandas DataFrame named scores to organize student scores across different subjects.
  • Contains three subjects: "python", "sql", and "excel", each with a list of scores.
  • Uses a custom index with student names: "Aarav", "Meera", and "Ishan" to label the rows.
  • Facilitates easy data manipulation and analysis for educational performance tracking.

Column-wise sum:

python
scores.sum(axis=0)

Explanation

  • The scores variable is expected to be a NumPy array or a similar data structure.
  • The sum method computes the total of the array elements.
  • The axis=0 argument specifies that the summation should be performed column-wise, aggregating values from each row.
  • The result will be a new array containing the sum of each column in the original array.

Row-wise sum:

python
scores.sum(axis=1)

Explanation

  • The scores variable is expected to be a 2D array or matrix, where each row represents a different set of scores.
  • The sum method is called with the parameter axis=1, indicating that the summation should be performed across rows.
  • The result will be a 1D array containing the total score for each row, effectively aggregating the scores horizontally.
  • This operation is commonly used in data analysis to compute totals or aggregates for each observation or sample.

Row-wise average:

python
scores.mean(axis=1)

Explanation

  • The mean function computes the average of the elements along the specified axis.
  • By setting axis=1, the function calculates the mean for each row, treating each row as a separate dataset.
  • This is useful for obtaining a single average score per entity represented by the rows in the array.
  • The output will be a 1D array containing the mean values for each row.

Other useful methods:

python
scores.median(axis=1)
scores.std(axis=1)
scores.var(axis=1)
scores.min(axis=1)
scores.max(axis=1)

Explanation

  • Computes the median of scores across each row using median(axis=1).
  • Calculates the standard deviation of scores for each row with std(axis=1).
  • Determines the variance of scores in each row using var(axis=1).
  • Finds the minimum score in each row with min(axis=1).
  • Identifies the maximum score in each row using max(axis=1).

Remember:

text
axis=0 means down the rows, producing one result per column
axis=1 means across the columns, producing one result per row

17. Selecting One Column

python
students["name"]

Explanation

  • This code snippet retrieves the value associated with the key "name" from the students dictionary.
  • It assumes that students is a dictionary where each key corresponds to a student's identifier and the value is another dictionary containing student details.
  • If "name" is not a valid key in the dictionary, it will raise a KeyError.
  • This operation is commonly used to extract specific information about a student for further processing or display.

This returns a Series.

You can verify:

python
type(students["name"])

Explanation

  • The code accesses the 'name' key from the 'students' dictionary.
  • It uses the type() function to retrieve the data type of the value associated with the 'name' key.
  • This is useful for validating the expected data structure and ensuring data integrity within the dictionary.

Use bracket syntax for column names because it works with spaces and special characters.

python
students["practice_hours"]

Explanation

  • This code snippet retrieves the "practice_hours" data from a collection named "students".
  • It assumes that "students" is a dictionary or a similar data structure where "practice_hours" is a key.
  • The output will be the value associated with "practice_hours", which could be a list, integer, or any other data type.
  • This operation is useful for analyzing or displaying the amount of practice time logged by each student.

Dot syntax can be convenient:

python
students.city

Explanation

  • This code snippet retrieves the 'city' attribute from the 'students' object.
  • It assumes that 'students' is an instance of a class or a data structure that has a defined 'city' property.
  • The value returned will depend on the specific implementation of the 'students' object.
  • This is a common way to access object attributes in Python, allowing for easy retrieval of stored data.

But dot syntax can break when column names have spaces, conflict with methods, or are created dynamically.

Prefer bracket syntax in teaching and production code.

18. Selecting Multiple Columns

Use a list of column names inside brackets.

python
students[["name", "city", "python_score"]]

Explanation

  • This code snippet accesses a DataFrame named students.
  • It retrieves only the columns "name", "city", and "python_score".
  • The result is a new DataFrame containing only the specified columns for further analysis or display.
  • This operation is useful for focusing on relevant data without altering the original DataFrame.

This returns a DataFrame.

Column order follows the list you pass:

python
students[["python_score", "name"]]

Explanation

  • This code snippet accesses a DataFrame named students.
  • It selects and displays only the "python_score" and "name" columns.
  • The result is a new DataFrame containing only the specified data for further analysis or display.
  • This operation is useful for focusing on relevant information without altering the original DataFrame.

19. Selecting Rows With iloc

iloc selects by integer position.

python
students.iloc[0]

Explanation

  • The code retrieves the first row of the DataFrame named students using the iloc indexer.
  • iloc[0] specifically targets the first row, as indexing starts at 0 in Python.
  • This operation returns a Series object containing all the column values for the first student in the DataFrame.
  • It is useful for quickly inspecting the data of the first entry without needing to display the entire DataFrame.

This selects the first row.

Multiple rows:

python
students.iloc[0:3]

Explanation

  • students is a pandas DataFrame that likely contains information about students.
  • The iloc method is used for integer-location based indexing to select rows.
  • The slice 0:3 specifies that the code should return rows at index 0, 1, and 2, effectively giving the first three entries.
  • This operation is useful for quickly viewing a subset of data for analysis or debugging.

Rows at specific positions:

python
students.iloc[[0, 2, 5]]

Explanation

  • The code utilizes the iloc method from the pandas library to access specific rows of a DataFrame named students.
  • It retrieves the rows at index positions 0, 2, and 5, allowing for targeted data extraction.
  • This method is particularly useful for accessing rows by their integer location rather than by label.
  • The resulting output will be a new DataFrame containing only the selected rows.
  • Ensure that the indices specified exist within the DataFrame to avoid an IndexError.

Last row:

python
students.iloc[-1]

Explanation

  • The code uses the iloc method from the pandas library to access data by integer-location based indexing.
  • -1 indicates that the last row of the DataFrame students will be selected.
  • This is useful for quickly viewing the most recent entry or record in the dataset.
  • The returned output will be a Series object representing the last student's data.

In iloc, slices behave like normal Python slices. The stop position is excluded.

python
students.iloc[1:4]

Explanation

  • The iloc method is used for integer-location based indexing in pandas DataFrames.
  • The slice 1:4 indicates that it will return rows with index 1, 2, and 3 (the end index is exclusive).
  • This allows for easy selection of a subset of data for analysis or manipulation.
  • The resulting output will be a new DataFrame containing only the specified rows.

This selects positions 1, 2, and 3.

20. Selecting Rows With loc

loc selects by index label.

Create a labeled DataFrame:

python
students_by_id = students.set_index("name")

Explanation

  • The set_index method is used to change the index of the DataFrame students.
  • The "name" column is specified as the new index, allowing for more intuitive data retrieval based on student names.
  • After executing this line, the DataFrame will be indexed by student names instead of the default integer index.
  • This can improve performance and readability when performing operations that involve looking up students by their names.

Select a row by label:

python
students_by_id.loc["Meera"]

Explanation

  • The code snippet uses the loc method from the pandas library to retrieve data.
  • It targets the row corresponding to the index label "Meera" in the students_by_id DataFrame.
  • This operation returns all columns for the student identified by the index "Meera".
  • It is useful for quickly accessing information about a specific student based on their unique identifier.

Select multiple labels:

python
students_by_id.loc[["Aarav", "Tara", "Nila"]]

Explanation

  • The code accesses a DataFrame named students_by_id to retrieve data.
  • It uses the .loc[] method, which allows for label-based indexing to select rows.
  • The list ["Aarav", "Tara", "Nila"] specifies the IDs of the students whose records are to be fetched.
  • This operation returns the rows corresponding to the specified student IDs, enabling focused data analysis.

Label slicing:

python
students_by_id.loc["Meera":"Kabir"]

Explanation

  • Utilizes the loc method to access rows in a pandas DataFrame based on index labels.
  • The range specified, from "Meera" to "Kabir", includes all rows with index labels within this range.
  • This operation is useful for filtering data based on specific criteria or for viewing a subset of records.
  • Assumes that the DataFrame students_by_id is indexed by student names or IDs.
  • The output will include all records between the specified start and end labels, inclusive.

Important difference:

text
loc label slices include the stop label when it exists.
iloc position slices exclude the stop position.

This is a common beginner trap.

21. Selecting Rows And Columns Together

Use iloc when both rows and columns are positions.

python
students.iloc[0:3, 0:2]

Explanation

  • Utilizes the iloc method from the pandas library to access specific rows and columns by integer-location based indexing.
  • The slice 0:3 indicates that it selects rows 0, 1, and 2, effectively retrieving the first three rows.
  • The slice 0:2 specifies that it selects columns 0 and 1, thus retrieving the first two columns of the DataFrame.
  • This operation is useful for quickly examining a subset of data within a larger DataFrame.
  • The result is a new DataFrame containing only the specified rows and columns.

This means:

  • rows from position 0 to 2
  • columns from position 0 to 1

Use loc when both rows and columns are labels.

python
students_by_id.loc["Aarav":"Tara", ["city", "python_score"]]

Explanation

  • Uses the loc method to access rows and columns in a pandas DataFrame.
  • Selects rows from "Aarav" to "Tara" based on the index labels.
  • Retrieves only the "city" and "python_score" columns for the specified range of students.
  • Assumes that the DataFrame students_by_id is indexed by student names or IDs.
  • Useful for filtering data to analyze specific attributes of a subset of students.

You can also filter rows and select columns in one clean expression:

python
students.loc[students["city"] == "Pune", ["name", "python_score"]]

Explanation

  • Uses the loc method from the pandas library to filter rows in the students DataFrame.
  • The condition checks for students whose "city" column matches "Pune".
  • Only the "name" and "python_score" columns are selected for the output.
  • This operation is useful for analyzing performance based on geographical location.
  • The result is a subset of the original DataFrame containing relevant information for further analysis.

This is one of the most useful DataFrame patterns.

22. Filtering Rows With One Condition

Find students with Python score above 85:

python
high_python = students[students["python_score"] > 85]

Explanation

  • The code filters a DataFrame named students to identify those with a Python score greater than 85.
  • It uses boolean indexing to create a new DataFrame called high_python.
  • This operation allows for easy analysis of students who excel in Python, facilitating targeted educational strategies.
  • The resulting DataFrame high_python contains only the rows that meet the specified condition.

Find students from Pune:

python
pune_students = students[students["city"] == "Pune"]

Explanation

  • The code snippet filters a DataFrame named students to create a new DataFrame pune_students.
  • It uses a boolean condition to check if the "city" column equals "Pune".
  • Only the rows where the condition is true are included in the pune_students DataFrame.
  • This operation is useful for analyzing or processing data specific to students located in Pune.

Find students who completed the course:

python
completed = students[students["completed"] == True]

Explanation

  • The code filters a DataFrame named students to create a new DataFrame called completed.
  • It selects only the rows where the "completed" column has a value of True.
  • This is useful for analyzing or processing only those students who have finished their assignments or tasks.
  • The resulting completed DataFrame contains all relevant information about the students who have completed their tasks.

Cleaner version:

python
completed = students[students["completed"]]

Explanation

  • This code snippet filters a DataFrame named students to create a new variable completed.
  • It selects only the rows where the "completed" column is truthy, indicating that those students have completed their tasks.
  • The result is a subset of the original DataFrame containing only the relevant entries.
  • This operation is useful for analyzing or processing data related to students who have finished their assignments or courses.

23. Filtering With Multiple Conditions

Use & for AND.

python
strong_and_consistent = students[
    (students["python_score"] >= 85) &
    (students["sql_score"] >= 85)
]

Explanation

  • The code filters a DataFrame named students to identify those with high scores in both Python and SQL.
  • It uses boolean indexing to select rows where the python_score is greater than or equal to 85 and the sql_score is also greater than or equal to 85.
  • The result is stored in the variable strong_and_consistent, which contains only the students meeting these criteria.
  • This approach is useful for analyzing student performance and identifying top performers in specific subjects.

Use | for OR.

python
needs_attention = students[
    (students["python_score"] < 75) |
    (students["practice_hours"] < 10)
]

Explanation

  • The code creates a new DataFrame called needs_attention from an existing DataFrame students.
  • It selects students whose python_score is less than 75 or who have logged less than 10 practice hours.
  • The logical operator | is used to combine the two conditions, meaning either condition being true will include the student in the result.
  • This filtering helps identify students who may need extra help in their studies.

Important rule:

text
When combining conditions with & or |, wrap each condition in parentheses.

Do not write:

python
# Avoid this
students[students["python_score"] >= 85 & students["sql_score"] >= 85]

Explanation

  • The code attempts to filter a DataFrame of students based on their scores in Python and SQL.
  • It uses the bitwise AND operator & incorrectly with the comparison operators, which can lead to unexpected results.
  • The correct approach should use parentheses around each condition to ensure proper evaluation order.
  • The intended functionality is to select students with scores of 85 or higher in both subjects.
  • Proper syntax would be: students[(students["python_score"] >= 85) & (students["sql_score"] >= 85)].

The operator precedence will not behave the way you expect.

24. Filtering With isin

Use isin when a column should match any value from a list.

python
south_or_west = students[students["city"].isin(["Pune", "Kochi"])]

Explanation

  • The code filters a DataFrame named students to select only those entries where the "city" column matches either "Pune" or "Kochi".
  • The isin() method is used to check if the values in the "city" column are present in the specified list of cities.
  • The result is stored in the variable south_or_west, which will contain a subset of the original DataFrame.
  • This operation is useful for analyzing or processing data related to specific geographic locations.

This is cleaner than writing many OR conditions.

25. Filtering With between

Use between for ranges.

python
middle_scores = students[students["python_score"].between(75, 90)]

Explanation

  • The code filters a DataFrame named students to select only those rows where the python_score is between 75 and 90.
  • The between method is used to check if the python_score falls within the inclusive range of 75 to 90.
  • The result is stored in the variable middle_scores, which will contain only the students who meet this score criterion.
  • This operation is useful for analyzing a subset of students who are performing at a moderate level in Python.

By default, the endpoints are included.

26. Filtering Text With String Methods

Pandas gives string methods through .str.

python
students[students["name"].str.startswith("M")]

Explanation

  • Utilizes the pandas library to filter a DataFrame named students.
  • The str.startswith("M") method checks if the 'name' column values begin with the letter 'M'.
  • The result is a subset of the original DataFrame containing only the rows that meet this condition.
  • This operation is useful for quickly identifying specific groups of data based on string patterns.

Case-insensitive contains:

python
students[students["city"].str.contains("pu", case=False, na=False)]

Explanation

  • The code accesses a DataFrame named students which contains student data.
  • It uses the str.contains() method to check if the "city" column contains the substring "pu".
  • The case=False argument ensures that the search is case insensitive, allowing matches for "Pu", "pu", "PU", etc.
  • The na=False argument prevents errors by treating NaN values as False, meaning they won't be included in the results.
  • The result is a filtered DataFrame containing only the rows where the "city" column matches the specified condition.

Use na=False when the column may contain missing values.

27. Adding A New Column

Add a constant column:

python
students["country"] = "India"

Explanation

  • The code modifies the students dictionary by adding or updating the key "country".
  • All entries in the students dictionary will have their "country" value set to "India".
  • This operation is useful for bulk updating information for a group of students.
  • If "country" already exists, its value will be overwritten with "India".

Add a calculated column:

python
students["total_score"] = students["python_score"] + students["sql_score"]

Explanation

  • The code accesses a DataFrame named students, which contains individual student scores.
  • It creates a new column called total_score in the DataFrame.
  • The total_score is computed by adding the values from the python_score and sql_score columns for each student.
  • This operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously.
  • The resulting total_score column provides a quick reference for the overall performance of each student in both subjects.

Add an average:

python
students["average_score"] = students[["python_score", "sql_score"]].mean(axis=1)

Explanation

  • The code snippet adds a new column named "average_score" to the students DataFrame.
  • It computes the mean of the scores from the "python_score" and "sql_score" columns.
  • The mean(axis=1) function calculates the average across the specified columns for each row.
  • This operation allows for easy assessment of student performance in both subjects.

Add a category from a condition:

python
students["performance"] = np.where(
    students["average_score"] >= 85,
    "strong",
    "practice_more"
)

Explanation

  • The code uses NumPy's where function to create a new column in the students DataFrame called "performance".
  • It assigns the value "strong" to students whose "average_score" is 85 or higher.
  • For students with an "average_score" below 85, it assigns the value "practice_more".
  • This operation allows for efficient conditional logic application across the DataFrame.

28. Updating Existing Columns

You can update text values:

python
students["city"] = students["city"].str.title()

Explanation

  • The code accesses the "city" column of the students DataFrame.
  • It applies the str.title() method, which converts each city name to title case (capitalizing the first letter of each word).
  • This transformation helps standardize city names, improving data quality and readability.
  • The modified city names are then reassigned back to the "city" column in the DataFrame.

You can normalize inconsistent labels:

python
students["city"] = students["city"].replace({
    "Bengaluru": "Bangalore",
    "Bombay": "Mumbai"
})

Explanation

  • The code modifies the "city" column in the students DataFrame.
  • It uses the replace method to change specific city names to their standardized versions.
  • "Bengaluru" is replaced with "Bangalore" and "Bombay" with "Mumbai".
  • This ensures uniformity in city names, which is important for data analysis and reporting.
  • The operation is performed in-place, updating the original DataFrame directly.

You can update numeric values:

python
students["practice_hours"] = students["practice_hours"].clip(lower=0)

Explanation

  • The code modifies the "practice_hours" column in the "students" DataFrame.
  • It uses the clip method to set a lower limit of 0 for the values in this column.
  • Any negative values in "practice_hours" will be replaced with 0, ensuring all entries are non-negative.
  • This is useful for data cleaning, preventing unrealistic negative practice hours from affecting analysis.

29. Converting Data Types

Use astype when the conversion is direct.

python
students["completed"] = students["completed"].astype("bool")

Explanation

  • The code modifies the "completed" column in the students DataFrame.
  • It uses the astype method to change the data type of the "completed" column to boolean.
  • This conversion allows for more efficient storage and logical operations on the column's values.
  • The operation is performed in-place, meaning the original DataFrame is updated directly.

Convert a repeated text column to category:

python
students["city"] = students["city"].astype("category")

Explanation

  • The code modifies the "city" column of the students DataFrame.
  • It uses the astype method to change the data type of the column to "category".
  • Categorical data types are more memory-efficient and can improve performance for certain operations.
  • This conversion is particularly useful when the column contains a limited number of unique values.

This can reduce memory for columns with repeated values.

For messy numeric text, use pd.to_numeric.

python
raw = pd.DataFrame({
    "amount": ["499", "299", "missing", "799"]
})

raw["amount"] = pd.to_numeric(raw["amount"], errors="coerce")

Explanation

  • A pandas DataFrame named raw is created with a column "amount" containing string representations of numbers and a missing value.
  • The pd.to_numeric() function is applied to the "amount" column to convert its values to numeric types.
  • The errors="coerce" parameter ensures that any non-convertible values (like "missing") are replaced with NaN (Not a Number) instead of raising an error.
  • This approach is useful for data cleaning, allowing for easier numerical analysis and operations on the DataFrame.

Invalid values become missing values.

30. Handling Missing Values

Create an example:

python
feedback = pd.DataFrame({
    "student": ["Aarav", "Meera", "Ishan", "Tara", "Kabir"],
    "rating": [5, np.nan, 4, np.nan, 3],
    "comment": ["Clear", "Helpful", None, "Practical", None]
})

Explanation

  • A pandas DataFrame named feedback is created to organize student feedback data.
  • The DataFrame contains three columns: student, rating, and comment.
  • The student column lists the names of five students.
  • The rating column includes numerical ratings, with some entries as NaN (not a number) indicating missing values.
  • The comment column contains textual feedback, with some entries as None, representing absent comments.

Check missing values:

python
feedback.isna().sum()

Explanation

  • The isna() function checks for missing values (NaNs) in the DataFrame feedback.
  • The sum() function aggregates the boolean results from isna(), counting the total number of missing entries for each column.
  • This code snippet is useful for data cleaning and preprocessing, allowing users to quickly identify columns that require attention due to missing data.

Fill missing numeric values with median:

python
feedback["rating"] = feedback["rating"].fillna(feedback["rating"].median())

Explanation

  • The code accesses the 'rating' column of the feedback DataFrame.
  • It uses the fillna() method to fill any NaN (missing) values in the 'rating' column.
  • The missing values are replaced with the median of the existing ratings, calculated using feedback["rating"].median().
  • This approach helps maintain the integrity of the dataset by ensuring that all entries have a valid rating.
  • Using the median is beneficial as it is less affected by outliers compared to the mean.

Fill missing text:

python
feedback["comment"] = feedback["comment"].fillna("No comment")

Explanation

  • The code accesses the "comment" column of the feedback DataFrame.
  • It uses the fillna() method to identify and replace any NaN (missing) values.
  • The default value used for replacement is the string "No comment".
  • This ensures that all entries in the "comment" column have a valid string, improving data consistency.

Drop rows with missing values:

python
feedback.dropna()

Explanation

  • The dropna() method is called on the DataFrame feedback.
  • It eliminates any rows that contain at least one NaN (Not a Number) value.
  • The operation modifies the DataFrame in place unless specified otherwise.
  • This is useful for cleaning data before analysis to ensure completeness.
  • No parameters are provided, so it defaults to removing rows with any missing values.

Drop rows only when a specific column is missing:

python
feedback.dropna(subset=["rating"])

Explanation

  • The dropna method is used to eliminate rows that contain NaN (Not a Number) values.
  • The subset parameter specifies that only the 'rating' column should be checked for missing values.
  • Rows where 'rating' is NaN will be dropped from the DataFrame named feedback.
  • This operation helps in cleaning the dataset for analysis by ensuring that only complete entries are retained.

Use dropping carefully. Sometimes missing data is itself meaningful.

31. Sorting Rows

Sort by one column:

python
students.sort_values("python_score", ascending=False)

Explanation

  • The code snippet utilizes the sort_values method from a DataFrame to organize student records.
  • It specifically targets the "python_score" column for sorting.
  • The ascending=False parameter ensures that the highest scores appear first in the sorted list.
  • This operation modifies the order of the DataFrame in place, allowing for easy access to top-performing students.

Sort by multiple columns:

python
students.sort_values(
    ["city", "python_score"],
    ascending=[True, False]
)

Explanation

  • The sort_values method is used to sort the DataFrame named students.
  • The sorting is performed first by the city column in ascending order.
  • Within each city, the python_score column is sorted in descending order.
  • This allows for a structured view of students, grouped by their city and ranked by their Python scores.
  • The result is a DataFrame that prioritizes city organization while highlighting top performers in Python within each city.

32. Counting Categories

Use value_counts on a Series.

python
students["city"].value_counts()

Explanation

  • The code accesses the "city" column from the students DataFrame.
  • It uses the value_counts() method to count the number of occurrences of each unique city.
  • The result is a Series object that displays cities as the index and their respective counts as values.
  • This is useful for quickly analyzing the distribution of students across different cities.
  • The output can help in understanding demographic information or planning resources based on student locations.

Normalize to percentages:

python
students["city"].value_counts(normalize=True) * 100

Explanation

  • The code accesses the "city" column from a DataFrame named students.
  • It uses the value_counts() method to count the occurrences of each unique city.
  • The normalize=True parameter converts these counts into proportions of the total.
  • Multiplying by 100 converts the proportions into percentages, providing a clearer view of city distribution.
  • The result is a Series showing the percentage of students from each city.

Include missing values:

python
students["city"].value_counts(dropna=False)

Explanation

  • The code accesses the "city" column from a DataFrame named students.
  • It uses the value_counts() method to count the unique occurrences of each city.
  • The parameter dropna=False ensures that missing values (NaN) are included in the count.
  • This is useful for understanding the distribution of students across different cities, including those without specified locations.

33. Grouping Data With groupby

groupby helps you answer questions by category.

Average score by city:

python
students.groupby("city")["python_score"].mean()

Explanation

  • The code uses the groupby method from the Pandas library to group student data by the "city" column.
  • It selects the "python_score" column to focus on the scores of students in each city.
  • The mean() function computes the average Python score for each city group.
  • This operation results in a Series with cities as the index and their corresponding average Python scores as values.
  • It is useful for analyzing performance trends based on geographic location.

Multiple aggregations:

python
students.groupby("city").agg(
    learners=("name", "count"),
    avg_python=("python_score", "mean"),
    avg_sql=("sql_score", "mean"),
    total_hours=("practice_hours", "sum")
)

Explanation

  • The code uses the groupby method to organize student data based on the "city" column.
  • It aggregates the data to compute the number of learners, average Python scores, average SQL scores, and total practice hours for each city.
  • The agg function specifies multiple aggregation operations, including counting names, calculating means for scores, and summing practice hours.
  • The results are returned as a DataFrame with new column names for clarity, such as "learners", "avg_python", "avg_sql", and "total_hours".

Sort the result:

python
students.groupby("city")["practice_hours"].sum().sort_values(ascending=False)

Explanation

  • The code uses the groupby method to categorize student data based on the "city" column.
  • It then aggregates the "practice_hours" for each city using the sum() function.
  • The results are sorted in descending order with sort_values(ascending=False), allowing for easy identification of cities with the highest total practice hours.
  • This operation is useful for analyzing regional engagement in practice activities among students.

34. Mini Project 1: Course Enrollment Table

Create a small course enrollment dataset.

python
enrollments = pd.DataFrame({
    "learner": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila", "Rohan"],
    "track": ["Python", "Pandas", "Python", "SQL", "Pandas", "Python", "SQL"],
    "level": ["beginner", "beginner", "beginner", "intermediate", "beginner", "intermediate", "beginner"],
    "minutes_watched": [320, 410, 180, 260, 390, 520, 150],
    "quiz_score": [82, 91, 67, 74, 88, 95, 61],
    "certificate": [True, True, False, False, True, True, False]
})

Explanation

  • This code snippet initializes a pandas DataFrame named enrollments to manage data related to learners in various tracks.
  • It includes columns for learner names, the track they are enrolled in, their level of expertise, minutes_watched for course content, quiz_score achieved, and whether they received a certificate.
  • The data is structured as a dictionary, where each key corresponds to a column name and the associated list contains the respective values for each learner.
  • This DataFrame can be used for further analysis, such as calculating average scores or filtering learners based on their progress.

Practice Task

Question 1: Which learners watched more than 300 minutes?

python
enrollments[enrollments["minutes_watched"] > 300]

Explanation

  • This code snippet filters a DataFrame named enrollments.
  • It selects only the rows where the value in the "minutes_watched" column is greater than 300.
  • The result is a new DataFrame containing only the enrollments with significant viewing time.
  • This operation is useful for analyzing user engagement or identifying highly active users.

Practice Task

Question 2: Which beginner learners scored at least 80?

python
enrollments[
    (enrollments["level"] == "beginner") &
    (enrollments["quiz_score"] >= 80)
]

Explanation

  • The code filters a DataFrame named enrollments to select rows that meet specific criteria.
  • It checks for entries where the "level" column is equal to "beginner".
  • Additionally, it ensures that the "quiz_score" column has values greater than or equal to 80.
  • The result is a subset of the original DataFrame containing only the qualifying records.

Practice Task

Question 3: What is the average score per track?

python
enrollments.groupby("track")["quiz_score"].mean()

Explanation

  • The code uses the groupby method to organize data based on the "track" column.
  • It selects the "quiz_score" column to perform calculations on.
  • The mean() function computes the average quiz score for each unique track.
  • This operation is useful for analyzing performance trends across different tracks.
  • The result is a new Series with track names as the index and their corresponding average quiz scores.

Practice Task

Question 4: Which track has the most certificates?

python
enrollments.groupby("track")["certificate"].sum().sort_values(ascending=False)

Explanation

  • Groups the enrollments DataFrame by the track column.
  • Sums the certificate values for each group, providing the total certificates earned per track.
  • Sorts the resulting sums in descending order to highlight the tracks with the highest certificate totals.

Practice Task

Question 5: Add an engagement column

python
enrollments["engagement"] = np.where(
    enrollments["minutes_watched"] >= 300,
    "high",
    "low"
)

Explanation

  • Utilizes the np.where function from the NumPy library to create a new column called "engagement" in the enrollments DataFrame.
  • Checks if the value in the "minutes_watched" column is greater than or equal to 300.
  • Assigns the label "high" for users who meet or exceed 300 minutes watched, and "low" for those who do not.
  • This categorization helps in analyzing user engagement levels effectively.

This mini project covers filtering, grouping, sorting, and column creation.

35. Mini Project 2: Cleaning Store Orders

Create a messy order table.

python
orders = pd.DataFrame({
    "order_id": [1, 2, 3, 3, 4, 5, 6],
    "customer": ["Anika", "Rohan", "Zoya", "Zoya", "Kabir", "Meera", "Ishan"],
    "city": ["pune", "Delhi", "PUNE", "PUNE", None, "Kochi", "delhi"],
    "amount": ["499", "299", "799", "799", "bad", "999", "450"],
    "status": ["paid", "pending", "paid", "paid", "paid", None, "pending"]
})

Explanation

  • Initializes a pandas DataFrame named orders containing order details such as order ID, customer name, city, amount, and payment status.
  • The order_id column includes duplicate entries, indicating multiple orders from the same customer.
  • The city column contains inconsistent casing (e.g., "PUNE" vs "pune") and a missing value (None).
  • The amount column has a non-numeric entry ("bad"), which may lead to data type issues during analysis.
  • The status column includes both valid entries and a missing value (None), which may require handling for accurate reporting.

Step 1: Normalize city names

python
orders["city"] = orders["city"].str.title()

Explanation

  • The code accesses the "city" column of the orders DataFrame.
  • It applies the str.title() method, which converts each city name to title case (capitalizing the first letter of each word).
  • The formatted city names are then reassigned back to the "city" column, updating the DataFrame in place.
  • This transformation ensures consistency in the representation of city names for better readability and data analysis.

Step 2: Convert amount to numeric

python
orders["amount"] = pd.to_numeric(orders["amount"], errors="coerce")

Explanation

  • The code uses the pd.to_numeric() function from the Pandas library to convert the "amount" column in the orders DataFrame to numeric data types.
  • The errors="coerce" parameter ensures that any values that cannot be converted to numbers are replaced with NaN (Not a Number) instead of raising an error.
  • This approach is useful for cleaning data, as it allows for the identification and handling of invalid entries in the "amount" column.
  • After execution, the "amount" column will contain only numeric values or NaN, making it suitable for further analysis or calculations.

Step 3: Fill missing status

python
orders["status"] = orders["status"].fillna("unknown")

Explanation

  • The code accesses the 'status' column of the DataFrame named orders.
  • It uses the fillna() method to replace any NaN (missing) values in that column.
  • The replacement value for missing entries is set to the string "unknown".
  • This operation helps in maintaining data integrity by ensuring there are no missing values in the 'status' column.
  • The updated DataFrame retains all other data while only modifying the specified column.

Step 4: Remove duplicate orders

python
orders = orders.drop_duplicates(subset=["order_id"])

Explanation

  • The drop_duplicates method is called on the orders DataFrame.
  • The subset parameter specifies that duplicates should be identified based on the order_id column.
  • Only the first occurrence of each unique order_id is retained in the DataFrame.
  • This operation helps in cleaning the data by ensuring that each order is represented only once.
  • The result is a DataFrame with unique order entries, which is crucial for accurate data analysis.

Step 5: Find paid orders above 400

python
orders[
    (orders["status"] == "paid") &
    (orders["amount"] > 400)
]

Explanation

  • The code filters a DataFrame named orders to retrieve specific rows based on conditions.
  • It checks for rows where the status column is equal to "paid".
  • Additionally, it ensures that the amount column has values greater than 400.
  • The result is a subset of the orders DataFrame that meets both criteria, useful for analyzing high-value transactions.

This is closer to real work because data is rarely clean when it arrives.

36. Practice Dataset: Workshop Attendance

Use this dataset for the exercises.

python
attendance = pd.DataFrame({
    "learner": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila", "Rohan", "Zoya", "Aarav"],
    "topic": ["Python", "Pandas", "Python", "SQL", "Pandas", "Python", "SQL", "Pandas", "Python"],
    "age": [21, 24, np.nan, 23, 26, 22, np.nan, 25, 21],
    "visits": [3, 5, 2, np.nan, 4, 6, 1, 5, 3],
    "score": [78, 91, 65, 72, 88, 95, 59, 84, 78],
    "city": ["Pune", "Kochi", "Delhi", "Pune", "Jaipur", "Kochi", "Delhi", "Pune", "Pune"]
})

Explanation

  • Initializes a pandas DataFrame named attendance to store information about learners.
  • Contains columns for learner names, topic of study, age, visits to sessions, score achieved, and city of residence.
  • Includes missing values represented by np.nan for the age and visits columns, indicating incomplete data for some learners.
  • The DataFrame is structured to facilitate analysis of attendance patterns and performance metrics across different topics and demographics.

37. Practice Questions With Solutions

Practice Task

Q1. Display basic information about the DataFrame.

python
attendance.info()
attendance.describe()

Explanation

  • The attendance.info() method provides a concise summary of the DataFrame, including the number of entries, column data types, and memory usage.
  • The attendance.describe() method generates descriptive statistics for numerical columns, such as count, mean, standard deviation, min, and max values.
  • Both methods are useful for understanding the structure and distribution of data within the DataFrame before performing further analysis.
  • These functions help identify potential issues like missing values or outliers in the dataset.

Practice Task

Q2. Show alternate rows.

python
attendance.iloc[::2]

Explanation

  • The code uses the iloc method from the pandas library to access rows by their integer index.
  • The slicing notation ::2 indicates that every second row will be selected, starting from the first row (index 0).
  • This is useful for reducing the size of a DataFrame or for sampling data at regular intervals.
  • The resulting output is a new DataFrame containing only the selected rows, which can be further processed or analyzed.

Practice Task

Q3. Show rows at positions 0, 2, and 6 with only learner and age.

python
attendance.loc[attendance.index[[0, 2, 6]], ["learner", "age"]]

Explanation

  • The code uses the loc method from the Pandas library to access specific data in a DataFrame named attendance.
  • It retrieves rows at index positions 0, 2, and 6, which correspond to the first, third, and seventh entries in the DataFrame.
  • The columns selected for output are "learner" and "age", meaning only these two attributes will be displayed for the specified rows.
  • This method is useful for extracting a subset of data for analysis or reporting based on specific criteria.

Or:

python
attendance[["learner", "age"]].iloc[[0, 2, 6]]

Explanation

  • The code snippet utilizes the pandas library to manipulate a DataFrame named attendance.
  • It selects the columns "learner" and "age" from the DataFrame.
  • The iloc method is used to retrieve rows at indices 0, 2, and 6.
  • This operation results in a new DataFrame containing only the specified rows and columns.

Practice Task

Q4. Select rows where visits are less than 4.

python
attendance[attendance["visits"] < 4]

Explanation

  • The code accesses a DataFrame named attendance.
  • It filters the rows where the "visits" column has values less than 4.
  • The result is a new DataFrame containing only those records that meet the specified condition.
  • This operation is useful for identifying individuals who may need additional engagement or support.

Practice Task

Q5. Select rows where age or visits is missing.

python
attendance[
    attendance["age"].isna() |
    attendance["visits"].isna()
]

Explanation

  • The code filters the attendance DataFrame to identify rows where either the age or visits columns contain missing values (NaN).
  • The isna() function is used to check for NaN values in the specified columns.
  • The logical OR operator (|) combines the two conditions, ensuring that any row with a NaN in either column is included in the result.
  • The resulting DataFrame will only contain entries that have incomplete data regarding age or visit counts.

Practice Task

Q6. Fill missing ages with the median age.

python
attendance["age"] = attendance["age"].fillna(attendance["age"].median())

Explanation

  • The code accesses the "age" column of the attendance DataFrame.
  • It uses the fillna() method to replace any NaN (missing) values in the "age" column.
  • The replacement value is calculated using the median() function, which finds the median age from the existing values.
  • This approach helps maintain the integrity of the dataset by avoiding the removal of rows with missing values.
  • Filling missing values with the median is a common practice to reduce bias in statistical analysis.

Practice Task

Q7. Fill missing visits with the most common visits value.

python
attendance["visits"] = attendance["visits"].fillna(attendance["visits"].mode()[0])

Explanation

  • The code accesses the 'visits' column of the 'attendance' DataFrame.
  • It uses the fillna() method to replace any NaN (missing) values.
  • The replacement value is determined by calculating the mode (most frequent value) of the 'visits' column.
  • The mode is accessed using attendance["visits"].mode()[0], which retrieves the first mode value.
  • This approach ensures that the 'visits' column has no missing values, improving data integrity for analysis.

Practice Task

Q8. Find total visits for learners attending Python workshops.

python
attendance[attendance["topic"] == "Python"]["visits"].sum()

Explanation

  • Filters the attendance DataFrame to include only rows where the topic column equals "Python".
  • Accesses the visits column from the filtered DataFrame.
  • Uses the sum() function to compute the total number of visits for the Python topic.
  • Returns a single numeric value representing the aggregated visits.
  • Assumes that the attendance DataFrame is already defined and contains the relevant columns.

Practice Task

Q9. Count learners per topic.

python
attendance["topic"].value_counts()

Explanation

  • The code accesses the "topic" column of the attendance DataFrame.
  • It utilizes the value_counts() method to compute the frequency of each unique value in that column.
  • The result is a Series object that displays topics as the index and their corresponding counts as values.
  • This is useful for quickly understanding the distribution of topics in the attendance data.

Practice Task

Q10. Count duplicate rows.

python
attendance.duplicated().sum()

Explanation

  • The attendance.duplicated() method identifies duplicate rows in the attendance DataFrame, returning a boolean Series.
  • The sum() function then counts the number of True values in this Series, effectively giving the total number of duplicate entries.
  • This code is useful for data cleaning and validation, ensuring that each entry in the attendance record is unique.
  • It helps in maintaining data integrity before performing further analysis or operations on the dataset.

Practice Task

Q11. Drop duplicate rows.

python
attendance = attendance.drop_duplicates()

Explanation

  • The drop_duplicates() method is called on the attendance DataFrame.
  • It eliminates any rows that are exact duplicates of others, ensuring each entry is unique.
  • The operation modifies the DataFrame in place, meaning the original attendance variable will now only contain unique records.
  • This is useful for cleaning data before analysis or reporting, as duplicates can skew results.

Practice Task

Q12. Find average score by topic.

python
attendance.groupby("topic")["score"].mean()

Explanation

  • The code uses the groupby method to organize the data in the attendance DataFrame by the "topic" column.
  • It then selects the "score" column to perform calculations on.
  • The mean() function computes the average score for each unique topic.
  • This operation is useful for summarizing performance metrics across different topics.
  • The result is a new Series with topics as the index and their corresponding average scores as values.

Practice Task

Q13. Find top cities by total visits.

python
attendance.groupby("city")["visits"].sum().sort_values(ascending=False)

Explanation

  • The code uses the groupby method to aggregate data in the attendance DataFrame based on the unique values in the "city" column.
  • It selects the "visits" column to calculate the total number of visits for each city.
  • The sum() function computes the total visits for each group (city).
  • Finally, sort_values(ascending=False) arranges the resulting totals in descending order, allowing easy identification of cities with the highest visits.

Practice Task

Q14. Create a passed column where score is at least 70.

python
attendance["passed"] = attendance["score"] >= 70

Explanation

  • The code adds a new column named "passed" to the attendance DataFrame.
  • It evaluates whether each student's score is greater than or equal to 70.
  • The result is a boolean value (True or False) indicating if the student has passed.
  • This operation utilizes vectorized operations for efficient computation across the DataFrame.

Practice Task

Q15. Create a score_band column.

python
attendance["score_band"] = np.where(
    attendance["score"] >= 85,
    "strong",
    np.where(attendance["score"] >= 70, "good", "needs_practice")
)

Explanation

  • Utilizes NumPy's where function to create a new column score_band in the attendance DataFrame.
  • Assigns the value "strong" if the score is 85 or higher.
  • Assigns the value "good" if the score is between 70 and 84.
  • Assigns the value "needs_practice" for scores below 70.
  • This approach allows for efficient conditional assignment in a DataFrame.

38. Practice Dataset: League Matches

This dataset is fictional and made only for practice.

python
matches = pd.DataFrame({
    "season": [2023, 2023, 2023, 2024, 2024, 2024, 2024],
    "city": ["Pune", "Delhi", "Kochi", "Pune", "Jaipur", "Delhi", "Kochi"],
    "team_a": ["Falcons", "Tigers", "Falcons", "Rockets", "Tigers", "Falcons", "Rockets"],
    "team_b": ["Rockets", "Falcons", "Tigers", "Falcons", "Rockets", "Tigers", "Tigers"],
    "winner": ["Falcons", "Tigers", "Falcons", "Falcons", "Rockets", "Falcons", "Tigers"],
    "margin": [12, 8, 20, 5, 15, 6, 9],
    "final": [False, False, True, False, False, True, False]
})

Explanation

  • A pandas DataFrame named matches is created to organize data related to sports match results.
  • The DataFrame contains columns for the season year, city of the match, teams involved (team_a and team_b), the winner of each match, the margin of victory, and whether the match was a final.
  • The data includes matches from two different seasons (2023 and 2024) with various cities and teams participating.
  • The final column indicates if a match is a final, with boolean values (True or False) for each match entry.
  • This structure allows for easy analysis and manipulation of match data using pandas functionalities.

39. Match Practice Questions With Solutions

Practice Task

Q1. Find all final winners.

python
matches.loc[matches["final"], ["season", "winner"]]

Explanation

  • Utilizes the loc method to filter rows in the matches DataFrame where the "final" column is True.
  • Selects only the "season" and "winner" columns from the filtered results.
  • This operation is useful for analyzing the outcomes of championship matches across different seasons.

Practice Task

Q2. Count matches won by Falcons in Delhi.

python
matches[
    (matches["winner"] == "Falcons") &
    (matches["city"] == "Delhi")
].shape[0]

Explanation

  • Filters the matches DataFrame to include only rows where the winner is "Falcons" and the city is "Delhi".
  • Uses the bitwise AND operator & to combine the two conditions for filtering.
  • The shape[0] attribute returns the number of rows that meet the specified criteria, effectively counting the matches.
  • This code is useful for analyzing specific match outcomes based on team performance in a particular location.

Practice Task

Q3. Find top teams by number of matches played.

Because a team can appear in team_a or team_b, combine both columns.

python
played = pd.concat([matches["team_a"], matches["team_b"]])
played.value_counts()

Explanation

  • The code uses the pd.concat() function from the pandas library to combine two columns, team_a and team_b, from the matches DataFrame into a single Series named played.
  • The value_counts() method is then called on the played Series to count how many times each team appears in the combined list of matches.
  • This allows for easy analysis of team participation across all matches in the dataset.
  • The resulting output will show the teams along with the number of matches they played, sorted in descending order by count.

Practice Task

Q4. Build a team-vs-team summary function.

python
def team_summary(data, team_one, team_two):
    mask_one = (data["team_a"] == team_one) | (data["team_b"] == team_one)
    mask_two = (data["team_a"] == team_two) | (data["team_b"] == team_two)
    head_to_head = data[mask_one & mask_two]

    return {
        "matches": head_to_head.shape[0],
        f"{team_one}_wins": (head_to_head["winner"] == team_one).sum(),
        f"{team_two}_wins": (head_to_head["winner"] == team_two).sum(),
        "average_margin": head_to_head["margin"].mean()
    }

team_summary(matches, "Falcons", "Tigers")

Explanation

  • The function team_summary takes three parameters: a DataFrame data containing match details, and two team names team_one and team_two.
  • It creates boolean masks to filter matches where either team participated using logical OR operations.
  • The filtered DataFrame head_to_head contains only the matches involving both specified teams.
  • The function returns a dictionary with the total number of matches, the number of wins for each team, and the average margin of victory across those matches.
  • The function is called with a DataFrame matches and two team names, "Falcons" and "Tigers", to retrieve their match statistics.

Practice Task

Q5. Find the top cities where Falcons played.

python
falcons_matches = matches[
    (matches["team_a"] == "Falcons") |
    (matches["team_b"] == "Falcons")
]

falcons_matches["city"].value_counts()

Explanation

  • The code filters a DataFrame named matches to include only those rows where either team_a or team_b is "Falcons".
  • The filtered DataFrame is stored in the variable falcons_matches.
  • It then counts the occurrences of each unique city in the city column of the falcons_matches DataFrame.
  • The result is a Series that shows how many matches involving the Falcons took place in each city.

Practice Task

Q6. Find average winning margin for Falcons in 2024.

python
matches[
    (matches["season"] == 2024) &
    (
        (matches["team_a"] == "Falcons") |
        (matches["team_b"] == "Falcons")
    )
]["margin"].mean()

Explanation

  • Filters the matches DataFrame to include only rows where the season is 2024.
  • Further narrows down the results to matches where either team_a or team_b is "Falcons".
  • Selects the margin column from the filtered DataFrame.
  • Computes the mean of the margin values to determine the average margin for the specified conditions.

40. Common Beginner Mistakes

Mistake 1: Forgetting Double Brackets For Multiple Columns

Wrong:

python
students["name", "city"]

Explanation

  • This code snippet retrieves the "name" and "city" columns from a DataFrame named students.
  • The syntax students["name", "city"] indicates that multiple columns can be accessed simultaneously.
  • This operation is useful for filtering data to focus on specific attributes of the students.
  • Ensure that the students DataFrame is defined and contains the specified columns to avoid errors.

Right:

python
students[["name", "city"]]

Explanation

  • This code snippet accesses a pandas DataFrame named students.
  • It retrieves only the "name" and "city" columns from the DataFrame.
  • The double square brackets indicate that a subset of columns is being selected, returning a new DataFrame.
  • This operation is useful for focusing on specific data attributes for analysis or reporting.

Mistake 2: Confusing loc And iloc

Use:

  • loc for labels
  • iloc for integer positions
python
students.iloc[0]
students_by_id.loc["Meera"]

Explanation

  • The first line retrieves the first row of the students DataFrame using integer-location based indexing with iloc.
  • The second line accesses the record of a student named "Meera" from the students_by_id DataFrame using label-based indexing with loc.
  • Both methods are part of the pandas library, which is commonly used for data manipulation and analysis in Python.
  • These indexing techniques allow for efficient retrieval of data based on either position or label, enhancing data handling capabilities.

Mistake 3: Missing Parentheses In Multiple Conditions

Wrong:

python
students[students["python_score"] > 80 & students["sql_score"] > 80]

Explanation

  • The code filters a DataFrame named students to select rows where the python_score is greater than 80 and the sql_score is also greater than 80.
  • The logical condition uses the bitwise AND operator & to combine the two conditions for filtering.
  • It is important to use parentheses around each condition to ensure proper evaluation order in the expression.
  • The result is a subset of the original DataFrame containing only students who excel in both subjects.

Right:

python
students[
    (students["python_score"] > 80) &
    (students["sql_score"] > 80)
]

Explanation

  • The code filters a DataFrame named students to find rows where both the python_score and sql_score are greater than 80.
  • It uses boolean indexing with the & operator to combine conditions for both scores.
  • The resulting DataFrame will only include students who excel in both subjects, useful for identifying top performers.
  • This operation is efficient for data analysis in educational contexts, allowing for targeted insights.

Mistake 4: Using = Instead Of == In Filters

Wrong:

python
# This is assignment, not comparison
students["city"] = "Pune"

Explanation

  • The code modifies the dictionary students by assigning the string value "Pune" to the key "city".
  • This operation updates the existing value associated with the key if it already exists or creates a new key-value pair if it does not.
  • The comment clarifies that the single equals sign (=) is used for assignment, not for comparison, which uses double equals (==).

Right:

python
students[students["city"] == "Pune"]

Explanation

  • The code snippet accesses a DataFrame named students.
  • It filters the DataFrame to include only those rows where the "city" column has the value "Pune".
  • The result is a new DataFrame containing only the students from Pune.
  • This operation is useful for analyzing or processing data specific to a particular location.

Mistake 5: Dropping Missing Values Too Early

Before dropping rows, ask:

  • Why is the value missing?
  • Is the row still useful?
  • Can the value be filled safely?
  • Will dropping rows bias the analysis?

Mistake 6: Changing Original Data Accidentally

If you want to experiment, create a copy.

python
practice = students.copy()

Explanation

  • The copy() method is used to create a shallow copy of the students list.
  • The new list, practice, is independent of the original students list, meaning changes to one will not affect the other.
  • This is useful for scenarios where you want to manipulate data without altering the original dataset.
  • The operation is efficient for lists, ensuring that the original data remains intact while allowing for experimentation or practice.

41. DataFrame Cheat Sheet

python
# Create
pd.DataFrame(data)
pd.read_csv("file.csv")

# Inspect
df.shape
df.dtypes
df.index
df.columns
df.head()
df.tail()
df.sample(5)
df.info()
df.describe()

# Missing and duplicates
df.isna().sum()
df.duplicated().sum()
df.drop_duplicates()

# Select columns
df["col"]
df[["col1", "col2"]]

# Select rows
df.iloc[0]
df.iloc[0:5]
df.loc["label"]
df.loc["start":"stop"]

# Select rows and columns
df.iloc[0:5, 0:3]
df.loc[df["score"] > 80, ["name", "score"]]

# Filter
df[df["city"] == "Pune"]
df[(df["score"] > 80) & (df["city"] == "Pune")]
df[df["city"].isin(["Pune", "Kochi"])]
df[df["score"].between(70, 90)]

# Add or update
df["new_col"] = value
df["total"] = df["a"] + df["b"]
df.rename(columns={"old": "new"})

# Convert
df["col"].astype("category")
pd.to_numeric(df["amount"], errors="coerce")

# Summarize
df["city"].value_counts()
df.groupby("city")["score"].mean()
df.sort_values("score", ascending=False)

Explanation

  • The code demonstrates various operations on a Pandas DataFrame, including creation, inspection, and manipulation of data.
  • It shows how to read data from a CSV file and inspect the DataFrame's shape, data types, and summary statistics.
  • The snippet includes methods for handling missing values and duplicates, allowing for data cleaning.
  • It illustrates techniques for selecting specific rows and columns, as well as filtering data based on conditions.
  • The code also covers adding new columns, updating existing ones, converting data types, and summarizing data through grouping and counting.

42. Interview-Style Questions

1. What is a Pandas DataFrame?

A DataFrame is a two-dimensional labeled data structure with rows and columns. It is commonly used for tabular data.

2. What is the difference between a Series and a DataFrame?

A Series is one-dimensional. A DataFrame is two-dimensional and can be thought of as multiple Series sharing the same row index.

3. What does df.shape return?

It returns a tuple: (number_of_rows, number_of_columns).

4. What is the difference between loc and iloc?

loc selects by labels. iloc selects by integer positions.

5. Why do we use parentheses with & and | in filters?

Because Pandas comparisons must be grouped before combining them. Without parentheses, Python operator precedence can produce errors or incorrect logic.

6. How do you select multiple columns?

Use a list of column names:

python
df[["name", "score"]]

Explanation

  • This code snippet retrieves two columns, "name" and "score", from a pandas DataFrame named df.
  • The double square brackets indicate that a subset of the DataFrame is being created, returning only the specified columns.
  • The result is a new DataFrame containing only the data from the "name" and "score" columns, which can be useful for analysis or visualization.
  • This operation does not modify the original DataFrame df; it creates a new view of the data.

7. How do you count missing values per column?

python
df.isna().sum()

Explanation

  • The isna() function is called on a DataFrame df, which returns a DataFrame of the same shape with boolean values indicating the presence of missing values (True for NaN).
  • The sum() function is then applied to this boolean DataFrame, which counts the number of True values (missing values) in each column.
  • The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries for each column.
  • This is useful for data cleaning and preprocessing, allowing users to quickly identify columns that may require attention due to missing data.

8. How do you remove duplicate rows?

python
df.drop_duplicates()

Explanation

  • The drop_duplicates() method is called on a pandas DataFrame df.
  • It identifies and removes any rows that are exact duplicates of each other.
  • By default, it keeps the first occurrence of each duplicate and drops subsequent ones.
  • This operation helps in cleaning the dataset for accurate analysis and processing.
  • The method can be customized with parameters to specify which columns to consider for identifying duplicates.

9. How do you convert a column to numeric?

python
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")

Explanation

  • The code uses the pd.to_numeric() function from the pandas library to convert the values in the 'amount' column of a DataFrame (df) to numeric types.
  • The errors="coerce" parameter ensures that any values that cannot be converted to numeric will be replaced with NaN (Not a Number), preventing the code from raising an error.
  • This is useful for cleaning data, especially when dealing with mixed types or invalid entries in the 'amount' column.
  • The resulting 'amount' column will contain only numeric values or NaN, making it easier to perform numerical operations or analyses later on.

10. How do you find the average score by city?

python
df.groupby("city")["score"].mean()

Explanation

  • The code uses the groupby method on a DataFrame df to organize data based on unique values in the "city" column.
  • It then selects the "score" column to compute the mean for each group created by the cities.
  • The result is a new Series containing the average score for each city, which can be useful for data analysis and reporting.

43. A Good Learning Path After This

Once you understand DataFrames, continue with:

  • deeper missing-value handling
  • merge and join
  • groupby in detail
  • pivot tables
  • time series data
  • plotting with Pandas
  • performance-friendly Pandas patterns
  • exploratory data analysis projects

DataFrame basics are the foundation. Most Pandas work is just these ideas repeated on larger and messier datasets.

Sources And Further Reading

This guide uses original examples and practice datasets. The notebook was used only as a topic outline, not as copied lesson text or copied datasets.

Official Pandas references: