Pandas DataFrame: Create, Inspect, Select, Filter, Clean, and Analyze Tables
A Pandas DataFrame is the main object you use when working with table-shaped data in Python.
If a Series is one labeled column, a DataFrame is a full table made from many labeled columns.
You can use a DataFrame to represent:
- students and marks
- orders and payments
- website traffic logs
- course enrollments
- app events
- survey responses
- sports match records
- product inventory
- interview practice datasets
In this guide, you will learn DataFrames through original examples built around students, learning courses, small store orders, and simple sports records.
The goal is not just to memorize commands. The goal is to become comfortable asking questions from data.
What You Will Learn
By the end, you should be able to:
- explain what a DataFrame is
- create DataFrames from lists, dictionaries, and records
- read data from a CSV file
- inspect shape, columns, index, dtypes, and values
- preview rows with
head,tail, andsample - understand
info,describe,isna, andduplicated - rename columns safely
- select one column, many columns, one row, and many rows
- use
locfor labels andilocfor positions - select rows and columns together
- filter rows using one condition and multiple conditions
- use
isin,between, and string filters - add new columns from constants and formulas
- convert data types with
astypeandpd.to_numeric - clean missing values with
fillnaanddropna - remove duplicate rows
- sort rows
- count categories with
value_counts - group data with
groupby - solve beginner-to-practical DataFrame exercises
1. Importing Pandas
Most Pandas code uses the alias pd.
import pandas as pd
import numpy as npYou do not always need NumPy, but it is useful for missing values and numeric examples.
2. What Is A DataFrame?
A DataFrame is a two-dimensional labeled data structure.
Simple meaning:
- rows represent records
- columns represent fields
- row labels live in the index
- column labels live in
columns
Example:
students = pd.DataFrame({
"name": ["Aarav", "Meera", "Ishan", "Tara"],
"python_score": [86, 92, 74, 88],
"sql_score": [78, 85, 81, 90],
"city": ["Pune", "Kochi", "Delhi", "Pune"]
})
print(students)Output:
name python_score sql_score city
0 Aarav 86 78 Pune
1 Meera 92 85 Kochi
2 Ishan 74 81 Delhi
3 Tara 88 90 PuneThis is a DataFrame with 4 rows and 4 columns.
3. DataFrame Vs Series
Use this mental model:
Series = one labeled column
DataFrame = many Series side by sideSelecting a single column from a DataFrame usually gives a Series:
students["python_score"]Selecting multiple columns gives another DataFrame:
students[["name", "python_score"]]This difference matters because a Series and a DataFrame support many of the same operations, but their shapes are different.
4. Creating A DataFrame From A List Of Lists
You can create a DataFrame from rows.
rows = [
["Aarav", 86, "Pune"],
["Meera", 92, "Kochi"],
["Ishan", 74, "Delhi"]
]
df = pd.DataFrame(rows, columns=["name", "score", "city"])
print(df)Output:
name score city
0 Aarav 86 Pune
1 Meera 92 Kochi
2 Ishan 74 DelhiThis style is useful when your data naturally arrives row by row.
5. Creating A DataFrame From A Dictionary Of Lists
This is one of the most common beginner-friendly ways.
course_data = {
"student": ["Aarav", "Meera", "Ishan", "Tara"],
"course": ["Python", "Pandas", "Python", "SQL"],
"hours_spent": [14, 18, 9, 12],
"completed": [True, True, False, True]
}
courses = pd.DataFrame(course_data)
print(courses)Each dictionary key becomes a column name.
Each list becomes a column.
Important rule: all lists must have the same length.
6. Creating A DataFrame From A List Of Dictionaries
This style is excellent for API-like records.
records = [
{"order_id": 101, "customer": "Anika", "amount": 499, "paid": True},
{"order_id": 102, "customer": "Rohan", "amount": 299, "paid": False},
{"order_id": 103, "customer": "Zoya", "amount": 799, "paid": True}
]
orders = pd.DataFrame(records)
print(orders)Output:
order_id customer amount paid
0 101 Anika 499 True
1 102 Rohan 299 False
2 103 Zoya 799 TrueThis is often easier to read than a list of lists because every value carries a field name.
7. Creating A DataFrame With A Custom Index
The index identifies rows.
students = pd.DataFrame(
{
"python_score": [86, 92, 74, 88],
"sql_score": [78, 85, 81, 90],
"city": ["Pune", "Kochi", "Delhi", "Pune"]
},
index=["aarav", "meera", "ishan", "tara"]
)
print(students)Output:
python_score sql_score city
aarav 86 78 Pune
meera 92 85 Kochi
ishan 74 81 Delhi
tara 88 90 PuneCustom indexes are useful when your rows already have meaningful labels.
8. Reading A CSV File
In real work, you often read data from files.
orders = pd.read_csv("orders.csv")If the CSV has a column that should become the index:
orders = pd.read_csv("orders.csv", index_col="order_id")If you only want selected columns:
orders = pd.read_csv("orders.csv", usecols=["order_id", "customer", "amount"])If date columns should be parsed:
orders = pd.read_csv("orders.csv", parse_dates=["ordered_at"])For beginners, start with plain pd.read_csv("file.csv"), inspect the result, and then add options only when needed.
9. Inspecting A DataFrame
Let us build one reusable DataFrame for the next few sections.
students = pd.DataFrame({
"name": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila"],
"city": ["Pune", "Kochi", "Delhi", "Pune", "Jaipur", "Kochi"],
"python_score": [86, 92, 74, 88, 67, 95],
"sql_score": [78, 85, 81, 90, 72, 89],
"practice_hours": [14, 18, 9, 16, 7, 20],
"completed": [True, True, False, True, False, True]
})Shape
students.shapeExplanation
- The
shapeattribute of a DataFrame returns a tuple representing its dimensions. - The first element of the tuple indicates the number of rows (students), while the second element indicates the number of columns (attributes).
- This is useful for quickly assessing the size of the dataset before performing further analysis or operations.
Output:
(6, 6)This means 6 rows and 6 columns.
Data Types
students.dtypesExplanation
- The code accesses the
dtypesattribute of a Pandas DataFrame namedstudents. - It returns a Series object that contains the data type of each column in the DataFrame.
- This information is useful for understanding the structure of the data and for ensuring that data types are appropriate for analysis.
- Common data types include
int64,float64,object, anddatetime64, among others. - Analyzing data types can help identify potential issues such as incorrect data formats or the need for type conversions.
This tells you the type of each column.
Common Pandas dtypes:
int64float64boolobjectdatetime64category
Many text columns appear as object.
Index
students.indexExplanation
- The
students.indexattribute retrieves the index (row labels) of the DataFrame namedstudents. - This index can be used to identify and access specific rows in the DataFrame.
- It is particularly useful for operations that require referencing or manipulating data based on row positions.
- The index can be customized or set to meaningful labels to enhance data readability and management.
The index is the row label system.
Columns
students.columnsExplanation
- Accesses the
columnsattribute of a pandas DataFrame namedstudents. - Returns an Index object containing the names of the columns in the DataFrame.
- Useful for understanding the structure of the DataFrame and for data manipulation tasks.
- Can be used in data analysis to quickly identify available features for further processing.
This returns the column names.
Values
students.valuesExplanation
- The code snippet retrieves all the values from the
studentsdictionary. students.valuesreturns a view object that displays a list of all the values in the dictionary.- This can be useful for iterating over student data or performing operations on the values without needing the keys.
- To convert the view object into a list, you can wrap it with
list(students.values()).
This gives the raw underlying values as an array-like object.
In beginner Pandas work, you usually interact with columns and rows through Pandas methods instead of relying on .values.
10. Previewing Rows
First Rows With head
students.head()Explanation
- The
head()method is called on a DataFrame namedstudents. - It returns the first five rows by default, allowing for a quick overview of the data structure and contents.
- This is useful for verifying data loading and understanding the format of the dataset.
- Users can specify a different number of rows to display by passing an integer argument to
head().
By default, it returns the first 5 rows.
students.head(3)Explanation
- The
head()function is a method used on a pandas DataFrame to display the first few rows of the dataset. - In this case,
students.head(3)returns the first three rows of thestudentsDataFrame. - This is useful for quickly inspecting the structure and contents of the DataFrame without displaying the entire dataset.
- The number passed to
head()can be adjusted to retrieve a different number of rows as needed.
Returns the first 3 rows.
Last Rows With tail
students.tail()
students.tail(2)Explanation
- The
students.tail()function returns the last five rows of the DataFrame namedstudents. - By calling
students.tail(2), the code retrieves the last two rows specifically, allowing for a focused view of the most recent entries. - This method is useful for quickly inspecting the end of a dataset without displaying the entire DataFrame.
- The
tail()function is commonly used in data analysis to check for data integrity and trends at the end of the dataset.
Random Rows With sample
students.sample(2, random_state=42)Explanation
- The
samplemethod is used to randomly select a specified number of rows from a DataFrame. - In this case, it selects 2 random students from the
studentsDataFrame. - The
random_stateparameter is set to 42, ensuring that the random selection is reproducible across different runs. - This is useful for testing or demonstration purposes, as it allows the same subset of data to be retrieved consistently.
- If
random_statewere omitted, the selection would vary each time the code is executed.
Use random_state when you want reproducible examples.
11. Using info
info gives a compact technical summary.
students.info()Explanation
- The
info()method is called on a DataFrame namedstudents. - It provides a concise summary of the DataFrame, including the number of entries, column names, data types, and memory usage.
- This method is useful for quickly assessing the structure and data types of the dataset before performing further analysis.
- It helps identify any missing values or inconsistencies in the data.
It shows:
- row count
- column names
- non-null counts
- data types
- memory usage
This is one of the first commands you should run after loading a new dataset.
12. Using describe
describe summarizes numeric columns by default.
students.describe()Explanation
- The
describe()method is called on thestudentsDataFrame, which is typically used in pandas. - It generates summary statistics such as count, mean, standard deviation, minimum, and maximum values for numerical columns.
- For categorical columns, it provides counts of unique values and the most frequent value.
- This method is useful for quickly understanding the distribution and central tendencies of the data.
- The output helps in identifying potential outliers and the overall shape of the dataset.
It gives:
- count
- mean
- standard deviation
- minimum
- quartiles
- maximum
For non-numeric columns:
students.describe(include="object")Explanation
- The
describe()method is called on a DataFrame namedstudents. - The parameter
include="object"specifies that only columns with object data types (typically strings) should be included in the output. - This method provides insights such as count, unique values, top values, and frequency of the top values for the specified columns.
- It is useful for understanding the distribution and characteristics of categorical data in the DataFrame.
For all columns:
students.describe(include="all")Explanation
- The
describe()method is called on a DataFrame namedstudentsto generate descriptive statistics. - The parameter
include="all"ensures that the summary includes statistics for all columns, regardless of their data types. - For numeric columns, it provides metrics like count, mean, standard deviation, min, max, and quartiles.
- For categorical columns, it includes counts of unique values, the most frequent value, and its frequency.
- This method is useful for quickly understanding the distribution and characteristics of the dataset.
13. Checking Missing Values
Missing values are common in real data.
students.isna().sum()Explanation
- The
isna()function is called on thestudentsDataFrame to identify missing values, returning a DataFrame of the same shape with boolean values. - The
sum()function is then applied to this boolean DataFrame, which counts the number ofTruevalues (missing entries) for each column. - The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries in each column.
- This is useful for data cleaning and preprocessing, allowing users to quickly assess the completeness of their dataset.
You may also see:
students.isnull().sum()Explanation
- The
isnull()method is called on thestudentsDataFrame to identify missing values, returning a DataFrame of the same shape with boolean values. - The
sum()method is then applied to this boolean DataFrame, which counts theTruevalues (indicating missing data) for each column. - The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries in each column.
- This is useful for data cleaning and preprocessing, allowing users to quickly assess the completeness of their dataset.
In Pandas, isna and isnull are commonly used for the same purpose.
14. Checking Duplicate Rows
students.duplicated().sum()Explanation
- The
duplicated()method is called on thestudentsDataFrame to identify duplicate rows. - It returns a boolean Series where
Trueindicates a duplicate row. - The
sum()function then counts the number ofTruevalues in the Series, effectively giving the total number of duplicate entries. - This code is useful for data cleaning and ensuring the integrity of the dataset by identifying redundancy.
This counts complete duplicate rows.
To see duplicated rows:
students[students.duplicated()]Explanation
- The code uses the
duplicated()method from the pandas library to check for duplicate rows in thestudentsDataFrame. - It returns a boolean Series indicating whether each row is a duplicate of a previous row.
- By indexing the
studentsDataFrame with this boolean Series, it filters and displays only the duplicate entries. - This is useful for data cleaning and ensuring the integrity of student records.
To remove duplicate rows:
students = students.drop_duplicates()Explanation
- The
drop_duplicates()method is called on thestudentsDataFrame. - It identifies and removes any rows that are exact duplicates of each other.
- The operation helps ensure that each student is represented only once in the DataFrame.
- The result is a cleaner dataset, which is essential for accurate data analysis and reporting.
15. Renaming Columns
Readable column names make your code easier to understand.
students = students.rename(columns={
"python_score": "python",
"sql_score": "sql",
"practice_hours": "hours"
})Explanation
- The
renamemethod is used to change the names of specific columns in thestudentsDataFrame. - The
columnsparameter takes a dictionary where the keys are the current column names and the values are the new names. - In this case, "python_score" is renamed to "python", "sql_score" to "sql", and "practice_hours" to "hours".
- This operation helps in making the DataFrame more user-friendly and easier to work with in subsequent analyses.
You can also rename in place:
students.rename(columns={"hours": "study_hours"}, inplace=True)Explanation
- The
renamemethod is used to change the name of a specific column in a pandas DataFrame. - The
columnsparameter takes a dictionary where the key is the current column name and the value is the new name. - In this case, the column "hours" is renamed to "study_hours".
- The
inplace=Trueargument modifies the original DataFrame directly without needing to create a copy.
Many teams avoid inplace=True because returning a new DataFrame is easier to reason about in longer pipelines.
16. Math Across Columns And Rows
Create a small score DataFrame:
scores = pd.DataFrame({
"python": [86, 92, 74],
"sql": [78, 85, 81],
"excel": [90, 88, 76]
}, index=["Aarav", "Meera", "Ishan"])Explanation
- Initializes a pandas DataFrame named
scoresto organize student scores across different subjects. - Contains three subjects: "python", "sql", and "excel", each with a list of scores.
- Uses a custom index with student names: "Aarav", "Meera", and "Ishan" to label the rows.
- Facilitates easy data manipulation and analysis for educational performance tracking.
Column-wise sum:
scores.sum(axis=0)Explanation
- The
scoresvariable is expected to be a NumPy array or a similar data structure. - The
summethod computes the total of the array elements. - The
axis=0argument specifies that the summation should be performed column-wise, aggregating values from each row. - The result will be a new array containing the sum of each column in the original array.
Row-wise sum:
scores.sum(axis=1)Explanation
- The
scoresvariable is expected to be a 2D array or matrix, where each row represents a different set of scores. - The
summethod is called with the parameteraxis=1, indicating that the summation should be performed across rows. - The result will be a 1D array containing the total score for each row, effectively aggregating the scores horizontally.
- This operation is commonly used in data analysis to compute totals or aggregates for each observation or sample.
Row-wise average:
scores.mean(axis=1)Explanation
- The
meanfunction computes the average of the elements along the specified axis. - By setting
axis=1, the function calculates the mean for each row, treating each row as a separate dataset. - This is useful for obtaining a single average score per entity represented by the rows in the array.
- The output will be a 1D array containing the mean values for each row.
Other useful methods:
scores.median(axis=1)
scores.std(axis=1)
scores.var(axis=1)
scores.min(axis=1)
scores.max(axis=1)Explanation
- Computes the median of scores across each row using
median(axis=1). - Calculates the standard deviation of scores for each row with
std(axis=1). - Determines the variance of scores in each row using
var(axis=1). - Finds the minimum score in each row with
min(axis=1). - Identifies the maximum score in each row using
max(axis=1).
Remember:
axis=0 means down the rows, producing one result per column
axis=1 means across the columns, producing one result per row17. Selecting One Column
students["name"]Explanation
- This code snippet retrieves the value associated with the key "name" from the
studentsdictionary. - It assumes that
studentsis a dictionary where each key corresponds to a student's identifier and the value is another dictionary containing student details. - If "name" is not a valid key in the dictionary, it will raise a KeyError.
- This operation is commonly used to extract specific information about a student for further processing or display.
This returns a Series.
You can verify:
type(students["name"])Explanation
- The code accesses the 'name' key from the 'students' dictionary.
- It uses the
type()function to retrieve the data type of the value associated with the 'name' key. - This is useful for validating the expected data structure and ensuring data integrity within the dictionary.
Use bracket syntax for column names because it works with spaces and special characters.
students["practice_hours"]Explanation
- This code snippet retrieves the "practice_hours" data from a collection named "students".
- It assumes that "students" is a dictionary or a similar data structure where "practice_hours" is a key.
- The output will be the value associated with "practice_hours", which could be a list, integer, or any other data type.
- This operation is useful for analyzing or displaying the amount of practice time logged by each student.
Dot syntax can be convenient:
students.cityExplanation
- This code snippet retrieves the 'city' attribute from the 'students' object.
- It assumes that 'students' is an instance of a class or a data structure that has a defined 'city' property.
- The value returned will depend on the specific implementation of the 'students' object.
- This is a common way to access object attributes in Python, allowing for easy retrieval of stored data.
But dot syntax can break when column names have spaces, conflict with methods, or are created dynamically.
Prefer bracket syntax in teaching and production code.
18. Selecting Multiple Columns
Use a list of column names inside brackets.
students[["name", "city", "python_score"]]Explanation
- This code snippet accesses a DataFrame named
students. - It retrieves only the columns "name", "city", and "python_score".
- The result is a new DataFrame containing only the specified columns for further analysis or display.
- This operation is useful for focusing on relevant data without altering the original DataFrame.
This returns a DataFrame.
Column order follows the list you pass:
students[["python_score", "name"]]Explanation
- This code snippet accesses a DataFrame named
students. - It selects and displays only the "python_score" and "name" columns.
- The result is a new DataFrame containing only the specified data for further analysis or display.
- This operation is useful for focusing on relevant information without altering the original DataFrame.
19. Selecting Rows With iloc
iloc selects by integer position.
students.iloc[0]Explanation
- The code retrieves the first row of the DataFrame named
studentsusing theilocindexer. iloc[0]specifically targets the first row, as indexing starts at 0 in Python.- This operation returns a Series object containing all the column values for the first student in the DataFrame.
- It is useful for quickly inspecting the data of the first entry without needing to display the entire DataFrame.
This selects the first row.
Multiple rows:
students.iloc[0:3]Explanation
studentsis a pandas DataFrame that likely contains information about students.- The
ilocmethod is used for integer-location based indexing to select rows. - The slice
0:3specifies that the code should return rows at index 0, 1, and 2, effectively giving the first three entries. - This operation is useful for quickly viewing a subset of data for analysis or debugging.
Rows at specific positions:
students.iloc[[0, 2, 5]]Explanation
- The code utilizes the
ilocmethod from the pandas library to access specific rows of a DataFrame namedstudents. - It retrieves the rows at index positions 0, 2, and 5, allowing for targeted data extraction.
- This method is particularly useful for accessing rows by their integer location rather than by label.
- The resulting output will be a new DataFrame containing only the selected rows.
- Ensure that the indices specified exist within the DataFrame to avoid an IndexError.
Last row:
students.iloc[-1]Explanation
- The code uses the
ilocmethod from the pandas library to access data by integer-location based indexing. -1indicates that the last row of the DataFramestudentswill be selected.- This is useful for quickly viewing the most recent entry or record in the dataset.
- The returned output will be a Series object representing the last student's data.
In iloc, slices behave like normal Python slices. The stop position is excluded.
students.iloc[1:4]Explanation
- The
ilocmethod is used for integer-location based indexing in pandas DataFrames. - The slice
1:4indicates that it will return rows with index 1, 2, and 3 (the end index is exclusive). - This allows for easy selection of a subset of data for analysis or manipulation.
- The resulting output will be a new DataFrame containing only the specified rows.
This selects positions 1, 2, and 3.
20. Selecting Rows With loc
loc selects by index label.
Create a labeled DataFrame:
students_by_id = students.set_index("name")Explanation
- The
set_indexmethod is used to change the index of the DataFramestudents. - The "name" column is specified as the new index, allowing for more intuitive data retrieval based on student names.
- After executing this line, the DataFrame will be indexed by student names instead of the default integer index.
- This can improve performance and readability when performing operations that involve looking up students by their names.
Select a row by label:
students_by_id.loc["Meera"]Explanation
- The code snippet uses the
locmethod from the pandas library to retrieve data. - It targets the row corresponding to the index label "Meera" in the
students_by_idDataFrame. - This operation returns all columns for the student identified by the index "Meera".
- It is useful for quickly accessing information about a specific student based on their unique identifier.
Select multiple labels:
students_by_id.loc[["Aarav", "Tara", "Nila"]]Explanation
- The code accesses a DataFrame named
students_by_idto retrieve data. - It uses the
.loc[]method, which allows for label-based indexing to select rows. - The list
["Aarav", "Tara", "Nila"]specifies the IDs of the students whose records are to be fetched. - This operation returns the rows corresponding to the specified student IDs, enabling focused data analysis.
Label slicing:
students_by_id.loc["Meera":"Kabir"]Explanation
- Utilizes the
locmethod to access rows in a pandas DataFrame based on index labels. - The range specified, from "Meera" to "Kabir", includes all rows with index labels within this range.
- This operation is useful for filtering data based on specific criteria or for viewing a subset of records.
- Assumes that the DataFrame
students_by_idis indexed by student names or IDs. - The output will include all records between the specified start and end labels, inclusive.
Important difference:
loc label slices include the stop label when it exists.
iloc position slices exclude the stop position.This is a common beginner trap.
21. Selecting Rows And Columns Together
Use iloc when both rows and columns are positions.
students.iloc[0:3, 0:2]Explanation
- Utilizes the
ilocmethod from the pandas library to access specific rows and columns by integer-location based indexing. - The slice
0:3indicates that it selects rows 0, 1, and 2, effectively retrieving the first three rows. - The slice
0:2specifies that it selects columns 0 and 1, thus retrieving the first two columns of the DataFrame. - This operation is useful for quickly examining a subset of data within a larger DataFrame.
- The result is a new DataFrame containing only the specified rows and columns.
This means:
- rows from position 0 to 2
- columns from position 0 to 1
Use loc when both rows and columns are labels.
students_by_id.loc["Aarav":"Tara", ["city", "python_score"]]Explanation
- Uses the
locmethod to access rows and columns in a pandas DataFrame. - Selects rows from "Aarav" to "Tara" based on the index labels.
- Retrieves only the "city" and "python_score" columns for the specified range of students.
- Assumes that the DataFrame
students_by_idis indexed by student names or IDs. - Useful for filtering data to analyze specific attributes of a subset of students.
You can also filter rows and select columns in one clean expression:
students.loc[students["city"] == "Pune", ["name", "python_score"]]Explanation
- Uses the
locmethod from the pandas library to filter rows in thestudentsDataFrame. - The condition checks for students whose "city" column matches "Pune".
- Only the "name" and "python_score" columns are selected for the output.
- This operation is useful for analyzing performance based on geographical location.
- The result is a subset of the original DataFrame containing relevant information for further analysis.
This is one of the most useful DataFrame patterns.
22. Filtering Rows With One Condition
Find students with Python score above 85:
high_python = students[students["python_score"] > 85]Explanation
- The code filters a DataFrame named
studentsto identify those with a Python score greater than 85. - It uses boolean indexing to create a new DataFrame called
high_python. - This operation allows for easy analysis of students who excel in Python, facilitating targeted educational strategies.
- The resulting DataFrame
high_pythoncontains only the rows that meet the specified condition.
Find students from Pune:
pune_students = students[students["city"] == "Pune"]Explanation
- The code snippet filters a DataFrame named
studentsto create a new DataFramepune_students. - It uses a boolean condition to check if the "city" column equals "Pune".
- Only the rows where the condition is true are included in the
pune_studentsDataFrame. - This operation is useful for analyzing or processing data specific to students located in Pune.
Find students who completed the course:
completed = students[students["completed"] == True]Explanation
- The code filters a DataFrame named
studentsto create a new DataFrame calledcompleted. - It selects only the rows where the "completed" column has a value of
True. - This is useful for analyzing or processing only those students who have finished their assignments or tasks.
- The resulting
completedDataFrame contains all relevant information about the students who have completed their tasks.
Cleaner version:
completed = students[students["completed"]]Explanation
- This code snippet filters a DataFrame named
studentsto create a new variablecompleted. - It selects only the rows where the "completed" column is truthy, indicating that those students have completed their tasks.
- The result is a subset of the original DataFrame containing only the relevant entries.
- This operation is useful for analyzing or processing data related to students who have finished their assignments or courses.
23. Filtering With Multiple Conditions
Use & for AND.
strong_and_consistent = students[
(students["python_score"] >= 85) &
(students["sql_score"] >= 85)
]Explanation
- The code filters a DataFrame named
studentsto identify those with high scores in both Python and SQL. - It uses boolean indexing to select rows where the
python_scoreis greater than or equal to 85 and thesql_scoreis also greater than or equal to 85. - The result is stored in the variable
strong_and_consistent, which contains only the students meeting these criteria. - This approach is useful for analyzing student performance and identifying top performers in specific subjects.
Use | for OR.
needs_attention = students[
(students["python_score"] < 75) |
(students["practice_hours"] < 10)
]Explanation
- The code creates a new DataFrame called
needs_attentionfrom an existing DataFramestudents. - It selects students whose
python_scoreis less than 75 or who have logged less than 10 practice hours. - The logical operator
|is used to combine the two conditions, meaning either condition being true will include the student in the result. - This filtering helps identify students who may need extra help in their studies.
Important rule:
When combining conditions with & or |, wrap each condition in parentheses.Do not write:
# Avoid this
students[students["python_score"] >= 85 & students["sql_score"] >= 85]Explanation
- The code attempts to filter a DataFrame of students based on their scores in Python and SQL.
- It uses the bitwise AND operator
&incorrectly with the comparison operators, which can lead to unexpected results. - The correct approach should use parentheses around each condition to ensure proper evaluation order.
- The intended functionality is to select students with scores of 85 or higher in both subjects.
- Proper syntax would be:
students[(students["python_score"] >= 85) & (students["sql_score"] >= 85)].
The operator precedence will not behave the way you expect.
24. Filtering With isin
Use isin when a column should match any value from a list.
south_or_west = students[students["city"].isin(["Pune", "Kochi"])]Explanation
- The code filters a DataFrame named
studentsto select only those entries where the "city" column matches either "Pune" or "Kochi". - The
isin()method is used to check if the values in the "city" column are present in the specified list of cities. - The result is stored in the variable
south_or_west, which will contain a subset of the original DataFrame. - This operation is useful for analyzing or processing data related to specific geographic locations.
This is cleaner than writing many OR conditions.
25. Filtering With between
Use between for ranges.
middle_scores = students[students["python_score"].between(75, 90)]Explanation
- The code filters a DataFrame named
studentsto select only those rows where thepython_scoreis between 75 and 90. - The
betweenmethod is used to check if thepython_scorefalls within the inclusive range of 75 to 90. - The result is stored in the variable
middle_scores, which will contain only the students who meet this score criterion. - This operation is useful for analyzing a subset of students who are performing at a moderate level in Python.
By default, the endpoints are included.
26. Filtering Text With String Methods
Pandas gives string methods through .str.
students[students["name"].str.startswith("M")]Explanation
- Utilizes the pandas library to filter a DataFrame named
students. - The
str.startswith("M")method checks if the 'name' column values begin with the letter 'M'. - The result is a subset of the original DataFrame containing only the rows that meet this condition.
- This operation is useful for quickly identifying specific groups of data based on string patterns.
Case-insensitive contains:
students[students["city"].str.contains("pu", case=False, na=False)]Explanation
- The code accesses a DataFrame named
studentswhich contains student data. - It uses the
str.contains()method to check if the "city" column contains the substring "pu". - The
case=Falseargument ensures that the search is case insensitive, allowing matches for "Pu", "pu", "PU", etc. - The
na=Falseargument prevents errors by treating NaN values as False, meaning they won't be included in the results. - The result is a filtered DataFrame containing only the rows where the "city" column matches the specified condition.
Use na=False when the column may contain missing values.
27. Adding A New Column
Add a constant column:
students["country"] = "India"Explanation
- The code modifies the
studentsdictionary by adding or updating the key "country". - All entries in the
studentsdictionary will have their "country" value set to "India". - This operation is useful for bulk updating information for a group of students.
- If "country" already exists, its value will be overwritten with "India".
Add a calculated column:
students["total_score"] = students["python_score"] + students["sql_score"]Explanation
- The code accesses a DataFrame named
students, which contains individual student scores. - It creates a new column called
total_scorein the DataFrame. - The
total_scoreis computed by adding the values from thepython_scoreandsql_scorecolumns for each student. - This operation is vectorized, meaning it efficiently processes all rows in the DataFrame simultaneously.
- The resulting
total_scorecolumn provides a quick reference for the overall performance of each student in both subjects.
Add an average:
students["average_score"] = students[["python_score", "sql_score"]].mean(axis=1)Explanation
- The code snippet adds a new column named "average_score" to the
studentsDataFrame. - It computes the mean of the scores from the "python_score" and "sql_score" columns.
- The
mean(axis=1)function calculates the average across the specified columns for each row. - This operation allows for easy assessment of student performance in both subjects.
Add a category from a condition:
students["performance"] = np.where(
students["average_score"] >= 85,
"strong",
"practice_more"
)Explanation
- The code uses NumPy's
wherefunction to create a new column in thestudentsDataFrame called "performance". - It assigns the value "strong" to students whose "average_score" is 85 or higher.
- For students with an "average_score" below 85, it assigns the value "practice_more".
- This operation allows for efficient conditional logic application across the DataFrame.
28. Updating Existing Columns
You can update text values:
students["city"] = students["city"].str.title()Explanation
- The code accesses the "city" column of the
studentsDataFrame. - It applies the
str.title()method, which converts each city name to title case (capitalizing the first letter of each word). - This transformation helps standardize city names, improving data quality and readability.
- The modified city names are then reassigned back to the "city" column in the DataFrame.
You can normalize inconsistent labels:
students["city"] = students["city"].replace({
"Bengaluru": "Bangalore",
"Bombay": "Mumbai"
})Explanation
- The code modifies the "city" column in the
studentsDataFrame. - It uses the
replacemethod to change specific city names to their standardized versions. - "Bengaluru" is replaced with "Bangalore" and "Bombay" with "Mumbai".
- This ensures uniformity in city names, which is important for data analysis and reporting.
- The operation is performed in-place, updating the original DataFrame directly.
You can update numeric values:
students["practice_hours"] = students["practice_hours"].clip(lower=0)Explanation
- The code modifies the "practice_hours" column in the "students" DataFrame.
- It uses the
clipmethod to set a lower limit of 0 for the values in this column. - Any negative values in "practice_hours" will be replaced with 0, ensuring all entries are non-negative.
- This is useful for data cleaning, preventing unrealistic negative practice hours from affecting analysis.
29. Converting Data Types
Use astype when the conversion is direct.
students["completed"] = students["completed"].astype("bool")Explanation
- The code modifies the "completed" column in the
studentsDataFrame. - It uses the
astypemethod to change the data type of the "completed" column to boolean. - This conversion allows for more efficient storage and logical operations on the column's values.
- The operation is performed in-place, meaning the original DataFrame is updated directly.
Convert a repeated text column to category:
students["city"] = students["city"].astype("category")Explanation
- The code modifies the "city" column of the
studentsDataFrame. - It uses the
astypemethod to change the data type of the column to "category". - Categorical data types are more memory-efficient and can improve performance for certain operations.
- This conversion is particularly useful when the column contains a limited number of unique values.
This can reduce memory for columns with repeated values.
For messy numeric text, use pd.to_numeric.
raw = pd.DataFrame({
"amount": ["499", "299", "missing", "799"]
})
raw["amount"] = pd.to_numeric(raw["amount"], errors="coerce")Explanation
- A pandas DataFrame named
rawis created with a column "amount" containing string representations of numbers and a missing value. - The
pd.to_numeric()function is applied to the "amount" column to convert its values to numeric types. - The
errors="coerce"parameter ensures that any non-convertible values (like "missing") are replaced with NaN (Not a Number) instead of raising an error. - This approach is useful for data cleaning, allowing for easier numerical analysis and operations on the DataFrame.
Invalid values become missing values.
30. Handling Missing Values
Create an example:
feedback = pd.DataFrame({
"student": ["Aarav", "Meera", "Ishan", "Tara", "Kabir"],
"rating": [5, np.nan, 4, np.nan, 3],
"comment": ["Clear", "Helpful", None, "Practical", None]
})Explanation
- A pandas DataFrame named
feedbackis created to organize student feedback data. - The DataFrame contains three columns:
student,rating, andcomment. - The
studentcolumn lists the names of five students. - The
ratingcolumn includes numerical ratings, with some entries as NaN (not a number) indicating missing values. - The
commentcolumn contains textual feedback, with some entries as None, representing absent comments.
Check missing values:
feedback.isna().sum()Explanation
- The
isna()function checks for missing values (NaNs) in the DataFramefeedback. - The
sum()function aggregates the boolean results fromisna(), counting the total number of missing entries for each column. - This code snippet is useful for data cleaning and preprocessing, allowing users to quickly identify columns that require attention due to missing data.
Fill missing numeric values with median:
feedback["rating"] = feedback["rating"].fillna(feedback["rating"].median())Explanation
- The code accesses the 'rating' column of the
feedbackDataFrame. - It uses the
fillna()method to fill any NaN (missing) values in the 'rating' column. - The missing values are replaced with the median of the existing ratings, calculated using
feedback["rating"].median(). - This approach helps maintain the integrity of the dataset by ensuring that all entries have a valid rating.
- Using the median is beneficial as it is less affected by outliers compared to the mean.
Fill missing text:
feedback["comment"] = feedback["comment"].fillna("No comment")Explanation
- The code accesses the "comment" column of the
feedbackDataFrame. - It uses the
fillna()method to identify and replace anyNaN(missing) values. - The default value used for replacement is the string "No comment".
- This ensures that all entries in the "comment" column have a valid string, improving data consistency.
Drop rows with missing values:
feedback.dropna()Explanation
- The
dropna()method is called on the DataFramefeedback. - It eliminates any rows that contain at least one NaN (Not a Number) value.
- The operation modifies the DataFrame in place unless specified otherwise.
- This is useful for cleaning data before analysis to ensure completeness.
- No parameters are provided, so it defaults to removing rows with any missing values.
Drop rows only when a specific column is missing:
feedback.dropna(subset=["rating"])Explanation
- The
dropnamethod is used to eliminate rows that contain NaN (Not a Number) values. - The
subsetparameter specifies that only the 'rating' column should be checked for missing values. - Rows where 'rating' is NaN will be dropped from the DataFrame named
feedback. - This operation helps in cleaning the dataset for analysis by ensuring that only complete entries are retained.
Use dropping carefully. Sometimes missing data is itself meaningful.
31. Sorting Rows
Sort by one column:
students.sort_values("python_score", ascending=False)Explanation
- The code snippet utilizes the
sort_valuesmethod from a DataFrame to organize student records. - It specifically targets the "python_score" column for sorting.
- The
ascending=Falseparameter ensures that the highest scores appear first in the sorted list. - This operation modifies the order of the DataFrame in place, allowing for easy access to top-performing students.
Sort by multiple columns:
students.sort_values(
["city", "python_score"],
ascending=[True, False]
)Explanation
- The
sort_valuesmethod is used to sort the DataFrame namedstudents. - The sorting is performed first by the
citycolumn in ascending order. - Within each city, the
python_scorecolumn is sorted in descending order. - This allows for a structured view of students, grouped by their city and ranked by their Python scores.
- The result is a DataFrame that prioritizes city organization while highlighting top performers in Python within each city.
32. Counting Categories
Use value_counts on a Series.
students["city"].value_counts()Explanation
- The code accesses the "city" column from the
studentsDataFrame. - It uses the
value_counts()method to count the number of occurrences of each unique city. - The result is a Series object that displays cities as the index and their respective counts as values.
- This is useful for quickly analyzing the distribution of students across different cities.
- The output can help in understanding demographic information or planning resources based on student locations.
Normalize to percentages:
students["city"].value_counts(normalize=True) * 100Explanation
- The code accesses the "city" column from a DataFrame named
students. - It uses the
value_counts()method to count the occurrences of each unique city. - The
normalize=Trueparameter converts these counts into proportions of the total. - Multiplying by 100 converts the proportions into percentages, providing a clearer view of city distribution.
- The result is a Series showing the percentage of students from each city.
Include missing values:
students["city"].value_counts(dropna=False)Explanation
- The code accesses the "city" column from a DataFrame named
students. - It uses the
value_counts()method to count the unique occurrences of each city. - The parameter
dropna=Falseensures that missing values (NaN) are included in the count. - This is useful for understanding the distribution of students across different cities, including those without specified locations.
33. Grouping Data With groupby
groupby helps you answer questions by category.
Average score by city:
students.groupby("city")["python_score"].mean()Explanation
- The code uses the
groupbymethod from the Pandas library to group student data by the "city" column. - It selects the "python_score" column to focus on the scores of students in each city.
- The
mean()function computes the average Python score for each city group. - This operation results in a Series with cities as the index and their corresponding average Python scores as values.
- It is useful for analyzing performance trends based on geographic location.
Multiple aggregations:
students.groupby("city").agg(
learners=("name", "count"),
avg_python=("python_score", "mean"),
avg_sql=("sql_score", "mean"),
total_hours=("practice_hours", "sum")
)Explanation
- The code uses the
groupbymethod to organize student data based on the "city" column. - It aggregates the data to compute the number of learners, average Python scores, average SQL scores, and total practice hours for each city.
- The
aggfunction specifies multiple aggregation operations, including counting names, calculating means for scores, and summing practice hours. - The results are returned as a DataFrame with new column names for clarity, such as "learners", "avg_python", "avg_sql", and "total_hours".
Sort the result:
students.groupby("city")["practice_hours"].sum().sort_values(ascending=False)Explanation
- The code uses the
groupbymethod to categorize student data based on the "city" column. - It then aggregates the "practice_hours" for each city using the
sum()function. - The results are sorted in descending order with
sort_values(ascending=False), allowing for easy identification of cities with the highest total practice hours. - This operation is useful for analyzing regional engagement in practice activities among students.
34. Mini Project 1: Course Enrollment Table
Create a small course enrollment dataset.
enrollments = pd.DataFrame({
"learner": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila", "Rohan"],
"track": ["Python", "Pandas", "Python", "SQL", "Pandas", "Python", "SQL"],
"level": ["beginner", "beginner", "beginner", "intermediate", "beginner", "intermediate", "beginner"],
"minutes_watched": [320, 410, 180, 260, 390, 520, 150],
"quiz_score": [82, 91, 67, 74, 88, 95, 61],
"certificate": [True, True, False, False, True, True, False]
})Explanation
- This code snippet initializes a pandas DataFrame named
enrollmentsto manage data related to learners in various tracks. - It includes columns for
learnernames, thetrackthey are enrolled in, theirlevelof expertise,minutes_watchedfor course content,quiz_scoreachieved, and whether they received acertificate. - The data is structured as a dictionary, where each key corresponds to a column name and the associated list contains the respective values for each learner.
- This DataFrame can be used for further analysis, such as calculating average scores or filtering learners based on their progress.
Practice Task
Question 1: Which learners watched more than 300 minutes?
enrollments[enrollments["minutes_watched"] > 300]Explanation
- This code snippet filters a DataFrame named
enrollments. - It selects only the rows where the value in the "minutes_watched" column is greater than 300.
- The result is a new DataFrame containing only the enrollments with significant viewing time.
- This operation is useful for analyzing user engagement or identifying highly active users.
Practice Task
Question 2: Which beginner learners scored at least 80?
enrollments[
(enrollments["level"] == "beginner") &
(enrollments["quiz_score"] >= 80)
]Explanation
- The code filters a DataFrame named
enrollmentsto select rows that meet specific criteria. - It checks for entries where the "level" column is equal to "beginner".
- Additionally, it ensures that the "quiz_score" column has values greater than or equal to 80.
- The result is a subset of the original DataFrame containing only the qualifying records.
Practice Task
Question 3: What is the average score per track?
enrollments.groupby("track")["quiz_score"].mean()Explanation
- The code uses the
groupbymethod to organize data based on the "track" column. - It selects the "quiz_score" column to perform calculations on.
- The
mean()function computes the average quiz score for each unique track. - This operation is useful for analyzing performance trends across different tracks.
- The result is a new Series with track names as the index and their corresponding average quiz scores.
Practice Task
Question 4: Which track has the most certificates?
enrollments.groupby("track")["certificate"].sum().sort_values(ascending=False)Explanation
- Groups the
enrollmentsDataFrame by thetrackcolumn. - Sums the
certificatevalues for each group, providing the total certificates earned per track. - Sorts the resulting sums in descending order to highlight the tracks with the highest certificate totals.
Practice Task
Question 5: Add an engagement column
enrollments["engagement"] = np.where(
enrollments["minutes_watched"] >= 300,
"high",
"low"
)Explanation
- Utilizes the
np.wherefunction from the NumPy library to create a new column called "engagement" in theenrollmentsDataFrame. - Checks if the value in the "minutes_watched" column is greater than or equal to 300.
- Assigns the label "high" for users who meet or exceed 300 minutes watched, and "low" for those who do not.
- This categorization helps in analyzing user engagement levels effectively.
This mini project covers filtering, grouping, sorting, and column creation.
35. Mini Project 2: Cleaning Store Orders
Create a messy order table.
orders = pd.DataFrame({
"order_id": [1, 2, 3, 3, 4, 5, 6],
"customer": ["Anika", "Rohan", "Zoya", "Zoya", "Kabir", "Meera", "Ishan"],
"city": ["pune", "Delhi", "PUNE", "PUNE", None, "Kochi", "delhi"],
"amount": ["499", "299", "799", "799", "bad", "999", "450"],
"status": ["paid", "pending", "paid", "paid", "paid", None, "pending"]
})Explanation
- Initializes a pandas DataFrame named
orderscontaining order details such as order ID, customer name, city, amount, and payment status. - The
order_idcolumn includes duplicate entries, indicating multiple orders from the same customer. - The
citycolumn contains inconsistent casing (e.g., "PUNE" vs "pune") and a missing value (None). - The
amountcolumn has a non-numeric entry ("bad"), which may lead to data type issues during analysis. - The
statuscolumn includes both valid entries and a missing value (None), which may require handling for accurate reporting.
Step 1: Normalize city names
orders["city"] = orders["city"].str.title()Explanation
- The code accesses the "city" column of the
ordersDataFrame. - It applies the
str.title()method, which converts each city name to title case (capitalizing the first letter of each word). - The formatted city names are then reassigned back to the "city" column, updating the DataFrame in place.
- This transformation ensures consistency in the representation of city names for better readability and data analysis.
Step 2: Convert amount to numeric
orders["amount"] = pd.to_numeric(orders["amount"], errors="coerce")Explanation
- The code uses the
pd.to_numeric()function from the Pandas library to convert the "amount" column in theordersDataFrame to numeric data types. - The
errors="coerce"parameter ensures that any values that cannot be converted to numbers are replaced withNaN(Not a Number) instead of raising an error. - This approach is useful for cleaning data, as it allows for the identification and handling of invalid entries in the "amount" column.
- After execution, the "amount" column will contain only numeric values or
NaN, making it suitable for further analysis or calculations.
Step 3: Fill missing status
orders["status"] = orders["status"].fillna("unknown")Explanation
- The code accesses the 'status' column of the DataFrame named
orders. - It uses the
fillna()method to replace anyNaN(missing) values in that column. - The replacement value for missing entries is set to the string "unknown".
- This operation helps in maintaining data integrity by ensuring there are no missing values in the 'status' column.
- The updated DataFrame retains all other data while only modifying the specified column.
Step 4: Remove duplicate orders
orders = orders.drop_duplicates(subset=["order_id"])Explanation
- The
drop_duplicatesmethod is called on theordersDataFrame. - The
subsetparameter specifies that duplicates should be identified based on theorder_idcolumn. - Only the first occurrence of each unique
order_idis retained in the DataFrame. - This operation helps in cleaning the data by ensuring that each order is represented only once.
- The result is a DataFrame with unique order entries, which is crucial for accurate data analysis.
Step 5: Find paid orders above 400
orders[
(orders["status"] == "paid") &
(orders["amount"] > 400)
]Explanation
- The code filters a DataFrame named
ordersto retrieve specific rows based on conditions. - It checks for rows where the
statuscolumn is equal to "paid". - Additionally, it ensures that the
amountcolumn has values greater than 400. - The result is a subset of the
ordersDataFrame that meets both criteria, useful for analyzing high-value transactions.
This is closer to real work because data is rarely clean when it arrives.
36. Practice Dataset: Workshop Attendance
Use this dataset for the exercises.
attendance = pd.DataFrame({
"learner": ["Aarav", "Meera", "Ishan", "Tara", "Kabir", "Nila", "Rohan", "Zoya", "Aarav"],
"topic": ["Python", "Pandas", "Python", "SQL", "Pandas", "Python", "SQL", "Pandas", "Python"],
"age": [21, 24, np.nan, 23, 26, 22, np.nan, 25, 21],
"visits": [3, 5, 2, np.nan, 4, 6, 1, 5, 3],
"score": [78, 91, 65, 72, 88, 95, 59, 84, 78],
"city": ["Pune", "Kochi", "Delhi", "Pune", "Jaipur", "Kochi", "Delhi", "Pune", "Pune"]
})Explanation
- Initializes a pandas DataFrame named
attendanceto store information about learners. - Contains columns for
learnernames,topicof study,age,visitsto sessions,scoreachieved, andcityof residence. - Includes missing values represented by
np.nanfor theageandvisitscolumns, indicating incomplete data for some learners. - The DataFrame is structured to facilitate analysis of attendance patterns and performance metrics across different topics and demographics.
37. Practice Questions With Solutions
Practice Task
Q1. Display basic information about the DataFrame.
attendance.info()
attendance.describe()Explanation
- The
attendance.info()method provides a concise summary of the DataFrame, including the number of entries, column data types, and memory usage. - The
attendance.describe()method generates descriptive statistics for numerical columns, such as count, mean, standard deviation, min, and max values. - Both methods are useful for understanding the structure and distribution of data within the DataFrame before performing further analysis.
- These functions help identify potential issues like missing values or outliers in the dataset.
Practice Task
Q2. Show alternate rows.
attendance.iloc[::2]Explanation
- The code uses the
ilocmethod from the pandas library to access rows by their integer index. - The slicing notation
::2indicates that every second row will be selected, starting from the first row (index 0). - This is useful for reducing the size of a DataFrame or for sampling data at regular intervals.
- The resulting output is a new DataFrame containing only the selected rows, which can be further processed or analyzed.
Practice Task
Q3. Show rows at positions 0, 2, and 6 with only learner and age.
attendance.loc[attendance.index[[0, 2, 6]], ["learner", "age"]]Explanation
- The code uses the
locmethod from the Pandas library to access specific data in a DataFrame namedattendance. - It retrieves rows at index positions 0, 2, and 6, which correspond to the first, third, and seventh entries in the DataFrame.
- The columns selected for output are "learner" and "age", meaning only these two attributes will be displayed for the specified rows.
- This method is useful for extracting a subset of data for analysis or reporting based on specific criteria.
Or:
attendance[["learner", "age"]].iloc[[0, 2, 6]]Explanation
- The code snippet utilizes the pandas library to manipulate a DataFrame named
attendance. - It selects the columns "learner" and "age" from the DataFrame.
- The
ilocmethod is used to retrieve rows at indices 0, 2, and 6. - This operation results in a new DataFrame containing only the specified rows and columns.
Practice Task
Q4. Select rows where visits are less than 4.
attendance[attendance["visits"] < 4]Explanation
- The code accesses a DataFrame named
attendance. - It filters the rows where the "visits" column has values less than 4.
- The result is a new DataFrame containing only those records that meet the specified condition.
- This operation is useful for identifying individuals who may need additional engagement or support.
Practice Task
Q5. Select rows where age or visits is missing.
attendance[
attendance["age"].isna() |
attendance["visits"].isna()
]Explanation
- The code filters the
attendanceDataFrame to identify rows where either theageorvisitscolumns contain missing values (NaN). - The
isna()function is used to check for NaN values in the specified columns. - The logical OR operator (
|) combines the two conditions, ensuring that any row with a NaN in either column is included in the result. - The resulting DataFrame will only contain entries that have incomplete data regarding age or visit counts.
Practice Task
Q6. Fill missing ages with the median age.
attendance["age"] = attendance["age"].fillna(attendance["age"].median())Explanation
- The code accesses the "age" column of the
attendanceDataFrame. - It uses the
fillna()method to replace anyNaN(missing) values in the "age" column. - The replacement value is calculated using the
median()function, which finds the median age from the existing values. - This approach helps maintain the integrity of the dataset by avoiding the removal of rows with missing values.
- Filling missing values with the median is a common practice to reduce bias in statistical analysis.
Practice Task
Q7. Fill missing visits with the most common visits value.
attendance["visits"] = attendance["visits"].fillna(attendance["visits"].mode()[0])Explanation
- The code accesses the 'visits' column of the 'attendance' DataFrame.
- It uses the
fillna()method to replace any NaN (missing) values. - The replacement value is determined by calculating the mode (most frequent value) of the 'visits' column.
- The mode is accessed using
attendance["visits"].mode()[0], which retrieves the first mode value. - This approach ensures that the 'visits' column has no missing values, improving data integrity for analysis.
Practice Task
Q8. Find total visits for learners attending Python workshops.
attendance[attendance["topic"] == "Python"]["visits"].sum()Explanation
- Filters the
attendanceDataFrame to include only rows where thetopiccolumn equals "Python". - Accesses the
visitscolumn from the filtered DataFrame. - Uses the
sum()function to compute the total number of visits for the Python topic. - Returns a single numeric value representing the aggregated visits.
- Assumes that the
attendanceDataFrame is already defined and contains the relevant columns.
Practice Task
Q9. Count learners per topic.
attendance["topic"].value_counts()Explanation
- The code accesses the "topic" column of the
attendanceDataFrame. - It utilizes the
value_counts()method to compute the frequency of each unique value in that column. - The result is a Series object that displays topics as the index and their corresponding counts as values.
- This is useful for quickly understanding the distribution of topics in the attendance data.
Practice Task
Q10. Count duplicate rows.
attendance.duplicated().sum()Explanation
- The
attendance.duplicated()method identifies duplicate rows in theattendanceDataFrame, returning a boolean Series. - The
sum()function then counts the number ofTruevalues in this Series, effectively giving the total number of duplicate entries. - This code is useful for data cleaning and validation, ensuring that each entry in the attendance record is unique.
- It helps in maintaining data integrity before performing further analysis or operations on the dataset.
Practice Task
Q11. Drop duplicate rows.
attendance = attendance.drop_duplicates()Explanation
- The
drop_duplicates()method is called on theattendanceDataFrame. - It eliminates any rows that are exact duplicates of others, ensuring each entry is unique.
- The operation modifies the DataFrame in place, meaning the original
attendancevariable will now only contain unique records. - This is useful for cleaning data before analysis or reporting, as duplicates can skew results.
Practice Task
Q12. Find average score by topic.
attendance.groupby("topic")["score"].mean()Explanation
- The code uses the
groupbymethod to organize the data in theattendanceDataFrame by the "topic" column. - It then selects the "score" column to perform calculations on.
- The
mean()function computes the average score for each unique topic. - This operation is useful for summarizing performance metrics across different topics.
- The result is a new Series with topics as the index and their corresponding average scores as values.
Practice Task
Q13. Find top cities by total visits.
attendance.groupby("city")["visits"].sum().sort_values(ascending=False)Explanation
- The code uses the
groupbymethod to aggregate data in theattendanceDataFrame based on the unique values in the "city" column. - It selects the "visits" column to calculate the total number of visits for each city.
- The
sum()function computes the total visits for each group (city). - Finally,
sort_values(ascending=False)arranges the resulting totals in descending order, allowing easy identification of cities with the highest visits.
Practice Task
Q14. Create a passed column where score is at least 70.
attendance["passed"] = attendance["score"] >= 70Explanation
- The code adds a new column named "passed" to the
attendanceDataFrame. - It evaluates whether each student's score is greater than or equal to 70.
- The result is a boolean value (
TrueorFalse) indicating if the student has passed. - This operation utilizes vectorized operations for efficient computation across the DataFrame.
Practice Task
Q15. Create a score_band column.
attendance["score_band"] = np.where(
attendance["score"] >= 85,
"strong",
np.where(attendance["score"] >= 70, "good", "needs_practice")
)Explanation
- Utilizes NumPy's
wherefunction to create a new columnscore_bandin theattendanceDataFrame. - Assigns the value "strong" if the
scoreis 85 or higher. - Assigns the value "good" if the
scoreis between 70 and 84. - Assigns the value "needs_practice" for scores below 70.
- This approach allows for efficient conditional assignment in a DataFrame.
38. Practice Dataset: League Matches
This dataset is fictional and made only for practice.
matches = pd.DataFrame({
"season": [2023, 2023, 2023, 2024, 2024, 2024, 2024],
"city": ["Pune", "Delhi", "Kochi", "Pune", "Jaipur", "Delhi", "Kochi"],
"team_a": ["Falcons", "Tigers", "Falcons", "Rockets", "Tigers", "Falcons", "Rockets"],
"team_b": ["Rockets", "Falcons", "Tigers", "Falcons", "Rockets", "Tigers", "Tigers"],
"winner": ["Falcons", "Tigers", "Falcons", "Falcons", "Rockets", "Falcons", "Tigers"],
"margin": [12, 8, 20, 5, 15, 6, 9],
"final": [False, False, True, False, False, True, False]
})Explanation
- A pandas DataFrame named
matchesis created to organize data related to sports match results. - The DataFrame contains columns for the season year, city of the match, teams involved (team_a and team_b), the winner of each match, the margin of victory, and whether the match was a final.
- The data includes matches from two different seasons (2023 and 2024) with various cities and teams participating.
- The
finalcolumn indicates if a match is a final, with boolean values (True or False) for each match entry. - This structure allows for easy analysis and manipulation of match data using pandas functionalities.
39. Match Practice Questions With Solutions
Practice Task
Q1. Find all final winners.
matches.loc[matches["final"], ["season", "winner"]]Explanation
- Utilizes the
locmethod to filter rows in thematchesDataFrame where the "final" column isTrue. - Selects only the "season" and "winner" columns from the filtered results.
- This operation is useful for analyzing the outcomes of championship matches across different seasons.
Practice Task
Q2. Count matches won by Falcons in Delhi.
matches[
(matches["winner"] == "Falcons") &
(matches["city"] == "Delhi")
].shape[0]Explanation
- Filters the
matchesDataFrame to include only rows where the winner is "Falcons" and the city is "Delhi". - Uses the bitwise AND operator
&to combine the two conditions for filtering. - The
shape[0]attribute returns the number of rows that meet the specified criteria, effectively counting the matches. - This code is useful for analyzing specific match outcomes based on team performance in a particular location.
Practice Task
Q3. Find top teams by number of matches played.
Because a team can appear in team_a or team_b, combine both columns.
played = pd.concat([matches["team_a"], matches["team_b"]])
played.value_counts()Explanation
- The code uses the
pd.concat()function from the pandas library to combine two columns,team_aandteam_b, from thematchesDataFrame into a single Series namedplayed. - The
value_counts()method is then called on theplayedSeries to count how many times each team appears in the combined list of matches. - This allows for easy analysis of team participation across all matches in the dataset.
- The resulting output will show the teams along with the number of matches they played, sorted in descending order by count.
Practice Task
Q4. Build a team-vs-team summary function.
def team_summary(data, team_one, team_two):
mask_one = (data["team_a"] == team_one) | (data["team_b"] == team_one)
mask_two = (data["team_a"] == team_two) | (data["team_b"] == team_two)
head_to_head = data[mask_one & mask_two]
return {
"matches": head_to_head.shape[0],
f"{team_one}_wins": (head_to_head["winner"] == team_one).sum(),
f"{team_two}_wins": (head_to_head["winner"] == team_two).sum(),
"average_margin": head_to_head["margin"].mean()
}
team_summary(matches, "Falcons", "Tigers")Explanation
- The function
team_summarytakes three parameters: a DataFramedatacontaining match details, and two team namesteam_oneandteam_two. - It creates boolean masks to filter matches where either team participated using logical OR operations.
- The filtered DataFrame
head_to_headcontains only the matches involving both specified teams. - The function returns a dictionary with the total number of matches, the number of wins for each team, and the average margin of victory across those matches.
- The function is called with a DataFrame
matchesand two team names, "Falcons" and "Tigers", to retrieve their match statistics.
Practice Task
Q5. Find the top cities where Falcons played.
falcons_matches = matches[
(matches["team_a"] == "Falcons") |
(matches["team_b"] == "Falcons")
]
falcons_matches["city"].value_counts()Explanation
- The code filters a DataFrame named
matchesto include only those rows where eitherteam_aorteam_bis "Falcons". - The filtered DataFrame is stored in the variable
falcons_matches. - It then counts the occurrences of each unique city in the
citycolumn of thefalcons_matchesDataFrame. - The result is a Series that shows how many matches involving the Falcons took place in each city.
Practice Task
Q6. Find average winning margin for Falcons in 2024.
matches[
(matches["season"] == 2024) &
(
(matches["team_a"] == "Falcons") |
(matches["team_b"] == "Falcons")
)
]["margin"].mean()Explanation
- Filters the
matchesDataFrame to include only rows where the season is 2024. - Further narrows down the results to matches where either
team_aorteam_bis "Falcons". - Selects the
margincolumn from the filtered DataFrame. - Computes the mean of the
marginvalues to determine the average margin for the specified conditions.
40. Common Beginner Mistakes
Mistake 1: Forgetting Double Brackets For Multiple Columns
Wrong:
students["name", "city"]Explanation
- This code snippet retrieves the "name" and "city" columns from a DataFrame named
students. - The syntax
students["name", "city"]indicates that multiple columns can be accessed simultaneously. - This operation is useful for filtering data to focus on specific attributes of the students.
- Ensure that the
studentsDataFrame is defined and contains the specified columns to avoid errors.
Right:
students[["name", "city"]]Explanation
- This code snippet accesses a pandas DataFrame named
students. - It retrieves only the "name" and "city" columns from the DataFrame.
- The double square brackets indicate that a subset of columns is being selected, returning a new DataFrame.
- This operation is useful for focusing on specific data attributes for analysis or reporting.
Mistake 2: Confusing loc And iloc
Use:
locfor labelsilocfor integer positions
students.iloc[0]
students_by_id.loc["Meera"]Explanation
- The first line retrieves the first row of the
studentsDataFrame using integer-location based indexing withiloc. - The second line accesses the record of a student named "Meera" from the
students_by_idDataFrame using label-based indexing withloc. - Both methods are part of the pandas library, which is commonly used for data manipulation and analysis in Python.
- These indexing techniques allow for efficient retrieval of data based on either position or label, enhancing data handling capabilities.
Mistake 3: Missing Parentheses In Multiple Conditions
Wrong:
students[students["python_score"] > 80 & students["sql_score"] > 80]Explanation
- The code filters a DataFrame named
studentsto select rows where thepython_scoreis greater than 80 and thesql_scoreis also greater than 80. - The logical condition uses the bitwise AND operator
&to combine the two conditions for filtering. - It is important to use parentheses around each condition to ensure proper evaluation order in the expression.
- The result is a subset of the original DataFrame containing only students who excel in both subjects.
Right:
students[
(students["python_score"] > 80) &
(students["sql_score"] > 80)
]Explanation
- The code filters a DataFrame named
studentsto find rows where both thepython_scoreandsql_scoreare greater than 80. - It uses boolean indexing with the
&operator to combine conditions for both scores. - The resulting DataFrame will only include students who excel in both subjects, useful for identifying top performers.
- This operation is efficient for data analysis in educational contexts, allowing for targeted insights.
Mistake 4: Using = Instead Of == In Filters
Wrong:
# This is assignment, not comparison
students["city"] = "Pune"Explanation
- The code modifies the dictionary
studentsby assigning the string value"Pune"to the key"city". - This operation updates the existing value associated with the key if it already exists or creates a new key-value pair if it does not.
- The comment clarifies that the single equals sign (
=) is used for assignment, not for comparison, which uses double equals (==).
Right:
students[students["city"] == "Pune"]Explanation
- The code snippet accesses a DataFrame named
students. - It filters the DataFrame to include only those rows where the "city" column has the value "Pune".
- The result is a new DataFrame containing only the students from Pune.
- This operation is useful for analyzing or processing data specific to a particular location.
Mistake 5: Dropping Missing Values Too Early
Before dropping rows, ask:
- Why is the value missing?
- Is the row still useful?
- Can the value be filled safely?
- Will dropping rows bias the analysis?
Mistake 6: Changing Original Data Accidentally
If you want to experiment, create a copy.
practice = students.copy()Explanation
- The
copy()method is used to create a shallow copy of thestudentslist. - The new list,
practice, is independent of the originalstudentslist, meaning changes to one will not affect the other. - This is useful for scenarios where you want to manipulate data without altering the original dataset.
- The operation is efficient for lists, ensuring that the original data remains intact while allowing for experimentation or practice.
41. DataFrame Cheat Sheet
# Create
pd.DataFrame(data)
pd.read_csv("file.csv")
# Inspect
df.shape
df.dtypes
df.index
df.columns
df.head()
df.tail()
df.sample(5)
df.info()
df.describe()
# Missing and duplicates
df.isna().sum()
df.duplicated().sum()
df.drop_duplicates()
# Select columns
df["col"]
df[["col1", "col2"]]
# Select rows
df.iloc[0]
df.iloc[0:5]
df.loc["label"]
df.loc["start":"stop"]
# Select rows and columns
df.iloc[0:5, 0:3]
df.loc[df["score"] > 80, ["name", "score"]]
# Filter
df[df["city"] == "Pune"]
df[(df["score"] > 80) & (df["city"] == "Pune")]
df[df["city"].isin(["Pune", "Kochi"])]
df[df["score"].between(70, 90)]
# Add or update
df["new_col"] = value
df["total"] = df["a"] + df["b"]
df.rename(columns={"old": "new"})
# Convert
df["col"].astype("category")
pd.to_numeric(df["amount"], errors="coerce")
# Summarize
df["city"].value_counts()
df.groupby("city")["score"].mean()
df.sort_values("score", ascending=False)Explanation
- The code demonstrates various operations on a Pandas DataFrame, including creation, inspection, and manipulation of data.
- It shows how to read data from a CSV file and inspect the DataFrame's shape, data types, and summary statistics.
- The snippet includes methods for handling missing values and duplicates, allowing for data cleaning.
- It illustrates techniques for selecting specific rows and columns, as well as filtering data based on conditions.
- The code also covers adding new columns, updating existing ones, converting data types, and summarizing data through grouping and counting.
42. Interview-Style Questions
1. What is a Pandas DataFrame?
A DataFrame is a two-dimensional labeled data structure with rows and columns. It is commonly used for tabular data.
2. What is the difference between a Series and a DataFrame?
A Series is one-dimensional. A DataFrame is two-dimensional and can be thought of as multiple Series sharing the same row index.
3. What does df.shape return?
It returns a tuple: (number_of_rows, number_of_columns).
4. What is the difference between loc and iloc?
loc selects by labels. iloc selects by integer positions.
5. Why do we use parentheses with & and | in filters?
Because Pandas comparisons must be grouped before combining them. Without parentheses, Python operator precedence can produce errors or incorrect logic.
6. How do you select multiple columns?
Use a list of column names:
df[["name", "score"]]Explanation
- This code snippet retrieves two columns, "name" and "score", from a pandas DataFrame named
df. - The double square brackets indicate that a subset of the DataFrame is being created, returning only the specified columns.
- The result is a new DataFrame containing only the data from the "name" and "score" columns, which can be useful for analysis or visualization.
- This operation does not modify the original DataFrame
df; it creates a new view of the data.
7. How do you count missing values per column?
df.isna().sum()Explanation
- The
isna()function is called on a DataFramedf, which returns a DataFrame of the same shape with boolean values indicating the presence of missing values (True for NaN). - The
sum()function is then applied to this boolean DataFrame, which counts the number of True values (missing values) in each column. - The result is a Series object where the index corresponds to the column names and the values represent the count of missing entries for each column.
- This is useful for data cleaning and preprocessing, allowing users to quickly identify columns that may require attention due to missing data.
8. How do you remove duplicate rows?
df.drop_duplicates()Explanation
- The
drop_duplicates()method is called on a pandas DataFramedf. - It identifies and removes any rows that are exact duplicates of each other.
- By default, it keeps the first occurrence of each duplicate and drops subsequent ones.
- This operation helps in cleaning the dataset for accurate analysis and processing.
- The method can be customized with parameters to specify which columns to consider for identifying duplicates.
9. How do you convert a column to numeric?
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")Explanation
- The code uses the
pd.to_numeric()function from the pandas library to convert the values in the 'amount' column of a DataFrame (df) to numeric types. - The
errors="coerce"parameter ensures that any values that cannot be converted to numeric will be replaced withNaN(Not a Number), preventing the code from raising an error. - This is useful for cleaning data, especially when dealing with mixed types or invalid entries in the 'amount' column.
- The resulting 'amount' column will contain only numeric values or
NaN, making it easier to perform numerical operations or analyses later on.
10. How do you find the average score by city?
df.groupby("city")["score"].mean()Explanation
- The code uses the
groupbymethod on a DataFramedfto organize data based on unique values in the "city" column. - It then selects the "score" column to compute the mean for each group created by the cities.
- The result is a new Series containing the average score for each city, which can be useful for data analysis and reporting.
43. A Good Learning Path After This
Once you understand DataFrames, continue with:
- deeper missing-value handling
mergeandjoingroupbyin detail- pivot tables
- time series data
- plotting with Pandas
- performance-friendly Pandas patterns
- exploratory data analysis projects
DataFrame basics are the foundation. Most Pandas work is just these ideas repeated on larger and messier datasets.
Sources And Further Reading
This guide uses original examples and practice datasets. The notebook was used only as a topic outline, not as copied lesson text or copied datasets.
Official Pandas references:
- Pandas DataFrame introduction: https://pandas.pydata.org/docs/user_guide/dsintro.html
- Pandas getting started tutorials: https://pandas.pydata.org/docs/getting_started/intro_tutorials/
- Indexing and selecting data: https://pandas.pydata.org/docs/user_guide/indexing.html
- Group by user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
- Pandas API reference: https://pandas.pydata.org/docs/reference/
