Pandas Series: Create, Index, Clean, Analyze, and Practice
Pandas is one of the most important Python libraries for data analysis.
If NumPy gives you fast arrays, Pandas gives you labeled data structures that feel closer to real-world tables.
The first Pandas object you should understand is the Series.
A Pandas Series is a one-dimensional labeled array. You can think of it as a single column of data with row labels.
Examples of Series-style data:
- daily website visitors
- monthly revenue
- marks scored by students
- product prices
- customer ratings
- city temperatures
- movie genres
- app downloads by date
In this lesson, you will learn how to create, inspect, select, clean, analyze, and transform Pandas Series.
What You Will Learn
By the end, you should be able to:
- explain what a Pandas Series is
- create Series from lists, dictionaries, and scalar values
- use custom indexes and names
- inspect
size,dtype,name,index,values, andis_unique - read a CSV column as a Series
- use
head,tail,sample,value_counts,sort_values, andsort_index - calculate
sum,mean,median,mode,std,var,min,max, anddescribe - select values using labels and positions
- understand
locandiloc - edit values safely
- use boolean indexing
- handle missing data with
isna,dropna, andfillna - convert values using
astypeandpd.to_numeric - use
between,clip,duplicated,drop_duplicates,isin,map, andapply - solve beginner Pandas Series practice problems
1. Installing And Importing Pandas
Install Pandas if needed:
pip install pandasImport it with the standard alias:
import pandas as pd
import numpy as npMost Pandas code uses pd as the alias.
2. What Is A Pandas Series?
A Series is a one-dimensional labeled array.
Create a simple Series:
import pandas as pd
visitors = pd.Series([120, 135, 150, 160])
print(visitors)Output:
0 120
1 135
2 150
3 160
dtype: int64The left side is the index.
The right side is the value.
By default, Pandas creates a numeric index starting from 0.
3. Series vs Python List
A Python list stores values by position.
visitors_list = [120, 135, 150, 160]A Series stores values with labels.
visitors = pd.Series([120, 135, 150, 160])Why does this matter?
Because real data often needs labels.
visitors = pd.Series(
[120, 135, 150, 160],
index=["Mon", "Tue", "Wed", "Thu"],
)
print(visitors)Output:
Mon 120
Tue 135
Wed 150
Thu 160
dtype: int64Now each value has a meaningful row label.
4. Creating A Series From A List
topics = pd.Series(["Python", "NumPy", "Pandas", "SQL"])
print(topics)Output:
0 Python
1 NumPy
2 Pandas
3 SQL
dtype: objectStrings usually use the object dtype in many Pandas displays.
Create a numeric Series:
scores = pd.Series([82, 91, 76, 88])
print(scores)Output:
0 82
1 91
2 76
3 88
dtype: int645. Creating A Series With Custom Index
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
)
print(scores)Output:
Asha 82
Ravi 91
Meera 76
Kabir 88
dtype: int64Now the index contains student names.
You can select by label:
print(scores["Ravi"])Output:
916. Naming A Series
A Series can have a name.
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
name="math_score",
)
print(scores)Output:
Asha 82
Ravi 91
Meera 76
Kabir 88
Name: math_score, dtype: int64The name is useful when a Series becomes a column in a DataFrame.
7. Creating A Series From A Dictionary
When you create a Series from a dictionary, keys become the index and values become the data.
course_minutes = {
"Python Basics": 45,
"NumPy Arrays": 55,
"Pandas Series": 60,
"SQL Joins": 50,
}
minutes = pd.Series(course_minutes, name="duration_minutes")
print(minutes)Output:
Python Basics 45
NumPy Arrays 55
Pandas Series 60
SQL Joins 50
Name: duration_minutes, dtype: int64This is one of the cleanest ways to create a labeled Series.
8. Creating A Series From A Scalar
You can repeat one value across an index.
status = pd.Series("draft", index=["post_1", "post_2", "post_3"])
print(status)Output:
post_1 draft
post_2 draft
post_3 draft
dtype: objectThis is useful when creating default values.
9. Important Series Attributes
Create a Series:
ratings = pd.Series(
[4.5, 4.8, 3.9, 4.8, np.nan],
index=["course_a", "course_b", "course_c", "course_d", "course_e"],
name="rating",
)size
size returns the total number of values, including missing values.
print(ratings.size)Explanation
- The
ratingsvariable is expected to be a data structure, such as a NumPy array or a Pandas DataFrame. - The
sizeattribute returns the total number of elements contained within theratingsdata structure. - The
printfunction outputs this size to the console, allowing the user to see how many ratings are present. - This can be useful for understanding the scale of the data being analyzed.
Output:
5count
count() returns non-missing values.
print(ratings.count())Explanation
- The
count()method is called on theratingsobject, which is expected to be a list, array, or similar iterable. - It returns the total number of elements present in the
ratingscollection. - This can be useful for determining the size of the dataset or for further statistical analysis.
- The output will be an integer representing the count of items in
ratings.
Output:
4This distinction is important in interviews.
dtype
print(ratings.dtype)Explanation
- The
print()function outputs the result to the console. ratings.dtypeaccesses the data type attribute of the 'ratings' variable, which is typically a NumPy array or a pandas DataFrame.- This is useful for understanding the type of data stored in 'ratings', which can affect how operations are performed on it.
- Knowing the data type helps in debugging and ensuring compatibility with functions that require specific data types.
Output:
float64name
print(ratings.name)Explanation
- The code snippet prints the value of the 'name' attribute from the 'ratings' object.
- The 'ratings' object is expected to be an instance of a class that has a 'name' attribute defined.
- The 'r' prefix before the string indicates that the string is a raw string, but in this case, it is not necessary since there are no escape characters.
- This operation is commonly used to retrieve and display specific information stored within an object.
Output:
ratingindex
print(ratings.index)Explanation
- The code snippet uses the
indexattribute of a Pandas DataFrame or Series namedratings. - It outputs the index labels, which represent the row identifiers for the data structure.
- This can be useful for understanding the structure of the data or for debugging purposes.
- The
printfunction displays the index in the console, allowing for quick inspection.
The index stores row labels.
values
print(ratings.values)Explanation
- The code snippet uses the
print()function to output data to the console. ratings.valuesaccesses the values attribute of theratingsobject, which typically contains numerical or categorical data.- This is commonly used in data analysis to quickly view the underlying values of a dataset, such as in a Pandas DataFrame.
- The
rbefore the string indicates a raw string, but in this case, it is not necessary since no escape characters are present.
This gives the underlying values as an array-like object.
is_unique
print(ratings.is_unique)Explanation
- The code accesses the
is_uniqueattribute of theratingsDataFrame. - It returns a boolean value indicating whether all index values in the DataFrame are unique.
- This can be useful for data validation to ensure there are no duplicate entries in the index.
- If
True, it confirms that each index label is distinct; ifFalse, it indicates duplicates exist.
Output:
FalseIt is false because 4.8 appears more than once.
10. Reading A CSV Column As A Series
Assume you have a CSV file called daily_visitors.csv:
date,visitors
2026-01-01,120
2026-01-02,135
2026-01-03,150Read the file:
df = pd.read_csv("daily_visitors.csv")
print(df)Explanation
- The
pd.read_csvfunction is used to load data from a CSV file named "daily_visitors.csv" into a pandas DataFrame calleddf. - The
print(df)statement outputs the entire DataFrame to the console, allowing users to view the data contained in the CSV file. - This code is useful for quickly inspecting the structure and contents of the dataset for further analysis.
Select one column as a Series:
visitors = df["visitors"]
print(type(visitors))
print(visitors)Explanation
- The variable
visitorsis assigned the 'visitors' column from the DataFramedf. - The
type(visitors)function is called to print the data type of thevisitorsvariable, which helps in understanding the structure of the data. - The
print(visitors)statement outputs the actual content of the 'visitors' column, allowing for a quick inspection of the data values.
If you want the date column as the index:
df = pd.read_csv("daily_visitors.csv", index_col="date")
visitors = df["visitors"]
print(visitors)Explanation
- The code imports a CSV file named "daily_visitors.csv" into a pandas DataFrame, setting the "date" column as the index.
- It extracts the "visitors" column from the DataFrame for further analysis or display.
- The
printfunction outputs the values of the "visitors" column to the console, allowing for a quick review of the data.
CSV reading usually returns a DataFrame. A single selected column is a Series.
11. head() And tail()
Use head() to preview the first rows.
sales = pd.Series([120, 135, 150, 160, 155, 170, 180])
print(sales.head())
print(sales.head(3))Explanation
- The code initializes a Pandas Series named
salescontaining a list of sales figures. - The
print(sales.head())statement outputs the first five entries of the Series by default. - The
print(sales.head(3))statement specifically retrieves and displays the first three entries of the Series. - This functionality is useful for quickly inspecting the data structure and values within the Series.
Use tail() to preview the last rows.
print(sales.tail())
print(sales.tail(2))Explanation
- The
print(sales.tail())function call outputs the last five rows of the DataFrame namedsales. - The
print(sales.tail(2))function call specifically retrieves and displays the last two rows of the same DataFrame. - This method is useful for quickly inspecting the end of a dataset to understand its structure or check for data integrity.
These are useful for checking data quickly.
12. sample()
sample() returns random rows.
products = pd.Series(
["notebook", "pen", "marker", "bag", "bottle", "eraser"],
name="product",
)
print(products.sample(3, random_state=42))Explanation
- A Pandas Series named
productsis created containing a list of product names. - The
samplemethod is used to randomly select 3 items from the Series. - The
random_stateparameter is set to 42, ensuring that the random selection is reproducible across different runs. - The selected products are printed to the console, allowing for easy inspection of the random sample.
random_state makes the sample reproducible.
13. value_counts()
value_counts() counts unique values.
categories = pd.Series(
["free", "pro", "free", "team", "pro", "free"],
name="plan",
)
print(categories.value_counts())Explanation
- A pandas Series named
categoriesis created containing different subscription plan types. - The
value_counts()method is called on the Series to count the number of occurrences of each unique value. - The result is printed, displaying the frequency of each subscription plan type in descending order.
- This code is useful for quickly analyzing categorical data and understanding the distribution of different categories.
Output:
free 3
pro 2
team 1
Name: count, dtype: int64Use this for categorical summaries.
To include missing values:
print(categories.value_counts(dropna=False))Explanation
- Uses the
value_counts()method from the pandas library to count unique values in thecategoriesSeries. - The
dropna=Falseparameter ensures that NaN (missing) values are included in the count. - The output is a Series showing the count of each unique category, which can be useful for data analysis and understanding distribution.
- This method is commonly used in data preprocessing and exploratory data analysis to identify the presence of missing data.
14. Sorting Values
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
)
print(scores.sort_values())Explanation
- A Pandas Series named
scoresis created with four integer values representing scores and corresponding string indices for names. - The
sort_values()method is called on thescoresSeries to sort the scores in ascending order. - The sorted Series is printed, displaying the names alongside their scores in order from lowest to highest.
- This code snippet demonstrates how to efficiently sort and display data using the Pandas library in Python.
Output:
Meera 76
Asha 82
Kabir 88
Ravi 91
dtype: int64Descending order:
print(scores.sort_values(ascending=False))Explanation
- The code snippet utilizes the
sort_valuesmethod from the pandas library to sort a DataFrame or Series namedscores. - The
ascending=Falseargument specifies that the sorting should be done in descending order, meaning higher scores will appear first. - The
printfunction outputs the sorted scores to the console, allowing for immediate visibility of the results. - This operation is useful for quickly identifying the highest scores in a dataset.
Get the top scorer:
top_student = scores.sort_values(ascending=False).head(1)
print(top_student)Explanation
- The
scoresvariable is expected to be a pandas Series or DataFrame containing student scores. - The
sort_values(ascending=False)method sorts the scores in descending order, placing the highest score at the top. - The
head(1)method retrieves the first entry from the sorted list, which corresponds to the top student. - The
print(top_student)statement outputs the highest score or student information to the console.
15. Sorting Index
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
)
print(scores.sort_index())Explanation
- Creates a Pandas Series named
scoreswith specified values and custom indices representing names. - Uses the
sort_index()method to sort the Series based on the alphabetical order of the indices. - The sorted Series is printed, displaying the scores associated with each name in order.
- This approach enhances data readability and organization, making it easier to locate specific entries.
Output:
Asha 82
Kabir 88
Meera 76
Ravi 91
dtype: int64Use sort_index() when label order matters.
16. inplace=True: Should You Use It?
Many Pandas methods can return a new object.
sorted_scores = scores.sort_values()Explanation
- The
sort_values()method is called on thescoresobject, which is typically a pandas Series or DataFrame. - The method sorts the values in ascending order by default.
- The sorted result is stored in the variable
sorted_scores. - This operation modifies the original
scoresobject unless a new sorted object is created.
Some methods also support inplace=True.
scores.sort_values(inplace=True)Explanation
- The
sort_values()method is called on a DataFrame namedscores. - The
inplace=Trueargument modifies the original DataFrame directly, rather than returning a new sorted DataFrame. - This operation is useful for organizing data in ascending order based on the specified column(s).
- It helps in preparing data for analysis or visualization by ensuring that the values are in a desired order.
For learning and production code, returning a new object is often clearer.
Why?
- it avoids accidental mutation
- it makes code easier to debug
- it works well with method chaining
Prefer:
scores = scores.sort_values()Explanation
- The
sort_values()method is called on thescoresobject, which is expected to be a pandas DataFrame or Series. - The method sorts the data in ascending order by default, rearranging the values.
- The original
scoresobject is modified in place, meaning the sorted values replace the unsorted ones. - This operation is useful for organizing data for analysis or visualization purposes.
17. Mathematical Methods
Create a Series:
orders = pd.Series([12, 18, 10, 25, 17, np.nan], name="orders")Explanation
- Initializes a Pandas Series named "orders" containing a list of integers representing order quantities.
- The list includes a NaN (Not a Number) value to represent missing data in the series.
- The use of
pd.Seriesallows for easy manipulation and analysis of the order data. - The
nameparameter assigns a label to the series, making it easier to reference in data analysis tasks.
sum
print(orders.sum())Explanation
- The
print()function outputs the result to the console. ordersis expected to be a data structure, such as a list or a pandas DataFrame, containing numerical values.- The
sum()method computes the total of all elements within theordersvariable. - This operation is useful for quickly assessing total sales or quantities in a dataset.
Output:
82.0By default, missing values are skipped.
mean
print(orders.mean())Explanation
- The code snippet uses the
mean()function to compute the average of the values in theordersdataset. - The
print()function outputs the calculated mean to the console for easy viewing. - This operation is typically used in data analysis to summarize the central tendency of numerical data.
median
print(orders.median())Explanation
- The code uses the
print()function to output the result of themedian()method. ordersis expected to be a data structure, such as a list or a Pandas DataFrame, containing numerical values.- The
median()method computes the median, which is the middle value when the data is sorted. - If the dataset has an even number of values, the median is the average of the two middle numbers.
mode
print(orders.mode())Explanation
- The code uses the
mode()function from the pandas library to find the mode of theordersDataFrame. - The mode represents the value(s) that appear most frequently in the dataset.
- The result is printed to the console, displaying the most common entries in the specified DataFrame.
- This function is useful for understanding the distribution of categorical data within the DataFrame.
mode() can return more than one value.
standard deviation and variance
print(orders.std())
print(orders.var())Explanation
- The
print(orders.std())function computes and displays the standard deviation of the values in theordersdataset, which measures the amount of variation or dispersion. - The
print(orders.var())function calculates and outputs the variance of theordersdataset, representing the average of the squared differences from the mean. - Both functions are useful for understanding the distribution and spread of order values, aiding in statistical analysis.
- This code assumes that
ordersis a Pandas DataFrame or Series containing numerical data.
min and max
print(orders.min())
print(orders.max())Explanation
- The
print(orders.min())function call outputs the smallest value found in theordersdataset. - The
print(orders.max())function call outputs the largest value found in theordersdataset. - This code is useful for quickly assessing the range of order values in a dataset.
- It assumes that
ordersis a data structure that supports themin()andmax()methods, such as a list or a pandas DataFrame.
18. describe()
describe() gives a quick statistical summary.
print(orders.describe())Explanation
- The
print()function outputs the result of thedescribe()method to the console. ordersis expected to be a pandas DataFrame that contains order-related data.- The
describe()method generates descriptive statistics such as count, mean, standard deviation, min, and max for numerical columns. - This method helps in quickly understanding the distribution and central tendencies of the data in the DataFrame.
- It is useful for data analysis and preprocessing steps in data science projects.
Possible output:
count 5.000000
mean 16.400000
std 5.770615
min 10.000000
25% 12.000000
50% 17.000000
75% 18.000000
max 25.000000
Name: orders, dtype: float64For text data:
plans = pd.Series(["free", "pro", "free", "team", "free"])
print(plans.describe())Explanation
- The code creates a Pandas Series named
planscontaining different subscription types. - The
describe()method is called on the Series, which provides a summary of the data, including count, unique values, top value, and frequency. - This summary helps in understanding the distribution of subscription plans, such as how many users are on each plan.
- The output is useful for data analysis and decision-making regarding subscription offerings.
It reports count, unique values, top value, and frequency.
19. Selecting Values By Position With iloc
Use iloc for integer-position selection.
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
)
print(scores.iloc[0])
print(scores.iloc[1:3])
print(scores.iloc[[0, 3]])Explanation
- A Pandas Series named
scoresis created with integer values and custom string indices representing names. - The first
printstatement retrieves the score of the first index ("Asha") usingiloc[0]. - The second
printstatement retrieves a slice of scores from the second to the third index ("Ravi" and "Meera") usingiloc[1:3]. - The third
printstatement accesses the scores of the first and last indices ("Asha" and "Kabir") using a list withiloc[[0, 3]].
iloc ignores labels and uses positions.
20. Selecting Values By Label With loc
Use loc for label-based selection.
print(scores.loc["Ravi"])
print(scores.loc["Asha":"Meera"])Explanation
- The first line retrieves and prints the row associated with the index label "Ravi" from the DataFrame
scores. - The second line retrieves and prints all rows from the DataFrame
scoresstarting from the index label "Asha" to "Meera", inclusive. - The
locmethod is used for label-based indexing, allowing for selection of rows and columns by their labels. - This code assumes that
scoresis a Pandas DataFrame that has been previously defined and populated with data.
Important:
Label slicing with loc includes the stop label when it exists.
Position slicing with iloc excludes the stop position.
21. Why Avoid Ambiguous Integer Indexing?
Consider this Series:
numbers = pd.Series([100, 200, 300], index=[10, 20, 30])Explanation
- Initializes a Pandas Series named
numberscontaining three integer values: 100, 200, and 300. - Assigns custom indices of 10, 20, and 30 to the respective values in the Series.
- Facilitates easier data manipulation and retrieval by using meaningful indices instead of default integer indices.
- Useful for scenarios where data points need to be accessed or analyzed based on specific labels rather than their position.
This can be confusing:
numbers[10]Explanation
- The code attempts to retrieve the element at index 10 from the list named
numbers. - Python uses zero-based indexing, meaning the first element is at index 0, and the eleventh element is at index 10.
- If the list
numberscontains fewer than 11 elements, this will raise anIndexError. - This operation is commonly used to access specific data points in a list for further processing or analysis.
Does 10 mean label or position?
Use explicit access:
print(numbers.loc[10])
print(numbers.iloc[0])Explanation
- The first line retrieves the value at index 10 from the DataFrame
numbersusing thelocmethod, which accesses data by label. - The second line retrieves the value at the first position (index 0) from the DataFrame
numbersusing theilocmethod, which accesses data by integer position. - This code demonstrates how to access data in a pandas DataFrame using both label-based and position-based indexing.
- It is essential to ensure that the index exists in the DataFrame to avoid errors during retrieval.
Good Pandas code is explicit about label vs position.
22. Slicing A Series
weekly_sales = pd.Series(
[120, 135, 150, 160, 155, 170, 180],
index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
)
print(weekly_sales.iloc[1:4])
print(weekly_sales.loc["Tue":"Thu"])Explanation
- A Pandas Series named
weekly_salesis created with sales data for each day of the week. - The
ilocmethod is used to retrieve sales data for Tuesday to Thursday using integer-based indexing. - The
locmethod is employed to access sales data from Tuesday to Thursday using label-based indexing. - Both methods demonstrate different ways to slice data from the Series, showcasing flexibility in data retrieval.
Both select Tuesday through Thursday here, but they use different rules.
23. Editing Values
Create a Series:
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
)Explanation
- Initializes a Pandas Series named
scorescontaining four numerical values representing scores. - The scores are associated with custom indices: "Asha", "Ravi", "Meera", and "Kabir".
- This structure allows for easy access and manipulation of scores using the corresponding names as labels.
- The use of
pd.Seriesindicates that the code relies on the Pandas library for data manipulation.
Edit by label:
scores.loc["Meera"] = 80Explanation
- The code modifies the DataFrame
scoresby assigning the value80to the row labeled "Meera". - The
locmethod is used to access a group of rows and columns by labels or a boolean array. - If "Meera" does not already exist in the DataFrame, this operation will create a new row with that label.
- This is a common operation for updating or adding data in pandas DataFrames.
Edit by position:
scores.iloc[0] = 85Explanation
- The code uses the
ilocmethod to access a specific row in a pandas DataFrame calledscores. - It targets the first row (index 0) of the DataFrame for modification.
- The value 85 is assigned to the entire first row, replacing any existing values.
- This operation is useful for updating data in a DataFrame without needing to reassign the entire DataFrame.
- Ensure that the DataFrame
scoresis already defined and has at least one row before executing this code.
Add a new label:
scores.loc["Nisha"] = 92Explanation
- The code modifies the DataFrame
scoresby assigning a new value. - It specifically targets the row labeled "Nisha" to update her score.
- The value
92is set as the new score for Nisha in the DataFrame. - This operation is useful for dynamically changing data in data analysis tasks.
Print:
print(scores)Explanation
- The code uses the
print()function to output the value of the variablescoresto the console. - It assumes that
scoresis already defined and contains data, such as a list or a dictionary. - This is a common way to debug or check the contents of a variable during development.
24. Copying A Series Safely
When you take a subset and plan to modify it, use .copy().
top_scores = scores.head(2).copy()
top_scores.iloc[0] = 999
print(top_scores)
print(scores)Explanation
- The code creates a copy of the top two entries from the
scoresDataFrame using thehead(2)method. - It modifies the first entry of the copied DataFrame (
top_scores) by setting its value to 999. - The original
scoresDataFrame remains unchanged, demonstrating the use of thecopy()method to avoid unintended side effects. - Finally, both the modified
top_scoresand the originalscoresare printed to show the difference.
Why?
It prevents accidental changes or warnings caused by modifying a view-like object.
Interview answer:
If I need an independent object, I call
.copy()before modifying a subset.
25. Python Functions With Series
Many Python built-ins work with Series.
scores = pd.Series([82, 91, 76, 88])
print(len(scores))
print(type(scores))
print(max(scores))
print(min(scores))
print(sorted(scores))Explanation
- Initializes a Pandas Series named
scoreswith a list of integers representing scores. - Uses
len(scores)to print the number of elements in the Series. - Utilizes
type(scores)to display the data type of thescoresobject, confirming it is a Pandas Series. - Calls
max(scores)to find and print the highest score in the Series. - Calls
min(scores)to find and print the lowest score in the Series. - Uses
sorted(scores)to print the scores in ascending order.
Convert to a list:
print(scores.tolist())Explanation
- The code uses the
tolist()method to convert a NumPy array namedscoresinto a standard Python list. - This conversion is useful for compatibility with Python functions that require list inputs instead of NumPy arrays.
- The
print()function outputs the resulting list to the console, allowing for easy visualization of the data. - This snippet assumes that
scoresis already defined as a NumPy array prior to this line of code.
Convert to a dictionary:
named_scores = pd.Series(
[82, 91, 76],
index=["Asha", "Ravi", "Meera"],
)
print(named_scores.to_dict())Explanation
- A Pandas Series is created with scores assigned to specific names as indices.
- The
pd.Seriesconstructor takes a list of scores and an index list to associate each score with a name. - The
to_dict()method is called on the Series to convert it into a dictionary, where names are keys and scores are values. - The resulting dictionary is printed, displaying the mapping of names to their corresponding scores.
26. Membership: Index vs Values
For a Series, the in operator checks the index, not the values.
scores = pd.Series(
[82, 91, 76],
index=["Asha", "Ravi", "Meera"],
)
print("Ravi" in scores)
print(91 in scores)Explanation
- A Pandas Series named
scoresis created with three integer values and corresponding string indices. - The first print statement checks if the index "Ravi" exists in the Series, returning a boolean result.
- The second print statement checks if the value 91 is present in the Series, also returning a boolean result.
- This code illustrates basic membership testing in a Pandas Series, which is useful for data validation.
Output:
True
FalseTo check values, use:
print(91 in scores.values)Explanation
- The
printfunction outputs the result of the expression to the console. scores.valuesretrieves all the values from thescoresdictionary.- The
inoperator checks for the presence of the value 91 within those values. - The result will be
Trueif 91 is found, andFalseif it is not.
Or use isin():
print(scores.isin([91]))Explanation
- The code uses the
isin()method from the pandas library to determine if the value91is present in thescoresSeries. - It returns a boolean Series where each element indicates whether the corresponding element in
scoresmatches91. - This is useful for filtering or validating data within a pandas DataFrame or Series.
27. Looping Over A Series
Looping over a Series gives values:
for score in scores:
print(score)Explanation
- The code uses a
forloop to traverse each element in thescoreslist. - Each
scorein the list is accessed one at a time during each iteration of the loop. - The
print()function outputs the currentscoreto the console, allowing for real-time feedback of the scores. - This snippet is useful for displaying a collection of values in a straightforward manner.
Loop over index and values:
for name, score in scores.items():
print(name, score)Explanation
- The code uses a for loop to traverse the
scoresdictionary. namerepresents the key, whilescorerepresents the corresponding value in each iteration.- The
printfunction outputs each key-value pair to the console. - This approach is useful for displaying or logging the contents of a dictionary in a readable format.
Use vectorized operations when possible. Loops are useful for display, debugging, or custom logic.
28. Arithmetic Operations
Series operations align by index labels.
jan = pd.Series([100, 200, 300], index=["A", "B", "C"])
feb = pd.Series([110, 190, 250], index=["A", "B", "D"])
print(feb - jan)Explanation
- Creates two pandas Series,
janandfeb, with specified indices "A", "B", "C" forjanand "A", "B", "D" forfeb. - The subtraction operation
feb - janis performed, aligning the indices of both Series. - For indices that do not match, such as "C" in
janand "D" infeb, the result will contain NaN (Not a Number) for those positions. - The output will display the differences for matching indices and NaN for non-matching ones.
Output:
A 10.0
B -10.0
C NaN
D NaN
dtype: float64Why?
AandBexist in both SeriesCis missing from FebruaryDis missing from January
If you want missing values treated as zero:
print(feb.sub(jan, fill_value=0))Explanation
- The code uses the
submethod from the pandas library to perform element-wise subtraction between two Series,febandjan. - The
fill_value=0argument ensures that any missing values in either Series are treated as zeros during the subtraction. - This approach helps to avoid NaN results when one Series has values that the other does not, providing a cleaner output.
- The result will be a new Series containing the differences, with indices from both Series preserved.
29. Relational Operations
scores = pd.Series([82, 91, 76, 88])
print(scores >= 85)Explanation
- A pandas Series named
scoresis created containing four integer values representing scores. - The expression
scores >= 85performs a comparison operation, checking each score to see if it is greater than or equal to 85. - The result of this comparison is a boolean Series, where each element indicates whether the corresponding score meets the threshold.
- The
printfunction outputs the boolean Series to the console, allowing users to see which scores are above or equal to 85.
Output:
0 False
1 True
2 False
3 True
dtype: boolThis creates a boolean Series.
30. Boolean Indexing
Use a boolean condition to filter values.
scores = pd.Series(
[82, 91, 76, 88],
index=["Asha", "Ravi", "Meera", "Kabir"],
)
high_scores = scores[scores >= 85]
print(high_scores)Explanation
- A pandas Series named
scoresis created with student names as indices and their corresponding scores as values. - The
high_scoresvariable filters thescoresSeries to include only those scores that are greater than or equal to 85. - The filtered high scores are then printed to the console, showing only the students who achieved this threshold.
- This code demonstrates basic data manipulation and filtering using pandas in Python.
Output:
Ravi 91
Kabir 88
dtype: int64Count values above a threshold:
print((scores >= 85).sum())Explanation
- The expression
scores >= 85creates a boolean array where each element indicates whether the corresponding score meets the condition. - The
sum()function counts the number ofTruevalues in the boolean array, effectively counting how many scores are 85 or higher. - This snippet is useful for quickly assessing performance metrics in a dataset of scores.
- It assumes that
scoresis a NumPy array or a similar structure that supports element-wise comparison.
Because True behaves like 1 and False behaves like 0.
31. Multiple Conditions
Use & for AND and | for OR.
Wrap each condition in parentheses.
scores = pd.Series([45, 62, 78, 91, 38, 84])
selected = scores[(scores >= 60) & (scores <= 85)]
print(selected)Explanation
- A Pandas Series named
scoresis created with a list of integer values representing scores. - The
selectedvariable filters thescoresSeries to include only those values that are greater than or equal to 60 and less than or equal to 85. - The filtering is done using a boolean condition that combines two comparisons with the logical AND operator (
&). - Finally, the filtered results stored in
selectedare printed to the console, displaying only the scores that meet the specified criteria.
Output:
1 62
2 78
5 84
dtype: int64Common mistake:
scores >= 60 & scores <= 85Explanation
- The expression evaluates whether each score is greater than or equal to 60 and less than or equal to 85.
- The use of the bitwise AND operator
&combines the two conditions for evaluation. - This code is likely part of a filtering process to identify scores that meet the specified criteria.
- It is important to ensure that
scoresis a compatible data type, such as a NumPy array or a Pandas Series, for this operation to work correctly.
This is wrong because operator precedence can change the meaning.
32. Plotting A Series
Pandas can plot Series using Matplotlib behind the scenes.
daily_visitors = pd.Series(
[120, 135, 150, 160, 155, 170, 180],
index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
)
daily_visitors.plot(kind="line", title="Daily Visitors")Explanation
- Creates a pandas Series named
daily_visitorscontaining visitor counts for each day of the week. - Assigns custom index labels ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") to represent the days.
- Utilizes the
plotmethod to generate a line graph of the daily visitors. - Sets the title of the plot to "Daily Visitors" for clarity in visualization.
For category counts:
plans = pd.Series(["free", "pro", "free", "team", "pro", "free"])
plans.value_counts().plot(kind="bar", title="Plan Counts")Explanation
- The code creates a pandas Series containing different subscription plan types.
- It utilizes the
value_counts()method to count the occurrences of each unique plan. - The resulting counts are then plotted as a bar chart using the
plot()method with the title "Plan Counts". - This visualization helps in understanding the popularity of each subscription plan at a glance.
In notebooks, plots display inline if plotting is configured.
33. Changing Data Type With astype
scores = pd.Series([82.0, 91.0, 76.0])
scores_int = scores.astype("int64")
print(scores_int)
print(scores_int.dtype)Explanation
- A Pandas Series named
scoresis created containing three float values. - The
astypemethod is used to convert the float values in the Series to integers, resulting in a new Seriesscores_int. - The converted Series
scores_intis printed to display the integer values. - The data type of the
scores_intSeries is printed to confirm the conversion toint64.
Use astype() when conversion is straightforward.
34. Safer Numeric Conversion With pd.to_numeric
Real data often has messy strings.
raw_prices = pd.Series(["120", "99.5", "missing", "150"])
prices = pd.to_numeric(raw_prices, errors="coerce")
print(prices)Explanation
- The code initializes a pandas Series containing string representations of prices, including a non-numeric value ("missing").
- The
pd.to_numeric()function is used to convert the Series to numeric values, with theerrors="coerce"argument ensuring that any non-convertible values are replaced with NaN (Not a Number). - The resulting Series,
prices, contains numeric values for valid entries and NaN for the invalid entry. - Finally, the code prints the converted Series, allowing for easy inspection of the numeric values.
Output:
0 120.0
1 99.5
2 NaN
3 150.0
dtype: float64errors="coerce" converts invalid values to missing values.
This is useful for cleaning imported CSV data.
35. between()
between() checks whether values lie inside a range.
scores = pd.Series([45, 62, 78, 91, 38, 84])
print(scores.between(60, 85))
print(scores[scores.between(60, 85)])Explanation
- The code creates a Pandas Series named
scorescontaining a list of numerical values representing scores. - The
betweenmethod is used to check which scores fall within the range of 60 to 85, returning a boolean Series. - The first
printstatement outputs the boolean Series indicating whether each score meets the criteria. - The second
printstatement filters the originalscoresSeries to display only those scores that are between 60 and 85, based on the boolean mask.
Output:
0 False
1 True
2 True
3 False
4 False
5 True
dtype: bool
1 62
2 78
5 84
dtype: int6436. clip()
clip() limits values to a lower and upper bound.
ratings = pd.Series([2, 5, 8, 11, -3, 7])
safe_ratings = ratings.clip(lower=0, upper=10)
print(safe_ratings)Explanation
- The code creates a Pandas Series named
ratingscontaining a mix of integers, including negative values. - The
clipmethod is used to limit the values in theratingsSeries, setting a lower bound of 0 and an upper bound of 10. - Any ratings below 0 are replaced with 0, and any ratings above 10 are replaced with 10, ensuring all values are within the desired range.
- The modified Series,
safe_ratings, is then printed, displaying the adjusted ratings.
Output:
0 2
1 5
2 8
3 10
4 0
5 7
dtype: int64Use this for outlier capping or valid range enforcement.
37. Duplicates: duplicated() And drop_duplicates()
plans = pd.Series(["free", "pro", "free", "team", "pro", "free"])
print(plans.duplicated())
print(plans.duplicated().sum())Explanation
- The code creates a pandas Series named
planscontaining various subscription types. - The
duplicated()method is called on the Series to identify duplicate entries, returning a boolean Series whereTrueindicates a duplicate. - The first
printstatement outputs the boolean Series showing which entries are duplicates. - The second
printstatement sums theTruevalues from the boolean Series, providing the total count of duplicate entries in theplansSeries.
Output:
0 False
1 False
2 True
3 False
4 True
5 True
dtype: bool
3Drop duplicates:
print(plans.drop_duplicates())Explanation
- The
drop_duplicates()method is called on theplansDataFrame to eliminate any duplicate rows. - The method returns a new DataFrame with only unique rows, preserving the first occurrence of each duplicate.
- The
print()function outputs the resulting DataFrame to the console for review. - This operation is useful for data cleaning and ensuring the integrity of the dataset before analysis.
Output:
0 free
1 pro
3 team
dtype: objectKeep the last occurrence:
print(plans.drop_duplicates(keep="last"))Explanation
- The
drop_duplicatesmethod is called on a DataFrame namedplans. - The parameter
keep="last"specifies that when duplicates are found, the last occurrence should be kept in the resulting DataFrame. - The result is printed to the console, displaying the DataFrame without duplicates.
- This operation is useful for cleaning data by ensuring that only unique entries remain based on the specified criteria.
38. Missing Data
Create a Series with missing values:
ratings = pd.Series([4.5, np.nan, 3.8, np.nan, 4.9])Explanation
- Initializes a Pandas Series named
ratingscontaining five numerical values representing ratings. - Utilizes
np.nanto denote missing or undefined ratings in the dataset. - The Series can be used for further data analysis or manipulation, leveraging Pandas' powerful data handling capabilities.
- This structure allows for easy identification and handling of missing data points in subsequent operations.
Find missing values:
print(ratings.isna())
print(ratings.isna().sum())Explanation
- The first line
print(ratings.isna())outputs a DataFrame of the same shape asratings, where each entry is a boolean indicating whether the corresponding value is missing (True) or not (False). - The second line
print(ratings.isna().sum())calculates and prints the total number of missing values in each column of theratingsDataFrame by summing the boolean values (True counts as 1). - This code is useful for data cleaning and preprocessing, allowing users to quickly identify and address missing data issues in their dataset.
isnull() is an alias for isna().
Drop missing values:
print(ratings.dropna())Explanation
- The
dropna()method is called on theratingsDataFrame to eliminate any rows containing NaN (missing) values. - The result is a new DataFrame that only includes rows with complete data, improving data quality for analysis.
- The
print()function outputs the cleaned DataFrame to the console for immediate review. - This operation is useful in data preprocessing steps before performing any statistical analysis or machine learning tasks.
Fill missing values:
filled = ratings.fillna(ratings.mean())
print(filled)Explanation
- The
fillna()method is used to replace NaN (missing) values in the DataFrameratings. - The argument
ratings.mean()calculates the mean of each column in the DataFrame, providing a value to fill in for missing entries. - The result is stored in the variable
filled, which contains the DataFrame with no missing values. - The
print(filled)statement outputs the modified DataFrame to the console for review.
Use a domain-appropriate fill value. Do not blindly use the mean for every dataset.
39. isin()
isin() checks whether each value is in a list-like collection.
scores = pd.Series([49, 50, 75, 99, 100, 42])
near_milestones = scores[scores.isin([49, 99])]
print(near_milestones)Explanation
- A Pandas Series named
scoresis created containing a list of integer values representing scores. - The
isin()method is used to filter the Series, selecting only the scores that match the specified milestone values of 49 and 99. - The filtered results are stored in the variable
near_milestones, which contains only the scores that are near the defined milestones. - Finally, the
print()function outputs the filtered Series to the console, displaying the selected milestone scores.
Output:
0 49
3 99
dtype: int64Use it for membership filters.
40. map()
map() is useful for value replacement using a dictionary or function.
plans = pd.Series(["free", "pro", "team", "free"])
plan_labels = plans.map({
"free": "Starter",
"pro": "Professional",
"team": "Team",
})
print(plan_labels)Explanation
- A pandas Series named
plansis created containing different subscription plan types. - The
mapfunction is utilized to replace each plan type with a corresponding label defined in a dictionary. - The dictionary maps "free" to "Starter", "pro" to "Professional", and "team" to "Team".
- The transformed labels are stored in the variable
plan_labels. - Finally, the new labels are printed to the console, displaying the mapped values.
Output:
0 Starter
1 Professional
2 Team
3 Starter
dtype: objectIf a value is not found in the dictionary, the result becomes missing for that value.
41. apply()
apply() applies a function to each value.
prices = pd.Series([99, 149, 249])
def add_tax(price):
return price * 1.18
final_prices = prices.apply(add_tax)
print(final_prices)Explanation
- A Pandas Series named
pricesis created containing three initial price values. - The function
add_taxtakes a single price as input and returns the price increased by 18% to account for tax. - The
applymethod is used on thepricesSeries to apply theadd_taxfunction to each element, resulting in a new Series calledfinal_prices. - Finally, the
final_pricesSeries is printed, displaying the prices after tax has been added.
Output:
0 116.82
1 175.82
2 293.82
dtype: float64For simple arithmetic, vectorized code is better:
final_prices = prices * 1.18Explanation
- The variable
final_pricesis created to store the updated price values. - The original
pricesvariable is multiplied by1.18, which represents a 18% increase, typically for tax purposes. - This operation applies the same tax rate to all elements in the
pricesarray or list, resulting in a new list of final prices. - The code assumes that
pricesis a numeric type or a collection of numeric types that support multiplication.
Use apply() when the logic is custom and cannot be expressed cleanly with vectorized operations.
42. Cleaning Price Strings
Real CSV data often stores prices as strings:
raw_prices = pd.Series(["$2.39", "$3.50", None, "$10.25", "not available"])Explanation
- Initializes a Pandas Series named
raw_pricescontaining various price strings and a None value. - The Series includes valid price entries as strings (e.g., "$2.39", "$3.50", "$10.25") and a placeholder for missing data ("not available").
- This structure allows for easy manipulation and analysis of price data, despite the presence of inconsistent formats.
- The use of None indicates missing data, which is a common practice in data handling with Pandas.
Remove the dollar symbol:
clean_text = raw_prices.str.replace("$", "", regex=False)Explanation
- The
raw_pricesvariable is expected to be a pandas Series containing price strings with dollar signs. - The
str.replacemethod is used to search for the dollar sign character ("$") in each string of the Series. - The
regex=Falseargument indicates that the dollar sign should be treated as a literal character, not a regular expression. - The result is stored in the
clean_textvariable, which contains the price strings without the dollar signs.
Convert to numbers:
prices_usd = pd.to_numeric(clean_text, errors="coerce")
print(prices_usd)Explanation
- The code uses the
pd.to_numeric()function from the Pandas library to convert a variableclean_textinto numeric values. - The parameter
errors="coerce"ensures that any non-convertible values inclean_textare replaced with NaN (Not a Number) instead of raising an error. - The resulting numeric values are stored in the variable
prices_usd. - Finally, the code prints the
prices_usdvariable to display the converted numeric values.
Output:
0 2.39
1 3.50
2 NaN
3 10.25
4 NaN
dtype: float64Fill missing values:
prices_usd = prices_usd.fillna(prices_usd.mean())Explanation
- The
fillna()method is used to replace NaN (missing) values in theprices_usdDataFrame. - The argument
prices_usd.mean()calculates the mean of each column in the DataFrame. - This operation ensures that any missing values are replaced with the average value, maintaining the integrity of the data.
- It is a common practice in data preprocessing to handle missing data before analysis or modeling.
Convert to rupees:
prices_inr = prices_usd * 83
print(prices_inr)Explanation
- The code multiplies a variable
prices_usdby 83, which represents the exchange rate from USD to INR. - The result is stored in the variable
prices_inr, which contains the equivalent prices in Indian Rupees. - The
printfunction outputs the converted prices to the console for the user to see. - This snippet assumes that
prices_usdis already defined and contains numeric values.
In production, use a real exchange rate source. In practice exercises, a fixed rate is fine.
43. Mini Project: Analyze Daily Subscribers
Suppose you track daily subscribers gained:
subscribers = pd.Series(
[120, 135, 150, 90, 210, 240, 180, 160, 260, 300],
name="subscribers_gained",
)Explanation
- Initializes a Pandas Series named "subscribers_gained" to store subscriber counts.
- Contains a list of integers representing the number of subscribers gained at different time intervals.
- Each integer in the list corresponds to a specific point in time, allowing for time series analysis.
- The Series can be used for further data manipulation and visualization in data analysis tasks.
Find:
- total subscribers gained
- average daily gain
- best day
- number of days above 200
- capped values between 100 and 250
Solution:
total = subscribers.sum()
average = subscribers.mean()
best_day = subscribers.idxmax()
days_above_200 = (subscribers > 200).sum()
capped = subscribers.clip(lower=100, upper=250)
print("Total:", total)
print("Average:", average)
print("Best day index:", best_day)
print("Days above 200:", days_above_200)
print(capped)Explanation
- Computes the total number of subscribers using the
sum()method. - Calculates the average number of subscribers with the
mean()function. - Identifies the index of the day with the highest subscriber count using
idxmax(). - Counts how many days had more than 200 subscribers by summing a boolean condition.
- Clips the subscriber values to a range between 100 and 250 using the
clip()method, ensuring no values fall outside this range. - Outputs the total, average, best day index, count of days above 200, and the capped subscriber values.
This project uses:
- aggregation
- boolean indexing
idxmaxclip
44. Mini Project: Clean Product Prices
raw_prices = pd.Series(
["$2.39", "$3.39", "$5.99", None, "$12.50", "unknown"],
index=[
"chips",
"juice",
"sandwich",
"salad",
"bowl",
"soup",
],
name="price_usd",
)Explanation
- Initializes a Pandas Series named
raw_pricescontaining price data as strings, including valid prices, aNonevalue, and an invalid entry ("unknown"). - The index of the Series is explicitly defined with food item names: "chips", "juice", "sandwich", "salad", "bowl", and "soup".
- The
nameattribute of the Series is set to "price_usd", indicating the context of the data as prices in USD. - This structure allows for easy manipulation and analysis of price data, despite the presence of non-numeric values.
- The use of
Noneand a string like "unknown" demonstrates how to handle missing or invalid data in a dataset.
Clean and analyze:
price_text = raw_prices.str.replace("$", "", regex=False)
prices = pd.to_numeric(price_text, errors="coerce")
prices = prices.fillna(prices.mean())
prices_inr = prices * 83
print(prices_inr)
print("Mean INR:", prices_inr.mean())
print("30th percentile:", prices_inr.quantile(0.30))
print("60th percentile:", prices_inr.quantile(0.60))
print("Between 300 and 800:")
print(prices_inr[prices_inr.between(300, 800)])Explanation
- The code first removes the dollar sign from a series of raw price strings using
str.replace. - It converts the cleaned price strings into numeric values, coercing any errors to NaN.
- Missing values are filled with the mean of the prices to ensure no gaps in the data.
- The prices are then converted to Indian Rupees (INR) by multiplying by a conversion rate of 83.
- Finally, it prints the converted prices, their mean, specific percentiles, and filters prices that fall between 300 and 800 INR.
This kind of cleaning appears often in data analyst tasks.
45. Practice Exercises
Try these before reading the solutions.
Practice Lab
Exercise 1: Empty Series
Create an empty Series with dtype float.
Practice Lab
Exercise 2: Series Arithmetic
Create two Series:
first = pd.Series([2, 4, 6, 8, 10])
second = pd.Series([1, 3, 5, 7, 10])Explanation
- Initializes the first Series named
firstcontaining even numbers from 2 to 10. - Initializes the second Series named
secondcontaining odd numbers from 1 to 10. - Both Series are created using the Pandas library, which is commonly used for data manipulation in Python.
- These Series can be used for various operations such as mathematical computations, comparisons, or visualizations.
Print addition, subtraction, multiplication, and division.
Practice Lab
Exercise 3: Series Comparison
Using the same two Series, compare:
- greater than
- less than
- equal to
Practice Lab
Exercise 4: Convert Mixed Data To Numeric
Create:
mixed = pd.Series([1, 2, "Python", 2.0, True, 100])Explanation
- The code initializes a Pandas Series named
mixedcontaining various data types including integers, strings, floats, and booleans. - The
pd.Seriesfunction is used to create the Series, which allows for the storage of heterogeneous data. - Each element in the Series can be accessed using its index, making it versatile for data manipulation and analysis.
- This structure is useful in scenarios where data may not be uniform, such as in data frames or when handling diverse datasets.
Convert it to numeric values, turning invalid values into missing values.
Practice Lab
Exercise 5: Top Values
Create a Series of player scores and print the top 5 values.
Practice Lab
Exercise 6: Count Above Mean
Create a numeric Series and count how many values are greater than the mean.
Practice Lab
Exercise 7: Missing Values
Create a Series with three missing values. Count missing values, drop them, and fill them with the median.
Practice Lab
Exercise 8: Price Cleaning
Create a Series of price strings such as "$10.50", "$20.00", and "missing". Remove $, convert to numeric, and fill missing values with the mean.
Practice Lab
Exercise 9: Category Counts
Create a Series of course categories and show the top 3 most common categories.
Practice Lab
Exercise 10: Range Filter
Create a Series of product prices and return prices between 100 and 500.
46. Practice Solutions
Solution Key
Solution 1: Empty Series
empty = pd.Series(dtype="float64")
print(empty)Explanation
- Initializes an empty Pandas Series object with a data type of float64.
- The
dtypeparameter ensures that any data added later will be treated as floating-point numbers. - The
printfunction outputs the Series to the console, showing its current state (which is empty). - This code is useful for initializing a Series before populating it with data in subsequent operations.
Solution Key
Solution 2: Series Arithmetic
first = pd.Series([2, 4, 6, 8, 10])
second = pd.Series([1, 3, 5, 7, 10])
print(first + second)
print(first - second)
print(first * second)
print(first / second)Explanation
- The code creates two Pandas Series,
firstandsecond, containing integer values. - It performs element-wise addition, subtraction, multiplication, and division between the two Series.
- The results of these operations are printed to the console, showing the output for each arithmetic operation.
- This demonstrates how Pandas handles vectorized operations, allowing for efficient calculations on Series data.
- The operations align based on the index of the Series, ensuring that corresponding elements are processed together.
Solution Key
Solution 3: Series Comparison
first = pd.Series([2, 4, 6, 8, 10])
second = pd.Series([1, 3, 5, 7, 10])
print(first > second)
print(first < second)
print(first == second)Explanation
- Creates two pandas Series,
firstandsecond, containing integer values. - Performs element-wise comparison between the two Series using greater than (
>), less than (<), and equality (==) operators. - Outputs three boolean Series indicating the result of each comparison for corresponding elements in
firstandsecond. - Useful for data analysis tasks where relational comparisons between datasets are needed.
Solution Key
Solution 4: Convert Mixed Data To Numeric
mixed = pd.Series([1, 2, "Python", 2.0, True, 100])
converted = pd.to_numeric(mixed, errors="coerce")
print(converted)Explanation
- The code creates a Pandas Series named
mixedcontaining various data types, including integers, strings, floats, and booleans. - The
pd.to_numeric()function is used to convert the elements of themixedSeries to numeric values, with theerrors="coerce"parameter ensuring that any non-convertible values are replaced withNaN. - The result of the conversion is stored in the
convertedvariable, which will contain numeric representations of the original values where possible. - Finally, the
print()function outputs theconvertedSeries, displaying the numeric values along with anyNaNentries for the non-numeric data.
Solution Key
Solution 5: Top Values
scores = pd.Series([420, 180, 550, 610, 320, 720, 150])
top_5 = scores.sort_values(ascending=False).head(5)
print(top_5)Explanation
- A pandas Series named
scoresis created containing a list of numerical values. - The
sort_valuesmethod is used to sort the scores in descending order. - The
head(5)method extracts the top five scores from the sorted Series. - Finally, the top five scores are printed to the console.
Solution Key
Solution 6: Count Above Mean
values = pd.Series([10, 20, 30, 40, 50])
above_mean_count = (values > values.mean()).sum()
print(above_mean_count)Explanation
- A Pandas Series is created with five integer values: 10, 20, 30, 40, and 50.
- The mean of the Series is calculated using
values.mean(). - A boolean condition checks which elements are greater than the mean, resulting in a Series of True/False values.
- The
sum()function counts the number of True values, indicating how many elements are above the mean. - Finally, the count of elements above the mean is printed to the console.
Solution Key
Solution 7: Missing Values
values = pd.Series([10, np.nan, 30, np.nan, 50, np.nan])
print(values.isna().sum())
print(values.dropna())
print(values.fillna(values.median()))Explanation
- A Pandas Series is created with some numeric values and NaN (Not a Number) entries to represent missing data.
- The
isna().sum()method counts and prints the total number of missing values in the Series. - The
dropna()method removes all entries with NaN values and prints the cleaned Series. - The
fillna()method replaces NaN values with the median of the Series, providing a way to impute missing data.
Solution Key
Solution 8: Price Cleaning
prices = pd.Series(["$10.50", "$20.00", "missing", "$15.75"])
clean_text = prices.str.replace("$", "", regex=False)
numeric_prices = pd.to_numeric(clean_text, errors="coerce")
filled_prices = numeric_prices.fillna(numeric_prices.mean())
print(filled_prices)Explanation
- The code initializes a pandas Series containing price strings, some of which are invalid or missing.
- It uses the
str.replacemethod to remove the dollar sign from each price string, resulting in a clean text representation. - The
pd.to_numericfunction converts the cleaned strings into numeric values, with theerrors="coerce"argument turning any non-convertible entries into NaN. - The
fillnamethod replaces NaN values with the mean of the valid numeric prices, ensuring no missing data remains. - Finally, the cleaned and filled prices are printed to the console.
Solution Key
Solution 9: Category Counts
categories = pd.Series([
"python",
"pandas",
"python",
"sql",
"pandas",
"python",
"excel",
])
print(categories.value_counts().head(3))Explanation
- A pandas Series named
categoriesis created containing various programming-related strings. - The
value_counts()method is called on the Series to count the occurrences of each unique category. - The
head(3)method is used to retrieve the top three categories based on their frequency. - Finally, the result is printed, showing the most common categories in descending order.
Solution Key
Solution 10: Range Filter
prices = pd.Series([50, 120, 250, 600, 499, 80])
selected = prices[prices.between(100, 500)]
print(selected)Explanation
- A Pandas Series named
pricesis created containing a list of numerical values representing prices. - The
betweenmethod is used to filter the Series, selecting only the prices that fall between 100 and 500, inclusive. - The filtered results are stored in the variable
selected. - Finally, the
printfunction outputs the filtered prices to the console.
47. Quick Interview Questions
1. What is a Pandas Series?
A one-dimensional labeled array.
2. What is the difference between size and count()?
size counts all entries, including missing values. count() counts non-missing values.
3. What does value_counts() do?
It counts unique values in a Series.
4. What is the difference between loc and iloc?
loc selects by label. iloc selects by integer position.
5. Why use pd.to_numeric()?
To convert messy values to numbers with options like errors="coerce".
6. What does dropna() do?
It removes missing values.
7. What does fillna() do?
It replaces missing values with a chosen value.
8. What does isin() do?
It checks whether values are present in a given list-like collection.
9. When should you use .copy()?
When you want to modify a subset independently from the original object.
10. Why can Series arithmetic produce missing values?
Because Series align by index labels. If a label is missing from one side, the result becomes missing for that label.
48. Common Beginner Mistakes
Mistake 1: Confusing label and position
Use loc for labels and iloc for positions.
Mistake 2: Thinking size ignores missing values
size includes missing values. Use count() for non-missing values.
Mistake 3: Forgetting index alignment
Series arithmetic aligns by labels, not only by row order.
Mistake 4: Using astype() on messy strings
If values are messy, use pd.to_numeric(..., errors="coerce").
Mistake 5: Modifying a subset without copying
Use .copy() when you intentionally want an independent object.
Final Takeaway
A Pandas Series is simple at first glance: one column of values with labels.
But it becomes powerful because it supports:
- labeled indexing
- automatic alignment
- missing-data handling
- statistical summaries
- boolean filtering
- value counts
- sorting
- type conversion
- string cleaning
- plotting
- element-wise transformation
If you are new to Pandas, master Series before moving deeply into DataFrames. DataFrames are mostly collections of Series working together.
Sources and Further Reading
- Pandas documentation: https://pandas.pydata.org/docs/
- Pandas Series user guide: https://pandas.pydata.org/docs/user_guide/dsintro.html#series
- Pandas Series API reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.html
- Pandas indexing guide: https://pandas.pydata.org/docs/user_guide/indexing.html
- Pandas missing data guide: https://pandas.pydata.org/docs/user_guide/missing_data.html
- Pandas
to_numeric: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html - Pandas visualization guide: https://pandas.pydata.org/docs/user_guide/visualization.html
