Pandas Series: Create, Index, and Analyze Data Efficiently

May 30, 2026
28 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: pandas-series-create-index-and-analyze-data-efficiently
Quick Answer

Learn Pandas Series from scratch with original examples: create Series from lists and dictionaries, inspect attributes, read CSV columns, use loc and iloc, sort values, count categories, handle missing data, filter with conditions, transform values, and solve practice tasks.

Quick Summary

Learn to create, index, and analyze Pandas Series in Python. Master data manipulation with practical examples and tips for effective data analysis.

Pandas Series: Create, Index, Clean, Analyze, and Practice

Pandas is one of the most important Python libraries for data analysis.

If NumPy gives you fast arrays, Pandas gives you labeled data structures that feel closer to real-world tables.

The first Pandas object you should understand is the Series.

A Pandas Series is a one-dimensional labeled array. You can think of it as a single column of data with row labels.

Examples of Series-style data:

  • daily website visitors
  • monthly revenue
  • marks scored by students
  • product prices
  • customer ratings
  • city temperatures
  • movie genres
  • app downloads by date

In this lesson, you will learn how to create, inspect, select, clean, analyze, and transform Pandas Series.

What You Will Learn

By the end, you should be able to:

  • explain what a Pandas Series is
  • create Series from lists, dictionaries, and scalar values
  • use custom indexes and names
  • inspect size, dtype, name, index, values, and is_unique
  • read a CSV column as a Series
  • use head, tail, sample, value_counts, sort_values, and sort_index
  • calculate sum, mean, median, mode, std, var, min, max, and describe
  • select values using labels and positions
  • understand loc and iloc
  • edit values safely
  • use boolean indexing
  • handle missing data with isna, dropna, and fillna
  • convert values using astype and pd.to_numeric
  • use between, clip, duplicated, drop_duplicates, isin, map, and apply
  • solve beginner Pandas Series practice problems

1. Installing And Importing Pandas

Install Pandas if needed:

bash
pip install pandas

Import it with the standard alias:

python
import pandas as pd
import numpy as np

Most Pandas code uses pd as the alias.

2. What Is A Pandas Series?

A Series is a one-dimensional labeled array.

Create a simple Series:

python
import pandas as pd

visitors = pd.Series([120, 135, 150, 160])

print(visitors)

Output:

text
0    120
1    135
2    150
3    160
dtype: int64

The left side is the index.

The right side is the value.

By default, Pandas creates a numeric index starting from 0.

3. Series vs Python List

A Python list stores values by position.

python
visitors_list = [120, 135, 150, 160]

A Series stores values with labels.

python
visitors = pd.Series([120, 135, 150, 160])

Why does this matter?

Because real data often needs labels.

python
visitors = pd.Series(
    [120, 135, 150, 160],
    index=["Mon", "Tue", "Wed", "Thu"],
)

print(visitors)

Output:

text
Mon    120
Tue    135
Wed    150
Thu    160
dtype: int64

Now each value has a meaningful row label.

4. Creating A Series From A List

python
topics = pd.Series(["Python", "NumPy", "Pandas", "SQL"])

print(topics)

Output:

text
0    Python
1     NumPy
2    Pandas
3       SQL
dtype: object

Strings usually use the object dtype in many Pandas displays.

Create a numeric Series:

python
scores = pd.Series([82, 91, 76, 88])

print(scores)

Output:

text
0    82
1    91
2    76
3    88
dtype: int64

5. Creating A Series With Custom Index

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
)

print(scores)

Output:

text
Asha     82
Ravi     91
Meera    76
Kabir    88
dtype: int64

Now the index contains student names.

You can select by label:

python
print(scores["Ravi"])

Output:

text
91

6. Naming A Series

A Series can have a name.

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
    name="math_score",
)

print(scores)

Output:

text
Asha     82
Ravi     91
Meera    76
Kabir    88
Name: math_score, dtype: int64

The name is useful when a Series becomes a column in a DataFrame.

7. Creating A Series From A Dictionary

When you create a Series from a dictionary, keys become the index and values become the data.

python
course_minutes = {
    "Python Basics": 45,
    "NumPy Arrays": 55,
    "Pandas Series": 60,
    "SQL Joins": 50,
}

minutes = pd.Series(course_minutes, name="duration_minutes")

print(minutes)

Output:

text
Python Basics    45
NumPy Arrays     55
Pandas Series    60
SQL Joins        50
Name: duration_minutes, dtype: int64

This is one of the cleanest ways to create a labeled Series.

8. Creating A Series From A Scalar

You can repeat one value across an index.

python
status = pd.Series("draft", index=["post_1", "post_2", "post_3"])

print(status)

Output:

text
post_1    draft
post_2    draft
post_3    draft
dtype: object

This is useful when creating default values.

9. Important Series Attributes

Create a Series:

python
ratings = pd.Series(
    [4.5, 4.8, 3.9, 4.8, np.nan],
    index=["course_a", "course_b", "course_c", "course_d", "course_e"],
    name="rating",
)

size

size returns the total number of values, including missing values.

python
print(ratings.size)

Explanation

  • The ratings variable is expected to be a data structure, such as a NumPy array or a Pandas DataFrame.
  • The size attribute returns the total number of elements contained within the ratings data structure.
  • The print function outputs this size to the console, allowing the user to see how many ratings are present.
  • This can be useful for understanding the scale of the data being analyzed.

Output:

text
5

count

count() returns non-missing values.

python
print(ratings.count())

Explanation

  • The count() method is called on the ratings object, which is expected to be a list, array, or similar iterable.
  • It returns the total number of elements present in the ratings collection.
  • This can be useful for determining the size of the dataset or for further statistical analysis.
  • The output will be an integer representing the count of items in ratings.

Output:

text
4

This distinction is important in interviews.

dtype

python
print(ratings.dtype)

Explanation

  • The print() function outputs the result to the console.
  • ratings.dtype accesses the data type attribute of the 'ratings' variable, which is typically a NumPy array or a pandas DataFrame.
  • This is useful for understanding the type of data stored in 'ratings', which can affect how operations are performed on it.
  • Knowing the data type helps in debugging and ensuring compatibility with functions that require specific data types.

Output:

text
float64

name

python
print(ratings.name)

Explanation

  • The code snippet prints the value of the 'name' attribute from the 'ratings' object.
  • The 'ratings' object is expected to be an instance of a class that has a 'name' attribute defined.
  • The 'r' prefix before the string indicates that the string is a raw string, but in this case, it is not necessary since there are no escape characters.
  • This operation is commonly used to retrieve and display specific information stored within an object.

Output:

text
rating

index

python
print(ratings.index)

Explanation

  • The code snippet uses the index attribute of a Pandas DataFrame or Series named ratings.
  • It outputs the index labels, which represent the row identifiers for the data structure.
  • This can be useful for understanding the structure of the data or for debugging purposes.
  • The print function displays the index in the console, allowing for quick inspection.

The index stores row labels.

values

python
print(ratings.values)

Explanation

  • The code snippet uses the print() function to output data to the console.
  • ratings.values accesses the values attribute of the ratings object, which typically contains numerical or categorical data.
  • This is commonly used in data analysis to quickly view the underlying values of a dataset, such as in a Pandas DataFrame.
  • The r before the string indicates a raw string, but in this case, it is not necessary since no escape characters are present.

This gives the underlying values as an array-like object.

is_unique

python
print(ratings.is_unique)

Explanation

  • The code accesses the is_unique attribute of the ratings DataFrame.
  • It returns a boolean value indicating whether all index values in the DataFrame are unique.
  • This can be useful for data validation to ensure there are no duplicate entries in the index.
  • If True, it confirms that each index label is distinct; if False, it indicates duplicates exist.

Output:

text
False

It is false because 4.8 appears more than once.

10. Reading A CSV Column As A Series

Assume you have a CSV file called daily_visitors.csv:

text
date,visitors
2026-01-01,120
2026-01-02,135
2026-01-03,150

Read the file:

python
df = pd.read_csv("daily_visitors.csv")

print(df)

Explanation

  • The pd.read_csv function is used to load data from a CSV file named "daily_visitors.csv" into a pandas DataFrame called df.
  • The print(df) statement outputs the entire DataFrame to the console, allowing users to view the data contained in the CSV file.
  • This code is useful for quickly inspecting the structure and contents of the dataset for further analysis.

Select one column as a Series:

python
visitors = df["visitors"]

print(type(visitors))
print(visitors)

Explanation

  • The variable visitors is assigned the 'visitors' column from the DataFrame df.
  • The type(visitors) function is called to print the data type of the visitors variable, which helps in understanding the structure of the data.
  • The print(visitors) statement outputs the actual content of the 'visitors' column, allowing for a quick inspection of the data values.

If you want the date column as the index:

python
df = pd.read_csv("daily_visitors.csv", index_col="date")
visitors = df["visitors"]

print(visitors)

Explanation

  • The code imports a CSV file named "daily_visitors.csv" into a pandas DataFrame, setting the "date" column as the index.
  • It extracts the "visitors" column from the DataFrame for further analysis or display.
  • The print function outputs the values of the "visitors" column to the console, allowing for a quick review of the data.

CSV reading usually returns a DataFrame. A single selected column is a Series.

11. head() And tail()

Use head() to preview the first rows.

python
sales = pd.Series([120, 135, 150, 160, 155, 170, 180])

print(sales.head())
print(sales.head(3))

Explanation

  • The code initializes a Pandas Series named sales containing a list of sales figures.
  • The print(sales.head()) statement outputs the first five entries of the Series by default.
  • The print(sales.head(3)) statement specifically retrieves and displays the first three entries of the Series.
  • This functionality is useful for quickly inspecting the data structure and values within the Series.

Use tail() to preview the last rows.

python
print(sales.tail())
print(sales.tail(2))

Explanation

  • The print(sales.tail()) function call outputs the last five rows of the DataFrame named sales.
  • The print(sales.tail(2)) function call specifically retrieves and displays the last two rows of the same DataFrame.
  • This method is useful for quickly inspecting the end of a dataset to understand its structure or check for data integrity.

These are useful for checking data quickly.

12. sample()

sample() returns random rows.

python
products = pd.Series(
    ["notebook", "pen", "marker", "bag", "bottle", "eraser"],
    name="product",
)

print(products.sample(3, random_state=42))

Explanation

  • A Pandas Series named products is created containing a list of product names.
  • The sample method is used to randomly select 3 items from the Series.
  • The random_state parameter is set to 42, ensuring that the random selection is reproducible across different runs.
  • The selected products are printed to the console, allowing for easy inspection of the random sample.

random_state makes the sample reproducible.

13. value_counts()

value_counts() counts unique values.

python
categories = pd.Series(
    ["free", "pro", "free", "team", "pro", "free"],
    name="plan",
)

print(categories.value_counts())

Explanation

  • A pandas Series named categories is created containing different subscription plan types.
  • The value_counts() method is called on the Series to count the number of occurrences of each unique value.
  • The result is printed, displaying the frequency of each subscription plan type in descending order.
  • This code is useful for quickly analyzing categorical data and understanding the distribution of different categories.

Output:

text
free    3
pro     2
team    1
Name: count, dtype: int64

Use this for categorical summaries.

To include missing values:

python
print(categories.value_counts(dropna=False))

Explanation

  • Uses the value_counts() method from the pandas library to count unique values in the categories Series.
  • The dropna=False parameter ensures that NaN (missing) values are included in the count.
  • The output is a Series showing the count of each unique category, which can be useful for data analysis and understanding distribution.
  • This method is commonly used in data preprocessing and exploratory data analysis to identify the presence of missing data.

14. Sorting Values

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
)

print(scores.sort_values())

Explanation

  • A Pandas Series named scores is created with four integer values representing scores and corresponding string indices for names.
  • The sort_values() method is called on the scores Series to sort the scores in ascending order.
  • The sorted Series is printed, displaying the names alongside their scores in order from lowest to highest.
  • This code snippet demonstrates how to efficiently sort and display data using the Pandas library in Python.

Output:

text
Meera    76
Asha     82
Kabir    88
Ravi     91
dtype: int64

Descending order:

python
print(scores.sort_values(ascending=False))

Explanation

  • The code snippet utilizes the sort_values method from the pandas library to sort a DataFrame or Series named scores.
  • The ascending=False argument specifies that the sorting should be done in descending order, meaning higher scores will appear first.
  • The print function outputs the sorted scores to the console, allowing for immediate visibility of the results.
  • This operation is useful for quickly identifying the highest scores in a dataset.

Get the top scorer:

python
top_student = scores.sort_values(ascending=False).head(1)

print(top_student)

Explanation

  • The scores variable is expected to be a pandas Series or DataFrame containing student scores.
  • The sort_values(ascending=False) method sorts the scores in descending order, placing the highest score at the top.
  • The head(1) method retrieves the first entry from the sorted list, which corresponds to the top student.
  • The print(top_student) statement outputs the highest score or student information to the console.

15. Sorting Index

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
)

print(scores.sort_index())

Explanation

  • Creates a Pandas Series named scores with specified values and custom indices representing names.
  • Uses the sort_index() method to sort the Series based on the alphabetical order of the indices.
  • The sorted Series is printed, displaying the scores associated with each name in order.
  • This approach enhances data readability and organization, making it easier to locate specific entries.

Output:

text
Asha     82
Kabir    88
Meera    76
Ravi     91
dtype: int64

Use sort_index() when label order matters.

16. inplace=True: Should You Use It?

Many Pandas methods can return a new object.

python
sorted_scores = scores.sort_values()

Explanation

  • The sort_values() method is called on the scores object, which is typically a pandas Series or DataFrame.
  • The method sorts the values in ascending order by default.
  • The sorted result is stored in the variable sorted_scores.
  • This operation modifies the original scores object unless a new sorted object is created.

Some methods also support inplace=True.

python
scores.sort_values(inplace=True)

Explanation

  • The sort_values() method is called on a DataFrame named scores.
  • The inplace=True argument modifies the original DataFrame directly, rather than returning a new sorted DataFrame.
  • This operation is useful for organizing data in ascending order based on the specified column(s).
  • It helps in preparing data for analysis or visualization by ensuring that the values are in a desired order.

For learning and production code, returning a new object is often clearer.

Why?

  • it avoids accidental mutation
  • it makes code easier to debug
  • it works well with method chaining

Prefer:

python
scores = scores.sort_values()

Explanation

  • The sort_values() method is called on the scores object, which is expected to be a pandas DataFrame or Series.
  • The method sorts the data in ascending order by default, rearranging the values.
  • The original scores object is modified in place, meaning the sorted values replace the unsorted ones.
  • This operation is useful for organizing data for analysis or visualization purposes.

17. Mathematical Methods

Create a Series:

python
orders = pd.Series([12, 18, 10, 25, 17, np.nan], name="orders")

Explanation

  • Initializes a Pandas Series named "orders" containing a list of integers representing order quantities.
  • The list includes a NaN (Not a Number) value to represent missing data in the series.
  • The use of pd.Series allows for easy manipulation and analysis of the order data.
  • The name parameter assigns a label to the series, making it easier to reference in data analysis tasks.

sum

python
print(orders.sum())

Explanation

  • The print() function outputs the result to the console.
  • orders is expected to be a data structure, such as a list or a pandas DataFrame, containing numerical values.
  • The sum() method computes the total of all elements within the orders variable.
  • This operation is useful for quickly assessing total sales or quantities in a dataset.

Output:

text
82.0

By default, missing values are skipped.

mean

python
print(orders.mean())

Explanation

  • The code snippet uses the mean() function to compute the average of the values in the orders dataset.
  • The print() function outputs the calculated mean to the console for easy viewing.
  • This operation is typically used in data analysis to summarize the central tendency of numerical data.

median

python
print(orders.median())

Explanation

  • The code uses the print() function to output the result of the median() method.
  • orders is expected to be a data structure, such as a list or a Pandas DataFrame, containing numerical values.
  • The median() method computes the median, which is the middle value when the data is sorted.
  • If the dataset has an even number of values, the median is the average of the two middle numbers.

mode

python
print(orders.mode())

Explanation

  • The code uses the mode() function from the pandas library to find the mode of the orders DataFrame.
  • The mode represents the value(s) that appear most frequently in the dataset.
  • The result is printed to the console, displaying the most common entries in the specified DataFrame.
  • This function is useful for understanding the distribution of categorical data within the DataFrame.

mode() can return more than one value.

standard deviation and variance

python
print(orders.std())
print(orders.var())

Explanation

  • The print(orders.std()) function computes and displays the standard deviation of the values in the orders dataset, which measures the amount of variation or dispersion.
  • The print(orders.var()) function calculates and outputs the variance of the orders dataset, representing the average of the squared differences from the mean.
  • Both functions are useful for understanding the distribution and spread of order values, aiding in statistical analysis.
  • This code assumes that orders is a Pandas DataFrame or Series containing numerical data.

min and max

python
print(orders.min())
print(orders.max())

Explanation

  • The print(orders.min()) function call outputs the smallest value found in the orders dataset.
  • The print(orders.max()) function call outputs the largest value found in the orders dataset.
  • This code is useful for quickly assessing the range of order values in a dataset.
  • It assumes that orders is a data structure that supports the min() and max() methods, such as a list or a pandas DataFrame.

18. describe()

describe() gives a quick statistical summary.

python
print(orders.describe())

Explanation

  • The print() function outputs the result of the describe() method to the console.
  • orders is expected to be a pandas DataFrame that contains order-related data.
  • The describe() method generates descriptive statistics such as count, mean, standard deviation, min, and max for numerical columns.
  • This method helps in quickly understanding the distribution and central tendencies of the data in the DataFrame.
  • It is useful for data analysis and preprocessing steps in data science projects.

Possible output:

text
count     5.000000
mean     16.400000
std       5.770615
min      10.000000
25%      12.000000
50%      17.000000
75%      18.000000
max      25.000000
Name: orders, dtype: float64

For text data:

python
plans = pd.Series(["free", "pro", "free", "team", "free"])

print(plans.describe())

Explanation

  • The code creates a Pandas Series named plans containing different subscription types.
  • The describe() method is called on the Series, which provides a summary of the data, including count, unique values, top value, and frequency.
  • This summary helps in understanding the distribution of subscription plans, such as how many users are on each plan.
  • The output is useful for data analysis and decision-making regarding subscription offerings.

It reports count, unique values, top value, and frequency.

19. Selecting Values By Position With iloc

Use iloc for integer-position selection.

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
)

print(scores.iloc[0])
print(scores.iloc[1:3])
print(scores.iloc[[0, 3]])

Explanation

  • A Pandas Series named scores is created with integer values and custom string indices representing names.
  • The first print statement retrieves the score of the first index ("Asha") using iloc[0].
  • The second print statement retrieves a slice of scores from the second to the third index ("Ravi" and "Meera") using iloc[1:3].
  • The third print statement accesses the scores of the first and last indices ("Asha" and "Kabir") using a list with iloc[[0, 3]].

iloc ignores labels and uses positions.

20. Selecting Values By Label With loc

Use loc for label-based selection.

python
print(scores.loc["Ravi"])
print(scores.loc["Asha":"Meera"])

Explanation

  • The first line retrieves and prints the row associated with the index label "Ravi" from the DataFrame scores.
  • The second line retrieves and prints all rows from the DataFrame scores starting from the index label "Asha" to "Meera", inclusive.
  • The loc method is used for label-based indexing, allowing for selection of rows and columns by their labels.
  • This code assumes that scores is a Pandas DataFrame that has been previously defined and populated with data.

Important:

Label slicing with loc includes the stop label when it exists.

Position slicing with iloc excludes the stop position.

21. Why Avoid Ambiguous Integer Indexing?

Consider this Series:

python
numbers = pd.Series([100, 200, 300], index=[10, 20, 30])

Explanation

  • Initializes a Pandas Series named numbers containing three integer values: 100, 200, and 300.
  • Assigns custom indices of 10, 20, and 30 to the respective values in the Series.
  • Facilitates easier data manipulation and retrieval by using meaningful indices instead of default integer indices.
  • Useful for scenarios where data points need to be accessed or analyzed based on specific labels rather than their position.

This can be confusing:

python
numbers[10]

Explanation

  • The code attempts to retrieve the element at index 10 from the list named numbers.
  • Python uses zero-based indexing, meaning the first element is at index 0, and the eleventh element is at index 10.
  • If the list numbers contains fewer than 11 elements, this will raise an IndexError.
  • This operation is commonly used to access specific data points in a list for further processing or analysis.

Does 10 mean label or position?

Use explicit access:

python
print(numbers.loc[10])
print(numbers.iloc[0])

Explanation

  • The first line retrieves the value at index 10 from the DataFrame numbers using the loc method, which accesses data by label.
  • The second line retrieves the value at the first position (index 0) from the DataFrame numbers using the iloc method, which accesses data by integer position.
  • This code demonstrates how to access data in a pandas DataFrame using both label-based and position-based indexing.
  • It is essential to ensure that the index exists in the DataFrame to avoid errors during retrieval.

Good Pandas code is explicit about label vs position.

22. Slicing A Series

python
weekly_sales = pd.Series(
    [120, 135, 150, 160, 155, 170, 180],
    index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
)

print(weekly_sales.iloc[1:4])
print(weekly_sales.loc["Tue":"Thu"])

Explanation

  • A Pandas Series named weekly_sales is created with sales data for each day of the week.
  • The iloc method is used to retrieve sales data for Tuesday to Thursday using integer-based indexing.
  • The loc method is employed to access sales data from Tuesday to Thursday using label-based indexing.
  • Both methods demonstrate different ways to slice data from the Series, showcasing flexibility in data retrieval.

Both select Tuesday through Thursday here, but they use different rules.

23. Editing Values

Create a Series:

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
)

Explanation

  • Initializes a Pandas Series named scores containing four numerical values representing scores.
  • The scores are associated with custom indices: "Asha", "Ravi", "Meera", and "Kabir".
  • This structure allows for easy access and manipulation of scores using the corresponding names as labels.
  • The use of pd.Series indicates that the code relies on the Pandas library for data manipulation.

Edit by label:

python
scores.loc["Meera"] = 80

Explanation

  • The code modifies the DataFrame scores by assigning the value 80 to the row labeled "Meera".
  • The loc method is used to access a group of rows and columns by labels or a boolean array.
  • If "Meera" does not already exist in the DataFrame, this operation will create a new row with that label.
  • This is a common operation for updating or adding data in pandas DataFrames.

Edit by position:

python
scores.iloc[0] = 85

Explanation

  • The code uses the iloc method to access a specific row in a pandas DataFrame called scores.
  • It targets the first row (index 0) of the DataFrame for modification.
  • The value 85 is assigned to the entire first row, replacing any existing values.
  • This operation is useful for updating data in a DataFrame without needing to reassign the entire DataFrame.
  • Ensure that the DataFrame scores is already defined and has at least one row before executing this code.

Add a new label:

python
scores.loc["Nisha"] = 92

Explanation

  • The code modifies the DataFrame scores by assigning a new value.
  • It specifically targets the row labeled "Nisha" to update her score.
  • The value 92 is set as the new score for Nisha in the DataFrame.
  • This operation is useful for dynamically changing data in data analysis tasks.

Print:

python
print(scores)

Explanation

  • The code uses the print() function to output the value of the variable scores to the console.
  • It assumes that scores is already defined and contains data, such as a list or a dictionary.
  • This is a common way to debug or check the contents of a variable during development.

24. Copying A Series Safely

When you take a subset and plan to modify it, use .copy().

python
top_scores = scores.head(2).copy()
top_scores.iloc[0] = 999

print(top_scores)
print(scores)

Explanation

  • The code creates a copy of the top two entries from the scores DataFrame using the head(2) method.
  • It modifies the first entry of the copied DataFrame (top_scores) by setting its value to 999.
  • The original scores DataFrame remains unchanged, demonstrating the use of the copy() method to avoid unintended side effects.
  • Finally, both the modified top_scores and the original scores are printed to show the difference.

Why?

It prevents accidental changes or warnings caused by modifying a view-like object.

Interview answer:

If I need an independent object, I call .copy() before modifying a subset.

25. Python Functions With Series

Many Python built-ins work with Series.

python
scores = pd.Series([82, 91, 76, 88])

print(len(scores))
print(type(scores))
print(max(scores))
print(min(scores))
print(sorted(scores))

Explanation

  • Initializes a Pandas Series named scores with a list of integers representing scores.
  • Uses len(scores) to print the number of elements in the Series.
  • Utilizes type(scores) to display the data type of the scores object, confirming it is a Pandas Series.
  • Calls max(scores) to find and print the highest score in the Series.
  • Calls min(scores) to find and print the lowest score in the Series.
  • Uses sorted(scores) to print the scores in ascending order.

Convert to a list:

python
print(scores.tolist())

Explanation

  • The code uses the tolist() method to convert a NumPy array named scores into a standard Python list.
  • This conversion is useful for compatibility with Python functions that require list inputs instead of NumPy arrays.
  • The print() function outputs the resulting list to the console, allowing for easy visualization of the data.
  • This snippet assumes that scores is already defined as a NumPy array prior to this line of code.

Convert to a dictionary:

python
named_scores = pd.Series(
    [82, 91, 76],
    index=["Asha", "Ravi", "Meera"],
)

print(named_scores.to_dict())

Explanation

  • A Pandas Series is created with scores assigned to specific names as indices.
  • The pd.Series constructor takes a list of scores and an index list to associate each score with a name.
  • The to_dict() method is called on the Series to convert it into a dictionary, where names are keys and scores are values.
  • The resulting dictionary is printed, displaying the mapping of names to their corresponding scores.

26. Membership: Index vs Values

For a Series, the in operator checks the index, not the values.

python
scores = pd.Series(
    [82, 91, 76],
    index=["Asha", "Ravi", "Meera"],
)

print("Ravi" in scores)
print(91 in scores)

Explanation

  • A Pandas Series named scores is created with three integer values and corresponding string indices.
  • The first print statement checks if the index "Ravi" exists in the Series, returning a boolean result.
  • The second print statement checks if the value 91 is present in the Series, also returning a boolean result.
  • This code illustrates basic membership testing in a Pandas Series, which is useful for data validation.

Output:

text
True
False

To check values, use:

python
print(91 in scores.values)

Explanation

  • The print function outputs the result of the expression to the console.
  • scores.values retrieves all the values from the scores dictionary.
  • The in operator checks for the presence of the value 91 within those values.
  • The result will be True if 91 is found, and False if it is not.

Or use isin():

python
print(scores.isin([91]))

Explanation

  • The code uses the isin() method from the pandas library to determine if the value 91 is present in the scores Series.
  • It returns a boolean Series where each element indicates whether the corresponding element in scores matches 91.
  • This is useful for filtering or validating data within a pandas DataFrame or Series.

27. Looping Over A Series

Looping over a Series gives values:

python
for score in scores:
    print(score)

Explanation

  • The code uses a for loop to traverse each element in the scores list.
  • Each score in the list is accessed one at a time during each iteration of the loop.
  • The print() function outputs the current score to the console, allowing for real-time feedback of the scores.
  • This snippet is useful for displaying a collection of values in a straightforward manner.

Loop over index and values:

python
for name, score in scores.items():
    print(name, score)

Explanation

  • The code uses a for loop to traverse the scores dictionary.
  • name represents the key, while score represents the corresponding value in each iteration.
  • The print function outputs each key-value pair to the console.
  • This approach is useful for displaying or logging the contents of a dictionary in a readable format.

Use vectorized operations when possible. Loops are useful for display, debugging, or custom logic.

28. Arithmetic Operations

Series operations align by index labels.

python
jan = pd.Series([100, 200, 300], index=["A", "B", "C"])
feb = pd.Series([110, 190, 250], index=["A", "B", "D"])

print(feb - jan)

Explanation

  • Creates two pandas Series, jan and feb, with specified indices "A", "B", "C" for jan and "A", "B", "D" for feb.
  • The subtraction operation feb - jan is performed, aligning the indices of both Series.
  • For indices that do not match, such as "C" in jan and "D" in feb, the result will contain NaN (Not a Number) for those positions.
  • The output will display the differences for matching indices and NaN for non-matching ones.

Output:

text
A    10.0
B   -10.0
C     NaN
D     NaN
dtype: float64

Why?

  • A and B exist in both Series
  • C is missing from February
  • D is missing from January

If you want missing values treated as zero:

python
print(feb.sub(jan, fill_value=0))

Explanation

  • The code uses the sub method from the pandas library to perform element-wise subtraction between two Series, feb and jan.
  • The fill_value=0 argument ensures that any missing values in either Series are treated as zeros during the subtraction.
  • This approach helps to avoid NaN results when one Series has values that the other does not, providing a cleaner output.
  • The result will be a new Series containing the differences, with indices from both Series preserved.

29. Relational Operations

python
scores = pd.Series([82, 91, 76, 88])

print(scores >= 85)

Explanation

  • A pandas Series named scores is created containing four integer values representing scores.
  • The expression scores >= 85 performs a comparison operation, checking each score to see if it is greater than or equal to 85.
  • The result of this comparison is a boolean Series, where each element indicates whether the corresponding score meets the threshold.
  • The print function outputs the boolean Series to the console, allowing users to see which scores are above or equal to 85.

Output:

text
0    False
1     True
2    False
3     True
dtype: bool

This creates a boolean Series.

30. Boolean Indexing

Use a boolean condition to filter values.

python
scores = pd.Series(
    [82, 91, 76, 88],
    index=["Asha", "Ravi", "Meera", "Kabir"],
)

high_scores = scores[scores >= 85]

print(high_scores)

Explanation

  • A pandas Series named scores is created with student names as indices and their corresponding scores as values.
  • The high_scores variable filters the scores Series to include only those scores that are greater than or equal to 85.
  • The filtered high scores are then printed to the console, showing only the students who achieved this threshold.
  • This code demonstrates basic data manipulation and filtering using pandas in Python.

Output:

text
Ravi     91
Kabir    88
dtype: int64

Count values above a threshold:

python
print((scores >= 85).sum())

Explanation

  • The expression scores >= 85 creates a boolean array where each element indicates whether the corresponding score meets the condition.
  • The sum() function counts the number of True values in the boolean array, effectively counting how many scores are 85 or higher.
  • This snippet is useful for quickly assessing performance metrics in a dataset of scores.
  • It assumes that scores is a NumPy array or a similar structure that supports element-wise comparison.

Because True behaves like 1 and False behaves like 0.

31. Multiple Conditions

Use & for AND and | for OR.

Wrap each condition in parentheses.

python
scores = pd.Series([45, 62, 78, 91, 38, 84])

selected = scores[(scores >= 60) & (scores <= 85)]

print(selected)

Explanation

  • A Pandas Series named scores is created with a list of integer values representing scores.
  • The selected variable filters the scores Series to include only those values that are greater than or equal to 60 and less than or equal to 85.
  • The filtering is done using a boolean condition that combines two comparisons with the logical AND operator (&).
  • Finally, the filtered results stored in selected are printed to the console, displaying only the scores that meet the specified criteria.

Output:

text
1    62
2    78
5    84
dtype: int64

Common mistake:

python
scores >= 60 & scores <= 85

Explanation

  • The expression evaluates whether each score is greater than or equal to 60 and less than or equal to 85.
  • The use of the bitwise AND operator & combines the two conditions for evaluation.
  • This code is likely part of a filtering process to identify scores that meet the specified criteria.
  • It is important to ensure that scores is a compatible data type, such as a NumPy array or a Pandas Series, for this operation to work correctly.

This is wrong because operator precedence can change the meaning.

32. Plotting A Series

Pandas can plot Series using Matplotlib behind the scenes.

python
daily_visitors = pd.Series(
    [120, 135, 150, 160, 155, 170, 180],
    index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
)

daily_visitors.plot(kind="line", title="Daily Visitors")

Explanation

  • Creates a pandas Series named daily_visitors containing visitor counts for each day of the week.
  • Assigns custom index labels ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") to represent the days.
  • Utilizes the plot method to generate a line graph of the daily visitors.
  • Sets the title of the plot to "Daily Visitors" for clarity in visualization.

For category counts:

python
plans = pd.Series(["free", "pro", "free", "team", "pro", "free"])

plans.value_counts().plot(kind="bar", title="Plan Counts")

Explanation

  • The code creates a pandas Series containing different subscription plan types.
  • It utilizes the value_counts() method to count the occurrences of each unique plan.
  • The resulting counts are then plotted as a bar chart using the plot() method with the title "Plan Counts".
  • This visualization helps in understanding the popularity of each subscription plan at a glance.

In notebooks, plots display inline if plotting is configured.

33. Changing Data Type With astype

python
scores = pd.Series([82.0, 91.0, 76.0])

scores_int = scores.astype("int64")

print(scores_int)
print(scores_int.dtype)

Explanation

  • A Pandas Series named scores is created containing three float values.
  • The astype method is used to convert the float values in the Series to integers, resulting in a new Series scores_int.
  • The converted Series scores_int is printed to display the integer values.
  • The data type of the scores_int Series is printed to confirm the conversion to int64.

Use astype() when conversion is straightforward.

34. Safer Numeric Conversion With pd.to_numeric

Real data often has messy strings.

python
raw_prices = pd.Series(["120", "99.5", "missing", "150"])

prices = pd.to_numeric(raw_prices, errors="coerce")

print(prices)

Explanation

  • The code initializes a pandas Series containing string representations of prices, including a non-numeric value ("missing").
  • The pd.to_numeric() function is used to convert the Series to numeric values, with the errors="coerce" argument ensuring that any non-convertible values are replaced with NaN (Not a Number).
  • The resulting Series, prices, contains numeric values for valid entries and NaN for the invalid entry.
  • Finally, the code prints the converted Series, allowing for easy inspection of the numeric values.

Output:

text
0    120.0
1     99.5
2      NaN
3    150.0
dtype: float64

errors="coerce" converts invalid values to missing values.

This is useful for cleaning imported CSV data.

35. between()

between() checks whether values lie inside a range.

python
scores = pd.Series([45, 62, 78, 91, 38, 84])

print(scores.between(60, 85))
print(scores[scores.between(60, 85)])

Explanation

  • The code creates a Pandas Series named scores containing a list of numerical values representing scores.
  • The between method is used to check which scores fall within the range of 60 to 85, returning a boolean Series.
  • The first print statement outputs the boolean Series indicating whether each score meets the criteria.
  • The second print statement filters the original scores Series to display only those scores that are between 60 and 85, based on the boolean mask.

Output:

text
0    False
1     True
2     True
3    False
4    False
5     True
dtype: bool
1    62
2    78
5    84
dtype: int64

36. clip()

clip() limits values to a lower and upper bound.

python
ratings = pd.Series([2, 5, 8, 11, -3, 7])

safe_ratings = ratings.clip(lower=0, upper=10)

print(safe_ratings)

Explanation

  • The code creates a Pandas Series named ratings containing a mix of integers, including negative values.
  • The clip method is used to limit the values in the ratings Series, setting a lower bound of 0 and an upper bound of 10.
  • Any ratings below 0 are replaced with 0, and any ratings above 10 are replaced with 10, ensuring all values are within the desired range.
  • The modified Series, safe_ratings, is then printed, displaying the adjusted ratings.

Output:

text
0     2
1     5
2     8
3    10
4     0
5     7
dtype: int64

Use this for outlier capping or valid range enforcement.

37. Duplicates: duplicated() And drop_duplicates()

python
plans = pd.Series(["free", "pro", "free", "team", "pro", "free"])

print(plans.duplicated())
print(plans.duplicated().sum())

Explanation

  • The code creates a pandas Series named plans containing various subscription types.
  • The duplicated() method is called on the Series to identify duplicate entries, returning a boolean Series where True indicates a duplicate.
  • The first print statement outputs the boolean Series showing which entries are duplicates.
  • The second print statement sums the True values from the boolean Series, providing the total count of duplicate entries in the plans Series.

Output:

text
0    False
1    False
2     True
3    False
4     True
5     True
dtype: bool
3

Drop duplicates:

python
print(plans.drop_duplicates())

Explanation

  • The drop_duplicates() method is called on the plans DataFrame to eliminate any duplicate rows.
  • The method returns a new DataFrame with only unique rows, preserving the first occurrence of each duplicate.
  • The print() function outputs the resulting DataFrame to the console for review.
  • This operation is useful for data cleaning and ensuring the integrity of the dataset before analysis.

Output:

text
0    free
1     pro
3    team
dtype: object

Keep the last occurrence:

python
print(plans.drop_duplicates(keep="last"))

Explanation

  • The drop_duplicates method is called on a DataFrame named plans.
  • The parameter keep="last" specifies that when duplicates are found, the last occurrence should be kept in the resulting DataFrame.
  • The result is printed to the console, displaying the DataFrame without duplicates.
  • This operation is useful for cleaning data by ensuring that only unique entries remain based on the specified criteria.

38. Missing Data

Create a Series with missing values:

python
ratings = pd.Series([4.5, np.nan, 3.8, np.nan, 4.9])

Explanation

  • Initializes a Pandas Series named ratings containing five numerical values representing ratings.
  • Utilizes np.nan to denote missing or undefined ratings in the dataset.
  • The Series can be used for further data analysis or manipulation, leveraging Pandas' powerful data handling capabilities.
  • This structure allows for easy identification and handling of missing data points in subsequent operations.

Find missing values:

python
print(ratings.isna())
print(ratings.isna().sum())

Explanation

  • The first line print(ratings.isna()) outputs a DataFrame of the same shape as ratings, where each entry is a boolean indicating whether the corresponding value is missing (True) or not (False).
  • The second line print(ratings.isna().sum()) calculates and prints the total number of missing values in each column of the ratings DataFrame by summing the boolean values (True counts as 1).
  • This code is useful for data cleaning and preprocessing, allowing users to quickly identify and address missing data issues in their dataset.

isnull() is an alias for isna().

Drop missing values:

python
print(ratings.dropna())

Explanation

  • The dropna() method is called on the ratings DataFrame to eliminate any rows containing NaN (missing) values.
  • The result is a new DataFrame that only includes rows with complete data, improving data quality for analysis.
  • The print() function outputs the cleaned DataFrame to the console for immediate review.
  • This operation is useful in data preprocessing steps before performing any statistical analysis or machine learning tasks.

Fill missing values:

python
filled = ratings.fillna(ratings.mean())

print(filled)

Explanation

  • The fillna() method is used to replace NaN (missing) values in the DataFrame ratings.
  • The argument ratings.mean() calculates the mean of each column in the DataFrame, providing a value to fill in for missing entries.
  • The result is stored in the variable filled, which contains the DataFrame with no missing values.
  • The print(filled) statement outputs the modified DataFrame to the console for review.

Use a domain-appropriate fill value. Do not blindly use the mean for every dataset.

39. isin()

isin() checks whether each value is in a list-like collection.

python
scores = pd.Series([49, 50, 75, 99, 100, 42])

near_milestones = scores[scores.isin([49, 99])]

print(near_milestones)

Explanation

  • A Pandas Series named scores is created containing a list of integer values representing scores.
  • The isin() method is used to filter the Series, selecting only the scores that match the specified milestone values of 49 and 99.
  • The filtered results are stored in the variable near_milestones, which contains only the scores that are near the defined milestones.
  • Finally, the print() function outputs the filtered Series to the console, displaying the selected milestone scores.

Output:

text
0    49
3    99
dtype: int64

Use it for membership filters.

40. map()

map() is useful for value replacement using a dictionary or function.

python
plans = pd.Series(["free", "pro", "team", "free"])

plan_labels = plans.map({
    "free": "Starter",
    "pro": "Professional",
    "team": "Team",
})

print(plan_labels)

Explanation

  • A pandas Series named plans is created containing different subscription plan types.
  • The map function is utilized to replace each plan type with a corresponding label defined in a dictionary.
  • The dictionary maps "free" to "Starter", "pro" to "Professional", and "team" to "Team".
  • The transformed labels are stored in the variable plan_labels.
  • Finally, the new labels are printed to the console, displaying the mapped values.

Output:

text
0          Starter
1     Professional
2             Team
3          Starter
dtype: object

If a value is not found in the dictionary, the result becomes missing for that value.

41. apply()

apply() applies a function to each value.

python
prices = pd.Series([99, 149, 249])

def add_tax(price):
    return price * 1.18

final_prices = prices.apply(add_tax)

print(final_prices)

Explanation

  • A Pandas Series named prices is created containing three initial price values.
  • The function add_tax takes a single price as input and returns the price increased by 18% to account for tax.
  • The apply method is used on the prices Series to apply the add_tax function to each element, resulting in a new Series called final_prices.
  • Finally, the final_prices Series is printed, displaying the prices after tax has been added.

Output:

text
0    116.82
1    175.82
2    293.82
dtype: float64

For simple arithmetic, vectorized code is better:

python
final_prices = prices * 1.18

Explanation

  • The variable final_prices is created to store the updated price values.
  • The original prices variable is multiplied by 1.18, which represents a 18% increase, typically for tax purposes.
  • This operation applies the same tax rate to all elements in the prices array or list, resulting in a new list of final prices.
  • The code assumes that prices is a numeric type or a collection of numeric types that support multiplication.

Use apply() when the logic is custom and cannot be expressed cleanly with vectorized operations.

42. Cleaning Price Strings

Real CSV data often stores prices as strings:

python
raw_prices = pd.Series(["$2.39", "$3.50", None, "$10.25", "not available"])

Explanation

  • Initializes a Pandas Series named raw_prices containing various price strings and a None value.
  • The Series includes valid price entries as strings (e.g., "$2.39", "$3.50", "$10.25") and a placeholder for missing data ("not available").
  • This structure allows for easy manipulation and analysis of price data, despite the presence of inconsistent formats.
  • The use of None indicates missing data, which is a common practice in data handling with Pandas.

Remove the dollar symbol:

python
clean_text = raw_prices.str.replace("$", "", regex=False)

Explanation

  • The raw_prices variable is expected to be a pandas Series containing price strings with dollar signs.
  • The str.replace method is used to search for the dollar sign character ("$") in each string of the Series.
  • The regex=False argument indicates that the dollar sign should be treated as a literal character, not a regular expression.
  • The result is stored in the clean_text variable, which contains the price strings without the dollar signs.

Convert to numbers:

python
prices_usd = pd.to_numeric(clean_text, errors="coerce")

print(prices_usd)

Explanation

  • The code uses the pd.to_numeric() function from the Pandas library to convert a variable clean_text into numeric values.
  • The parameter errors="coerce" ensures that any non-convertible values in clean_text are replaced with NaN (Not a Number) instead of raising an error.
  • The resulting numeric values are stored in the variable prices_usd.
  • Finally, the code prints the prices_usd variable to display the converted numeric values.

Output:

text
0     2.39
1     3.50
2      NaN
3    10.25
4      NaN
dtype: float64

Fill missing values:

python
prices_usd = prices_usd.fillna(prices_usd.mean())

Explanation

  • The fillna() method is used to replace NaN (missing) values in the prices_usd DataFrame.
  • The argument prices_usd.mean() calculates the mean of each column in the DataFrame.
  • This operation ensures that any missing values are replaced with the average value, maintaining the integrity of the data.
  • It is a common practice in data preprocessing to handle missing data before analysis or modeling.

Convert to rupees:

python
prices_inr = prices_usd * 83

print(prices_inr)

Explanation

  • The code multiplies a variable prices_usd by 83, which represents the exchange rate from USD to INR.
  • The result is stored in the variable prices_inr, which contains the equivalent prices in Indian Rupees.
  • The print function outputs the converted prices to the console for the user to see.
  • This snippet assumes that prices_usd is already defined and contains numeric values.

In production, use a real exchange rate source. In practice exercises, a fixed rate is fine.

43. Mini Project: Analyze Daily Subscribers

Suppose you track daily subscribers gained:

python
subscribers = pd.Series(
    [120, 135, 150, 90, 210, 240, 180, 160, 260, 300],
    name="subscribers_gained",
)

Explanation

  • Initializes a Pandas Series named "subscribers_gained" to store subscriber counts.
  • Contains a list of integers representing the number of subscribers gained at different time intervals.
  • Each integer in the list corresponds to a specific point in time, allowing for time series analysis.
  • The Series can be used for further data manipulation and visualization in data analysis tasks.

Find:

  • total subscribers gained
  • average daily gain
  • best day
  • number of days above 200
  • capped values between 100 and 250

Solution:

python
total = subscribers.sum()
average = subscribers.mean()
best_day = subscribers.idxmax()
days_above_200 = (subscribers > 200).sum()
capped = subscribers.clip(lower=100, upper=250)

print("Total:", total)
print("Average:", average)
print("Best day index:", best_day)
print("Days above 200:", days_above_200)
print(capped)

Explanation

  • Computes the total number of subscribers using the sum() method.
  • Calculates the average number of subscribers with the mean() function.
  • Identifies the index of the day with the highest subscriber count using idxmax().
  • Counts how many days had more than 200 subscribers by summing a boolean condition.
  • Clips the subscriber values to a range between 100 and 250 using the clip() method, ensuring no values fall outside this range.
  • Outputs the total, average, best day index, count of days above 200, and the capped subscriber values.

This project uses:

  • aggregation
  • boolean indexing
  • idxmax
  • clip

44. Mini Project: Clean Product Prices

python
raw_prices = pd.Series(
    ["$2.39", "$3.39", "$5.99", None, "$12.50", "unknown"],
    index=[
        "chips",
        "juice",
        "sandwich",
        "salad",
        "bowl",
        "soup",
    ],
    name="price_usd",
)

Explanation

  • Initializes a Pandas Series named raw_prices containing price data as strings, including valid prices, a None value, and an invalid entry ("unknown").
  • The index of the Series is explicitly defined with food item names: "chips", "juice", "sandwich", "salad", "bowl", and "soup".
  • The name attribute of the Series is set to "price_usd", indicating the context of the data as prices in USD.
  • This structure allows for easy manipulation and analysis of price data, despite the presence of non-numeric values.
  • The use of None and a string like "unknown" demonstrates how to handle missing or invalid data in a dataset.

Clean and analyze:

python
price_text = raw_prices.str.replace("$", "", regex=False)
prices = pd.to_numeric(price_text, errors="coerce")

prices = prices.fillna(prices.mean())
prices_inr = prices * 83

print(prices_inr)
print("Mean INR:", prices_inr.mean())
print("30th percentile:", prices_inr.quantile(0.30))
print("60th percentile:", prices_inr.quantile(0.60))
print("Between 300 and 800:")
print(prices_inr[prices_inr.between(300, 800)])

Explanation

  • The code first removes the dollar sign from a series of raw price strings using str.replace.
  • It converts the cleaned price strings into numeric values, coercing any errors to NaN.
  • Missing values are filled with the mean of the prices to ensure no gaps in the data.
  • The prices are then converted to Indian Rupees (INR) by multiplying by a conversion rate of 83.
  • Finally, it prints the converted prices, their mean, specific percentiles, and filters prices that fall between 300 and 800 INR.

This kind of cleaning appears often in data analyst tasks.

45. Practice Exercises

Try these before reading the solutions.

Practice Lab

Exercise 1: Empty Series

Create an empty Series with dtype float.

Practice Lab

Exercise 2: Series Arithmetic

Create two Series:

python
first = pd.Series([2, 4, 6, 8, 10])
second = pd.Series([1, 3, 5, 7, 10])

Explanation

  • Initializes the first Series named first containing even numbers from 2 to 10.
  • Initializes the second Series named second containing odd numbers from 1 to 10.
  • Both Series are created using the Pandas library, which is commonly used for data manipulation in Python.
  • These Series can be used for various operations such as mathematical computations, comparisons, or visualizations.

Print addition, subtraction, multiplication, and division.

Practice Lab

Exercise 3: Series Comparison

Using the same two Series, compare:

  • greater than
  • less than
  • equal to

Practice Lab

Exercise 4: Convert Mixed Data To Numeric

Create:

python
mixed = pd.Series([1, 2, "Python", 2.0, True, 100])

Explanation

  • The code initializes a Pandas Series named mixed containing various data types including integers, strings, floats, and booleans.
  • The pd.Series function is used to create the Series, which allows for the storage of heterogeneous data.
  • Each element in the Series can be accessed using its index, making it versatile for data manipulation and analysis.
  • This structure is useful in scenarios where data may not be uniform, such as in data frames or when handling diverse datasets.

Convert it to numeric values, turning invalid values into missing values.

Practice Lab

Exercise 5: Top Values

Create a Series of player scores and print the top 5 values.

Practice Lab

Exercise 6: Count Above Mean

Create a numeric Series and count how many values are greater than the mean.

Practice Lab

Exercise 7: Missing Values

Create a Series with three missing values. Count missing values, drop them, and fill them with the median.

Practice Lab

Exercise 8: Price Cleaning

Create a Series of price strings such as "$10.50", "$20.00", and "missing". Remove $, convert to numeric, and fill missing values with the mean.

Practice Lab

Exercise 9: Category Counts

Create a Series of course categories and show the top 3 most common categories.

Practice Lab

Exercise 10: Range Filter

Create a Series of product prices and return prices between 100 and 500.

46. Practice Solutions

Solution Key

Solution 1: Empty Series

python
empty = pd.Series(dtype="float64")

print(empty)

Explanation

  • Initializes an empty Pandas Series object with a data type of float64.
  • The dtype parameter ensures that any data added later will be treated as floating-point numbers.
  • The print function outputs the Series to the console, showing its current state (which is empty).
  • This code is useful for initializing a Series before populating it with data in subsequent operations.

Solution Key

Solution 2: Series Arithmetic

python
first = pd.Series([2, 4, 6, 8, 10])
second = pd.Series([1, 3, 5, 7, 10])

print(first + second)
print(first - second)
print(first * second)
print(first / second)

Explanation

  • The code creates two Pandas Series, first and second, containing integer values.
  • It performs element-wise addition, subtraction, multiplication, and division between the two Series.
  • The results of these operations are printed to the console, showing the output for each arithmetic operation.
  • This demonstrates how Pandas handles vectorized operations, allowing for efficient calculations on Series data.
  • The operations align based on the index of the Series, ensuring that corresponding elements are processed together.

Solution Key

Solution 3: Series Comparison

python
first = pd.Series([2, 4, 6, 8, 10])
second = pd.Series([1, 3, 5, 7, 10])

print(first > second)
print(first < second)
print(first == second)

Explanation

  • Creates two pandas Series, first and second, containing integer values.
  • Performs element-wise comparison between the two Series using greater than (>), less than (<), and equality (==) operators.
  • Outputs three boolean Series indicating the result of each comparison for corresponding elements in first and second.
  • Useful for data analysis tasks where relational comparisons between datasets are needed.

Solution Key

Solution 4: Convert Mixed Data To Numeric

python
mixed = pd.Series([1, 2, "Python", 2.0, True, 100])

converted = pd.to_numeric(mixed, errors="coerce")

print(converted)

Explanation

  • The code creates a Pandas Series named mixed containing various data types, including integers, strings, floats, and booleans.
  • The pd.to_numeric() function is used to convert the elements of the mixed Series to numeric values, with the errors="coerce" parameter ensuring that any non-convertible values are replaced with NaN.
  • The result of the conversion is stored in the converted variable, which will contain numeric representations of the original values where possible.
  • Finally, the print() function outputs the converted Series, displaying the numeric values along with any NaN entries for the non-numeric data.

Solution Key

Solution 5: Top Values

python
scores = pd.Series([420, 180, 550, 610, 320, 720, 150])

top_5 = scores.sort_values(ascending=False).head(5)

print(top_5)

Explanation

  • A pandas Series named scores is created containing a list of numerical values.
  • The sort_values method is used to sort the scores in descending order.
  • The head(5) method extracts the top five scores from the sorted Series.
  • Finally, the top five scores are printed to the console.

Solution Key

Solution 6: Count Above Mean

python
values = pd.Series([10, 20, 30, 40, 50])

above_mean_count = (values > values.mean()).sum()

print(above_mean_count)

Explanation

  • A Pandas Series is created with five integer values: 10, 20, 30, 40, and 50.
  • The mean of the Series is calculated using values.mean().
  • A boolean condition checks which elements are greater than the mean, resulting in a Series of True/False values.
  • The sum() function counts the number of True values, indicating how many elements are above the mean.
  • Finally, the count of elements above the mean is printed to the console.

Solution Key

Solution 7: Missing Values

python
values = pd.Series([10, np.nan, 30, np.nan, 50, np.nan])

print(values.isna().sum())
print(values.dropna())
print(values.fillna(values.median()))

Explanation

  • A Pandas Series is created with some numeric values and NaN (Not a Number) entries to represent missing data.
  • The isna().sum() method counts and prints the total number of missing values in the Series.
  • The dropna() method removes all entries with NaN values and prints the cleaned Series.
  • The fillna() method replaces NaN values with the median of the Series, providing a way to impute missing data.

Solution Key

Solution 8: Price Cleaning

python
prices = pd.Series(["$10.50", "$20.00", "missing", "$15.75"])

clean_text = prices.str.replace("$", "", regex=False)
numeric_prices = pd.to_numeric(clean_text, errors="coerce")
filled_prices = numeric_prices.fillna(numeric_prices.mean())

print(filled_prices)

Explanation

  • The code initializes a pandas Series containing price strings, some of which are invalid or missing.
  • It uses the str.replace method to remove the dollar sign from each price string, resulting in a clean text representation.
  • The pd.to_numeric function converts the cleaned strings into numeric values, with the errors="coerce" argument turning any non-convertible entries into NaN.
  • The fillna method replaces NaN values with the mean of the valid numeric prices, ensuring no missing data remains.
  • Finally, the cleaned and filled prices are printed to the console.

Solution Key

Solution 9: Category Counts

python
categories = pd.Series([
    "python",
    "pandas",
    "python",
    "sql",
    "pandas",
    "python",
    "excel",
])

print(categories.value_counts().head(3))

Explanation

  • A pandas Series named categories is created containing various programming-related strings.
  • The value_counts() method is called on the Series to count the occurrences of each unique category.
  • The head(3) method is used to retrieve the top three categories based on their frequency.
  • Finally, the result is printed, showing the most common categories in descending order.

Solution Key

Solution 10: Range Filter

python
prices = pd.Series([50, 120, 250, 600, 499, 80])

selected = prices[prices.between(100, 500)]

print(selected)

Explanation

  • A Pandas Series named prices is created containing a list of numerical values representing prices.
  • The between method is used to filter the Series, selecting only the prices that fall between 100 and 500, inclusive.
  • The filtered results are stored in the variable selected.
  • Finally, the print function outputs the filtered prices to the console.

47. Quick Interview Questions

1. What is a Pandas Series?

A one-dimensional labeled array.

2. What is the difference between size and count()?

size counts all entries, including missing values. count() counts non-missing values.

3. What does value_counts() do?

It counts unique values in a Series.

4. What is the difference between loc and iloc?

loc selects by label. iloc selects by integer position.

5. Why use pd.to_numeric()?

To convert messy values to numbers with options like errors="coerce".

6. What does dropna() do?

It removes missing values.

7. What does fillna() do?

It replaces missing values with a chosen value.

8. What does isin() do?

It checks whether values are present in a given list-like collection.

9. When should you use .copy()?

When you want to modify a subset independently from the original object.

10. Why can Series arithmetic produce missing values?

Because Series align by index labels. If a label is missing from one side, the result becomes missing for that label.

48. Common Beginner Mistakes

Mistake 1: Confusing label and position

Use loc for labels and iloc for positions.

Mistake 2: Thinking size ignores missing values

size includes missing values. Use count() for non-missing values.

Mistake 3: Forgetting index alignment

Series arithmetic aligns by labels, not only by row order.

Mistake 4: Using astype() on messy strings

If values are messy, use pd.to_numeric(..., errors="coerce").

Mistake 5: Modifying a subset without copying

Use .copy() when you intentionally want an independent object.

Final Takeaway

A Pandas Series is simple at first glance: one column of values with labels.

But it becomes powerful because it supports:

  • labeled indexing
  • automatic alignment
  • missing-data handling
  • statistical summaries
  • boolean filtering
  • value counts
  • sorting
  • type conversion
  • string cleaning
  • plotting
  • element-wise transformation

If you are new to Pandas, master Series before moving deeply into DataFrames. DataFrames are mostly collections of Series working together.

Sources and Further Reading