# Mastering Pandas Series: Creation, Indexing, and Analysis URL: https://madhudadi.in/blog/posts/pandas-series-create-index-and-analyze-data-efficiently Published: 2026-05-30 Tags: python, Pandas Read time: 28 min Difficulty: beginner > Learn Pandas Series from scratch with original examples: create Series from lists and dictionaries, inspect attributes, read CSV columns, use loc and iloc, sort values, count categories, handle missing data, filter with conditions, transform values, and solve practice tasks.# Pandas Series: Create, Index, Clean, Analyze, and Practice Pandas is one of the most important Python libraries for data analysis. If NumPy gives you fast arrays, Pandas gives you labeled data structures that feel closer to real-world tables. The first Pandas object you should understand is the `Series`. A Pandas Series is a one-dimensional labeled array. You can think of it as a single column of data with row labels. Examples of Series-style data: - daily website visitors - monthly revenue - marks scored by students - product prices - customer ratings - city temperatures - movie genres - app downloads by date In this lesson, you will learn how to create, inspect, select, clean, analyze, and transform Pandas Series. ## What You Will Learn By the end, you should be able to: - explain what a Pandas Series is - create Series from lists, dictionaries, and scalar values - use custom indexes and names - inspect `size`, `dtype`, `name`, `index`, `values`, and `is_unique` - read a CSV column as a Series - use `head`, `tail`, `sample`, `value_counts`, `sort_values`, and `sort_index` - calculate `sum`, `mean`, `median`, `mode`, `std`, `var`, `min`, `max`, and `describe` - select values using labels and positions - understand `loc` and `iloc` - edit values safely - use boolean indexing - handle missing data with `isna`, `dropna`, and `fillna` - convert values using `astype` and `pd.to_numeric` - use `between`, `clip`, `duplicated`, `drop_duplicates`, `isin`, `map`, and `apply` - solve beginner Pandas Series practice problems ## 1. Installing And Importing Pandas Install Pandas if needed: ```bash pip install pandas ``` Import it with the standard alias: ```python import pandas as pd import numpy as np ``` Most Pandas code uses `pd` as the alias. ## 2. What Is A Pandas Series? A Series is a one-dimensional labeled array. Create a simple Series: ```python import pandas as pd visitors = pd.Series([120, 135, 150, 160]) print(visitors) ``` Output: ```text 0 120 1 135 2 150 3 160 dtype: int64 ``` The left side is the index. The right side is the value. By default, Pandas creates a numeric index starting from 0. ## 3. Series vs Python List A Python list stores values by position. ```python visitors_list = [120, 135, 150, 160] ``` A Series stores values with labels. ```python visitors = pd.Series([120, 135, 150, 160]) ``` Why does this matter? Because real data often needs labels. ```python visitors = pd.Series( [120, 135, 150, 160], index=["Mon", "Tue", "Wed", "Thu"], ) print(visitors) ``` Output: ```text Mon 120 Tue 135 Wed 150 Thu 160 dtype: int64 ``` Now each value has a meaningful row label. ## 4. Creating A Series From A List ```python topics = pd.Series(["Python", "NumPy", "Pandas", "SQL"]) print(topics) ``` Output: ```text 0 Python 1 NumPy 2 Pandas 3 SQL dtype: object ``` Strings usually use the `object` dtype in many Pandas displays. Create a numeric Series: ```python scores = pd.Series([82, 91, 76, 88]) print(scores) ``` Output: ```text 0 82 1 91 2 76 3 88 dtype: int64 ``` ## 5. Creating A Series With Custom Index ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], ) print(scores) ``` Output: ```text Asha 82 Ravi 91 Meera 76 Kabir 88 dtype: int64 ``` Now the index contains student names. You can select by label: ```python print(scores["Ravi"]) ``` Output: ```text 91 ``` ## 6. Naming A Series A Series can have a name. ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], name="math_score", ) print(scores) ``` Output: ```text Asha 82 Ravi 91 Meera 76 Kabir 88 Name: math_score, dtype: int64 ``` The name is useful when a Series becomes a column in a DataFrame. ## 7. Creating A Series From A Dictionary When you create a Series from a dictionary, keys become the index and values become the data. ```python course_minutes = { "Python Basics": 45, "NumPy Arrays": 55, "Pandas Series": 60, "SQL Joins": 50, } minutes = pd.Series(course_minutes, name="duration_minutes") print(minutes) ``` Output: ```text Python Basics 45 NumPy Arrays 55 Pandas Series 60 SQL Joins 50 Name: duration_minutes, dtype: int64 ``` This is one of the cleanest ways to create a labeled Series. ## 8. Creating A Series From A Scalar You can repeat one value across an index. ```python status = pd.Series("draft", index=["post_1", "post_2", "post_3"]) print(status) ``` Output: ```text post_1 draft post_2 draft post_3 draft dtype: object ``` This is useful when creating default values. ## 9. Important Series Attributes Create a Series: ```python ratings = pd.Series( [4.5, 4.8, 3.9, 4.8, np.nan], index=["course_a", "course_b", "course_c", "course_d", "course_e"], name="rating", ) ``` ### size `size` returns the total number of values, including missing values. ```python print(ratings.size) ``` **Explanation** - The `ratings` variable is expected to be a data structure, such as a NumPy array or a Pandas DataFrame. - The `size` attribute returns the total number of elements contained within the `ratings` data structure. - The `print` function outputs this size to the console, allowing the user to see how many ratings are present. - This can be useful for understanding the scale of the data being analyzed. Output: ```text 5 ``` ### count `count()` returns non-missing values. ```python print(ratings.count()) ``` **Explanation** - The `count()` method is called on the `ratings` object, which is expected to be a list, array, or similar iterable. - It returns the total number of elements present in the `ratings` collection. - This can be useful for determining the size of the dataset or for further statistical analysis. - The output will be an integer representing the count of items in `ratings`. Output: ```text 4 ``` This distinction is important in interviews. ### dtype ```python print(ratings.dtype) ``` **Explanation** - The `print()` function outputs the result to the console. - `ratings.dtype` accesses the data type attribute of the 'ratings' variable, which is typically a NumPy array or a pandas DataFrame. - This is useful for understanding the type of data stored in 'ratings', which can affect how operations are performed on it. - Knowing the data type helps in debugging and ensuring compatibility with functions that require specific data types. Output: ```text float64 ``` ### name ```python print(ratings.name) ``` **Explanation** - The code snippet prints the value of the 'name' attribute from the 'ratings' object. - The 'ratings' object is expected to be an instance of a class that has a 'name' attribute defined. - The 'r' prefix before the string indicates that the string is a raw string, but in this case, it is not necessary since there are no escape characters. - This operation is commonly used to retrieve and display specific information stored within an object. Output: ```text rating ``` ### index ```python print(ratings.index) ``` **Explanation** - The code snippet uses the `index` attribute of a Pandas DataFrame or Series named `ratings`. - It outputs the index labels, which represent the row identifiers for the data structure. - This can be useful for understanding the structure of the data or for debugging purposes. - The `print` function displays the index in the console, allowing for quick inspection. The index stores row labels. ### values ```python print(ratings.values) ``` **Explanation** - The code snippet uses the `print()` function to output data to the console. - `ratings.values` accesses the values attribute of the `ratings` object, which typically contains numerical or categorical data. - This is commonly used in data analysis to quickly view the underlying values of a dataset, such as in a Pandas DataFrame. - The `r` before the string indicates a raw string, but in this case, it is not necessary since no escape characters are present. This gives the underlying values as an array-like object. ### is_unique ```python print(ratings.is_unique) ``` **Explanation** - The code accesses the `is_unique` attribute of the `ratings` DataFrame. - It returns a boolean value indicating whether all index values in the DataFrame are unique. - This can be useful for data validation to ensure there are no duplicate entries in the index. - If `True`, it confirms that each index label is distinct; if `False`, it indicates duplicates exist. Output: ```text False ``` It is false because `4.8` appears more than once. ## 10. Reading A CSV Column As A Series Assume you have a CSV file called `daily_visitors.csv`: ```text date,visitors 2026-01-01,120 2026-01-02,135 2026-01-03,150 ``` Read the file: ```python df = pd.read_csv("daily_visitors.csv") print(df) ``` **Explanation** - The `pd.read_csv` function is used to load data from a CSV file named "daily_visitors.csv" into a pandas DataFrame called `df`. - The `print(df)` statement outputs the entire DataFrame to the console, allowing users to view the data contained in the CSV file. - This code is useful for quickly inspecting the structure and contents of the dataset for further analysis. Select one column as a Series: ```python visitors = df["visitors"] print(type(visitors)) print(visitors) ``` **Explanation** - The variable `visitors` is assigned the 'visitors' column from the DataFrame `df`. - The `type(visitors)` function is called to print the data type of the `visitors` variable, which helps in understanding the structure of the data. - The `print(visitors)` statement outputs the actual content of the 'visitors' column, allowing for a quick inspection of the data values. If you want the date column as the index: ```python df = pd.read_csv("daily_visitors.csv", index_col="date") visitors = df["visitors"] print(visitors) ``` **Explanation** - The code imports a CSV file named "daily_visitors.csv" into a pandas DataFrame, setting the "date" column as the index. - It extracts the "visitors" column from the DataFrame for further analysis or display. - The `print` function outputs the values of the "visitors" column to the console, allowing for a quick review of the data. CSV reading usually returns a DataFrame. A single selected column is a Series. ## 11. `head()` And `tail()` Use `head()` to preview the first rows. ```python sales = pd.Series([120, 135, 150, 160, 155, 170, 180]) print(sales.head()) print(sales.head(3)) ``` **Explanation** - The code initializes a Pandas Series named `sales` containing a list of sales figures. - The `print(sales.head())` statement outputs the first five entries of the Series by default. - The `print(sales.head(3))` statement specifically retrieves and displays the first three entries of the Series. - This functionality is useful for quickly inspecting the data structure and values within the Series. Use `tail()` to preview the last rows. ```python print(sales.tail()) print(sales.tail(2)) ``` **Explanation** - The `print(sales.tail())` function call outputs the last five rows of the DataFrame named `sales`. - The `print(sales.tail(2))` function call specifically retrieves and displays the last two rows of the same DataFrame. - This method is useful for quickly inspecting the end of a dataset to understand its structure or check for data integrity. These are useful for checking data quickly. ## 12. `sample()` `sample()` returns random rows. ```python products = pd.Series( ["notebook", "pen", "marker", "bag", "bottle", "eraser"], name="product", ) print(products.sample(3, random_state=42)) ``` **Explanation** - A Pandas Series named `products` is created containing a list of product names. - The `sample` method is used to randomly select 3 items from the Series. - The `random_state` parameter is set to 42, ensuring that the random selection is reproducible across different runs. - The selected products are printed to the console, allowing for easy inspection of the random sample. `random_state` makes the sample reproducible. ## 13. `value_counts()` `value_counts()` counts unique values. ```python categories = pd.Series( ["free", "pro", "free", "team", "pro", "free"], name="plan", ) print(categories.value_counts()) ``` **Explanation** - A pandas Series named `categories` is created containing different subscription plan types. - The `value_counts()` method is called on the Series to count the number of occurrences of each unique value. - The result is printed, displaying the frequency of each subscription plan type in descending order. - This code is useful for quickly analyzing categorical data and understanding the distribution of different categories. Output: ```text free 3 pro 2 team 1 Name: count, dtype: int64 ``` Use this for categorical summaries. To include missing values: ```python print(categories.value_counts(dropna=False)) ``` **Explanation** - Uses the `value_counts()` method from the pandas library to count unique values in the `categories` Series. - The `dropna=False` parameter ensures that NaN (missing) values are included in the count. - The output is a Series showing the count of each unique category, which can be useful for data analysis and understanding distribution. - This method is commonly used in data preprocessing and exploratory data analysis to identify the presence of missing data. ## 14. Sorting Values ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], ) print(scores.sort_values()) ``` **Explanation** - A Pandas Series named `scores` is created with four integer values representing scores and corresponding string indices for names. - The `sort_values()` method is called on the `scores` Series to sort the scores in ascending order. - The sorted Series is printed, displaying the names alongside their scores in order from lowest to highest. - This code snippet demonstrates how to efficiently sort and display data using the Pandas library in Python. Output: ```text Meera 76 Asha 82 Kabir 88 Ravi 91 dtype: int64 ``` Descending order: ```python print(scores.sort_values(ascending=False)) ``` **Explanation** - The code snippet utilizes the `sort_values` method from the pandas library to sort a DataFrame or Series named `scores`. - The `ascending=False` argument specifies that the sorting should be done in descending order, meaning higher scores will appear first. - The `print` function outputs the sorted scores to the console, allowing for immediate visibility of the results. - This operation is useful for quickly identifying the highest scores in a dataset. Get the top scorer: ```python top_student = scores.sort_values(ascending=False).head(1) print(top_student) ``` **Explanation** - The `scores` variable is expected to be a pandas Series or DataFrame containing student scores. - The `sort_values(ascending=False)` method sorts the scores in descending order, placing the highest score at the top. - The `head(1)` method retrieves the first entry from the sorted list, which corresponds to the top student. - The `print(top_student)` statement outputs the highest score or student information to the console. ## 15. Sorting Index ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], ) print(scores.sort_index()) ``` **Explanation** - Creates a Pandas Series named `scores` with specified values and custom indices representing names. - Uses the `sort_index()` method to sort the Series based on the alphabetical order of the indices. - The sorted Series is printed, displaying the scores associated with each name in order. - This approach enhances data readability and organization, making it easier to locate specific entries. Output: ```text Asha 82 Kabir 88 Meera 76 Ravi 91 dtype: int64 ``` Use `sort_index()` when label order matters. ## 16. `inplace=True`: Should You Use It? Many Pandas methods can return a new object. ```python sorted_scores = scores.sort_values() ``` **Explanation** - The `sort_values()` method is called on the `scores` object, which is typically a pandas Series or DataFrame. - The method sorts the values in ascending order by default. - The sorted result is stored in the variable `sorted_scores`. - This operation modifies the original `scores` object unless a new sorted object is created. Some methods also support `inplace=True`. ```python scores.sort_values(inplace=True) ``` **Explanation** - The `sort_values()` method is called on a DataFrame named `scores`. - The `inplace=True` argument modifies the original DataFrame directly, rather than returning a new sorted DataFrame. - This operation is useful for organizing data in ascending order based on the specified column(s). - It helps in preparing data for analysis or visualization by ensuring that the values are in a desired order. For learning and production code, returning a new object is often clearer. Why? - it avoids accidental mutation - it makes code easier to debug - it works well with method chaining Prefer: ```python scores = scores.sort_values() ``` **Explanation** - The `sort_values()` method is called on the `scores` object, which is expected to be a pandas DataFrame or Series. - The method sorts the data in ascending order by default, rearranging the values. - The original `scores` object is modified in place, meaning the sorted values replace the unsorted ones. - This operation is useful for organizing data for analysis or visualization purposes. ## 17. Mathematical Methods Create a Series: ```python orders = pd.Series([12, 18, 10, 25, 17, np.nan], name="orders") ``` **Explanation** - Initializes a Pandas Series named "orders" containing a list of integers representing order quantities. - The list includes a NaN (Not a Number) value to represent missing data in the series. - The use of `pd.Series` allows for easy manipulation and analysis of the order data. - The `name` parameter assigns a label to the series, making it easier to reference in data analysis tasks. ### sum ```python print(orders.sum()) ``` **Explanation** - The `print()` function outputs the result to the console. - `orders` is expected to be a data structure, such as a list or a pandas DataFrame, containing numerical values. - The `sum()` method computes the total of all elements within the `orders` variable. - This operation is useful for quickly assessing total sales or quantities in a dataset. Output: ```text 82.0 ``` By default, missing values are skipped. ### mean ```python print(orders.mean()) ``` **Explanation** - The code snippet uses the `mean()` function to compute the average of the values in the `orders` dataset. - The `print()` function outputs the calculated mean to the console for easy viewing. - This operation is typically used in data analysis to summarize the central tendency of numerical data. ### median ```python print(orders.median()) ``` **Explanation** - The code uses the `print()` function to output the result of the `median()` method. - `orders` is expected to be a data structure, such as a list or a Pandas DataFrame, containing numerical values. - The `median()` method computes the median, which is the middle value when the data is sorted. - If the dataset has an even number of values, the median is the average of the two middle numbers. ### mode ```python print(orders.mode()) ``` **Explanation** - The code uses the `mode()` function from the pandas library to find the mode of the `orders` DataFrame. - The mode represents the value(s) that appear most frequently in the dataset. - The result is printed to the console, displaying the most common entries in the specified DataFrame. - This function is useful for understanding the distribution of categorical data within the DataFrame. `mode()` can return more than one value. ### standard deviation and variance ```python print(orders.std()) print(orders.var()) ``` **Explanation** - The `print(orders.std())` function computes and displays the standard deviation of the values in the `orders` dataset, which measures the amount of variation or dispersion. - The `print(orders.var())` function calculates and outputs the variance of the `orders` dataset, representing the average of the squared differences from the mean. - Both functions are useful for understanding the distribution and spread of order values, aiding in statistical analysis. - This code assumes that `orders` is a Pandas DataFrame or Series containing numerical data. ### min and max ```python print(orders.min()) print(orders.max()) ``` **Explanation** - The `print(orders.min())` function call outputs the smallest value found in the `orders` dataset. - The `print(orders.max())` function call outputs the largest value found in the `orders` dataset. - This code is useful for quickly assessing the range of order values in a dataset. - It assumes that `orders` is a data structure that supports the `min()` and `max()` methods, such as a list or a pandas DataFrame. ## 18. `describe()` `describe()` gives a quick statistical summary. ```python print(orders.describe()) ``` **Explanation** - The `print()` function outputs the result of the `describe()` method to the console. - `orders` is expected to be a pandas DataFrame that contains order-related data. - The `describe()` method generates descriptive statistics such as count, mean, standard deviation, min, and max for numerical columns. - This method helps in quickly understanding the distribution and central tendencies of the data in the DataFrame. - It is useful for data analysis and preprocessing steps in data science projects. Possible output: ```text count 5.000000 mean 16.400000 std 5.770615 min 10.000000 25% 12.000000 50% 17.000000 75% 18.000000 max 25.000000 Name: orders, dtype: float64 ``` For text data: ```python plans = pd.Series(["free", "pro", "free", "team", "free"]) print(plans.describe()) ``` **Explanation** - The code creates a Pandas Series named `plans` containing different subscription types. - The `describe()` method is called on the Series, which provides a summary of the data, including count, unique values, top value, and frequency. - This summary helps in understanding the distribution of subscription plans, such as how many users are on each plan. - The output is useful for data analysis and decision-making regarding subscription offerings. It reports count, unique values, top value, and frequency. ## 19. Selecting Values By Position With `iloc` Use `iloc` for integer-position selection. ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], ) print(scores.iloc[0]) print(scores.iloc[1:3]) print(scores.iloc[[0, 3]]) ``` **Explanation** - A Pandas Series named `scores` is created with integer values and custom string indices representing names. - The first `print` statement retrieves the score of the first index ("Asha") using `iloc[0]`. - The second `print` statement retrieves a slice of scores from the second to the third index ("Ravi" and "Meera") using `iloc[1:3]`. - The third `print` statement accesses the scores of the first and last indices ("Asha" and "Kabir") using a list with `iloc[[0, 3]]`. `iloc` ignores labels and uses positions. ## 20. Selecting Values By Label With `loc` Use `loc` for label-based selection. ```python print(scores.loc["Ravi"]) print(scores.loc["Asha":"Meera"]) ``` **Explanation** - The first line retrieves and prints the row associated with the index label "Ravi" from the DataFrame `scores`. - The second line retrieves and prints all rows from the DataFrame `scores` starting from the index label "Asha" to "Meera", inclusive. - The `loc` method is used for label-based indexing, allowing for selection of rows and columns by their labels. - This code assumes that `scores` is a Pandas DataFrame that has been previously defined and populated with data. Important: Label slicing with `loc` includes the stop label when it exists. Position slicing with `iloc` excludes the stop position. ## 21. Why Avoid Ambiguous Integer Indexing? Consider this Series: ```python numbers = pd.Series([100, 200, 300], index=[10, 20, 30]) ``` **Explanation** - Initializes a Pandas Series named `numbers` containing three integer values: 100, 200, and 300. - Assigns custom indices of 10, 20, and 30 to the respective values in the Series. - Facilitates easier data manipulation and retrieval by using meaningful indices instead of default integer indices. - Useful for scenarios where data points need to be accessed or analyzed based on specific labels rather than their position. This can be confusing: ```python numbers[10] ``` **Explanation** - The code attempts to retrieve the element at index 10 from the list named `numbers`. - Python uses zero-based indexing, meaning the first element is at index 0, and the eleventh element is at index 10. - If the list `numbers` contains fewer than 11 elements, this will raise an `IndexError`. - This operation is commonly used to access specific data points in a list for further processing or analysis. Does `10` mean label or position? Use explicit access: ```python print(numbers.loc[10]) print(numbers.iloc[0]) ``` **Explanation** - The first line retrieves the value at index 10 from the DataFrame `numbers` using the `loc` method, which accesses data by label. - The second line retrieves the value at the first position (index 0) from the DataFrame `numbers` using the `iloc` method, which accesses data by integer position. - This code demonstrates how to access data in a pandas DataFrame using both label-based and position-based indexing. - It is essential to ensure that the index exists in the DataFrame to avoid errors during retrieval. Good Pandas code is explicit about label vs position. ## 22. Slicing A Series ```python weekly_sales = pd.Series( [120, 135, 150, 160, 155, 170, 180], index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"], ) print(weekly_sales.iloc[1:4]) print(weekly_sales.loc["Tue":"Thu"]) ``` **Explanation** - A Pandas Series named `weekly_sales` is created with sales data for each day of the week. - The `iloc` method is used to retrieve sales data for Tuesday to Thursday using integer-based indexing. - The `loc` method is employed to access sales data from Tuesday to Thursday using label-based indexing. - Both methods demonstrate different ways to slice data from the Series, showcasing flexibility in data retrieval. Both select Tuesday through Thursday here, but they use different rules. ## 23. Editing Values Create a Series: ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], ) ``` **Explanation** - Initializes a Pandas Series named `scores` containing four numerical values representing scores. - The scores are associated with custom indices: "Asha", "Ravi", "Meera", and "Kabir". - This structure allows for easy access and manipulation of scores using the corresponding names as labels. - The use of `pd.Series` indicates that the code relies on the Pandas library for data manipulation. Edit by label: ```python scores.loc["Meera"] = 80 ``` **Explanation** - The code modifies the DataFrame `scores` by assigning the value `80` to the row labeled "Meera". - The `loc` method is used to access a group of rows and columns by labels or a boolean array. - If "Meera" does not already exist in the DataFrame, this operation will create a new row with that label. - This is a common operation for updating or adding data in pandas DataFrames. Edit by position: ```python scores.iloc[0] = 85 ``` **Explanation** - The code uses the `iloc` method to access a specific row in a pandas DataFrame called `scores`. - It targets the first row (index 0) of the DataFrame for modification. - The value 85 is assigned to the entire first row, replacing any existing values. - This operation is useful for updating data in a DataFrame without needing to reassign the entire DataFrame. - Ensure that the DataFrame `scores` is already defined and has at least one row before executing this code. Add a new label: ```python scores.loc["Nisha"] = 92 ``` **Explanation** - The code modifies the DataFrame `scores` by assigning a new value. - It specifically targets the row labeled "Nisha" to update her score. - The value `92` is set as the new score for Nisha in the DataFrame. - This operation is useful for dynamically changing data in data analysis tasks. Print: ```python print(scores) ``` **Explanation** - The code uses the `print()` function to output the value of the variable `scores` to the console. - It assumes that `scores` is already defined and contains data, such as a list or a dictionary. - This is a common way to debug or check the contents of a variable during development. ## 24. Copying A Series Safely When you take a subset and plan to modify it, use `.copy()`. ```python top_scores = scores.head(2).copy() top_scores.iloc[0] = 999 print(top_scores) print(scores) ``` **Explanation** - The code creates a copy of the top two entries from the `scores` DataFrame using the `head(2)` method. - It modifies the first entry of the copied DataFrame (`top_scores`) by setting its value to 999. - The original `scores` DataFrame remains unchanged, demonstrating the use of the `copy()` method to avoid unintended side effects. - Finally, both the modified `top_scores` and the original `scores` are printed to show the difference. Why? It prevents accidental changes or warnings caused by modifying a view-like object. Interview answer: > If I need an independent object, I call `.copy()` before modifying a subset. ## 25. Python Functions With Series Many Python built-ins work with Series. ```python scores = pd.Series([82, 91, 76, 88]) print(len(scores)) print(type(scores)) print(max(scores)) print(min(scores)) print(sorted(scores)) ``` **Explanation** - Initializes a Pandas Series named `scores` with a list of integers representing scores. - Uses `len(scores)` to print the number of elements in the Series. - Utilizes `type(scores)` to display the data type of the `scores` object, confirming it is a Pandas Series. - Calls `max(scores)` to find and print the highest score in the Series. - Calls `min(scores)` to find and print the lowest score in the Series. - Uses `sorted(scores)` to print the scores in ascending order. Convert to a list: ```python print(scores.tolist()) ``` **Explanation** - The code uses the `tolist()` method to convert a NumPy array named `scores` into a standard Python list. - This conversion is useful for compatibility with Python functions that require list inputs instead of NumPy arrays. - The `print()` function outputs the resulting list to the console, allowing for easy visualization of the data. - This snippet assumes that `scores` is already defined as a NumPy array prior to this line of code. Convert to a dictionary: ```python named_scores = pd.Series( [82, 91, 76], index=["Asha", "Ravi", "Meera"], ) print(named_scores.to_dict()) ``` **Explanation** - A Pandas Series is created with scores assigned to specific names as indices. - The `pd.Series` constructor takes a list of scores and an index list to associate each score with a name. - The `to_dict()` method is called on the Series to convert it into a dictionary, where names are keys and scores are values. - The resulting dictionary is printed, displaying the mapping of names to their corresponding scores. ## 26. Membership: Index vs Values For a Series, the `in` operator checks the index, not the values. ```python scores = pd.Series( [82, 91, 76], index=["Asha", "Ravi", "Meera"], ) print("Ravi" in scores) print(91 in scores) ``` **Explanation** - A Pandas Series named `scores` is created with three integer values and corresponding string indices. - The first print statement checks if the index "Ravi" exists in the Series, returning a boolean result. - The second print statement checks if the value 91 is present in the Series, also returning a boolean result. - This code illustrates basic membership testing in a Pandas Series, which is useful for data validation. Output: ```text True False ``` To check values, use: ```python print(91 in scores.values) ``` **Explanation** - The `print` function outputs the result of the expression to the console. - `scores.values` retrieves all the values from the `scores` dictionary. - The `in` operator checks for the presence of the value 91 within those values. - The result will be `True` if 91 is found, and `False` if it is not. Or use `isin()`: ```python print(scores.isin([91])) ``` **Explanation** - The code uses the `isin()` method from the pandas library to determine if the value `91` is present in the `scores` Series. - It returns a boolean Series where each element indicates whether the corresponding element in `scores` matches `91`. - This is useful for filtering or validating data within a pandas DataFrame or Series. ## 27. Looping Over A Series Looping over a Series gives values: ```python for score in scores: print(score) ``` **Explanation** - The code uses a `for` loop to traverse each element in the `scores` list. - Each `score` in the list is accessed one at a time during each iteration of the loop. - The `print()` function outputs the current `score` to the console, allowing for real-time feedback of the scores. - This snippet is useful for displaying a collection of values in a straightforward manner. Loop over index and values: ```python for name, score in scores.items(): print(name, score) ``` **Explanation** - The code uses a for loop to traverse the `scores` dictionary. - `name` represents the key, while `score` represents the corresponding value in each iteration. - The `print` function outputs each key-value pair to the console. - This approach is useful for displaying or logging the contents of a dictionary in a readable format. Use vectorized operations when possible. Loops are useful for display, debugging, or custom logic. ## 28. Arithmetic Operations Series operations align by index labels. ```python jan = pd.Series([100, 200, 300], index=["A", "B", "C"]) feb = pd.Series([110, 190, 250], index=["A", "B", "D"]) print(feb - jan) ``` **Explanation** - Creates two pandas Series, `jan` and `feb`, with specified indices "A", "B", "C" for `jan` and "A", "B", "D" for `feb`. - The subtraction operation `feb - jan` is performed, aligning the indices of both Series. - For indices that do not match, such as "C" in `jan` and "D" in `feb`, the result will contain NaN (Not a Number) for those positions. - The output will display the differences for matching indices and NaN for non-matching ones. Output: ```text A 10.0 B -10.0 C NaN D NaN dtype: float64 ``` Why? - `A` and `B` exist in both Series - `C` is missing from February - `D` is missing from January If you want missing values treated as zero: ```python print(feb.sub(jan, fill_value=0)) ``` **Explanation** - The code uses the `sub` method from the pandas library to perform element-wise subtraction between two Series, `feb` and `jan`. - The `fill_value=0` argument ensures that any missing values in either Series are treated as zeros during the subtraction. - This approach helps to avoid NaN results when one Series has values that the other does not, providing a cleaner output. - The result will be a new Series containing the differences, with indices from both Series preserved. ## 29. Relational Operations ```python scores = pd.Series([82, 91, 76, 88]) print(scores >= 85) ``` **Explanation** - A pandas Series named `scores` is created containing four integer values representing scores. - The expression `scores >= 85` performs a comparison operation, checking each score to see if it is greater than or equal to 85. - The result of this comparison is a boolean Series, where each element indicates whether the corresponding score meets the threshold. - The `print` function outputs the boolean Series to the console, allowing users to see which scores are above or equal to 85. Output: ```text 0 False 1 True 2 False 3 True dtype: bool ``` This creates a boolean Series. ## 30. Boolean Indexing Use a boolean condition to filter values. ```python scores = pd.Series( [82, 91, 76, 88], index=["Asha", "Ravi", "Meera", "Kabir"], ) high_scores = scores[scores >= 85] print(high_scores) ``` **Explanation** - A pandas Series named `scores` is created with student names as indices and their corresponding scores as values. - The `high_scores` variable filters the `scores` Series to include only those scores that are greater than or equal to 85. - The filtered high scores are then printed to the console, showing only the students who achieved this threshold. - This code demonstrates basic data manipulation and filtering using pandas in Python. Output: ```text Ravi 91 Kabir 88 dtype: int64 ``` Count values above a threshold: ```python print((scores >= 85).sum()) ``` **Explanation** - The expression `scores >= 85` creates a boolean array where each element indicates whether the corresponding score meets the condition. - The `sum()` function counts the number of `True` values in the boolean array, effectively counting how many scores are 85 or higher. - This snippet is useful for quickly assessing performance metrics in a dataset of scores. - It assumes that `scores` is a NumPy array or a similar structure that supports element-wise comparison. Because `True` behaves like 1 and `False` behaves like 0. ## 31. Multiple Conditions Use `&` for AND and `|` for OR. Wrap each condition in parentheses. ```python scores = pd.Series([45, 62, 78, 91, 38, 84]) selected = scores[(scores >= 60) & (scores <= 85)] print(selected) ``` **Explanation** - A Pandas Series named `scores` is created with a list of integer values representing scores. - The `selected` variable filters the `scores` Series to include only those values that are greater than or equal to 60 and less than or equal to 85. - The filtering is done using a boolean condition that combines two comparisons with the logical AND operator (`&`). - Finally, the filtered results stored in `selected` are printed to the console, displaying only the scores that meet the specified criteria. Output: ```text 1 62 2 78 5 84 dtype: int64 ``` Common mistake: ```python scores >= 60 & scores <= 85 ``` **Explanation** - The expression evaluates whether each score is greater than or equal to 60 and less than or equal to 85. - The use of the bitwise AND operator `&` combines the two conditions for evaluation. - This code is likely part of a filtering process to identify scores that meet the specified criteria. - It is important to ensure that `scores` is a compatible data type, such as a NumPy array or a Pandas Series, for this operation to work correctly. This is wrong because operator precedence can change the meaning. ## 32. Plotting A Series Pandas can plot Series using Matplotlib behind the scenes. ```python daily_visitors = pd.Series( [120, 135, 150, 160, 155, 170, 180], index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"], ) daily_visitors.plot(kind="line", title="Daily Visitors") ``` **Explanation** - Creates a pandas Series named `daily_visitors` containing visitor counts for each day of the week. - Assigns custom index labels ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") to represent the days. - Utilizes the `plot` method to generate a line graph of the daily visitors. - Sets the title of the plot to "Daily Visitors" for clarity in visualization. For category counts: ```python plans = pd.Series(["free", "pro", "free", "team", "pro", "free"]) plans.value_counts().plot(kind="bar", title="Plan Counts") ``` **Explanation** - The code creates a pandas Series containing different subscription plan types. - It utilizes the `value_counts()` method to count the occurrences of each unique plan. - The resulting counts are then plotted as a bar chart using the `plot()` method with the title "Plan Counts". - This visualization helps in understanding the popularity of each subscription plan at a glance. In notebooks, plots display inline if plotting is configured. ## 33. Changing Data Type With `astype` ```python scores = pd.Series([82.0, 91.0, 76.0]) scores_int = scores.astype("int64") print(scores_int) print(scores_int.dtype) ``` **Explanation** - A Pandas Series named `scores` is created containing three float values. - The `astype` method is used to convert the float values in the Series to integers, resulting in a new Series `scores_int`. - The converted Series `scores_int` is printed to display the integer values. - The data type of the `scores_int` Series is printed to confirm the conversion to `int64`. Use `astype()` when conversion is straightforward. ## 34. Safer Numeric Conversion With `pd.to_numeric` Real data often has messy strings. ```python raw_prices = pd.Series(["120", "99.5", "missing", "150"]) prices = pd.to_numeric(raw_prices, errors="coerce") print(prices) ``` **Explanation** - The code initializes a pandas Series containing string representations of prices, including a non-numeric value ("missing"). - The `pd.to_numeric()` function is used to convert the Series to numeric values, with the `errors="coerce"` argument ensuring that any non-convertible values are replaced with NaN (Not a Number). - The resulting Series, `prices`, contains numeric values for valid entries and NaN for the invalid entry. - Finally, the code prints the converted Series, allowing for easy inspection of the numeric values. Output: ```text 0 120.0 1 99.5 2 NaN 3 150.0 dtype: float64 ``` `errors="coerce"` converts invalid values to missing values. This is useful for cleaning imported CSV data. ## 35. `between()` `between()` checks whether values lie inside a range. ```python scores = pd.Series([45, 62, 78, 91, 38, 84]) print(scores.between(60, 85)) print(scores[scores.between(60, 85)]) ``` **Explanation** - The code creates a Pandas Series named `scores` containing a list of numerical values representing scores. - The `between` method is used to check which scores fall within the range of 60 to 85, returning a boolean Series. - The first `print` statement outputs the boolean Series indicating whether each score meets the criteria. - The second `print` statement filters the original `scores` Series to display only those scores that are between 60 and 85, based on the boolean mask. Output: ```text 0 False 1 True 2 True 3 False 4 False 5 True dtype: bool 1 62 2 78 5 84 dtype: int64 ``` ## 36. `clip()` `clip()` limits values to a lower and upper bound. ```python ratings = pd.Series([2, 5, 8, 11, -3, 7]) safe_ratings = ratings.clip(lower=0, upper=10) print(safe_ratings) ``` **Explanation** - The code creates a Pandas Series named `ratings` containing a mix of integers, including negative values. - The `clip` method is used to limit the values in the `ratings` Series, setting a lower bound of 0 and an upper bound of 10. - Any ratings below 0 are replaced with 0, and any ratings above 10 are replaced with 10, ensuring all values are within the desired range. - The modified Series, `safe_ratings`, is then printed, displaying the adjusted ratings. Output: ```text 0 2 1 5 2 8 3 10 4 0 5 7 dtype: int64 ``` Use this for outlier capping or valid range enforcement. ## 37. Duplicates: `duplicated()` And `drop_duplicates()` ```python plans = pd.Series(["free", "pro", "free", "team", "pro", "free"]) print(plans.duplicated()) print(plans.duplicated().sum()) ``` **Explanation** - The code creates a pandas Series named `plans` containing various subscription types. - The `duplicated()` method is called on the Series to identify duplicate entries, returning a boolean Series where `True` indicates a duplicate. - The first `print` statement outputs the boolean Series showing which entries are duplicates. - The second `print` statement sums the `True` values from the boolean Series, providing the total count of duplicate entries in the `plans` Series. Output: ```text 0 False 1 False 2 True 3 False 4 True 5 True dtype: bool 3 ``` Drop duplicates: ```python print(plans.drop_duplicates()) ``` **Explanation** - The `drop_duplicates()` method is called on the `plans` DataFrame to eliminate any duplicate rows. - The method returns a new DataFrame with only unique rows, preserving the first occurrence of each duplicate. - The `print()` function outputs the resulting DataFrame to the console for review. - This operation is useful for data cleaning and ensuring the integrity of the dataset before analysis. Output: ```text 0 free 1 pro 3 team dtype: object ``` Keep the last occurrence: ```python print(plans.drop_duplicates(keep="last")) ``` **Explanation** - The `drop_duplicates` method is called on a DataFrame named `plans`. - The parameter `keep="last"` specifies that when duplicates are found, the last occurrence should be kept in the resulting DataFrame. - The result is printed to the console, displaying the DataFrame without duplicates. - This operation is useful for cleaning data by ensuring that only unique entries remain based on the specified criteria. ## 38. Missing Data Create a Series with missing values: ```python ratings = pd.Series([4.5, np.nan, 3.8, np.nan, 4.9]) ``` **Explanation** - Initializes a Pandas Series named `ratings` containing five numerical values representing ratings. - Utilizes `np.nan` to denote missing or undefined ratings in the dataset. - The Series can be used for further data analysis or manipulation, leveraging Pandas' powerful data handling capabilities. - This structure allows for easy identification and handling of missing data points in subsequent operations. Find missing values: ```python print(ratings.isna()) print(ratings.isna().sum()) ``` **Explanation** - The first line `print(ratings.isna())` outputs a DataFrame of the same shape as `ratings`, where each entry is a boolean indicating whether the corresponding value is missing (True) or not (False). - The second line `print(ratings.isna().sum())` calculates and prints the total number of missing values in each column of the `ratings` DataFrame by summing the boolean values (True counts as 1). - This code is useful for data cleaning and preprocessing, allowing users to quickly identify and address missing data issues in their dataset. `isnull()` is an alias for `isna()`. Drop missing values: ```python print(ratings.dropna()) ``` **Explanation** - The `dropna()` method is called on the `ratings` DataFrame to eliminate any rows containing NaN (missing) values. - The result is a new DataFrame that only includes rows with complete data, improving data quality for analysis. - The `print()` function outputs the cleaned DataFrame to the console for immediate review. - This operation is useful in data preprocessing steps before performing any statistical analysis or machine learning tasks. Fill missing values: ```python filled = ratings.fillna(ratings.mean()) print(filled) ``` **Explanation** - The `fillna()` method is used to replace NaN (missing) values in the DataFrame `ratings`. - The argument `ratings.mean()` calculates the mean of each column in the DataFrame, providing a value to fill in for missing entries. - The result is stored in the variable `filled`, which contains the DataFrame with no missing values. - The `print(filled)` statement outputs the modified DataFrame to the console for review. Use a domain-appropriate fill value. Do not blindly use the mean for every dataset. ## 39. `isin()` `isin()` checks whether each value is in a list-like collection. ```python scores = pd.Series([49, 50, 75, 99, 100, 42]) near_milestones = scores[scores.isin([49, 99])] print(near_milestones) ``` **Explanation** - A Pandas Series named `scores` is created containing a list of integer values representing scores. - The `isin()` method is used to filter the Series, selecting only the scores that match the specified milestone values of 49 and 99. - The filtered results are stored in the variable `near_milestones`, which contains only the scores that are near the defined milestones. - Finally, the `print()` function outputs the filtered Series to the console, displaying the selected milestone scores. Output: ```text 0 49 3 99 dtype: int64 ``` Use it for membership filters. ## 40. `map()` `map()` is useful for value replacement using a dictionary or function. ```python plans = pd.Series(["free", "pro", "team", "free"]) plan_labels = plans.map({ "free": "Starter", "pro": "Professional", "team": "Team", }) print(plan_labels) ``` **Explanation** - A pandas Series named `plans` is created containing different subscription plan types. - The `map` function is utilized to replace each plan type with a corresponding label defined in a dictionary. - The dictionary maps "free" to "Starter", "pro" to "Professional", and "team" to "Team". - The transformed labels are stored in the variable `plan_labels`. - Finally, the new labels are printed to the console, displaying the mapped values. Output: ```text 0 Starter 1 Professional 2 Team 3 Starter dtype: object ``` If a value is not found in the dictionary, the result becomes missing for that value. ## 41. `apply()` `apply()` applies a function to each value. ```python prices = pd.Series([99, 149, 249]) def add_tax(price): return price * 1.18 final_prices = prices.apply(add_tax) print(final_prices) ``` **Explanation** - A Pandas Series named `prices` is created containing three initial price values. - The function `add_tax` takes a single price as input and returns the price increased by 18% to account for tax. - The `apply` method is used on the `prices` Series to apply the `add_tax` function to each element, resulting in a new Series called `final_prices`. - Finally, the `final_prices` Series is printed, displaying the prices after tax has been added. Output: ```text 0 116.82 1 175.82 2 293.82 dtype: float64 ``` For simple arithmetic, vectorized code is better: ```python final_prices = prices * 1.18 ``` **Explanation** - The variable `final_prices` is created to store the updated price values. - The original `prices` variable is multiplied by `1.18`, which represents a 18% increase, typically for tax purposes. - This operation applies the same tax rate to all elements in the `prices` array or list, resulting in a new list of final prices. - The code assumes that `prices` is a numeric type or a collection of numeric types that support multiplication. Use `apply()` when the logic is custom and cannot be expressed cleanly with vectorized operations. ## 42. Cleaning Price Strings Real CSV data often stores prices as strings: ```python raw_prices = pd.Series(["$2.39", "$3.50", None, "$10.25", "not available"]) ``` **Explanation** - Initializes a Pandas Series named `raw_prices` containing various price strings and a None value. - The Series includes valid price entries as strings (e.g., "$2.39", "$3.50", "$10.25") and a placeholder for missing data ("not available"). - This structure allows for easy manipulation and analysis of price data, despite the presence of inconsistent formats. - The use of None indicates missing data, which is a common practice in data handling with Pandas. Remove the dollar symbol: ```python clean_text = raw_prices.str.replace("$", "", regex=False) ``` **Explanation** - The `raw_prices` variable is expected to be a pandas Series containing price strings with dollar signs. - The `str.replace` method is used to search for the dollar sign character (`"$"`) in each string of the Series. - The `regex=False` argument indicates that the dollar sign should be treated as a literal character, not a regular expression. - The result is stored in the `clean_text` variable, which contains the price strings without the dollar signs. Convert to numbers: ```python prices_usd = pd.to_numeric(clean_text, errors="coerce") print(prices_usd) ``` **Explanation** - The code uses the `pd.to_numeric()` function from the Pandas library to convert a variable `clean_text` into numeric values. - The parameter `errors="coerce"` ensures that any non-convertible values in `clean_text` are replaced with NaN (Not a Number) instead of raising an error. - The resulting numeric values are stored in the variable `prices_usd`. - Finally, the code prints the `prices_usd` variable to display the converted numeric values. Output: ```text 0 2.39 1 3.50 2 NaN 3 10.25 4 NaN dtype: float64 ``` Fill missing values: ```python prices_usd = prices_usd.fillna(prices_usd.mean()) ``` **Explanation** - The `fillna()` method is used to replace NaN (missing) values in the `prices_usd` DataFrame. - The argument `prices_usd.mean()` calculates the mean of each column in the DataFrame. - This operation ensures that any missing values are replaced with the average value, maintaining the integrity of the data. - It is a common practice in data preprocessing to handle missing data before analysis or modeling. Convert to rupees: ```python prices_inr = prices_usd * 83 print(prices_inr) ``` **Explanation** - The code multiplies a variable `prices_usd` by 83, which represents the exchange rate from USD to INR. - The result is stored in the variable `prices_inr`, which contains the equivalent prices in Indian Rupees. - The `print` function outputs the converted prices to the console for the user to see. - This snippet assumes that `prices_usd` is already defined and contains numeric values. In production, use a real exchange rate source. In practice exercises, a fixed rate is fine. ## 43. Mini Project: Analyze Daily Subscribers Suppose you track daily subscribers gained: ```python subscribers = pd.Series( [120, 135, 150, 90, 210, 240, 180, 160, 260, 300], name="subscribers_gained", ) ``` **Explanation** - Initializes a Pandas Series named "subscribers_gained" to store subscriber counts. - Contains a list of integers representing the number of subscribers gained at different time intervals. - Each integer in the list corresponds to a specific point in time, allowing for time series analysis. - The Series can be used for further data manipulation and visualization in data analysis tasks. Find: - total subscribers gained - average daily gain - best day - number of days above 200 - capped values between 100 and 250 Solution: ```python total = subscribers.sum() average = subscribers.mean() best_day = subscribers.idxmax() days_above_200 = (subscribers > 200).sum() capped = subscribers.clip(lower=100, upper=250) print("Total:", total) print("Average:", average) print("Best day index:", best_day) print("Days above 200:", days_above_200) print(capped) ``` **Explanation** - Computes the total number of subscribers using the `sum()` method. - Calculates the average number of subscribers with the `mean()` function. - Identifies the index of the day with the highest subscriber count using `idxmax()`. - Counts how many days had more than 200 subscribers by summing a boolean condition. - Clips the subscriber values to a range between 100 and 250 using the `clip()` method, ensuring no values fall outside this range. - Outputs the total, average, best day index, count of days above 200, and the capped subscriber values. This project uses: - aggregation - boolean indexing - `idxmax` - `clip` ## 44. Mini Project: Clean Product Prices ```python raw_prices = pd.Series( ["$2.39", "$3.39", "$5.99", None, "$12.50", "unknown"], index=[ "chips", "juice", "sandwich", "salad", "bowl", "soup", ], name="price_usd", ) ``` **Explanation** - Initializes a Pandas Series named `raw_prices` containing price data as strings, including valid prices, a `None` value, and an invalid entry ("unknown"). - The index of the Series is explicitly defined with food item names: "chips", "juice", "sandwich", "salad", "bowl", and "soup". - The `name` attribute of the Series is set to "price_usd", indicating the context of the data as prices in USD. - This structure allows for easy manipulation and analysis of price data, despite the presence of non-numeric values. - The use of `None` and a string like "unknown" demonstrates how to handle missing or invalid data in a dataset. Clean and analyze: ```python price_text = raw_prices.str.replace("$", "", regex=False) prices = pd.to_numeric(price_text, errors="coerce") prices = prices.fillna(prices.mean()) prices_inr = prices * 83 print(prices_inr) print("Mean INR:", prices_inr.mean()) print("30th percentile:", prices_inr.quantile(0.30)) print("60th percentile:", prices_inr.quantile(0.60)) print("Between 300 and 800:") print(prices_inr[prices_inr.between(300, 800)]) ``` **Explanation** - The code first removes the dollar sign from a series of raw price strings using `str.replace`. - It converts the cleaned price strings into numeric values, coercing any errors to NaN. - Missing values are filled with the mean of the prices to ensure no gaps in the data. - The prices are then converted to Indian Rupees (INR) by multiplying by a conversion rate of 83. - Finally, it prints the converted prices, their mean, specific percentiles, and filters prices that fall between 300 and 800 INR. This kind of cleaning appears often in data analyst tasks. ## 45. Practice Exercises Try these before reading the solutions. ### Exercise 1: Empty Series Create an empty Series with dtype `float`. ### Exercise 2: Series Arithmetic Create two Series: ```python first = pd.Series([2, 4, 6, 8, 10]) second = pd.Series([1, 3, 5, 7, 10]) ``` **Explanation** - Initializes the first Series named `first` containing even numbers from 2 to 10. - Initializes the second Series named `second` containing odd numbers from 1 to 10. - Both Series are created using the Pandas library, which is commonly used for data manipulation in Python. - These Series can be used for various operations such as mathematical computations, comparisons, or visualizations. Print addition, subtraction, multiplication, and division. ### Exercise 3: Series Comparison Using the same two Series, compare: - greater than - less than - equal to ### Exercise 4: Convert Mixed Data To Numeric Create: ```python mixed = pd.Series([1, 2, "Python", 2.0, True, 100]) ``` **Explanation** - The code initializes a Pandas Series named `mixed` containing various data types including integers, strings, floats, and booleans. - The `pd.Series` function is used to create the Series, which allows for the storage of heterogeneous data. - Each element in the Series can be accessed using its index, making it versatile for data manipulation and analysis. - This structure is useful in scenarios where data may not be uniform, such as in data frames or when handling diverse datasets. Convert it to numeric values, turning invalid values into missing values. ### Exercise 5: Top Values Create a Series of player scores and print the top 5 values. ### Exercise 6: Count Above Mean Create a numeric Series and count how many values are greater than the mean. ### Exercise 7: Missing Values Create a Series with three missing values. Count missing values, drop them, and fill them with the median. ### Exercise 8: Price Cleaning Create a Series of price strings such as `"$10.50"`, `"$20.00"`, and `"missing"`. Remove `$`, convert to numeric, and fill missing values with the mean. ### Exercise 9: Category Counts Create a Series of course categories and show the top 3 most common categories. ### Exercise 10: Range Filter Create a Series of product prices and return prices between 100 and 500. ## 46. Practice Solutions ### Solution 1: Empty Series ```python empty = pd.Series(dtype="float64") print(empty) ``` **Explanation** - Initializes an empty Pandas Series object with a data type of float64. - The `dtype` parameter ensures that any data added later will be treated as floating-point numbers. - The `print` function outputs the Series to the console, showing its current state (which is empty). - This code is useful for initializing a Series before populating it with data in subsequent operations. ### Solution 2: Series Arithmetic ```python first = pd.Series([2, 4, 6, 8, 10]) second = pd.Series([1, 3, 5, 7, 10]) print(first + second) print(first - second) print(first * second) print(first / second) ``` **Explanation** - The code creates two Pandas Series, `first` and `second`, containing integer values. - It performs element-wise addition, subtraction, multiplication, and division between the two Series. - The results of these operations are printed to the console, showing the output for each arithmetic operation. - This demonstrates how Pandas handles vectorized operations, allowing for efficient calculations on Series data. - The operations align based on the index of the Series, ensuring that corresponding elements are processed together. ### Solution 3: Series Comparison ```python first = pd.Series([2, 4, 6, 8, 10]) second = pd.Series([1, 3, 5, 7, 10]) print(first > second) print(first < second) print(first == second) ``` **Explanation** - Creates two pandas Series, `first` and `second`, containing integer values. - Performs element-wise comparison between the two Series using greater than (`>`), less than (`<`), and equality (`==`) operators. - Outputs three boolean Series indicating the result of each comparison for corresponding elements in `first` and `second`. - Useful for data analysis tasks where relational comparisons between datasets are needed. ### Solution 4: Convert Mixed Data To Numeric ```python mixed = pd.Series([1, 2, "Python", 2.0, True, 100]) converted = pd.to_numeric(mixed, errors="coerce") print(converted) ``` **Explanation** - The code creates a Pandas Series named `mixed` containing various data types, including integers, strings, floats, and booleans. - The `pd.to_numeric()` function is used to convert the elements of the `mixed` Series to numeric values, with the `errors="coerce"` parameter ensuring that any non-convertible values are replaced with `NaN`. - The result of the conversion is stored in the `converted` variable, which will contain numeric representations of the original values where possible. - Finally, the `print()` function outputs the `converted` Series, displaying the numeric values along with any `NaN` entries for the non-numeric data. ### Solution 5: Top Values ```python scores = pd.Series([420, 180, 550, 610, 320, 720, 150]) top_5 = scores.sort_values(ascending=False).head(5) print(top_5) ``` **Explanation** - A pandas Series named `scores` is created containing a list of numerical values. - The `sort_values` method is used to sort the scores in descending order. - The `head(5)` method extracts the top five scores from the sorted Series. - Finally, the top five scores are printed to the console. ### Solution 6: Count Above Mean ```python values = pd.Series([10, 20, 30, 40, 50]) above_mean_count = (values > values.mean()).sum() print(above_mean_count) ``` **Explanation** - A Pandas Series is created with five integer values: 10, 20, 30, 40, and 50. - The mean of the Series is calculated using `values.mean()`. - A boolean condition checks which elements are greater than the mean, resulting in a Series of True/False values. - The `sum()` function counts the number of True values, indicating how many elements are above the mean. - Finally, the count of elements above the mean is printed to the console. ### Solution 7: Missing Values ```python values = pd.Series([10, np.nan, 30, np.nan, 50, np.nan]) print(values.isna().sum()) print(values.dropna()) print(values.fillna(values.median())) ``` **Explanation** - A Pandas Series is created with some numeric values and NaN (Not a Number) entries to represent missing data. - The `isna().sum()` method counts and prints the total number of missing values in the Series. - The `dropna()` method removes all entries with NaN values and prints the cleaned Series. - The `fillna()` method replaces NaN values with the median of the Series, providing a way to impute missing data. ### Solution 8: Price Cleaning ```python prices = pd.Series(["$10.50", "$20.00", "missing", "$15.75"]) clean_text = prices.str.replace("$", "", regex=False) numeric_prices = pd.to_numeric(clean_text, errors="coerce") filled_prices = numeric_prices.fillna(numeric_prices.mean()) print(filled_prices) ``` **Explanation** - The code initializes a pandas Series containing price strings, some of which are invalid or missing. - It uses the `str.replace` method to remove the dollar sign from each price string, resulting in a clean text representation. - The `pd.to_numeric` function converts the cleaned strings into numeric values, with the `errors="coerce"` argument turning any non-convertible entries into NaN. - The `fillna` method replaces NaN values with the mean of the valid numeric prices, ensuring no missing data remains. - Finally, the cleaned and filled prices are printed to the console. ### Solution 9: Category Counts ```python categories = pd.Series([ "python", "pandas", "python", "sql", "pandas", "python", "excel", ]) print(categories.value_counts().head(3)) ``` **Explanation** - A pandas Series named `categories` is created containing various programming-related strings. - The `value_counts()` method is called on the Series to count the occurrences of each unique category. - The `head(3)` method is used to retrieve the top three categories based on their frequency. - Finally, the result is printed, showing the most common categories in descending order. ### Solution 10: Range Filter ```python prices = pd.Series([50, 120, 250, 600, 499, 80]) selected = prices[prices.between(100, 500)] print(selected) ``` **Explanation** - A Pandas Series named `prices` is created containing a list of numerical values representing prices. - The `between` method is used to filter the Series, selecting only the prices that fall between 100 and 500, inclusive. - The filtered results are stored in the variable `selected`. - Finally, the `print` function outputs the filtered prices to the console. ## 47. Quick Interview Questions ### 1. What is a Pandas Series? A one-dimensional labeled array. ### 2. What is the difference between `size` and `count()`? `size` counts all entries, including missing values. `count()` counts non-missing values. ### 3. What does `value_counts()` do? It counts unique values in a Series. ### 4. What is the difference between `loc` and `iloc`? `loc` selects by label. `iloc` selects by integer position. ### 5. Why use `pd.to_numeric()`? To convert messy values to numbers with options like `errors="coerce"`. ### 6. What does `dropna()` do? It removes missing values. ### 7. What does `fillna()` do? It replaces missing values with a chosen value. ### 8. What does `isin()` do? It checks whether values are present in a given list-like collection. ### 9. When should you use `.copy()`? When you want to modify a subset independently from the original object. ### 10. Why can Series arithmetic produce missing values? Because Series align by index labels. If a label is missing from one side, the result becomes missing for that label. ## 48. Common Beginner Mistakes ### Mistake 1: Confusing label and position Use `loc` for labels and `iloc` for positions. ### Mistake 2: Thinking `size` ignores missing values `size` includes missing values. Use `count()` for non-missing values. ### Mistake 3: Forgetting index alignment Series arithmetic aligns by labels, not only by row order. ### Mistake 4: Using `astype()` on messy strings If values are messy, use `pd.to_numeric(..., errors="coerce")`. ### Mistake 5: Modifying a subset without copying Use `.copy()` when you intentionally want an independent object. ## Final Takeaway A Pandas Series is simple at first glance: one column of values with labels. But it becomes powerful because it supports: - labeled indexing - automatic alignment - missing-data handling - statistical summaries - boolean filtering - value counts - sorting - type conversion - string cleaning - plotting - element-wise transformation If you are new to Pandas, master Series before moving deeply into DataFrames. DataFrames are mostly collections of Series working together. ## Sources and Further Reading - Pandas documentation: https://pandas.pydata.org/docs/ - Pandas Series user guide: https://pandas.pydata.org/docs/user_guide/dsintro.html#series - Pandas Series API reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.html - Pandas indexing guide: https://pandas.pydata.org/docs/user_guide/indexing.html - Pandas missing data guide: https://pandas.pydata.org/docs/user_guide/missing_data.html - Pandas `to_numeric`: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html - Pandas visualization guide: https://pandas.pydata.org/docs/user_guide/visualization.html