# Mastering Advanced NumPy: Indexing, Broadcasting, and More URL: https://madhudadi.in/blog/posts/advanced-numpy-indexing-broadcasting-missing-values Published: 2026-05-25 Tags: Numpy, python Read time: 24 min Difficulty: intermediate > Go beyond NumPy basics with performance comparisons, memory-aware dtypes, fancy indexing, boolean masks, broadcasting rules, vectorized formulas, missing value handling, plotting-ready arrays, and practice problems.# Advanced NumPy: Fancy Indexing, Broadcasting, Missing Values, and Vectorized Math Once you know how to create arrays, check shapes, slice data, and run simple operations, the next step is learning how NumPy helps you write compact, fast, data-focused code. This lesson covers the ideas that make NumPy feel powerful: - vectorized operations - memory-friendly data types - fancy indexing - boolean filtering - broadcasting - mathematical formulas on whole arrays - missing value handling - arrays for plotting These are not just academic features. They appear in real data cleaning, feature engineering, machine learning, analytics dashboards, simulations, and scientific computing. ## What you will learn By the end, you should be able to: - explain why NumPy is usually faster than Python loops for numerical work - choose smaller dtypes when memory matters - select rows and columns with fancy indexing - filter arrays with boolean masks - combine multiple mask conditions correctly - understand NumPy broadcasting rules - write vectorized versions of formulas such as sigmoid and mean squared error - detect and replace missing values - generate arrays for plotting mathematical functions - solve practical intermediate NumPy exercises ## 1. Why NumPy Can Be Faster Than Python Lists Python lists are flexible, but flexibility has a cost. A list can hold many different object types, so Python must manage references to separate objects. NumPy arrays are more specialized. A NumPy array usually stores values of the same type in a compact memory layout. That makes numerical operations easier to optimize. Here is a small comparison: ```python import time import numpy as np size = 1_000_000 list_a = list(range(size)) list_b = list(range(size, size * 2)) start = time.perf_counter() list_result = [x + y for x, y in zip(list_a, list_b)] print("List time:", time.perf_counter() - start) array_a = np.arange(size) array_b = np.arange(size, size * 2) start = time.perf_counter() array_result = array_a + array_b print("NumPy time:", time.perf_counter() - start) ``` The exact time depends on your machine, but the NumPy version is usually much faster for large numerical arrays. The important idea is not just speed. It is also readability: ```python array_result = array_a + array_b ``` That line clearly says: add the arrays element by element. ## 2. Memory and dtype Choices The dtype controls how much memory each value needs. ```python import numpy as np large_default = np.arange(1_000_000) small_int = np.arange(1_000_000, dtype=np.int16) print(large_default.dtype, large_default.nbytes) print(small_int.dtype, small_int.nbytes) ``` `nbytes` tells you how many bytes the array data uses. Smaller dtypes can save memory, but they also have smaller value ranges. ```python tiny = np.array([120, 125, 130], dtype=np.int8) print(tiny) ``` `int8` can store values from -128 to 127. The value `130` cannot fit correctly, so you should not choose tiny dtypes blindly. Use smaller dtypes when: - you know the value range - the dataset is large - memory pressure matters - precision loss is acceptable For most beginner work, default integer and float types are fine. ## 3. Normal Slicing Review Before fancy indexing, remember normal slicing. ```python data = np.arange(20).reshape(5, 4) print(data) ``` Output: ```text [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15] [16 17 18 19]] ``` Get rows 1 to 3: ```python print(data[1:4]) ``` Get columns 0 and 1 for all rows: ```python print(data[:, 0:2]) ``` Normal slicing works with continuous ranges. Fancy indexing is useful when you want specific positions. ## 4. Fancy Indexing Rows Fancy indexing lets you pass a list of indexes. ```python data = np.arange(20).reshape(5, 4) selected_rows = data[[0, 2, 4]] print(selected_rows) ``` Output: ```text [[ 0 1 2 3] [ 8 9 10 11] [16 17 18 19]] ``` This selected row 0, row 2, and row 4. You can also change the order: ```python print(data[[4, 0, 1]]) ``` Fancy indexing returns a copy, not a simple slice view. That distinction matters when you start modifying results. ## 5. Fancy Indexing Columns To pick specific columns, use `:` for all rows and a list for columns. ```python data = np.arange(20).reshape(5, 4) selected_columns = data[:, [0, 3]] print(selected_columns) ``` Output: ```text [[ 0 3] [ 4 7] [ 8 11] [12 15] [16 19]] ``` This is useful when you want selected features from a table-like array. ## 6. Selecting Specific Rows and Columns Together If you want rows first and then columns, write it in two steps: ```python data = np.arange(20).reshape(5, 4) rows = data[[1, 3, 4]] result = rows[:, [0, 2]] print(result) ``` Output: ```text [[ 4 6] [12 14] [16 18]] ``` This is easier to read than trying to do everything in one expression. Another compact version: ```python result = data[[1, 3, 4]][:, [0, 2]] print(result) ``` Use the two-step version when teaching, debugging, or reviewing code. ## 7. Updating Values With Fancy Indexing You can update selected values. ```python scores = np.array([50, 60, 70, 80, 90]) indexes = [1, 3] scores[indexes] += 5 print(scores) ``` Output: ```text [50 65 70 85 90] ``` This is useful when selected positions need special treatment. ## 8. Boolean Indexing Boolean indexing uses a condition to filter values. ```python marks = np.array([35, 62, 88, 49, 73]) passed = marks[marks >= 50] print(passed) ``` Output: ```text [62 88 73] ``` The expression `marks >= 50` creates a boolean mask: ```python print(marks >= 50) ``` Output: ```text [False True True False True] ``` NumPy returns only the values where the mask is `True`. ## 9. Combining Boolean Conditions Use: - `&` for and - `|` for or - `~` for not Each condition should be wrapped in parentheses. ```python values = np.array([12, 17, 24, 33, 40, 55, 72]) result = values[(values > 20) & (values % 2 == 0)] print(result) ``` Output: ```text [24 40 72] ``` Values greater than 20 or divisible by 5: ```python result = values[(values > 20) | (values % 5 == 0)] print(result) ``` Values not divisible by 3: ```python result = values[~(values % 3 == 0)] print(result) ``` Avoid Python's `and`, `or`, and `not` for NumPy array masks. Use `&`, `|`, and `~`. ## 10. Boolean Indexing in 2D Arrays Create a random table: ```python rng = np.random.default_rng(4) table = rng.integers(1, 100, size=(4, 5)) print(table) ``` Find all values above 70: ```python high_values = table[table > 70] print(high_values) ``` This returns a 1D array of matching values. Replace all values below 20 with zero: ```python cleaned = table.copy() cleaned[cleaned < 20] = 0 print(cleaned) ``` Using `.copy()` protects the original array. ## 11. Broadcasting: The Big Idea Broadcasting is how NumPy performs operations between arrays with different but compatible shapes. Example: ```python prices = np.array([ [100, 200, 300], [150, 250, 350], ]) discount = np.array([10, 20, 30]) print(prices - discount) ``` Output: ```text [[ 90 180 270] [140 230 320]] ``` The discount array has shape `(3,)`. NumPy treats it as if the same row of discounts applies to every row in `prices`. That is broadcasting. ## 12. Broadcasting Rules When NumPy compares two shapes, it checks dimensions from right to left. Two dimensions are compatible if: - they are equal - one of them is `1` Example: ```text (4, 3) ( 3) ``` The second shape behaves like: ```text (1, 3) ``` Then NumPy stretches it to: ```text (4, 3) ``` So this works: ```python a = np.ones((4, 3)) b = np.array([10, 20, 30]) print(a + b) ``` This does not work: ```python a = np.ones((3, 4)) b = np.array([10, 20, 30]) print(a + b) ``` Why? ```text (3, 4) ( 3) ``` The last dimensions are `4` and `3`. They are not equal, and neither is `1`. ## 13. Broadcasting With Columns Sometimes you want to apply a different value to each row. Use a column-shaped array: ```python scores = np.array([ [80, 85, 90], [70, 75, 78], [88, 92, 95], ]) bonus = np.array([[5], [10], [2]]) print(scores + bonus) ``` Output: ```text [[85 90 95] [80 85 88] [90 94 97]] ``` `bonus` has shape `(3, 1)`, so each row gets its own bonus. ## 14. Broadcasting to Create a Grid Broadcasting can combine row and column arrays. ```python row = np.array([[1, 2, 3]]) column = np.array([[10], [20], [30], [40]]) grid = row + column print(grid) ``` Output: ```text [[11 12 13] [21 22 23] [31 32 33] [41 42 43]] ``` Shapes: ```text row shape = (1, 3) column shape = (4, 1) result shape = (4, 3) ``` This pattern is useful for distance matrices, pairwise comparisons, lookup grids, and feature engineering. ## 15. Checking Broadcast Compatibility Here is a simple helper that checks whether two shapes are broadcast-compatible. ```python def can_broadcast(shape_a, shape_b): a = tuple(shape_a) b = tuple(shape_b) max_len = max(len(a), len(b)) a = (1,) * (max_len - len(a)) + a b = (1,) * (max_len - len(b)) + b for dim_a, dim_b in zip(a, b): if dim_a != dim_b and dim_a != 1 and dim_b != 1: return False return True print(can_broadcast((4, 3), (3,))) print(can_broadcast((3, 4), (3,))) print(can_broadcast((5, 1, 7), (1, 3, 7))) ``` Output: ```text True False True ``` This helper mirrors the basic rule: dimensions must match, or one side must be 1. ## 16. Vectorized Mathematical Formulas NumPy lets you write formulas almost the same way they appear mathematically. ### Sigmoid The sigmoid function is common in machine learning: ```text sigmoid(x) = 1 / (1 + exp(-x)) ``` Vectorized version: ```python def sigmoid(x): x = np.asarray(x) return 1 / (1 + np.exp(-x)) values = np.array([-3, -1, 0, 1, 3]) print(sigmoid(values)) ``` **Explanation** - The `sigmoid` function takes an input `x`, converts it to a NumPy array, and applies the sigmoid formula. - The formula `1 / (1 + np.exp(-x))` computes the sigmoid value, which is commonly used in machine learning for binary classification. - The `values` array contains a set of integers, which are passed to the `sigmoid` function. - The result of the sigmoid function is printed, showing how each input value is transformed to fall within the (0, 1) range. ### Mean Squared Error Mean squared error compares predictions with actual values. ```python def mean_squared_error(actual, predicted): actual = np.asarray(actual) predicted = np.asarray(predicted) if actual.shape != predicted.shape: raise ValueError("actual and predicted must have the same shape") return np.mean((actual - predicted) ** 2) actual = np.array([10, 20, 30, 40]) predicted = np.array([12, 18, 33, 37]) print(mean_squared_error(actual, predicted)) ``` **Explanation** - The function `mean_squared_error` computes the mean squared error (MSE) between two numpy arrays: `actual` and `predicted`. - It first converts the input lists to numpy arrays for efficient numerical operations. - A shape check ensures that both arrays have the same dimensions; if not, a ValueError is raised. - The MSE is calculated by taking the average of the squared differences between the actual and predicted values. - The provided example demonstrates how to use the function with sample data and prints the resulting MSE. The formula works on the whole array without writing a loop. ## 17. Missing Values With np.nan Numerical datasets often have missing values. NumPy represents missing floating-point values with `np.nan`. ```python readings = np.array([12.5, 14.2, np.nan, 13.8, np.nan]) print(readings) ``` **Explanation** - The code imports the NumPy library, which is commonly used for numerical operations in Python. - It creates a NumPy array named `readings` containing five temperature values, two of which are `np.nan`, representing missing data. - The `print` function outputs the contents of the `readings` array to the console, allowing for inspection of the data. - This snippet demonstrates how to handle arrays with missing values in data analysis using NumPy. Check missing values: ```python print(np.isnan(readings)) ``` **Explanation** - Utilizes the `np.isnan()` function from the NumPy library to identify NaN (Not a Number) values. - The function returns a boolean array of the same shape as the input, with `True` for NaN values and `False` for non-NaN values. - The `print()` function outputs the result to the console, allowing for immediate inspection of the NaN presence in the `readings` array. - This is useful for data cleaning and preprocessing, ensuring that subsequent analyses handle missing values appropriately. Output: ```text [False False True False True] ``` Remove missing values: ```python clean = readings[~np.isnan(readings)] print(clean) ``` **Explanation** - The `~np.isnan(readings)` expression creates a boolean mask that identifies non-NaN values in the `readings` array. - The `readings[...]` syntax uses this mask to select only the elements that are not NaN, effectively cleaning the data. - The result is stored in the variable `clean`, which contains only valid numerical readings. - The `print(clean)` statement outputs the cleaned array to the console for verification. Replace missing values with zero: ```python filled = np.nan_to_num(readings, nan=0.0) print(filled) ``` **Explanation** - The `np.nan_to_num()` function is used to convert NaN (Not a Number) values in the `readings` array to a specified numerical value, which is 0.0 in this case. - The result is stored in the variable `filled`, which will contain the original data with NaNs replaced by zeros. - The `print()` function outputs the modified array, allowing users to see the changes made to the original data. - This approach is useful for data preprocessing, especially in scenarios where NaN values can disrupt calculations or analyses. ## 18. Filling Missing Values With the Mean Replacing missing values with zero is not always a good choice. Sometimes the mean is better. ```python readings = np.array([12.5, 14.2, np.nan, 13.8, np.nan]) mean_value = np.nanmean(readings) filled = np.where(np.isnan(readings), mean_value, readings) print(filled) ``` **Explanation** - The code initializes a NumPy array `readings` containing some numerical values and NaN entries. - It uses `np.nanmean()` to compute the mean of the array while ignoring any NaN values. - The `np.where()` function replaces NaN values in the original array with the calculated mean, resulting in a new array `filled`. - Finally, the filled array is printed, showing the original values with NaNs replaced by the mean. `np.nanmean()` ignores missing values while calculating the mean. For 2D arrays, you can fill by column: ```python data = np.array([ [10.0, 2.0, np.nan], [12.0, np.nan, 5.0], [14.0, 4.0, 7.0], ]) column_means = np.nanmean(data, axis=0) missing_mask = np.isnan(data) filled = data.copy() filled[missing_mask] = np.take(column_means, np.where(missing_mask)[1]) print(filled) ``` **Explanation** - The code initializes a 2D NumPy array containing some `NaN` values. - It calculates the mean of each column while ignoring `NaN` values using `np.nanmean()`. - A boolean mask is created to identify the positions of `NaN` values in the array. - A copy of the original array is made, and the `NaN` values are replaced with the corresponding column means. - Finally, the modified array with filled values is printed to the console. This fills each missing value using the mean of its column. ## 19. Finding Nearest Values To find the array value nearest to a target number, compare distances. ```python values = np.array([8, 14, 19, 27, 35, 42]) target = 25 distance = np.abs(values - target) nearest_index = distance.argmin() print(values[nearest_index]) ``` **Explanation** - Initializes a NumPy array `values` containing a set of integers. - Defines a `target` integer to which the nearest value in the array will be found. - Calculates the absolute distance between each element in `values` and the `target`. - Identifies the index of the smallest distance using `argmin()`, which indicates the nearest value. - Prints the value from the `values` array that is closest to the specified `target`. Output: ```text 27 ``` This is a common pattern: ```python array[np.abs(array - target).argmin()] ``` **Explanation** - Utilizes NumPy's `abs` function to compute the absolute difference between each element in the array and a specified target value. - The `argmin` method identifies the index of the smallest difference, effectively locating the closest value to the target. - The final expression accesses the element in the array at the index found, returning the closest value. - This approach is efficient for finding proximity in numerical datasets, leveraging NumPy's optimized operations. ## 20. Element-Wise Maximum With np.where() Suppose you have two arrays with the same shape: ```python model_a = np.array([72, 88, 91, 64]) model_b = np.array([75, 80, 93, 70]) ``` **Explanation** - The code imports the NumPy library, which is typically done with `import numpy as np` (not shown here). - `model_a` is created as a NumPy array containing the scores [72, 88, 91, 64]. - `model_b` is created as a NumPy array containing the scores [75, 80, 93, 70]. - These arrays can be used for various numerical operations, such as statistical analysis or model comparison. - The use of NumPy allows for efficient computation and manipulation of large datasets. Choose the larger value at each position: ```python best = np.where(model_a >= model_b, model_a, model_b) print(best) ``` **Explanation** - Utilizes NumPy's `where` function to compare two arrays, `model_a` and `model_b`. - For each element, it checks if the value in `model_a` is greater than or equal to the corresponding value in `model_b`. - If true, it selects the value from `model_a`; otherwise, it selects from `model_b`. - The result is stored in the variable `best`, which contains the maximum values from both models for each position. - Finally, it prints the `best` array to display the selected values. Output: ```text [75 88 93 70] ``` `np.where(condition, value_if_true, value_if_false)` is extremely useful for conditional array logic. ## 21. Repeating and Tiling Values `np.repeat()` repeats each element. ```python items = np.array([1, 2, 3]) print(np.repeat(items, 3)) ``` **Explanation** - The code imports the NumPy library and creates a NumPy array named `items` containing the integers 1, 2, and 3. - The `np.repeat()` function is called with `items` as the first argument and `3` as the second argument, indicating that each element should be repeated three times. - The result of the `np.repeat()` function is printed, which will display a new array with each original element repeated consecutively. - The output will be `[1, 1, 1, 2, 2, 2, 3, 3, 3]`, showing the repeated elements. Output: ```text [1 1 1 2 2 2 3 3 3] ``` `np.tile()` repeats the whole array. ```python print(np.tile(items, 3)) ``` **Explanation** - The `np.tile` function is used to construct an array by repeating the input array. - In this case, `items` is the input array that will be repeated. - The number `3` specifies that the array should be repeated three times. - The result is a new array that contains the elements of `items` concatenated three times in sequence. - This is useful for creating larger datasets or for preparing data for operations that require repeated patterns. Output: ```text [1 2 3 1 2 3 1 2 3] ``` Combine both: ```python pattern = np.hstack([np.repeat(items, 3), np.tile(items, 3)]) print(pattern) ``` **Explanation** - `np.repeat(items, 3)` creates a new array by repeating each element in `items` three times. - `np.tile(items, 3)` constructs an array by repeating the entire `items` array three times. - `np.hstack([...])` horizontally stacks the two resulting arrays from the repeat and tile operations into a single array. - The final output, printed with `print(pattern)`, displays the combined pattern of repeated and tiled elements. Output: ```text [1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3] ``` ## 22. Arrays for Plotting NumPy is often used to generate x and y values for plots. ```python import numpy as np import matplotlib.pyplot as plt x = np.linspace(-5, 5, 200) y = x ** 2 plt.plot(x, y) plt.title("y = x^2") plt.xlabel("x") plt.ylabel("y") plt.show() ``` **Explanation** - Imports the NumPy library for numerical operations and Matplotlib for plotting. - Generates 200 evenly spaced values between -5 and 5 for the x-axis using `np.linspace()`. - Calculates the corresponding y values by squaring each x value (`y = x ** 2`). - Plots the quadratic function with labeled axes and a title for clarity. - Displays the plot using `plt.show()`, allowing users to visualize the parabolic curve. For a sine wave: ```python x = np.linspace(0, 2 * np.pi, 200) y = np.sin(x) plt.plot(x, y) plt.title("Sine wave") plt.show() ``` **Explanation** - The `np.linspace` function creates an array of 200 evenly spaced values between 0 and \(2\pi\). - The `np.sin` function computes the sine of each value in the array, resulting in the y-coordinates for the sine wave. - The `plt.plot` function is used to plot the x and y values, creating a graphical representation of the sine wave. - The `plt.title` function sets the title of the plot to "Sine wave". - Finally, `plt.show` displays the plot in a window. For sigmoid: ```python x = np.linspace(-10, 10, 300) y = 1 / (1 + np.exp(-x)) plt.plot(x, y) plt.title("Sigmoid curve") plt.show() ``` **Explanation** - The code generates 300 evenly spaced values between -10 and 10 using NumPy's `linspace` function, which are stored in the variable `x`. - It calculates the sigmoid function values for each `x` using the formula `1 / (1 + np.exp(-x))`, storing the results in the variable `y`. - The `plt.plot` function from Matplotlib is used to create a line plot of the sigmoid curve by plotting `x` against `y`. - A title "Sigmoid curve" is added to the plot for clarity using `plt.title`. - Finally, `plt.show()` displays the generated plot in a window. The plotting library draws the chart, but NumPy creates the numerical data. ## 23. Practice Exercises Try these before reading the solutions. ### Exercise 1: Compare memory Create three arrays with one million values: - default integer dtype - `int32` - `int16` Print each array's `dtype` and `nbytes`. ### Exercise 2: Select rows and columns Create a 6 by 5 array from 0 to 29. Select rows 0, 2, and 5, then select columns 1 and 4. ### Exercise 3: Filter values Create an array from 1 to 50. Return values that are divisible by 4 but not divisible by 8. ### Exercise 4: Replace multiples Create an array from 1 to 20. Replace values divisible by 3 or 5 with 0. ### Exercise 5: Row-wise bonus Create a 3 by 4 score array. Add a different bonus to each row using broadcasting. ### Exercise 6: Column centering Create a 4 by 3 array. Subtract the mean of each column from that column. ### Exercise 7: Fill missing values Create a 2D array with `np.nan` values. Replace missing values with the column mean. ### Exercise 8: Nearest element Given an array and a target number, find the nearest value. ### Exercise 9: Cauchy-style matrix Given: ```python x = np.array([1, 2, 4]) y = np.array([6, 8, 10, 12]) ``` **Explanation** - Initializes a NumPy array `x` containing three integers: 1, 2, and 4. - Initializes a second NumPy array `y` containing four integers: 6, 8, 10, and 12. - These arrays can be used for various mathematical operations and data analysis tasks. - NumPy is a powerful library for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices. Create a matrix where each value is: ```text 1 / (x_i - y_j) ``` ### Exercise 10: Plot tanh Generate x values from -6 to 6 and plot: ```text tanh(x) ``` Use either `np.tanh(x)` or the formula with `np.exp()`. ## 24. Practice Solutions ### Solution 1: Compare memory ```python arrays = [ np.arange(1_000_000), np.arange(1_000_000, dtype=np.int32), np.arange(1_000_000, dtype=np.int16), ] for arr in arrays: print(arr.dtype, arr.nbytes) ``` **Explanation** - Creates a list of three NumPy arrays with varying data types: default `int64`, `int32`, and `int16`. - Each array contains one million sequential integers generated by `np.arange()`. - Iterates through the list of arrays, printing the data type (`dtype`) and the total memory size in bytes (`nbytes`) for each array. - Highlights the impact of data type selection on memory consumption in NumPy arrays. ### Solution 2: Select rows and columns ```python data = np.arange(30).reshape(6, 5) result = data[[0, 2, 5]][:, [1, 4]] print(result) ``` **Explanation** - The code creates a 2D NumPy array `data` with values from 0 to 29, reshaped into 6 rows and 5 columns. - It selects rows 0, 2, and 5 from the `data` array using advanced indexing. - From the selected rows, it further extracts columns 1 and 4, resulting in a new array `result`. - Finally, the `result` array is printed, displaying the specified rows and columns. ### Solution 3: Filter values ```python values = np.arange(1, 51) result = values[(values % 4 == 0) & ~(values % 8 == 0)] print(result) ``` **Explanation** - The `np.arange(1, 51)` function generates an array of integers from 1 to 50. - The condition `(values % 4 == 0)` checks for numbers that are divisible by 4. - The condition `~(values % 8 == 0)` negates the check for numbers that are divisible by 8. - The combined condition filters the array to include only those numbers that meet both criteria. - Finally, `print(result)` outputs the filtered array to the console. ### Solution 4: Replace multiples ```python values = np.arange(1, 21) values[(values % 3 == 0) | (values % 5 == 0)] = 0 print(values) ``` **Explanation** - The code initializes an array of integers from 1 to 20 using NumPy's `arange` function. - It uses a boolean mask to identify elements in the array that are multiples of 3 or 5. - The identified elements are then set to zero, effectively replacing them in the original array. - Finally, the modified array is printed, showing the changes made. ### Solution 5: Row-wise bonus ```python scores = np.array([ [70, 75, 80, 85], [60, 65, 70, 75], [88, 90, 92, 94], ]) bonus = np.array([[5], [10], [2]]) print(scores + bonus) ``` **Explanation** - The code initializes a 2D NumPy array `scores` representing the scores of three students across four subjects. - A second 2D NumPy array `bonus` contains bonus points for each student, structured as a column vector. - The addition operation `scores + bonus` utilizes broadcasting, allowing the bonus points to be added to each corresponding row of the `scores` array. - The result is printed, showing each student's original scores increased by their respective bonus points. ### Solution 6: Column centering ```python data = np.array([ [10, 20, 30], [12, 18, 33], [14, 22, 36], [16, 24, 39], ]) column_means = data.mean(axis=0) centered = data - column_means print(centered) ``` **Explanation** - The code initializes a 2D NumPy array named `data` with four rows and three columns. - It calculates the mean of each column using `data.mean(axis=0)`, resulting in a 1D array of column means. - The original array `data` is centered by subtracting the corresponding column means from each element in that column. - The centered array is printed, showing how each value has been adjusted relative to its column mean. ### Solution 7: Fill missing values ```python data = np.array([ [10.0, np.nan, 30.0], [12.0, 22.0, np.nan], [14.0, 24.0, 36.0], ]) column_means = np.nanmean(data, axis=0) mask = np.isnan(data) filled = data.copy() filled[mask] = np.take(column_means, np.where(mask)[1]) print(filled) ``` **Explanation** - The code initializes a 2D NumPy array containing some NaN (Not a Number) values. - It calculates the mean of each column while ignoring NaN values using `np.nanmean()`. - A mask is created to identify the positions of NaN values in the original array. - A copy of the original array is made, and the NaN values are replaced with the corresponding column means. - Finally, the modified array with filled values is printed, showing the imputed data. ### Solution 8: Nearest element ```python values = np.array([11, 18, 26, 33, 47]) target = 29 nearest = values[np.abs(values - target).argmin()] print(nearest) ``` **Explanation** - Initializes a NumPy array `values` containing a set of integers. - Defines a `target` integer to which the nearest value in the array will be found. - Calculates the absolute difference between each element in `values` and the `target`, then finds the index of the minimum difference using `argmin()`. - Uses this index to retrieve the nearest value from the original array. - Prints the nearest value to the console. ### Solution 9: Cauchy-style matrix ```python x = np.array([1, 2, 4]).reshape(-1, 1) y = np.array([6, 8, 10, 12]) matrix = 1 / (x - y) print(matrix) ``` **Explanation** - The variable `x` is initialized as a 3x1 NumPy array, reshaping the input to ensure it has a single column. - The variable `y` is a 1D NumPy array containing four elements. - The expression `1 / (x - y)` utilizes broadcasting to compute the pairwise differences between each element in `x` and `y`, resulting in a 3x4 matrix. - The final output, `matrix`, contains the inverse of these differences, which is printed to the console. - This operation is useful in various mathematical and statistical applications, such as calculating distances or similarities between data points. Broadcasting handles the shape difference: ```text x shape = (3, 1) y shape = (4,) result shape = (3, 4) ``` ### Solution 10: Plot tanh ```python x = np.linspace(-6, 6, 300) y = np.tanh(x) plt.plot(x, y) plt.title("tanh(x)") plt.xlabel("x") plt.ylabel("y") plt.grid(True) plt.show() ``` **Explanation** - Generates 300 evenly spaced values between -6 and 6 using NumPy's `linspace` function. - Computes the hyperbolic tangent of each value in the array `x` with NumPy's `tanh` function, storing the results in `y`. - Plots the `x` values against the `y` values to visualize the tanh function using Matplotlib's `plot` method. - Sets the title of the plot to "tanh(x)" and labels the x-axis and y-axis accordingly. - Enables a grid for better readability and displays the plot with `show()`. Using the formula: ```python y = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x)) ``` **Explanation** - The code uses the NumPy library to perform exponential calculations efficiently. - It computes the hyperbolic tangent (tanh) of `x` by applying the formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)). - The result `y` will be a value between -1 and 1, representing the hyperbolic tangent of the input `x`. - This function is commonly used in machine learning and neural networks for activation functions. `np.tanh(x)` is preferred because it is shorter, clearer, and numerically safer. ## 25. Mini Project: Clean and Score Sensor Readings Suppose you receive sensor readings from 4 devices across 5 time points. ```python readings = np.array([ [12.0, 13.5, np.nan, 15.0, 14.5], [9.0, np.nan, 11.5, 12.0, 13.0], [20.0, 19.5, 21.0, np.nan, 22.0], [7.5, 8.0, 8.5, 9.0, np.nan], ]) ``` **Explanation** - Initializes a 2D NumPy array named `readings` to store temperature data. - Contains floating-point numbers representing temperature readings, with some values set to `np.nan` to indicate missing data. - The array has four rows and five columns, allowing for structured data representation. - Useful for data analysis tasks where handling of missing values is necessary, such as in scientific research or data preprocessing. Tasks: - fill missing values with each device's row mean - calculate each device's average reading - mark readings above the device average - normalize each row between 0 and 1 Solution: ```python row_means = np.nanmean(readings, axis=1, keepdims=True) missing = np.isnan(readings) filled = readings.copy() filled[missing] = np.take(row_means.ravel(), np.where(missing)[0]) device_average = filled.mean(axis=1, keepdims=True) above_average = filled > device_average row_min = filled.min(axis=1, keepdims=True) row_max = filled.max(axis=1, keepdims=True) normalized = (filled - row_min) / (row_max - row_min) print("Filled readings:") print(filled) print("Device averages:") print(device_average.ravel()) print("Above average mask:") print(above_average) print("Normalized readings:") print(normalized) ``` **Explanation** - Computes the mean of each row in the `readings` array while ignoring NaN values, storing the result in `row_means`. - Identifies missing values in the `readings` array and creates a copy to fill these missing entries with the corresponding row means. - Calculates the average of the filled readings for each device and creates a boolean mask indicating which readings are above the average. - Normalizes the filled readings by scaling them between 0 and 1 based on the minimum and maximum values of each row. - Outputs the filled readings, device averages, above-average mask, and normalized readings to the console for review. This mini project uses: - `np.nanmean` - boolean masks - broadcasting - row-wise operations - normalization These are core skills for real data cleaning. ## 26. Quick Quiz ### 1. Why is NumPy often faster than Python loops? Because NumPy stores numerical data compactly and runs many operations in optimized lower-level code. ### 2. What is fancy indexing? Fancy indexing means selecting array values using lists or arrays of indexes. ### 3. Which operators should you use for NumPy boolean masks? Use `&`, `|`, and `~`, with each condition wrapped in parentheses. ### 4. What makes two dimensions broadcast-compatible? They are compatible if the sizes are equal or one of the sizes is `1`. ### 5. Why is `np.nanmean()` useful? It calculates the mean while ignoring `np.nan` values. ## Final Takeaway Advanced NumPy is mostly about thinking in arrays instead of loops. The key habits are: - check shapes before operations - use boolean masks for filtering - use fancy indexing for specific rows or columns - understand broadcasting before writing repeated loops - use vectorized formulas for mathematical work - handle `np.nan` values deliberately When your code feels complicated, print the shape. Most NumPy confusion becomes easier once you know the shape of every array involved. ## Sources and Further Reading - NumPy indexing guide: https://numpy.org/doc/stable/user/basics.indexing.html - NumPy broadcasting guide: https://numpy.org/doc/stable/user/basics.broadcasting.html - NumPy dtype basics: https://numpy.org/doc/stable/user/basics.types.html - NumPy missing value helpers: https://numpy.org/doc/stable/reference/generated/numpy.isnan.html - Matplotlib pyplot guide: https://matplotlib.org/stable/tutorials/pyplot.html