Advanced NumPy: Fancy Indexing, Broadcasting, Missing Values, and Vectorized Math
Once you know how to create arrays, check shapes, slice data, and run simple operations, the next step is learning how NumPy helps you write compact, fast, data-focused code.
This lesson covers the ideas that make NumPy feel powerful:
- vectorized operations
- memory-friendly data types
- fancy indexing
- boolean filtering
- broadcasting
- mathematical formulas on whole arrays
- missing value handling
- arrays for plotting
These are not just academic features. They appear in real data cleaning, feature engineering, machine learning, analytics dashboards, simulations, and scientific computing.
What you will learn
By the end, you should be able to:
- explain why NumPy is usually faster than Python loops for numerical work
- choose smaller dtypes when memory matters
- select rows and columns with fancy indexing
- filter arrays with boolean masks
- combine multiple mask conditions correctly
- understand NumPy broadcasting rules
- write vectorized versions of formulas such as sigmoid and mean squared error
- detect and replace missing values
- generate arrays for plotting mathematical functions
- solve practical intermediate NumPy exercises
1. Why NumPy Can Be Faster Than Python Lists
Python lists are flexible, but flexibility has a cost. A list can hold many different object types, so Python must manage references to separate objects.
NumPy arrays are more specialized. A NumPy array usually stores values of the same type in a compact memory layout. That makes numerical operations easier to optimize.
Here is a small comparison:
import time
import numpy as np
size = 1_000_000
list_a = list(range(size))
list_b = list(range(size, size * 2))
start = time.perf_counter()
list_result = [x + y for x, y in zip(list_a, list_b)]
print("List time:", time.perf_counter() - start)
array_a = np.arange(size)
array_b = np.arange(size, size * 2)
start = time.perf_counter()
array_result = array_a + array_b
print("NumPy time:", time.perf_counter() - start)The exact time depends on your machine, but the NumPy version is usually much faster for large numerical arrays.
The important idea is not just speed. It is also readability:
array_result = array_a + array_bThat line clearly says: add the arrays element by element.
2. Memory and dtype Choices
The dtype controls how much memory each value needs.
import numpy as np
large_default = np.arange(1_000_000)
small_int = np.arange(1_000_000, dtype=np.int16)
print(large_default.dtype, large_default.nbytes)
print(small_int.dtype, small_int.nbytes)nbytes tells you how many bytes the array data uses.
Smaller dtypes can save memory, but they also have smaller value ranges.
tiny = np.array([120, 125, 130], dtype=np.int8)
print(tiny)int8 can store values from -128 to 127. The value 130 cannot fit correctly, so you should not choose tiny dtypes blindly.
Use smaller dtypes when:
- you know the value range
- the dataset is large
- memory pressure matters
- precision loss is acceptable
For most beginner work, default integer and float types are fine.
3. Normal Slicing Review
Before fancy indexing, remember normal slicing.
data = np.arange(20).reshape(5, 4)
print(data)Output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]Get rows 1 to 3:
print(data[1:4])Get columns 0 and 1 for all rows:
print(data[:, 0:2])Normal slicing works with continuous ranges. Fancy indexing is useful when you want specific positions.
4. Fancy Indexing Rows
Fancy indexing lets you pass a list of indexes.
data = np.arange(20).reshape(5, 4)
selected_rows = data[[0, 2, 4]]
print(selected_rows)Output:
[[ 0 1 2 3]
[ 8 9 10 11]
[16 17 18 19]]This selected row 0, row 2, and row 4.
You can also change the order:
print(data[[4, 0, 1]])Fancy indexing returns a copy, not a simple slice view. That distinction matters when you start modifying results.
5. Fancy Indexing Columns
To pick specific columns, use : for all rows and a list for columns.
data = np.arange(20).reshape(5, 4)
selected_columns = data[:, [0, 3]]
print(selected_columns)Output:
[[ 0 3]
[ 4 7]
[ 8 11]
[12 15]
[16 19]]This is useful when you want selected features from a table-like array.
6. Selecting Specific Rows and Columns Together
If you want rows first and then columns, write it in two steps:
data = np.arange(20).reshape(5, 4)
rows = data[[1, 3, 4]]
result = rows[:, [0, 2]]
print(result)Output:
[[ 4 6]
[12 14]
[16 18]]This is easier to read than trying to do everything in one expression.
Another compact version:
result = data[[1, 3, 4]][:, [0, 2]]
print(result)Use the two-step version when teaching, debugging, or reviewing code.
7. Updating Values With Fancy Indexing
You can update selected values.
scores = np.array([50, 60, 70, 80, 90])
indexes = [1, 3]
scores[indexes] += 5
print(scores)Output:
[50 65 70 85 90]This is useful when selected positions need special treatment.
8. Boolean Indexing
Boolean indexing uses a condition to filter values.
marks = np.array([35, 62, 88, 49, 73])
passed = marks[marks >= 50]
print(passed)Output:
[62 88 73]The expression marks >= 50 creates a boolean mask:
print(marks >= 50)Output:
[False True True False True]NumPy returns only the values where the mask is True.
9. Combining Boolean Conditions
Use:
&for and|for or~for not
Each condition should be wrapped in parentheses.
values = np.array([12, 17, 24, 33, 40, 55, 72])
result = values[(values > 20) & (values % 2 == 0)]
print(result)Output:
[24 40 72]Values greater than 20 or divisible by 5:
result = values[(values > 20) | (values % 5 == 0)]
print(result)Values not divisible by 3:
result = values[~(values % 3 == 0)]
print(result)Avoid Python's and, or, and not for NumPy array masks. Use &, |, and ~.
10. Boolean Indexing in 2D Arrays
Create a random table:
rng = np.random.default_rng(4)
table = rng.integers(1, 100, size=(4, 5))
print(table)Find all values above 70:
high_values = table[table > 70]
print(high_values)This returns a 1D array of matching values.
Replace all values below 20 with zero:
cleaned = table.copy()
cleaned[cleaned < 20] = 0
print(cleaned)Using .copy() protects the original array.
11. Broadcasting: The Big Idea
Broadcasting is how NumPy performs operations between arrays with different but compatible shapes.
Example:
prices = np.array([
[100, 200, 300],
[150, 250, 350],
])
discount = np.array([10, 20, 30])
print(prices - discount)Output:
[[ 90 180 270]
[140 230 320]]The discount array has shape (3,). NumPy treats it as if the same row of discounts applies to every row in prices.
That is broadcasting.
12. Broadcasting Rules
When NumPy compares two shapes, it checks dimensions from right to left.
Two dimensions are compatible if:
- they are equal
- one of them is
1
Example:
(4, 3)
( 3)The second shape behaves like:
(1, 3)Then NumPy stretches it to:
(4, 3)So this works:
a = np.ones((4, 3))
b = np.array([10, 20, 30])
print(a + b)This does not work:
a = np.ones((3, 4))
b = np.array([10, 20, 30])
print(a + b)Why?
(3, 4)
( 3)The last dimensions are 4 and 3. They are not equal, and neither is 1.
13. Broadcasting With Columns
Sometimes you want to apply a different value to each row.
Use a column-shaped array:
scores = np.array([
[80, 85, 90],
[70, 75, 78],
[88, 92, 95],
])
bonus = np.array([[5], [10], [2]])
print(scores + bonus)Output:
[[85 90 95]
[80 85 88]
[90 94 97]]bonus has shape (3, 1), so each row gets its own bonus.
14. Broadcasting to Create a Grid
Broadcasting can combine row and column arrays.
row = np.array([[1, 2, 3]])
column = np.array([[10], [20], [30], [40]])
grid = row + column
print(grid)Output:
[[11 12 13]
[21 22 23]
[31 32 33]
[41 42 43]]Shapes:
row shape = (1, 3)
column shape = (4, 1)
result shape = (4, 3)This pattern is useful for distance matrices, pairwise comparisons, lookup grids, and feature engineering.
15. Checking Broadcast Compatibility
Here is a simple helper that checks whether two shapes are broadcast-compatible.
def can_broadcast(shape_a, shape_b):
a = tuple(shape_a)
b = tuple(shape_b)
max_len = max(len(a), len(b))
a = (1,) * (max_len - len(a)) + a
b = (1,) * (max_len - len(b)) + b
for dim_a, dim_b in zip(a, b):
if dim_a != dim_b and dim_a != 1 and dim_b != 1:
return False
return True
print(can_broadcast((4, 3), (3,)))
print(can_broadcast((3, 4), (3,)))
print(can_broadcast((5, 1, 7), (1, 3, 7)))Output:
True
False
TrueThis helper mirrors the basic rule: dimensions must match, or one side must be 1.
16. Vectorized Mathematical Formulas
NumPy lets you write formulas almost the same way they appear mathematically.
Sigmoid
The sigmoid function is common in machine learning:
sigmoid(x) = 1 / (1 + exp(-x))Vectorized version:
def sigmoid(x):
x = np.asarray(x)
return 1 / (1 + np.exp(-x))
values = np.array([-3, -1, 0, 1, 3])
print(sigmoid(values))Explanation
- The
sigmoidfunction takes an inputx, converts it to a NumPy array, and applies the sigmoid formula. - The formula
1 / (1 + np.exp(-x))computes the sigmoid value, which is commonly used in machine learning for binary classification. - The
valuesarray contains a set of integers, which are passed to thesigmoidfunction. - The result of the sigmoid function is printed, showing how each input value is transformed to fall within the (0, 1) range.
Mean Squared Error
Mean squared error compares predictions with actual values.
def mean_squared_error(actual, predicted):
actual = np.asarray(actual)
predicted = np.asarray(predicted)
if actual.shape != predicted.shape:
raise ValueError("actual and predicted must have the same shape")
return np.mean((actual - predicted) ** 2)
actual = np.array([10, 20, 30, 40])
predicted = np.array([12, 18, 33, 37])
print(mean_squared_error(actual, predicted))Explanation
- The function
mean_squared_errorcomputes the mean squared error (MSE) between two numpy arrays:actualandpredicted. - It first converts the input lists to numpy arrays for efficient numerical operations.
- A shape check ensures that both arrays have the same dimensions; if not, a ValueError is raised.
- The MSE is calculated by taking the average of the squared differences between the actual and predicted values.
- The provided example demonstrates how to use the function with sample data and prints the resulting MSE.
The formula works on the whole array without writing a loop.
17. Missing Values With np.nan
Numerical datasets often have missing values. NumPy represents missing floating-point values with np.nan.
readings = np.array([12.5, 14.2, np.nan, 13.8, np.nan])
print(readings)Explanation
- The code imports the NumPy library, which is commonly used for numerical operations in Python.
- It creates a NumPy array named
readingscontaining five temperature values, two of which arenp.nan, representing missing data. - The
printfunction outputs the contents of thereadingsarray to the console, allowing for inspection of the data. - This snippet demonstrates how to handle arrays with missing values in data analysis using NumPy.
Check missing values:
print(np.isnan(readings))Explanation
- Utilizes the
np.isnan()function from the NumPy library to identify NaN (Not a Number) values. - The function returns a boolean array of the same shape as the input, with
Truefor NaN values andFalsefor non-NaN values. - The
print()function outputs the result to the console, allowing for immediate inspection of the NaN presence in thereadingsarray. - This is useful for data cleaning and preprocessing, ensuring that subsequent analyses handle missing values appropriately.
Output:
[False False True False True]Remove missing values:
clean = readings[~np.isnan(readings)]
print(clean)Explanation
- The
~np.isnan(readings)expression creates a boolean mask that identifies non-NaN values in thereadingsarray. - The
readings[...]syntax uses this mask to select only the elements that are not NaN, effectively cleaning the data. - The result is stored in the variable
clean, which contains only valid numerical readings. - The
print(clean)statement outputs the cleaned array to the console for verification.
Replace missing values with zero:
filled = np.nan_to_num(readings, nan=0.0)
print(filled)Explanation
- The
np.nan_to_num()function is used to convert NaN (Not a Number) values in thereadingsarray to a specified numerical value, which is 0.0 in this case. - The result is stored in the variable
filled, which will contain the original data with NaNs replaced by zeros. - The
print()function outputs the modified array, allowing users to see the changes made to the original data. - This approach is useful for data preprocessing, especially in scenarios where NaN values can disrupt calculations or analyses.
18. Filling Missing Values With the Mean
Replacing missing values with zero is not always a good choice. Sometimes the mean is better.
readings = np.array([12.5, 14.2, np.nan, 13.8, np.nan])
mean_value = np.nanmean(readings)
filled = np.where(np.isnan(readings), mean_value, readings)
print(filled)Explanation
- The code initializes a NumPy array
readingscontaining some numerical values and NaN entries. - It uses
np.nanmean()to compute the mean of the array while ignoring any NaN values. - The
np.where()function replaces NaN values in the original array with the calculated mean, resulting in a new arrayfilled. - Finally, the filled array is printed, showing the original values with NaNs replaced by the mean.
np.nanmean() ignores missing values while calculating the mean.
For 2D arrays, you can fill by column:
data = np.array([
[10.0, 2.0, np.nan],
[12.0, np.nan, 5.0],
[14.0, 4.0, 7.0],
])
column_means = np.nanmean(data, axis=0)
missing_mask = np.isnan(data)
filled = data.copy()
filled[missing_mask] = np.take(column_means, np.where(missing_mask)[1])
print(filled)Explanation
- The code initializes a 2D NumPy array containing some
NaNvalues. - It calculates the mean of each column while ignoring
NaNvalues usingnp.nanmean(). - A boolean mask is created to identify the positions of
NaNvalues in the array. - A copy of the original array is made, and the
NaNvalues are replaced with the corresponding column means. - Finally, the modified array with filled values is printed to the console.
This fills each missing value using the mean of its column.
19. Finding Nearest Values
To find the array value nearest to a target number, compare distances.
values = np.array([8, 14, 19, 27, 35, 42])
target = 25
distance = np.abs(values - target)
nearest_index = distance.argmin()
print(values[nearest_index])Explanation
- Initializes a NumPy array
valuescontaining a set of integers. - Defines a
targetinteger to which the nearest value in the array will be found. - Calculates the absolute distance between each element in
valuesand thetarget. - Identifies the index of the smallest distance using
argmin(), which indicates the nearest value. - Prints the value from the
valuesarray that is closest to the specifiedtarget.
Output:
27This is a common pattern:
array[np.abs(array - target).argmin()]Explanation
- Utilizes NumPy's
absfunction to compute the absolute difference between each element in the array and a specified target value. - The
argminmethod identifies the index of the smallest difference, effectively locating the closest value to the target. - The final expression accesses the element in the array at the index found, returning the closest value.
- This approach is efficient for finding proximity in numerical datasets, leveraging NumPy's optimized operations.
20. Element-Wise Maximum With np.where()
Suppose you have two arrays with the same shape:
model_a = np.array([72, 88, 91, 64])
model_b = np.array([75, 80, 93, 70])Explanation
- The code imports the NumPy library, which is typically done with
import numpy as np(not shown here). model_ais created as a NumPy array containing the scores [72, 88, 91, 64].model_bis created as a NumPy array containing the scores [75, 80, 93, 70].- These arrays can be used for various numerical operations, such as statistical analysis or model comparison.
- The use of NumPy allows for efficient computation and manipulation of large datasets.
Choose the larger value at each position:
best = np.where(model_a >= model_b, model_a, model_b)
print(best)Explanation
- Utilizes NumPy's
wherefunction to compare two arrays,model_aandmodel_b. - For each element, it checks if the value in
model_ais greater than or equal to the corresponding value inmodel_b. - If true, it selects the value from
model_a; otherwise, it selects frommodel_b. - The result is stored in the variable
best, which contains the maximum values from both models for each position. - Finally, it prints the
bestarray to display the selected values.
Output:
[75 88 93 70]np.where(condition, value_if_true, value_if_false) is extremely useful for conditional array logic.
21. Repeating and Tiling Values
np.repeat() repeats each element.
items = np.array([1, 2, 3])
print(np.repeat(items, 3))Explanation
- The code imports the NumPy library and creates a NumPy array named
itemscontaining the integers 1, 2, and 3. - The
np.repeat()function is called withitemsas the first argument and3as the second argument, indicating that each element should be repeated three times. - The result of the
np.repeat()function is printed, which will display a new array with each original element repeated consecutively. - The output will be
[1, 1, 1, 2, 2, 2, 3, 3, 3], showing the repeated elements.
Output:
[1 1 1 2 2 2 3 3 3]np.tile() repeats the whole array.
print(np.tile(items, 3))Explanation
- The
np.tilefunction is used to construct an array by repeating the input array. - In this case,
itemsis the input array that will be repeated. - The number
3specifies that the array should be repeated three times. - The result is a new array that contains the elements of
itemsconcatenated three times in sequence. - This is useful for creating larger datasets or for preparing data for operations that require repeated patterns.
Output:
[1 2 3 1 2 3 1 2 3]Combine both:
pattern = np.hstack([np.repeat(items, 3), np.tile(items, 3)])
print(pattern)Explanation
np.repeat(items, 3)creates a new array by repeating each element initemsthree times.np.tile(items, 3)constructs an array by repeating the entireitemsarray three times.np.hstack([...])horizontally stacks the two resulting arrays from the repeat and tile operations into a single array.- The final output, printed with
print(pattern), displays the combined pattern of repeated and tiled elements.
Output:
[1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3]22. Arrays for Plotting
NumPy is often used to generate x and y values for plots.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 200)
y = x ** 2
plt.plot(x, y)
plt.title("y = x^2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()Explanation
- Imports the NumPy library for numerical operations and Matplotlib for plotting.
- Generates 200 evenly spaced values between -5 and 5 for the x-axis using
np.linspace(). - Calculates the corresponding y values by squaring each x value (
y = x ** 2). - Plots the quadratic function with labeled axes and a title for clarity.
- Displays the plot using
plt.show(), allowing users to visualize the parabolic curve.
For a sine wave:
x = np.linspace(0, 2 * np.pi, 200)
y = np.sin(x)
plt.plot(x, y)
plt.title("Sine wave")
plt.show()Explanation
- The
np.linspacefunction creates an array of 200 evenly spaced values between 0 and (2\pi). - The
np.sinfunction computes the sine of each value in the array, resulting in the y-coordinates for the sine wave. - The
plt.plotfunction is used to plot the x and y values, creating a graphical representation of the sine wave. - The
plt.titlefunction sets the title of the plot to "Sine wave". - Finally,
plt.showdisplays the plot in a window.
For sigmoid:
x = np.linspace(-10, 10, 300)
y = 1 / (1 + np.exp(-x))
plt.plot(x, y)
plt.title("Sigmoid curve")
plt.show()Explanation
- The code generates 300 evenly spaced values between -10 and 10 using NumPy's
linspacefunction, which are stored in the variablex. - It calculates the sigmoid function values for each
xusing the formula1 / (1 + np.exp(-x)), storing the results in the variabley. - The
plt.plotfunction from Matplotlib is used to create a line plot of the sigmoid curve by plottingxagainsty. - A title "Sigmoid curve" is added to the plot for clarity using
plt.title. - Finally,
plt.show()displays the generated plot in a window.
The plotting library draws the chart, but NumPy creates the numerical data.
23. Practice Exercises
Try these before reading the solutions.
Practice Lab
Exercise 1: Compare memory
Create three arrays with one million values:
- default integer dtype
int32int16
Print each array's dtype and nbytes.
Practice Lab
Exercise 2: Select rows and columns
Create a 6 by 5 array from 0 to 29. Select rows 0, 2, and 5, then select columns 1 and 4.
Practice Lab
Exercise 3: Filter values
Create an array from 1 to 50. Return values that are divisible by 4 but not divisible by 8.
Practice Lab
Exercise 4: Replace multiples
Create an array from 1 to 20. Replace values divisible by 3 or 5 with 0.
Practice Lab
Exercise 5: Row-wise bonus
Create a 3 by 4 score array. Add a different bonus to each row using broadcasting.
Practice Lab
Exercise 6: Column centering
Create a 4 by 3 array. Subtract the mean of each column from that column.
Practice Lab
Exercise 7: Fill missing values
Create a 2D array with np.nan values. Replace missing values with the column mean.
Practice Lab
Exercise 8: Nearest element
Given an array and a target number, find the nearest value.
Practice Lab
Exercise 9: Cauchy-style matrix
Given:
x = np.array([1, 2, 4])
y = np.array([6, 8, 10, 12])Explanation
- Initializes a NumPy array
xcontaining three integers: 1, 2, and 4. - Initializes a second NumPy array
ycontaining four integers: 6, 8, 10, and 12. - These arrays can be used for various mathematical operations and data analysis tasks.
- NumPy is a powerful library for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices.
Create a matrix where each value is:
1 / (x_i - y_j)Practice Lab
Exercise 10: Plot tanh
Generate x values from -6 to 6 and plot:
tanh(x)Use either np.tanh(x) or the formula with np.exp().
24. Practice Solutions
Solution Key
Solution 1: Compare memory
arrays = [
np.arange(1_000_000),
np.arange(1_000_000, dtype=np.int32),
np.arange(1_000_000, dtype=np.int16),
]
for arr in arrays:
print(arr.dtype, arr.nbytes)Explanation
- Creates a list of three NumPy arrays with varying data types: default
int64,int32, andint16. - Each array contains one million sequential integers generated by
np.arange(). - Iterates through the list of arrays, printing the data type (
dtype) and the total memory size in bytes (nbytes) for each array. - Highlights the impact of data type selection on memory consumption in NumPy arrays.
Solution Key
Solution 2: Select rows and columns
data = np.arange(30).reshape(6, 5)
result = data[[0, 2, 5]][:, [1, 4]]
print(result)Explanation
- The code creates a 2D NumPy array
datawith values from 0 to 29, reshaped into 6 rows and 5 columns. - It selects rows 0, 2, and 5 from the
dataarray using advanced indexing. - From the selected rows, it further extracts columns 1 and 4, resulting in a new array
result. - Finally, the
resultarray is printed, displaying the specified rows and columns.
Solution Key
Solution 3: Filter values
values = np.arange(1, 51)
result = values[(values % 4 == 0) & ~(values % 8 == 0)]
print(result)Explanation
- The
np.arange(1, 51)function generates an array of integers from 1 to 50. - The condition
(values % 4 == 0)checks for numbers that are divisible by 4. - The condition
~(values % 8 == 0)negates the check for numbers that are divisible by 8. - The combined condition filters the array to include only those numbers that meet both criteria.
- Finally,
print(result)outputs the filtered array to the console.
Solution Key
Solution 4: Replace multiples
values = np.arange(1, 21)
values[(values % 3 == 0) | (values % 5 == 0)] = 0
print(values)Explanation
- The code initializes an array of integers from 1 to 20 using NumPy's
arangefunction. - It uses a boolean mask to identify elements in the array that are multiples of 3 or 5.
- The identified elements are then set to zero, effectively replacing them in the original array.
- Finally, the modified array is printed, showing the changes made.
Solution Key
Solution 5: Row-wise bonus
scores = np.array([
[70, 75, 80, 85],
[60, 65, 70, 75],
[88, 90, 92, 94],
])
bonus = np.array([[5], [10], [2]])
print(scores + bonus)Explanation
- The code initializes a 2D NumPy array
scoresrepresenting the scores of three students across four subjects. - A second 2D NumPy array
bonuscontains bonus points for each student, structured as a column vector. - The addition operation
scores + bonusutilizes broadcasting, allowing the bonus points to be added to each corresponding row of thescoresarray. - The result is printed, showing each student's original scores increased by their respective bonus points.
Solution Key
Solution 6: Column centering
data = np.array([
[10, 20, 30],
[12, 18, 33],
[14, 22, 36],
[16, 24, 39],
])
column_means = data.mean(axis=0)
centered = data - column_means
print(centered)Explanation
- The code initializes a 2D NumPy array named
datawith four rows and three columns. - It calculates the mean of each column using
data.mean(axis=0), resulting in a 1D array of column means. - The original array
datais centered by subtracting the corresponding column means from each element in that column. - The centered array is printed, showing how each value has been adjusted relative to its column mean.
Solution Key
Solution 7: Fill missing values
data = np.array([
[10.0, np.nan, 30.0],
[12.0, 22.0, np.nan],
[14.0, 24.0, 36.0],
])
column_means = np.nanmean(data, axis=0)
mask = np.isnan(data)
filled = data.copy()
filled[mask] = np.take(column_means, np.where(mask)[1])
print(filled)Explanation
- The code initializes a 2D NumPy array containing some NaN (Not a Number) values.
- It calculates the mean of each column while ignoring NaN values using
np.nanmean(). - A mask is created to identify the positions of NaN values in the original array.
- A copy of the original array is made, and the NaN values are replaced with the corresponding column means.
- Finally, the modified array with filled values is printed, showing the imputed data.
Solution Key
Solution 8: Nearest element
values = np.array([11, 18, 26, 33, 47])
target = 29
nearest = values[np.abs(values - target).argmin()]
print(nearest)Explanation
- Initializes a NumPy array
valuescontaining a set of integers. - Defines a
targetinteger to which the nearest value in the array will be found. - Calculates the absolute difference between each element in
valuesand thetarget, then finds the index of the minimum difference usingargmin(). - Uses this index to retrieve the nearest value from the original array.
- Prints the nearest value to the console.
Solution Key
Solution 9: Cauchy-style matrix
x = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([6, 8, 10, 12])
matrix = 1 / (x - y)
print(matrix)Explanation
- The variable
xis initialized as a 3x1 NumPy array, reshaping the input to ensure it has a single column. - The variable
yis a 1D NumPy array containing four elements. - The expression
1 / (x - y)utilizes broadcasting to compute the pairwise differences between each element inxandy, resulting in a 3x4 matrix. - The final output,
matrix, contains the inverse of these differences, which is printed to the console. - This operation is useful in various mathematical and statistical applications, such as calculating distances or similarities between data points.
Broadcasting handles the shape difference:
x shape = (3, 1)
y shape = (4,)
result shape = (3, 4)Solution Key
Solution 10: Plot tanh
x = np.linspace(-6, 6, 300)
y = np.tanh(x)
plt.plot(x, y)
plt.title("tanh(x)")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()Explanation
- Generates 300 evenly spaced values between -6 and 6 using NumPy's
linspacefunction. - Computes the hyperbolic tangent of each value in the array
xwith NumPy'stanhfunction, storing the results iny. - Plots the
xvalues against theyvalues to visualize the tanh function using Matplotlib'splotmethod. - Sets the title of the plot to "tanh(x)" and labels the x-axis and y-axis accordingly.
- Enables a grid for better readability and displays the plot with
show().
Using the formula:
y = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))Explanation
- The code uses the NumPy library to perform exponential calculations efficiently.
- It computes the hyperbolic tangent (tanh) of
xby applying the formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)). - The result
ywill be a value between -1 and 1, representing the hyperbolic tangent of the inputx. - This function is commonly used in machine learning and neural networks for activation functions.
np.tanh(x) is preferred because it is shorter, clearer, and numerically safer.
25. Mini Project: Clean and Score Sensor Readings
Suppose you receive sensor readings from 4 devices across 5 time points.
readings = np.array([
[12.0, 13.5, np.nan, 15.0, 14.5],
[9.0, np.nan, 11.5, 12.0, 13.0],
[20.0, 19.5, 21.0, np.nan, 22.0],
[7.5, 8.0, 8.5, 9.0, np.nan],
])Explanation
- Initializes a 2D NumPy array named
readingsto store temperature data. - Contains floating-point numbers representing temperature readings, with some values set to
np.nanto indicate missing data. - The array has four rows and five columns, allowing for structured data representation.
- Useful for data analysis tasks where handling of missing values is necessary, such as in scientific research or data preprocessing.
Tasks:
- fill missing values with each device's row mean
- calculate each device's average reading
- mark readings above the device average
- normalize each row between 0 and 1
Solution:
row_means = np.nanmean(readings, axis=1, keepdims=True)
missing = np.isnan(readings)
filled = readings.copy()
filled[missing] = np.take(row_means.ravel(), np.where(missing)[0])
device_average = filled.mean(axis=1, keepdims=True)
above_average = filled > device_average
row_min = filled.min(axis=1, keepdims=True)
row_max = filled.max(axis=1, keepdims=True)
normalized = (filled - row_min) / (row_max - row_min)
print("Filled readings:")
print(filled)
print("Device averages:")
print(device_average.ravel())
print("Above average mask:")
print(above_average)
print("Normalized readings:")
print(normalized)Explanation
- Computes the mean of each row in the
readingsarray while ignoring NaN values, storing the result inrow_means. - Identifies missing values in the
readingsarray and creates a copy to fill these missing entries with the corresponding row means. - Calculates the average of the filled readings for each device and creates a boolean mask indicating which readings are above the average.
- Normalizes the filled readings by scaling them between 0 and 1 based on the minimum and maximum values of each row.
- Outputs the filled readings, device averages, above-average mask, and normalized readings to the console for review.
This mini project uses:
np.nanmean- boolean masks
- broadcasting
- row-wise operations
- normalization
These are core skills for real data cleaning.
26. Quick Quiz
1. Why is NumPy often faster than Python loops?
Because NumPy stores numerical data compactly and runs many operations in optimized lower-level code.
2. What is fancy indexing?
Fancy indexing means selecting array values using lists or arrays of indexes.
3. Which operators should you use for NumPy boolean masks?
Use &, |, and ~, with each condition wrapped in parentheses.
4. What makes two dimensions broadcast-compatible?
They are compatible if the sizes are equal or one of the sizes is 1.
5. Why is np.nanmean() useful?
It calculates the mean while ignoring np.nan values.
Final Takeaway
Advanced NumPy is mostly about thinking in arrays instead of loops.
The key habits are:
- check shapes before operations
- use boolean masks for filtering
- use fancy indexing for specific rows or columns
- understand broadcasting before writing repeated loops
- use vectorized formulas for mathematical work
- handle
np.nanvalues deliberately
When your code feels complicated, print the shape. Most NumPy confusion becomes easier once you know the shape of every array involved.
Sources and Further Reading
- NumPy indexing guide: https://numpy.org/doc/stable/user/basics.indexing.html
- NumPy broadcasting guide: https://numpy.org/doc/stable/user/basics.broadcasting.html
- NumPy dtype basics: https://numpy.org/doc/stable/user/basics.types.html
- NumPy missing value helpers: https://numpy.org/doc/stable/reference/generated/numpy.isnan.html
- Matplotlib pyplot guide: https://matplotlib.org/stable/tutorials/pyplot.html
