#Numpy#pythonIntermediate

Advanced NumPy: Indexing, Broadcasting & Missing Values

May 25, 2026
24 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: advanced-numpy-indexing-broadcasting-missing-values
Quick Answer

Go beyond NumPy basics with performance comparisons, memory-aware dtypes, fancy indexing, boolean masks, broadcasting rules, vectorized formulas, missing value handling, plotting-ready arrays, and practice problems.

Quick Summary

Explore advanced NumPy techniques like fancy indexing, broadcasting, and handling missing values to enhance your data analysis skills.

Advanced NumPy: Fancy Indexing, Broadcasting, Missing Values, and Vectorized Math

Once you know how to create arrays, check shapes, slice data, and run simple operations, the next step is learning how NumPy helps you write compact, fast, data-focused code.

This lesson covers the ideas that make NumPy feel powerful:

  • vectorized operations
  • memory-friendly data types
  • fancy indexing
  • boolean filtering
  • broadcasting
  • mathematical formulas on whole arrays
  • missing value handling
  • arrays for plotting

These are not just academic features. They appear in real data cleaning, feature engineering, machine learning, analytics dashboards, simulations, and scientific computing.

What you will learn

By the end, you should be able to:

  • explain why NumPy is usually faster than Python loops for numerical work
  • choose smaller dtypes when memory matters
  • select rows and columns with fancy indexing
  • filter arrays with boolean masks
  • combine multiple mask conditions correctly
  • understand NumPy broadcasting rules
  • write vectorized versions of formulas such as sigmoid and mean squared error
  • detect and replace missing values
  • generate arrays for plotting mathematical functions
  • solve practical intermediate NumPy exercises

1. Why NumPy Can Be Faster Than Python Lists

Python lists are flexible, but flexibility has a cost. A list can hold many different object types, so Python must manage references to separate objects.

NumPy arrays are more specialized. A NumPy array usually stores values of the same type in a compact memory layout. That makes numerical operations easier to optimize.

Here is a small comparison:

python
import time
import numpy as np

size = 1_000_000

list_a = list(range(size))
list_b = list(range(size, size * 2))

start = time.perf_counter()
list_result = [x + y for x, y in zip(list_a, list_b)]
print("List time:", time.perf_counter() - start)

array_a = np.arange(size)
array_b = np.arange(size, size * 2)

start = time.perf_counter()
array_result = array_a + array_b
print("NumPy time:", time.perf_counter() - start)

The exact time depends on your machine, but the NumPy version is usually much faster for large numerical arrays.

The important idea is not just speed. It is also readability:

python
array_result = array_a + array_b

That line clearly says: add the arrays element by element.

2. Memory and dtype Choices

The dtype controls how much memory each value needs.

python
import numpy as np

large_default = np.arange(1_000_000)
small_int = np.arange(1_000_000, dtype=np.int16)

print(large_default.dtype, large_default.nbytes)
print(small_int.dtype, small_int.nbytes)

nbytes tells you how many bytes the array data uses.

Smaller dtypes can save memory, but they also have smaller value ranges.

python
tiny = np.array([120, 125, 130], dtype=np.int8)
print(tiny)

int8 can store values from -128 to 127. The value 130 cannot fit correctly, so you should not choose tiny dtypes blindly.

Use smaller dtypes when:

  • you know the value range
  • the dataset is large
  • memory pressure matters
  • precision loss is acceptable

For most beginner work, default integer and float types are fine.

3. Normal Slicing Review

Before fancy indexing, remember normal slicing.

python
data = np.arange(20).reshape(5, 4)
print(data)

Output:

text
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]

Get rows 1 to 3:

python
print(data[1:4])

Get columns 0 and 1 for all rows:

python
print(data[:, 0:2])

Normal slicing works with continuous ranges. Fancy indexing is useful when you want specific positions.

4. Fancy Indexing Rows

Fancy indexing lets you pass a list of indexes.

python
data = np.arange(20).reshape(5, 4)

selected_rows = data[[0, 2, 4]]
print(selected_rows)

Output:

text
[[ 0  1  2  3]
 [ 8  9 10 11]
 [16 17 18 19]]

This selected row 0, row 2, and row 4.

You can also change the order:

python
print(data[[4, 0, 1]])

Fancy indexing returns a copy, not a simple slice view. That distinction matters when you start modifying results.

5. Fancy Indexing Columns

To pick specific columns, use : for all rows and a list for columns.

python
data = np.arange(20).reshape(5, 4)

selected_columns = data[:, [0, 3]]
print(selected_columns)

Output:

text
[[ 0  3]
 [ 4  7]
 [ 8 11]
 [12 15]
 [16 19]]

This is useful when you want selected features from a table-like array.

6. Selecting Specific Rows and Columns Together

If you want rows first and then columns, write it in two steps:

python
data = np.arange(20).reshape(5, 4)

rows = data[[1, 3, 4]]
result = rows[:, [0, 2]]

print(result)

Output:

text
[[ 4  6]
 [12 14]
 [16 18]]

This is easier to read than trying to do everything in one expression.

Another compact version:

python
result = data[[1, 3, 4]][:, [0, 2]]
print(result)

Use the two-step version when teaching, debugging, or reviewing code.

7. Updating Values With Fancy Indexing

You can update selected values.

python
scores = np.array([50, 60, 70, 80, 90])
indexes = [1, 3]

scores[indexes] += 5

print(scores)

Output:

text
[50 65 70 85 90]

This is useful when selected positions need special treatment.

8. Boolean Indexing

Boolean indexing uses a condition to filter values.

python
marks = np.array([35, 62, 88, 49, 73])

passed = marks[marks >= 50]
print(passed)

Output:

text
[62 88 73]

The expression marks >= 50 creates a boolean mask:

python
print(marks >= 50)

Output:

text
[False  True  True False  True]

NumPy returns only the values where the mask is True.

9. Combining Boolean Conditions

Use:

  • & for and
  • | for or
  • ~ for not

Each condition should be wrapped in parentheses.

python
values = np.array([12, 17, 24, 33, 40, 55, 72])

result = values[(values > 20) & (values % 2 == 0)]
print(result)

Output:

text
[24 40 72]

Values greater than 20 or divisible by 5:

python
result = values[(values > 20) | (values % 5 == 0)]
print(result)

Values not divisible by 3:

python
result = values[~(values % 3 == 0)]
print(result)

Avoid Python's and, or, and not for NumPy array masks. Use &, |, and ~.

10. Boolean Indexing in 2D Arrays

Create a random table:

python
rng = np.random.default_rng(4)
table = rng.integers(1, 100, size=(4, 5))

print(table)

Find all values above 70:

python
high_values = table[table > 70]
print(high_values)

This returns a 1D array of matching values.

Replace all values below 20 with zero:

python
cleaned = table.copy()
cleaned[cleaned < 20] = 0

print(cleaned)

Using .copy() protects the original array.

11. Broadcasting: The Big Idea

Broadcasting is how NumPy performs operations between arrays with different but compatible shapes.

Example:

python
prices = np.array([
    [100, 200, 300],
    [150, 250, 350],
])

discount = np.array([10, 20, 30])

print(prices - discount)

Output:

text
[[ 90 180 270]
 [140 230 320]]

The discount array has shape (3,). NumPy treats it as if the same row of discounts applies to every row in prices.

That is broadcasting.

12. Broadcasting Rules

When NumPy compares two shapes, it checks dimensions from right to left.

Two dimensions are compatible if:

  • they are equal
  • one of them is 1

Example:

text
(4, 3)
(   3)

The second shape behaves like:

text
(1, 3)

Then NumPy stretches it to:

text
(4, 3)

So this works:

python
a = np.ones((4, 3))
b = np.array([10, 20, 30])

print(a + b)

This does not work:

python
a = np.ones((3, 4))
b = np.array([10, 20, 30])

print(a + b)

Why?

text
(3, 4)
(   3)

The last dimensions are 4 and 3. They are not equal, and neither is 1.

13. Broadcasting With Columns

Sometimes you want to apply a different value to each row.

Use a column-shaped array:

python
scores = np.array([
    [80, 85, 90],
    [70, 75, 78],
    [88, 92, 95],
])

bonus = np.array([[5], [10], [2]])

print(scores + bonus)

Output:

text
[[85 90 95]
 [80 85 88]
 [90 94 97]]

bonus has shape (3, 1), so each row gets its own bonus.

14. Broadcasting to Create a Grid

Broadcasting can combine row and column arrays.

python
row = np.array([[1, 2, 3]])
column = np.array([[10], [20], [30], [40]])

grid = row + column

print(grid)

Output:

text
[[11 12 13]
 [21 22 23]
 [31 32 33]
 [41 42 43]]

Shapes:

text
row shape    = (1, 3)
column shape = (4, 1)
result shape = (4, 3)

This pattern is useful for distance matrices, pairwise comparisons, lookup grids, and feature engineering.

15. Checking Broadcast Compatibility

Here is a simple helper that checks whether two shapes are broadcast-compatible.

python
def can_broadcast(shape_a, shape_b):
    a = tuple(shape_a)
    b = tuple(shape_b)

    max_len = max(len(a), len(b))
    a = (1,) * (max_len - len(a)) + a
    b = (1,) * (max_len - len(b)) + b

    for dim_a, dim_b in zip(a, b):
        if dim_a != dim_b and dim_a != 1 and dim_b != 1:
            return False

    return True

print(can_broadcast((4, 3), (3,)))
print(can_broadcast((3, 4), (3,)))
print(can_broadcast((5, 1, 7), (1, 3, 7)))

Output:

text
True
False
True

This helper mirrors the basic rule: dimensions must match, or one side must be 1.

16. Vectorized Mathematical Formulas

NumPy lets you write formulas almost the same way they appear mathematically.

Sigmoid

The sigmoid function is common in machine learning:

text
sigmoid(x) = 1 / (1 + exp(-x))

Vectorized version:

python
def sigmoid(x):
    x = np.asarray(x)
    return 1 / (1 + np.exp(-x))

values = np.array([-3, -1, 0, 1, 3])
print(sigmoid(values))

Explanation

  • The sigmoid function takes an input x, converts it to a NumPy array, and applies the sigmoid formula.
  • The formula 1 / (1 + np.exp(-x)) computes the sigmoid value, which is commonly used in machine learning for binary classification.
  • The values array contains a set of integers, which are passed to the sigmoid function.
  • The result of the sigmoid function is printed, showing how each input value is transformed to fall within the (0, 1) range.

Mean Squared Error

Mean squared error compares predictions with actual values.

python
def mean_squared_error(actual, predicted):
    actual = np.asarray(actual)
    predicted = np.asarray(predicted)

    if actual.shape != predicted.shape:
        raise ValueError("actual and predicted must have the same shape")

    return np.mean((actual - predicted) ** 2)

actual = np.array([10, 20, 30, 40])
predicted = np.array([12, 18, 33, 37])

print(mean_squared_error(actual, predicted))

Explanation

  • The function mean_squared_error computes the mean squared error (MSE) between two numpy arrays: actual and predicted.
  • It first converts the input lists to numpy arrays for efficient numerical operations.
  • A shape check ensures that both arrays have the same dimensions; if not, a ValueError is raised.
  • The MSE is calculated by taking the average of the squared differences between the actual and predicted values.
  • The provided example demonstrates how to use the function with sample data and prints the resulting MSE.

The formula works on the whole array without writing a loop.

17. Missing Values With np.nan

Numerical datasets often have missing values. NumPy represents missing floating-point values with np.nan.

python
readings = np.array([12.5, 14.2, np.nan, 13.8, np.nan])

print(readings)

Explanation

  • The code imports the NumPy library, which is commonly used for numerical operations in Python.
  • It creates a NumPy array named readings containing five temperature values, two of which are np.nan, representing missing data.
  • The print function outputs the contents of the readings array to the console, allowing for inspection of the data.
  • This snippet demonstrates how to handle arrays with missing values in data analysis using NumPy.

Check missing values:

python
print(np.isnan(readings))

Explanation

  • Utilizes the np.isnan() function from the NumPy library to identify NaN (Not a Number) values.
  • The function returns a boolean array of the same shape as the input, with True for NaN values and False for non-NaN values.
  • The print() function outputs the result to the console, allowing for immediate inspection of the NaN presence in the readings array.
  • This is useful for data cleaning and preprocessing, ensuring that subsequent analyses handle missing values appropriately.

Output:

text
[False False  True False  True]

Remove missing values:

python
clean = readings[~np.isnan(readings)]
print(clean)

Explanation

  • The ~np.isnan(readings) expression creates a boolean mask that identifies non-NaN values in the readings array.
  • The readings[...] syntax uses this mask to select only the elements that are not NaN, effectively cleaning the data.
  • The result is stored in the variable clean, which contains only valid numerical readings.
  • The print(clean) statement outputs the cleaned array to the console for verification.

Replace missing values with zero:

python
filled = np.nan_to_num(readings, nan=0.0)
print(filled)

Explanation

  • The np.nan_to_num() function is used to convert NaN (Not a Number) values in the readings array to a specified numerical value, which is 0.0 in this case.
  • The result is stored in the variable filled, which will contain the original data with NaNs replaced by zeros.
  • The print() function outputs the modified array, allowing users to see the changes made to the original data.
  • This approach is useful for data preprocessing, especially in scenarios where NaN values can disrupt calculations or analyses.

18. Filling Missing Values With the Mean

Replacing missing values with zero is not always a good choice. Sometimes the mean is better.

python
readings = np.array([12.5, 14.2, np.nan, 13.8, np.nan])

mean_value = np.nanmean(readings)
filled = np.where(np.isnan(readings), mean_value, readings)

print(filled)

Explanation

  • The code initializes a NumPy array readings containing some numerical values and NaN entries.
  • It uses np.nanmean() to compute the mean of the array while ignoring any NaN values.
  • The np.where() function replaces NaN values in the original array with the calculated mean, resulting in a new array filled.
  • Finally, the filled array is printed, showing the original values with NaNs replaced by the mean.

np.nanmean() ignores missing values while calculating the mean.

For 2D arrays, you can fill by column:

python
data = np.array([
    [10.0, 2.0, np.nan],
    [12.0, np.nan, 5.0],
    [14.0, 4.0, 7.0],
])

column_means = np.nanmean(data, axis=0)
missing_mask = np.isnan(data)

filled = data.copy()
filled[missing_mask] = np.take(column_means, np.where(missing_mask)[1])

print(filled)

Explanation

  • The code initializes a 2D NumPy array containing some NaN values.
  • It calculates the mean of each column while ignoring NaN values using np.nanmean().
  • A boolean mask is created to identify the positions of NaN values in the array.
  • A copy of the original array is made, and the NaN values are replaced with the corresponding column means.
  • Finally, the modified array with filled values is printed to the console.

This fills each missing value using the mean of its column.

19. Finding Nearest Values

To find the array value nearest to a target number, compare distances.

python
values = np.array([8, 14, 19, 27, 35, 42])
target = 25

distance = np.abs(values - target)
nearest_index = distance.argmin()

print(values[nearest_index])

Explanation

  • Initializes a NumPy array values containing a set of integers.
  • Defines a target integer to which the nearest value in the array will be found.
  • Calculates the absolute distance between each element in values and the target.
  • Identifies the index of the smallest distance using argmin(), which indicates the nearest value.
  • Prints the value from the values array that is closest to the specified target.

Output:

text
27

This is a common pattern:

python
array[np.abs(array - target).argmin()]

Explanation

  • Utilizes NumPy's abs function to compute the absolute difference between each element in the array and a specified target value.
  • The argmin method identifies the index of the smallest difference, effectively locating the closest value to the target.
  • The final expression accesses the element in the array at the index found, returning the closest value.
  • This approach is efficient for finding proximity in numerical datasets, leveraging NumPy's optimized operations.

20. Element-Wise Maximum With np.where()

Suppose you have two arrays with the same shape:

python
model_a = np.array([72, 88, 91, 64])
model_b = np.array([75, 80, 93, 70])

Explanation

  • The code imports the NumPy library, which is typically done with import numpy as np (not shown here).
  • model_a is created as a NumPy array containing the scores [72, 88, 91, 64].
  • model_b is created as a NumPy array containing the scores [75, 80, 93, 70].
  • These arrays can be used for various numerical operations, such as statistical analysis or model comparison.
  • The use of NumPy allows for efficient computation and manipulation of large datasets.

Choose the larger value at each position:

python
best = np.where(model_a >= model_b, model_a, model_b)

print(best)

Explanation

  • Utilizes NumPy's where function to compare two arrays, model_a and model_b.
  • For each element, it checks if the value in model_a is greater than or equal to the corresponding value in model_b.
  • If true, it selects the value from model_a; otherwise, it selects from model_b.
  • The result is stored in the variable best, which contains the maximum values from both models for each position.
  • Finally, it prints the best array to display the selected values.

Output:

text
[75 88 93 70]

np.where(condition, value_if_true, value_if_false) is extremely useful for conditional array logic.

21. Repeating and Tiling Values

np.repeat() repeats each element.

python
items = np.array([1, 2, 3])

print(np.repeat(items, 3))

Explanation

  • The code imports the NumPy library and creates a NumPy array named items containing the integers 1, 2, and 3.
  • The np.repeat() function is called with items as the first argument and 3 as the second argument, indicating that each element should be repeated three times.
  • The result of the np.repeat() function is printed, which will display a new array with each original element repeated consecutively.
  • The output will be [1, 1, 1, 2, 2, 2, 3, 3, 3], showing the repeated elements.

Output:

text
[1 1 1 2 2 2 3 3 3]

np.tile() repeats the whole array.

python
print(np.tile(items, 3))

Explanation

  • The np.tile function is used to construct an array by repeating the input array.
  • In this case, items is the input array that will be repeated.
  • The number 3 specifies that the array should be repeated three times.
  • The result is a new array that contains the elements of items concatenated three times in sequence.
  • This is useful for creating larger datasets or for preparing data for operations that require repeated patterns.

Output:

text
[1 2 3 1 2 3 1 2 3]

Combine both:

python
pattern = np.hstack([np.repeat(items, 3), np.tile(items, 3)])

print(pattern)

Explanation

  • np.repeat(items, 3) creates a new array by repeating each element in items three times.
  • np.tile(items, 3) constructs an array by repeating the entire items array three times.
  • np.hstack([...]) horizontally stacks the two resulting arrays from the repeat and tile operations into a single array.
  • The final output, printed with print(pattern), displays the combined pattern of repeated and tiled elements.

Output:

text
[1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3]

22. Arrays for Plotting

NumPy is often used to generate x and y values for plots.

python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 200)
y = x ** 2

plt.plot(x, y)
plt.title("y = x^2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Explanation

  • Imports the NumPy library for numerical operations and Matplotlib for plotting.
  • Generates 200 evenly spaced values between -5 and 5 for the x-axis using np.linspace().
  • Calculates the corresponding y values by squaring each x value (y = x ** 2).
  • Plots the quadratic function with labeled axes and a title for clarity.
  • Displays the plot using plt.show(), allowing users to visualize the parabolic curve.

For a sine wave:

python
x = np.linspace(0, 2 * np.pi, 200)
y = np.sin(x)

plt.plot(x, y)
plt.title("Sine wave")
plt.show()

Explanation

  • The np.linspace function creates an array of 200 evenly spaced values between 0 and (2\pi).
  • The np.sin function computes the sine of each value in the array, resulting in the y-coordinates for the sine wave.
  • The plt.plot function is used to plot the x and y values, creating a graphical representation of the sine wave.
  • The plt.title function sets the title of the plot to "Sine wave".
  • Finally, plt.show displays the plot in a window.

For sigmoid:

python
x = np.linspace(-10, 10, 300)
y = 1 / (1 + np.exp(-x))

plt.plot(x, y)
plt.title("Sigmoid curve")
plt.show()

Explanation

  • The code generates 300 evenly spaced values between -10 and 10 using NumPy's linspace function, which are stored in the variable x.
  • It calculates the sigmoid function values for each x using the formula 1 / (1 + np.exp(-x)), storing the results in the variable y.
  • The plt.plot function from Matplotlib is used to create a line plot of the sigmoid curve by plotting x against y.
  • A title "Sigmoid curve" is added to the plot for clarity using plt.title.
  • Finally, plt.show() displays the generated plot in a window.

The plotting library draws the chart, but NumPy creates the numerical data.

23. Practice Exercises

Try these before reading the solutions.

Practice Lab

Exercise 1: Compare memory

Create three arrays with one million values:

  • default integer dtype
  • int32
  • int16

Print each array's dtype and nbytes.

Practice Lab

Exercise 2: Select rows and columns

Create a 6 by 5 array from 0 to 29. Select rows 0, 2, and 5, then select columns 1 and 4.

Practice Lab

Exercise 3: Filter values

Create an array from 1 to 50. Return values that are divisible by 4 but not divisible by 8.

Practice Lab

Exercise 4: Replace multiples

Create an array from 1 to 20. Replace values divisible by 3 or 5 with 0.

Practice Lab

Exercise 5: Row-wise bonus

Create a 3 by 4 score array. Add a different bonus to each row using broadcasting.

Practice Lab

Exercise 6: Column centering

Create a 4 by 3 array. Subtract the mean of each column from that column.

Practice Lab

Exercise 7: Fill missing values

Create a 2D array with np.nan values. Replace missing values with the column mean.

Practice Lab

Exercise 8: Nearest element

Given an array and a target number, find the nearest value.

Practice Lab

Exercise 9: Cauchy-style matrix

Given:

python
x = np.array([1, 2, 4])
y = np.array([6, 8, 10, 12])

Explanation

  • Initializes a NumPy array x containing three integers: 1, 2, and 4.
  • Initializes a second NumPy array y containing four integers: 6, 8, 10, and 12.
  • These arrays can be used for various mathematical operations and data analysis tasks.
  • NumPy is a powerful library for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices.

Create a matrix where each value is:

text
1 / (x_i - y_j)

Practice Lab

Exercise 10: Plot tanh

Generate x values from -6 to 6 and plot:

text
tanh(x)

Use either np.tanh(x) or the formula with np.exp().

24. Practice Solutions

Solution Key

Solution 1: Compare memory

python
arrays = [
    np.arange(1_000_000),
    np.arange(1_000_000, dtype=np.int32),
    np.arange(1_000_000, dtype=np.int16),
]

for arr in arrays:
    print(arr.dtype, arr.nbytes)

Explanation

  • Creates a list of three NumPy arrays with varying data types: default int64, int32, and int16.
  • Each array contains one million sequential integers generated by np.arange().
  • Iterates through the list of arrays, printing the data type (dtype) and the total memory size in bytes (nbytes) for each array.
  • Highlights the impact of data type selection on memory consumption in NumPy arrays.

Solution Key

Solution 2: Select rows and columns

python
data = np.arange(30).reshape(6, 5)

result = data[[0, 2, 5]][:, [1, 4]]

print(result)

Explanation

  • The code creates a 2D NumPy array data with values from 0 to 29, reshaped into 6 rows and 5 columns.
  • It selects rows 0, 2, and 5 from the data array using advanced indexing.
  • From the selected rows, it further extracts columns 1 and 4, resulting in a new array result.
  • Finally, the result array is printed, displaying the specified rows and columns.

Solution Key

Solution 3: Filter values

python
values = np.arange(1, 51)

result = values[(values % 4 == 0) & ~(values % 8 == 0)]

print(result)

Explanation

  • The np.arange(1, 51) function generates an array of integers from 1 to 50.
  • The condition (values % 4 == 0) checks for numbers that are divisible by 4.
  • The condition ~(values % 8 == 0) negates the check for numbers that are divisible by 8.
  • The combined condition filters the array to include only those numbers that meet both criteria.
  • Finally, print(result) outputs the filtered array to the console.

Solution Key

Solution 4: Replace multiples

python
values = np.arange(1, 21)

values[(values % 3 == 0) | (values % 5 == 0)] = 0

print(values)

Explanation

  • The code initializes an array of integers from 1 to 20 using NumPy's arange function.
  • It uses a boolean mask to identify elements in the array that are multiples of 3 or 5.
  • The identified elements are then set to zero, effectively replacing them in the original array.
  • Finally, the modified array is printed, showing the changes made.

Solution Key

Solution 5: Row-wise bonus

python
scores = np.array([
    [70, 75, 80, 85],
    [60, 65, 70, 75],
    [88, 90, 92, 94],
])

bonus = np.array([[5], [10], [2]])

print(scores + bonus)

Explanation

  • The code initializes a 2D NumPy array scores representing the scores of three students across four subjects.
  • A second 2D NumPy array bonus contains bonus points for each student, structured as a column vector.
  • The addition operation scores + bonus utilizes broadcasting, allowing the bonus points to be added to each corresponding row of the scores array.
  • The result is printed, showing each student's original scores increased by their respective bonus points.

Solution Key

Solution 6: Column centering

python
data = np.array([
    [10, 20, 30],
    [12, 18, 33],
    [14, 22, 36],
    [16, 24, 39],
])

column_means = data.mean(axis=0)
centered = data - column_means

print(centered)

Explanation

  • The code initializes a 2D NumPy array named data with four rows and three columns.
  • It calculates the mean of each column using data.mean(axis=0), resulting in a 1D array of column means.
  • The original array data is centered by subtracting the corresponding column means from each element in that column.
  • The centered array is printed, showing how each value has been adjusted relative to its column mean.

Solution Key

Solution 7: Fill missing values

python
data = np.array([
    [10.0, np.nan, 30.0],
    [12.0, 22.0, np.nan],
    [14.0, 24.0, 36.0],
])

column_means = np.nanmean(data, axis=0)
mask = np.isnan(data)

filled = data.copy()
filled[mask] = np.take(column_means, np.where(mask)[1])

print(filled)

Explanation

  • The code initializes a 2D NumPy array containing some NaN (Not a Number) values.
  • It calculates the mean of each column while ignoring NaN values using np.nanmean().
  • A mask is created to identify the positions of NaN values in the original array.
  • A copy of the original array is made, and the NaN values are replaced with the corresponding column means.
  • Finally, the modified array with filled values is printed, showing the imputed data.

Solution Key

Solution 8: Nearest element

python
values = np.array([11, 18, 26, 33, 47])
target = 29

nearest = values[np.abs(values - target).argmin()]

print(nearest)

Explanation

  • Initializes a NumPy array values containing a set of integers.
  • Defines a target integer to which the nearest value in the array will be found.
  • Calculates the absolute difference between each element in values and the target, then finds the index of the minimum difference using argmin().
  • Uses this index to retrieve the nearest value from the original array.
  • Prints the nearest value to the console.

Solution Key

Solution 9: Cauchy-style matrix

python
x = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([6, 8, 10, 12])

matrix = 1 / (x - y)

print(matrix)

Explanation

  • The variable x is initialized as a 3x1 NumPy array, reshaping the input to ensure it has a single column.
  • The variable y is a 1D NumPy array containing four elements.
  • The expression 1 / (x - y) utilizes broadcasting to compute the pairwise differences between each element in x and y, resulting in a 3x4 matrix.
  • The final output, matrix, contains the inverse of these differences, which is printed to the console.
  • This operation is useful in various mathematical and statistical applications, such as calculating distances or similarities between data points.

Broadcasting handles the shape difference:

text
x shape = (3, 1)
y shape = (4,)
result shape = (3, 4)

Solution Key

Solution 10: Plot tanh

python
x = np.linspace(-6, 6, 300)
y = np.tanh(x)

plt.plot(x, y)
plt.title("tanh(x)")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()

Explanation

  • Generates 300 evenly spaced values between -6 and 6 using NumPy's linspace function.
  • Computes the hyperbolic tangent of each value in the array x with NumPy's tanh function, storing the results in y.
  • Plots the x values against the y values to visualize the tanh function using Matplotlib's plot method.
  • Sets the title of the plot to "tanh(x)" and labels the x-axis and y-axis accordingly.
  • Enables a grid for better readability and displays the plot with show().

Using the formula:

python
y = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

Explanation

  • The code uses the NumPy library to perform exponential calculations efficiently.
  • It computes the hyperbolic tangent (tanh) of x by applying the formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)).
  • The result y will be a value between -1 and 1, representing the hyperbolic tangent of the input x.
  • This function is commonly used in machine learning and neural networks for activation functions.

np.tanh(x) is preferred because it is shorter, clearer, and numerically safer.

25. Mini Project: Clean and Score Sensor Readings

Suppose you receive sensor readings from 4 devices across 5 time points.

python
readings = np.array([
    [12.0, 13.5, np.nan, 15.0, 14.5],
    [9.0, np.nan, 11.5, 12.0, 13.0],
    [20.0, 19.5, 21.0, np.nan, 22.0],
    [7.5, 8.0, 8.5, 9.0, np.nan],
])

Explanation

  • Initializes a 2D NumPy array named readings to store temperature data.
  • Contains floating-point numbers representing temperature readings, with some values set to np.nan to indicate missing data.
  • The array has four rows and five columns, allowing for structured data representation.
  • Useful for data analysis tasks where handling of missing values is necessary, such as in scientific research or data preprocessing.

Tasks:

  • fill missing values with each device's row mean
  • calculate each device's average reading
  • mark readings above the device average
  • normalize each row between 0 and 1

Solution:

python
row_means = np.nanmean(readings, axis=1, keepdims=True)

missing = np.isnan(readings)
filled = readings.copy()
filled[missing] = np.take(row_means.ravel(), np.where(missing)[0])

device_average = filled.mean(axis=1, keepdims=True)
above_average = filled > device_average

row_min = filled.min(axis=1, keepdims=True)
row_max = filled.max(axis=1, keepdims=True)
normalized = (filled - row_min) / (row_max - row_min)

print("Filled readings:")
print(filled)
print("Device averages:")
print(device_average.ravel())
print("Above average mask:")
print(above_average)
print("Normalized readings:")
print(normalized)

Explanation

  • Computes the mean of each row in the readings array while ignoring NaN values, storing the result in row_means.
  • Identifies missing values in the readings array and creates a copy to fill these missing entries with the corresponding row means.
  • Calculates the average of the filled readings for each device and creates a boolean mask indicating which readings are above the average.
  • Normalizes the filled readings by scaling them between 0 and 1 based on the minimum and maximum values of each row.
  • Outputs the filled readings, device averages, above-average mask, and normalized readings to the console for review.

This mini project uses:

  • np.nanmean
  • boolean masks
  • broadcasting
  • row-wise operations
  • normalization

These are core skills for real data cleaning.

26. Quick Quiz

1. Why is NumPy often faster than Python loops?

Because NumPy stores numerical data compactly and runs many operations in optimized lower-level code.

2. What is fancy indexing?

Fancy indexing means selecting array values using lists or arrays of indexes.

3. Which operators should you use for NumPy boolean masks?

Use &, |, and ~, with each condition wrapped in parentheses.

4. What makes two dimensions broadcast-compatible?

They are compatible if the sizes are equal or one of the sizes is 1.

5. Why is np.nanmean() useful?

It calculates the mean while ignoring np.nan values.

Final Takeaway

Advanced NumPy is mostly about thinking in arrays instead of loops.

The key habits are:

  • check shapes before operations
  • use boolean masks for filtering
  • use fancy indexing for specific rows or columns
  • understand broadcasting before writing repeated loops
  • use vectorized formulas for mathematical work
  • handle np.nan values deliberately

When your code feels complicated, print the shape. Most NumPy confusion becomes easier once you know the shape of every array involved.

Sources and Further Reading