Pandas MultiIndex: Hierarchical Indexing, Reshaping, and Pivot Tables
A normal DataFrame index has one level.
That is enough when each row has one simple label.
But many real datasets have nested meaning:
- campus -> track -> year -> quarter
- country -> city -> month
- learner -> course -> attempt
- account -> category -> subcategory
- metric group -> metric name
Pandas handles this with MultiIndex, also called hierarchical indexing.
This guide uses original sample datasets created for this lesson. It does not use copied course, college, disease, restaurant, finance, or public dataset examples.
Files used in this lesson:
pandas_multiindex_course_metrics.csvpandas_multiindex_course_metrics_wide.csvpandas_multiindex_expense_log.csv
Place all CSV files in the same folder as this Markdown file before running the examples.
What You Will Learn
By the end, you should be able to use:
pd.MultiIndex.from_tuples()pd.MultiIndex.from_product()set_index()with multiple columnsindex.names,index.levels, andindex.get_level_values().locwith partial and complete MultiIndex keyspd.IndexSlice.xs()for cross sectionssort_index()swaplevel()reset_index()stack()andunstack()- MultiIndex columns
melt()for wide-to-long conversionpivot_table()fill_value,aggfunc, andmargins- practical reporting patterns with hierarchical data
1. Setup
import pandas as pd
import numpy as npLoad the datasets:
metrics = pd.read_csv("pandas_multiindex_course_metrics.csv")
wide_metrics = pd.read_csv("pandas_multiindex_course_metrics_wide.csv")
expenses = pd.read_csv("pandas_multiindex_expense_log.csv", parse_dates=["date"])Inspect them:
metrics.head()wide_metrics.head()expenses.head()2. Why MultiIndex Exists
Suppose you want to store one metric for each campus and year.
You could create tuple labels:
labels = [
("Pune", 2025),
("Pune", 2026),
("Kochi", 2025),
("Kochi", 2026),
]
scores = pd.Series([82, 87, 78, 84], index=labels)
scoresThis works, but the index is just a plain list of tuples.
Try selecting all Pune rows:
scores["Pune"]That does not behave like a real nested index.
The better solution is a proper MultiIndex.
3. Create A MultiIndex From Tuples
multi_index = pd.MultiIndex.from_tuples(
labels,
names=["campus", "year"]
)
scores = pd.Series([82, 87, 78, 84], index=multi_index)
scoresNow select all Pune rows:
scores.loc["Pune"]Select one exact value:
scores.loc[("Pune", 2026)]The outer level is campus.
The inner level is year.
4. Create A MultiIndex From Product
When you need every combination, use from_product().
index = pd.MultiIndex.from_product(
[["Pune", "Kochi"], ["Pandas", "SQL"], [2025, 2026]],
names=["campus", "track", "year"]
)
indexThis creates:
- Pune + Pandas + 2025
- Pune + Pandas + 2026
- Pune + SQL + 2025
- Pune + SQL + 2026
- and so on
Use from_product() when all combinations should exist.
Use from_tuples() when you already have the exact combinations.
5. Create A MultiIndex DataFrame From CSV
The course metrics dataset is in long format.
metrics.head()Create a MultiIndex from multiple columns:
metrics_mi = metrics.set_index(["campus", "track", "year", "quarter"])
metrics_mi.head()Check the index names:
metrics_mi.index.namesCheck the index levels:
metrics_mi.index.levelsGet all unique campus labels from the index:
metrics_mi.index.get_level_values("campus").unique()Get all unique tracks:
metrics_mi.index.get_level_values("track").unique()6. Select Rows With .loc
Select all rows for Pune:
metrics_mi.loc["Pune"]Select all Pune Pandas rows:
metrics_mi.loc[("Pune", "Pandas")]Select Pune Pandas in 2026:
metrics_mi.loc[("Pune", "Pandas", 2026)]Select one exact row:
metrics_mi.loc[("Pune", "Pandas", 2026, "Q2")]When you pass a full tuple, Pandas selects one exact MultiIndex path.
When you pass a partial key, Pandas selects everything below that level.
7. Sort The MultiIndex
Sorting makes MultiIndex selection and slicing easier.
metrics_mi = metrics_mi.sort_index()Sort descending by campus but ascending by track/year/quarter:
metrics_mi.sort_index(ascending=[False, True, True, True]).head()Sort by one level:
metrics_mi.sort_index(level="track").head()In production code, sort before slicing ranges in a MultiIndex.
8. Select With pd.IndexSlice
For more complex selections, use pd.IndexSlice.
idx = pd.IndexSliceSelect all campuses, only Pandas, all years, only Q2:
metrics_mi.loc[idx[:, "Pandas", :, "Q2"], :]Select Pune and Kochi, all tracks, year 2026, all quarters:
metrics_mi.loc[idx[["Pune", "Kochi"], :, 2026, :], :]Select one metric column for all 2026 records:
metrics_mi.loc[idx[:, :, 2026, :], "avg_score"]IndexSlice helps when you need to keep some levels open while filtering others.
9. Use .xs() For Cross Sections
.xs() means cross section.
Select all Q1 rows from the quarter level:
metrics_mi.xs("Q1", level="quarter")Select all Pandas rows from the track level:
metrics_mi.xs("Pandas", level="track")Select one campus and one quarter using multiple levels:
metrics_mi.xs(("Pune", "Q2"), level=["campus", "quarter"])Use .loc for normal selection.
Use .xs() when you want to slice one or more named levels without writing full tuples.
10. Reset A MultiIndex
Convert the index levels back into normal columns:
metrics_mi.reset_index().head()Reset only one level:
metrics_mi.reset_index(level="quarter").head()Resetting is useful before:
- exporting to CSV
- merging with another table
- plotting with libraries that expect normal columns
- creating API responses
11. unstack() Moves Index Levels To Columns
Unstack quarter into columns:
quarter_score = metrics_mi["avg_score"].unstack("quarter")
quarter_score.head()Now Q1 and Q2 are columns.
Unstack year:
yearly_learners = metrics_mi["learners"].unstack("year")
yearly_learners.head()Unstack multiple levels:
metrics_mi["avg_score"].unstack(["year", "quarter"]).head()Use unstack() when you want to make nested row labels into columns.
12. stack() Moves Columns Back To Index
Take the unstacked score table:
quarter_score = metrics_mi["avg_score"].unstack("quarter")
quarter_score.head()Stack it back:
quarter_score.stack()stack() is the reverse of unstack() in many common cases.
Use:
unstack()for long-to-widestack()for wide-to-long when columns are hierarchical or repeated groups
13. MultiIndex Columns
You can also have MultiIndex columns.
Create a pivot table with multiple value columns:
campus_track_report = metrics.pivot_table(
index=["campus", "track"],
columns=["year", "quarter"],
values=["learners", "avg_score"],
aggfunc="mean"
)
campus_track_reportThis table has:
- MultiIndex rows:
campus,track - MultiIndex columns: metric name, year, quarter
Select one top-level column group:
campus_track_report["avg_score"]Select 2026 Q2 average score:
campus_track_report[("avg_score", 2026, "Q2")]MultiIndex columns are powerful, but they can look intimidating at first.
14. Swap Levels
Swap row index levels:
metrics_mi.swaplevel("campus", "track").head()Sort after swapping:
metrics_mi.swaplevel("campus", "track").sort_index().head()Swap column levels in a pivot table:
campus_track_report.swaplevel(0, 1, axis=1).sort_index(axis=1).head()Use swaplevel() when the hierarchy is correct, but the level order is inconvenient.
15. Flatten MultiIndex Columns
Sometimes you need simple column names.
flat_report = campus_track_report.copy()
flat_report.columns = [
f"{metric}_{year}_{quarter}"
for metric, year, quarter in flat_report.columns
]
flat_report.reset_index().head()Flatten columns before exporting to systems that do not understand MultiIndex columns.
16. Long Versus Wide Data
Long data stores one observation per row.
Wide data stores many observations across columns.
The file pandas_multiindex_course_metrics_wide.csv is wide:
wide_metrics.head()It has columns like:
2025_Q1_learners2025_Q2_learners2026_Q1_learners2026_Q2_learners
This is readable in a spreadsheet, but harder to analyze programmatically.
17. Convert Wide Data To Long With melt()
Use melt() to gather metric columns into rows.
wide_long = wide_metrics.melt(
id_vars=["campus", "track"],
var_name="period_metric",
value_name="value"
)
wide_long.head()Split the combined column name:
wide_long[["year", "quarter", "metric"]] = wide_long["period_metric"].str.split("_", expand=True)
wide_long["year"] = wide_long["year"].astype(int)
wide_long.head()Create a clean MultiIndex:
wide_long_mi = wide_long.set_index(["campus", "track", "year", "quarter", "metric"]).sort_index()
wide_long_mi.head()Now the wide data has become hierarchical long data.
18. Pivot Table Basics
The expense log is naturally long.
expenses.head()Create monthly spend by category:
expenses.pivot_table(
index="month",
columns="category",
values="amount",
aggfunc="sum",
fill_value=0
)Create monthly income versus expense:
expenses.pivot_table(
index="month",
columns="flow",
values="amount",
aggfunc="sum",
fill_value=0
)Create a multi-dimensional pivot:
expenses.pivot_table(
index=["month", "account"],
columns=["flow", "category"],
values="amount",
aggfunc="sum",
fill_value=0
)Pivot tables often create MultiIndex rows, MultiIndex columns, or both.
19. Pivot Table With Margins
Add totals with margins=True.
expenses.pivot_table(
index="month",
columns="category",
values="amount",
aggfunc="sum",
fill_value=0,
margins=True
)Use margins for reports where totals matter.
20. Pivot Table With Multiple Aggregations
Use multiple aggregations:
expenses.pivot_table(
index="category",
columns="flow",
values="amount",
aggfunc=["sum", "mean", "count"],
fill_value=0
)The result has MultiIndex columns because there are multiple aggregation functions.
21. Mini Project 1: Campus Performance Dashboard
Build a dashboard with:
- campus
- track
- total learners
- average score
- average completion rate
- total project submissions
Solution:
campus_dashboard = metrics.groupby(["campus", "track"], as_index=False).agg(
total_learners=("learners", "sum"),
average_score=("avg_score", "mean"),
average_completion_rate=("completion_rate", "mean"),
total_project_submissions=("project_submissions", "sum")
)
campus_dashboard["average_score"] = campus_dashboard["average_score"].round(2)
campus_dashboard["average_completion_rate"] = campus_dashboard["average_completion_rate"].round(2)
campus_dashboard.sort_values(["campus", "average_score"], ascending=[True, False])22. Mini Project 2: Quarter Comparison Table
Create a table where each quarter is a column and each row is campus-track-year.
quarter_comparison = metrics_mi["learners"].unstack("quarter")
quarter_comparison["growth_Q2_minus_Q1"] = quarter_comparison["Q2"] - quarter_comparison["Q1"]
quarter_comparison.sort_values("growth_Q2_minus_Q1", ascending=False).head()This is a clean use case for unstack().
23. Mini Project 3: Expense Summary Report
Build a report with monthly expense categories and a total column.
expense_only = expenses[expenses["flow"] == "Expense"]
monthly_expense = expense_only.pivot_table(
index="month",
columns="category",
values="amount",
aggfunc="sum",
fill_value=0
)
monthly_expense["total_expense"] = monthly_expense.sum(axis=1)
monthly_expenseSort months by total expense:
monthly_expense.sort_values("total_expense", ascending=False)24. Common Mistakes
Mistake 1: Thinking MultiIndex makes data higher-dimensional
A DataFrame is still two-dimensional.
MultiIndex adds multiple label levels on the row axis or column axis.
Mistake 2: Forgetting to sort before slicing
Sort first:
metrics_mi = metrics_mi.sort_index()Explanation
- Sorts the metrics_mi DataFrame by its index values in ascending order to ensure consistent data arrangement
- The sort_index() method reorders rows based on the index labels rather than any column values
- This operation helps maintain chronological or categorical order when working with time-series or grouped data
- Useful for preparing data for visualization or analysis where ordered presentation is required
- The sorting is performed in-place, modifying the original DataFrame structure directly
This prevents confusing behavior when slicing MultiIndex ranges.
Mistake 3: Using full tuple syntax when a partial key is enough
This selects one exact row:
metrics_mi.loc[("Pune", "Pandas", 2026, "Q2")]Explanation
- This code retrieves a specific row from a multi-index DataFrame named metrics_mi using a hierarchical index tuple
- The tuple ("Pune", "Pandas", 2026, "Q2") corresponds to four levels of the multi-index: location, tool, year, and quarter respectively
- The operation returns the metric value(s) associated with the exact combination of these four index levels
- This approach enables efficient querying of time-series data with multiple categorical dimensions in a structured format
- The result could be a single scalar value or a Series depending on how many columns exist in the original DataFrame
This selects all Pune Pandas rows:
metrics_mi.loc[("Pune", "Pandas")]Explanation
- This code retrieves a specific row from a multi-index DataFrame named metrics_mi by using a tuple containing the index values ("Pune", "Pandas")
- The first element "Pune" corresponds to the first level of the multi-index (likely city names), while "Pandas" corresponds to the second level (likely metric categories)
- This indexing approach allows for precise data retrieval when working with hierarchical data structures in pandas
- The result returns the complete row of data associated with the specified multi-index combination
- This technique is commonly used when analyzing nested categorical data such as performance metrics across different locations and categories
Mistake 4: Forgetting that unstack creates columns
metrics_mi["avg_score"].unstack("quarter")Explanation
- The code transforms a multi-index DataFrame by pivoting the "quarter" level from rows to columns
- This reshapes the data structure to create separate columns for each quarter's average scores
- The unstack operation converts the hierarchical index into a wide-format table layout
- Results in cleaner data organization where quarters become column headers instead of index levels
- Useful for time-series analysis or creating summary tables with quarterly breakdowns
After this, Q1 and Q2 are columns, not row labels.
Mistake 5: Keeping MultiIndex columns when exporting
Flatten before export if the target tool expects simple columns:
flat = campus_track_report.copy()
flat.columns = ["_".join(map(str, col)) for col in flat.columns]
flat.reset_index()Explanation
- Creates a copy of the original campus track report DataFrame to avoid modifying the source data
- Transforms multi-level column names into single string labels by joining tuple elements with underscores
- Resets the DataFrame index to return to default integer indexing starting from zero
- Prepares the data for easier analysis and visualization by simplifying the column structure
25. Practice Questions
Try these before looking at the solutions.
Practice Task
Q1. Create a MultiIndex DataFrame using campus, track, year, and quarter.
metrics_mi = metrics.set_index(["campus", "track", "year", "quarter"]).sort_index()
metrics_mi.head()Explanation
- Converts the metrics DataFrame into a multi-indexed structure using campus, track, year, and quarter as index levels
- Sorts the DataFrame by the new multi-index to ensure organized and predictable data ordering
- Displays the first few rows of the restructured DataFrame to verify the indexing operation
- Enables more efficient data querying and analysis through hierarchical grouping and sorting
- Prepares the data structure for advanced operations that require multi-level categorization
Practice Task
Q2. Select all SQL rows for Kochi.
metrics_mi.loc[("Kochi", "SQL")]Explanation
- This code retrieves a specific value from a pandas DataFrame with a MultiIndex structure using tuple-based indexing
- The operation targets the row with index levels ("Kochi", "SQL") where Kochi represents a city and SQL represents a skill category
- The result returns the metric value associated with the intersection of these two index levels
- This approach enables efficient lookup of nested data structures while maintaining clean hierarchical organization
- The syntax demonstrates pandas' capability to handle complex multi-level indexing for analytical data retrieval
Practice Task
Q3. Select all Q2 rows across all campuses and tracks.
idx = pd.IndexSlice
metrics_mi.loc[idx[:, :, :, "Q2"], :]Explanation
- IndexSlice creates a convenient way to slice multi-level indices without specifying each level explicitly
- The syntax idx[:, :, :, "Q2"] selects all rows where the fourth level equals "Q2" while keeping all values from other levels
- This approach enables efficient filtering of hierarchical data structures by specifying only the desired values for specific index levels
- The colon notation (:) represents "all values" for each dimension that isn't being filtered
- This technique is particularly useful when working with complex multi-index DataFrames where manual level specification would be cumbersome
Practice Task
Q4. Show average score as a year-quarter column table.
metrics_mi["avg_score"].unstack(["year", "quarter"])Explanation
- The code transforms a multi-indexed pandas DataFrame by unstacking the "year" and "quarter" levels from the index
- This operation pivots the data structure to create separate columns for each unique combination of year and quarter values
- The resulting DataFrame will have year-quarter combinations as column headers instead of index levels
- This reshaping technique is commonly used for time series analysis and creating wide-format datasets for reporting
- The "avg_score" column values are distributed across the newly created column structure based on their original year and quarter groupings
Practice Task
Q5. Find Q2 learner growth over Q1 for each campus-track-year.
learner_quarters = metrics_mi["learners"].unstack("quarter")
learner_quarters["q2_growth"] = learner_quarters["Q2"] - learner_quarters["Q1"]
learner_quarters.sort_values("q2_growth", ascending=False)Explanation
- Unstacks the quarter dimension from the learners metric data to create a wide format dataframe with quarters as columns
- Computes the growth rate between Q2 and Q1 by subtracting Q1 values from Q2 values for each learner category
- Sorts the resulting dataframe in descending order based on the calculated Q2 growth values to identify highest performing categories
- The analysis reveals which learner segments showed the most significant improvement from first to second quarter
Practice Task
Q6. Convert the wide metrics CSV into long format.
long_from_wide = wide_metrics.melt(
id_vars=["campus", "track"],
var_name="period_metric",
value_name="value"
)
long_from_wide.head()Explanation
- Transforms data from wide format where each period/metric is a separate column into long format where periods/metrics become rows
- Uses id_vars parameter to specify columns that identify each observation (campus and track)
- Creates new columns for the melted data: period_metric contains the original column names, and value contains the corresponding data values
- The head() method displays the first 5 rows of the transformed dataset for quick inspection
- This reshaping operation enables easier analysis and visualization of time-series or multi-metric data
Practice Task
Q7. Build a pivot table of monthly expenses by category.
expenses[expenses["flow"] == "Expense"].pivot_table(
index="month",
columns="category",
values="amount",
aggfunc="sum",
fill_value=0
)Explanation
- Filters the expenses DataFrame to include only rows where the flow column equals "Expense"
- Creates a pivot table structure with months as rows and categories as columns
- Aggregates the amount values using sum function to total expenses per month-category combination
- Fills any missing combinations with zero values instead of NaN
- Returns a clean tabular representation showing expense patterns across different time periods and categories
Practice Task
Q8. Build a pivot table of total amount by account and flow.
expenses.pivot_table(
index="account",
columns="flow",
values="amount",
aggfunc="sum",
fill_value=0
)Explanation
- Transforms flat expense data into a structured matrix format with accounts as rows and transaction flows as columns
- Calculates total amounts for each combination of account and flow type using sum aggregation
- Fills missing values with zero to ensure complete data representation for all account-flow combinations
- Provides a clear overview of financial transactions organized by account category and inflow/outflow status
- Enables easy analysis of spending patterns and cash flow management across different accounts
Practice Task
Q9. Swap campus and track levels in the MultiIndex.
metrics_mi.swaplevel("campus", "track").sort_index().head()Explanation
- The swaplevel() method exchanges the positions of two index levels in a multi-index DataFrame, allowing for different sorting priorities
- In this case, it swaps the "campus" and "track" levels to reorder the DataFrame structure for more logical grouping
- sort_index() then arranges the DataFrame rows in ascending order based on the new level arrangement
- head() displays only the first few rows of the resulting sorted DataFrame for quick preview
- This technique is commonly used when working with hierarchical data where you need to change the primary sorting criteria
Practice Task
Q10. Flatten a pivot table with MultiIndex columns.
report = expenses.pivot_table(
index="month",
columns=["flow", "category"],
values="amount",
aggfunc="sum",
fill_value=0
)
report.columns = ["_".join(map(str, col)) for col in report.columns]
report.reset_index()Explanation
- Generates a pivot table from expense data organizing by month with flow and category as nested column headers
- Applies sum aggregation to amount values while filling missing combinations with zero values
- Flattens the multi-level column structure into single string labels by joining tuple elements with underscores
- Resets the index to make the month column a regular column rather than an index for easier data manipulation
26. Interview Questions
1. What is a MultiIndex?
A MultiIndex is an index with multiple levels. It lets one axis carry hierarchical labels such as campus, track, year, and quarter.
2. Does MultiIndex make a DataFrame three-dimensional?
No. A DataFrame remains two-dimensional. MultiIndex only adds multiple label levels to rows or columns.
3. What is the difference between from_tuples() and from_product()?
from_tuples() builds an index from exact combinations you provide. from_product() creates every possible combination from input lists.
4. How do you create a MultiIndex from existing columns?
Use set_index() with a list of columns:
df.set_index(["campus", "track", "year"])Explanation
- Transforms the DataFrame by setting multiple columns (campus, track, year) as the index structure
- Creates a hierarchical index that enables efficient grouping and filtering operations across multiple dimensions
- Allows for easier data retrieval using multi-level indexing syntax like df.loc[("campus1", "track1", 2023)]
- Improves data organization for analytical work involving nested categorical data structures
- Prepares the DataFrame for advanced operations like pivot tables and grouped aggregations
5. What does unstack() do?
It moves one or more index levels into columns.
6. What does stack() do?
It moves column levels into the row index.
7. What is .xs() used for?
.xs() selects a cross section from one or more MultiIndex levels.
8. Why should you sort a MultiIndex?
Sorting makes selection, slicing, and reading results more predictable.
9. What is the difference between melt() and pivot_table()?
melt() usually converts wide data into long data. pivot_table() summarizes long data into a table, often wide.
10. Why can pivot tables create MultiIndex columns?
They create MultiIndex columns when you use multiple column keys, multiple value columns, or multiple aggregation functions.
11. When should you flatten MultiIndex columns?
Flatten them before exporting to tools or APIs that expect one-level column names.
12. What is pd.IndexSlice?
It is a helper for selecting across MultiIndex levels while leaving some levels open with :.
27. Final Mental Model
Use this table:
| Need | Use |
|---|---|
| Create nested labels from exact pairs | pd.MultiIndex.from_tuples() |
| Create all combinations of labels | pd.MultiIndex.from_product() |
| Make existing columns hierarchical labels | set_index([...]) |
| Select nested rows | .loc[...] |
| Select a level cross section | .xs(..., level=...) |
| Move row levels to columns | unstack() |
| Move column levels to rows | stack() |
| Change level order | swaplevel() |
| Return labels to normal columns | reset_index() |
| Convert wide data to long data | melt() |
| Summarize long data into a report | pivot_table() |
If you remember only one thing, remember this:
MultiIndex helps you represent nested labels. Stack, unstack, melt, and pivot tables help you reshape those labels into the form your analysis needs.
Official References
- Pandas advanced indexing and MultiIndex user guide: https://pandas.pydata.org/docs/user_guide/advanced.html
- Pandas reshaping and pivot tables user guide: https://pandas.pydata.org/docs/user_guide/reshaping.html
- Pandas
MultiIndexAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html - Pandas
pivot_tableAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html - Pandas
meltAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.melt.html
