What is the purpose of using pd.concat in Pandas?

pd.concat is used for stacking DataFrames when they have the same meaning and similar columns, such as combining January and February enrollments.

How can you reset the index when concatenating DataFrames?

You can reset the index by using the parameter ignore_index=True in pd.concat, which will create a clean index from 0 to n - 1.

What is the use of the keys parameter in pd.concat?

The keys parameter is used to create a MultiIndex, allowing you to label each row with the original file it came from, such as 'Jan' or 'Feb'.

How do you perform a database-style join in Pandas?

You can perform a database-style join using the merge function, which supports inner, left, right, outer, and cross joins.

What are some scenarios where pd.concat is useful?

pd.concat is useful for combining monthly sales files, daily logs, region-wise CSVs, scraped pages with the same schema, and batches of survey responses.

Pandas Merge, Join, and Concat: Essential

Pandas Merge, Join, and Concat: Combine DataFrames With Practice

Most real projects do not give you one perfect CSV.

You may get:

one file for January enrollments
one file for February enrollments
one file for learners
one file for courses
one file for mentors
one file for reviews

To answer useful questions, you need to combine these files correctly.

This is where pd.concat(), merge(), and join() become essential.

This guide uses original sample datasets created for this lesson. It does not use copied course, sports, movie, ecommerce, or public dataset examples.

Files used in this lesson:

pandas_join_learners.csv
pandas_join_courses.csv
pandas_join_mentors.csv
pandas_join_enrollments_jan.csv
pandas_join_enrollments_feb.csv
pandas_join_reviews.csv

Place all CSV files in the same folder as this Markdown file before running the examples.

What You Will Learn

By the end, you should be able to use:

pd.concat() for stacking DataFrames
ignore_index=True
keys= to create a MultiIndex while concatenating
axis=0 versus axis=1
merge() for database-style joins
inner, left, right, outer, and cross joins
on, left_on, and right_on
joining on multiple columns
suffixes
indicator=True
validate
self joins
semi joins and anti joins
index-based .join()
combine_first()
reporting workflows after merging

1. Setup

python

import pandas as pd
import numpy as np

Load the datasets:

python

learners = pd.read_csv("pandas_join_learners.csv")
courses = pd.read_csv("pandas_join_courses.csv")
mentors = pd.read_csv("pandas_join_mentors.csv")
jan = pd.read_csv("pandas_join_enrollments_jan.csv")
feb = pd.read_csv("pandas_join_enrollments_feb.csv")
reviews = pd.read_csv("pandas_join_reviews.csv")

Inspect the shapes:

python

for name, df in {
    "learners": learners,
    "courses": courses,
    "mentors": mentors,
    "jan": jan,
    "feb": feb,
    "reviews": reviews,
}.items():
    print(name, df.shape)

Look at the first few rows:

python

learners.head()

python

courses.head()

python

jan.head()

2. `concat()` For Stacking Rows

Use pd.concat() when two or more DataFrames have the same meaning and similar columns.

January enrollments and February enrollments are the same kind of data.

python

enrollments = pd.concat([jan, feb])
enrollments

Notice the index.

Each file kept its original index, so you may see duplicate index labels.

For most beginner workflows, reset the index while concatenating:

python

enrollments = pd.concat([jan, feb], ignore_index=True)
enrollments

Now the index is clean from 0 to n - 1.

Use this pattern for:

monthly sales files
daily logs
region-wise CSVs
scraped pages with the same schema
batches of survey responses

3. Add Source Labels With `keys`

Sometimes you want to remember which original file each row came from.

python

enrollments_by_month = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
)

enrollments_by_month

This creates a MultiIndex.

Get all January rows:

python

enrollments_by_month.loc["Jan"]

Get the first row inside February:

python

enrollments_by_month.loc[("Feb", 0)]

If you want the month label as a normal column:

python

enrollments_with_month = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "month"})

enrollments_with_month.head()

This is useful for month-wise revenue analysis.

4. `concat()` Side By Side With `axis=1`

axis=0 stacks rows.

axis=1 attaches columns side by side.

python

pd.concat([learners.head(3), courses.head(3)], axis=1)

This example is not a real relationship. It simply places columns next to each other based on index position.

Use axis=1 only when rows are already aligned by index and you know that row 0 in one DataFrame belongs with row 0 in another.

For most relationship-based combining, use merge().

5. `merge()` For Relationship-Based Joins

Use merge() when rows should match by one or more key columns.

Example:

python

enrollments.merge(learners, on="learner_id")

This matches enrollment rows to learner details using learner_id.

By default, merge() performs an inner join.

Inner join means:

keep rows with matching keys in both DataFrames
drop rows that do not match

6. Inner Join

Show only enrollments that have matching learner records:

python

enrollments.merge(learners, how="inner", on="learner_id")

Create a richer table by merging enrollments, learners, and courses:

python

enrollment_detail = (
    enrollments
    .merge(learners, how="inner", on="learner_id")
    .merge(courses, how="inner", on="course_id")
)

enrollment_detail.head()

Now each enrollment row has learner and course information.

7. Left Join

A left join keeps every row from the left DataFrame.

For example, keep all courses even if some courses have no enrollments:

python

courses.merge(enrollments, how="left", on="course_id")

Rows for courses without enrollments will have missing values in enrollment columns.

This is useful when the left table is your master list.

Examples:

all courses, whether enrolled or not
all customers, whether ordered or not
all employees, whether assigned or not
all products, whether sold or not

8. Right Join

A right join keeps every row from the right DataFrame.

python

learners.merge(enrollments, how="right", on="learner_id")

This keeps every enrollment row from the right table.

You can usually rewrite right joins as left joins by swapping the order of the DataFrames. Many teams prefer left joins because they are easier to read.

9. Outer Join

An outer join keeps all keys from both DataFrames.

python

learners.merge(enrollments, how="outer", on="learner_id")

This is useful for audits:

learners with enrollments
learners without enrollments
enrollment rows that reference missing learners

Outer joins help find data quality gaps.

10. Use `indicator=True` To Audit Matches

The indicator=True option adds a _merge column.

python

learner_enrollment_audit = learners.merge(
    enrollments,
    how="outer",
    on="learner_id",
    indicator=True
)

learner_enrollment_audit["_merge"].value_counts()

The _merge values are:

both: key exists in both tables
left_only: key exists only in the left table
right_only: key exists only in the right table

Find learners who have not enrolled:

python

learner_enrollment_audit[
    learner_enrollment_audit["_merge"] == "left_only"
]

Find enrollment records with missing learner profiles:

python

learner_enrollment_audit[
    learner_enrollment_audit["_merge"] == "right_only"
]

This is one of the best debugging tools for joins.

11. Anti Join

An anti join means: show rows from one table that do not have a match in another table.

Pandas does not have a direct anti_join() method, but you can build one with indicator=True.

Courses with no enrollment:

python

course_audit = courses.merge(
    enrollments,
    how="left",
    on="course_id",
    indicator=True
)

courses_without_enrollment = course_audit[
    course_audit["_merge"] == "left_only"
]

courses_without_enrollment[["course_id", "course_name", "category", "price"]]

Learners with no enrollment:

python

learner_audit = learners.merge(
    enrollments,
    how="left",
    on="learner_id",
    indicator=True
)

learners_without_enrollment = learner_audit[
    learner_audit["_merge"] == "left_only"
]

learners_without_enrollment[["learner_id", "learner", "city", "plan"]]

Anti joins are common in analytics work.

12. Semi Join

A semi join means: keep rows from one table if they have a match in another table.

Learners who enrolled at least once:

python

enrolled_learner_ids = enrollments["learner_id"].unique()

learners[learners["learner_id"].isin(enrolled_learner_ids)]

Courses that received at least one enrollment:

python

enrolled_course_ids = enrollments["course_id"].unique()

courses[courses["course_id"].isin(enrolled_course_ids)]

Use semi joins when you need the original table rows, not the combined table.

13. Merge With Different Column Names

The reviews dataset uses different key names:

student_ref points to learner_id
course_ref points to course_id

Use left_on and right_on.

python

reviews_with_learners = reviews.merge(
    learners,
    how="left",
    left_on="student_ref",
    right_on="learner_id"
)

reviews_with_learners.head()

Merge reviews with courses:

python

reviews_full = (
    reviews
    .merge(learners, how="left", left_on="student_ref", right_on="learner_id")
    .merge(courses, how="left", left_on="course_ref", right_on="course_id")
)

reviews_full[["review_id", "learner", "course_name", "rating", "review_text"]]

14. Merge On Multiple Columns

Sometimes one column is not enough to identify a match.

Create a small completion table:

python

completion_checks = pd.DataFrame({
    "learner_id": ["L001", "L002", "L006", "L010", "L015"],
    "course_id": ["C101", "C101", "C101", "C104", "C106"],
    "certificate_sent": [True, False, True, False, True]
})

Merge using both learner_id and course_id:

python

enrollments.merge(
    completion_checks,
    how="left",
    on=["learner_id", "course_id"]
)

Use multiple keys when the relationship depends on a combination of columns.

15. Handle Duplicate Column Names With `suffixes`

If both DataFrames contain columns with the same name, Pandas adds suffixes like _x and _y.

You can choose better suffixes.

python

mentor_course_table = courses.merge(
    mentors,
    how="left",
    on="mentor_id",
    suffixes=("_course", "_mentor")
)

mentor_course_table.head()

In this dataset, both tables have specialty, so custom suffixes keep the result readable.

16. Validate Join Relationships

validate helps catch accidental many-to-many joins.

For example, each course has one mentor, but one mentor can teach many courses.

python

courses.merge(
    mentors,
    how="left",
    on="mentor_id",
    validate="many_to_one"
)

Each enrollment belongs to one learner, but one learner can have many enrollments.

python

enrollments.merge(
    learners,
    how="left",
    on="learner_id",
    validate="many_to_one"
)

Common validation options:

Option	Meaning
`one_to_one`	each key appears once in both tables
`one_to_many`	left keys are unique, right keys may repeat
`many_to_one`	left keys may repeat, right keys are unique
`many_to_many`	keys may repeat in both tables

Use validate when correctness matters.

It can save you from silent row explosions.

17. Many-To-Many Joins Can Expand Rows

If the same key appears multiple times in both tables, merge creates all matching combinations.

Example:

python

left = pd.DataFrame({
    "topic": ["pandas", "pandas", "sql"],
    "left_note": ["A", "B", "C"]
})

right = pd.DataFrame({
    "topic": ["pandas", "pandas", "numpy"],
    "right_note": ["X", "Y", "Z"]
})

left.merge(right, on="topic", how="inner")

The pandas rows multiply because there are two matches on each side.

Before merging, check key uniqueness:

python

learners["learner_id"].is_unique

python

courses["course_id"].is_unique

python

enrollments["learner_id"].is_unique

It is normal for enrollments to have repeated learner IDs, but it is not normal for the learner master table to have duplicate learner IDs.

18. Self Join

A self join means merging a table with itself.

In the learners dataset, referral_id points to another learner.

python

learners[["learner_id", "learner", "referral_id"]]

Merge learners with learners:

python

referral_map = learners.merge(
    learners,
    how="left",
    left_on="referral_id",
    right_on="learner_id",
    suffixes=("_learner", "_referrer")
)

referral_map[[
    "learner_id_learner",
    "learner_learner",
    "referral_id_learner",
    "learner_referrer"
]]

Self joins are useful for:

employee-manager relationships
user-referrer relationships
product replacement mappings
category-parent relationships

19. Index-Based `.join()`

.join() is mostly used for index-based joins.

Set learner ID as the index:

python

learners_by_id = learners.set_index("learner_id")
enrollments_by_learner = enrollments.set_index("learner_id")

Join enrollment rows with learner columns:

python

enrollments_by_learner.join(
    learners_by_id,
    how="left",
    rsuffix="_learner"
).head()

For beginners, merge() is usually clearer because the join keys are visible as columns.

Use .join() when your DataFrames are already indexed by the relationship key.

20. `combine_first()` For Filling Missing Values

combine_first() fills missing values in one object using values from another object with aligned indexes.

Example:

python

primary_prices = courses.set_index("course_id")["price"]

backup_prices = pd.Series({
    "C101": 1499,
    "C102": 999,
    "C103": 1299,
    "C104": 1199,
    "C105": 1999,
    "C106": 799
})

primary_prices.combine_first(backup_prices)

This is not a replacement for merge().

Use it when you want to fill missing aligned values.

21. Build A Complete Enrollment Report

Create one table with enrollment, learner, course, and mentor information.

python

enrollment_report = (
    enrollments
    .merge(learners, how="left", on="learner_id", validate="many_to_one")
    .merge(courses, how="left", on="course_id", validate="many_to_one")
    .merge(mentors, how="left", on="mentor_id", validate="many_to_one", suffixes=("_course", "_mentor"))
)

enrollment_report.head()

Select useful columns:

python

enrollment_report = enrollment_report[[
    "enrollment_id",
    "enrolled_at",
    "learner",
    "city",
    "plan",
    "course_name",
    "category",
    "price",
    "payment_status",
    "mentor",
    "coupon"
]]

enrollment_report.head()

22. Month-Wise Revenue

Use the month label from concat(keys=...).

python

enrollments_with_month = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "month"})

month_revenue = (
    enrollments_with_month
    .merge(courses, how="left", on="course_id")
    .query("payment_status == 'paid'")
    .groupby("month", as_index=False)["price"]
    .sum()
    .rename(columns={"price": "revenue"})
)

month_revenue

23. Course-Wise Revenue

python

course_revenue = (
    enrollments
    .merge(courses, how="left", on="course_id")
    .query("payment_status == 'paid'")
    .groupby(["course_id", "course_name"], as_index=False)["price"]
    .sum()
    .rename(columns={"price": "revenue"})
    .sort_values("revenue", ascending=False)
)

course_revenue

24. Top Learners By Spend

python

learner_spend = (
    enrollments
    .merge(learners, how="left", on="learner_id")
    .merge(courses, how="left", on="course_id")
    .query("payment_status == 'paid'")
    .groupby(["learner_id", "learner"], as_index=False)["price"]
    .sum()
    .rename(columns={"price": "total_spend"})
    .sort_values("total_spend", ascending=False)
)

learner_spend.head(5)

25. Courses With No Paid Enrollment

This is slightly different from courses with no enrollment at all.

python

paid_enrollments = enrollments[enrollments["payment_status"] == "paid"]

paid_course_audit = courses.merge(
    paid_enrollments,
    how="left",
    on="course_id",
    indicator=True
)

paid_course_audit[
    paid_course_audit["_merge"] == "left_only"
][["course_id", "course_name", "category", "price"]]

This helps identify courses that exist in the catalog but did not produce paid orders.

26. Review Report

Merge reviews with learners and courses.

python

review_report = (
    reviews
    .merge(learners, how="left", left_on="student_ref", right_on="learner_id")
    .merge(courses, how="left", left_on="course_ref", right_on="course_id")
)

review_report[[
    "review_id",
    "learner",
    "course_name",
    "rating",
    "review_text"
]]

Average rating by course:

python

review_report.groupby("course_name", as_index=False)["rating"].mean()

27. Cross Join

A cross join creates every combination of rows.

Example: every active mentor with every active course category.

python

mentor_names = mentors[["mentor_id", "mentor"]].head(3)
categories = pd.DataFrame({"category": ["Python", "Data", "Project"]})

mentor_names.merge(categories, how="cross")

Use cross joins carefully.

If the left table has 1000 rows and the right table has 1000 rows, the result has 1,000,000 rows.

28. Common Mistakes

Mistake 1: Using `concat()` when you need `merge()`

This only places rows or columns together:

python

pd.concat([learners, enrollments])

Explanation

The pd.concat() function vertically stacks two pandas DataFrames (learners and enrollments) by concatenating their rows
This operation preserves all columns from both DataFrames while appending the rows of the second DataFrame below the first
The resulting DataFrame contains combined data from both sources, useful for merging related datasets
Column names must match between DataFrames for proper alignment during concatenation
This technique is commonly used in data processing workflows to aggregate multiple data sources into a single dataset

It does not match learner details to enrollments.

For relationship matching, use:

python

enrollments.merge(learners, on="learner_id")

Explanation

Combines two datasets (enrollments and learners) based on matching learner IDs to create a unified view of enrollment details alongside learner demographics
The merge operation aligns records from both tables where the "learner_id" field values match between the datasets
This creates a comprehensive dataset that includes both enrollment information and corresponding learner attributes in a single dataframe
The resulting merged dataset enables analysis of learner behavior, enrollment patterns, and demographic breakdowns simultaneously
Commonly used in educational analytics to understand student engagement and course participation metrics

Mistake 2: Forgetting `ignore_index=True`

When stacking files, duplicated index labels can confuse later code.

python

pd.concat([jan, feb], ignore_index=True)

Explanation

Concatenates two pandas DataFrame objects (jan and feb) along the row axis (axis=0 by default)
The ignore_index=True parameter creates a new continuous integer index starting from zero
Preserves all columns from both DataFrames without any merging or joining operations
Useful for combining time-series data or appending datasets with identical column structures
Returns a new DataFrame object containing all rows from both input DataFrames sequentially

Mistake 3: Ignoring unmatched rows

An inner join can silently drop rows.

Audit with:

python

learners.merge(enrollments, how="outer", on="learner_id", indicator=True)

Explanation

Performs an outer join operation between learners and enrollments DataFrames based on the common learner_id column
The indicator=True parameter adds a special column showing which DataFrame each record came from (left_only, right_only, both)
Preserves all records from both datasets regardless of matching conditions, making it ideal for identifying missing enrollments or unenrolled learners
Useful for data quality analysis to spot records that exist in one dataset but not the other
Returns a merged DataFrame that can be used to analyze complete enrollment coverage across all learners

Mistake 4: Not checking key uniqueness

Before merging master tables, check:

python

learners["learner_id"].is_unique

Explanation

The code checks whether all values in the 'learner_id' column of the learners DataFrame are distinct and non-duplicate
The is_unique property returns a boolean value (True/False) indicating if all elements in the Series are unique
This is useful for data validation to ensure each learner has a unique identifier
The operation is efficient as it leverages pandas' optimized internal methods for checking uniqueness
This check helps maintain data integrity before performing operations that require unique identifiers

python

courses["course_id"].is_unique

Explanation

This code checks whether all values in the 'course_id' column of the courses DataFrame are distinct
The .is_unique property returns a boolean value (True/False) indicating if there are no duplicate entries
When True, it confirms that each course ID appears only once in the dataset
When False, it indicates there are duplicate course IDs that need to be addressed
This validation is essential for ensuring data integrity in educational course management systems

Mistake 5: Letting duplicate columns become confusing

Use meaningful suffixes:

python

courses.merge(
    mentors,
    on="mentor_id",
    suffixes=("_course", "_mentor")
)

Explanation

Combines two pandas DataFrames (courses and mentors) based on their common "mentor_id" column
Uses suffixes parameter to distinguish columns with identical names between the two dataframes
Appends "_course" and "_mentor" suffixes to overlapping column names respectively
Creates a unified dataset that preserves all course information while adding corresponding mentor details
Returns a new merged DataFrame containing combined data from both source datasets

29. Practice Questions

Try these before looking at the solutions.

Practice Task

Q1. Append January and February enrollment files into one clean DataFrame.

python

enrollments = pd.concat([jan, feb], ignore_index=True)
enrollments.head()

Explanation

The code merges two pandas DataFrames (jan and feb) into a single DataFrame called enrollments by stacking them vertically
The ignore_index=True parameter resets the row indices to create a continuous sequential index starting from zero
This operation is useful for combining monthly enrollment data into a unified dataset for analysis
The head() method displays the first 5 rows of the combined DataFrame to verify the concatenation worked correctly
The resulting enrollments DataFrame contains all records from both input DataFrames in chronological order

Practice Task

Q2. Create a full enrollment table with learner and course names.

python

full_enrollments = (
    enrollments
    .merge(learners, how="left", on="learner_id")
    .merge(courses, how="left", on="course_id")
)

full_enrollments[["enrollment_id", "learner", "course_name", "payment_status"]]

Explanation

Combines three datasets (enrollments, learners, and courses) using left joins to preserve all enrollment records while adding corresponding learner and course details
Creates a unified dataframe that links student enrollments to specific courses and learner information through common identifier fields
Selects only the most relevant columns (enrollment ID, learner name, course name, and payment status) for reporting or analysis purposes
The merge operations maintain data integrity by matching records based on learner_id and course_id foreign keys
This structured approach enables easy tracking of which students are enrolled in which courses and their payment statuses

Practice Task

Q3. Find learners who never enrolled.

python

audit = learners.merge(
    enrollments,
    how="left",
    on="learner_id",
    indicator=True
)

audit[audit["_merge"] == "left_only"][["learner_id", "learner", "city"]]

Explanation

Performs a left join between learners and enrollments datasets using learner_id as the common key
Adds a merge indicator column to track which records exist in both datasets versus only in the left dataset
Filters results to show only records that exist in learners but not in enrollments (left_only matches)
Returns specific columns (learner_id, learner name, and city) for audit purposes
Useful for data quality checks to identify unenrolled learners in the system

Practice Task

Q4. Find courses that never received any enrollment.

python

audit = courses.merge(
    enrollments,
    how="left",
    on="course_id",
    indicator=True
)

audit[audit["_merge"] == "left_only"][["course_id", "course_name"]]

Explanation

Performs a left join between courses and enrollments dataframes using course_id as the merging column
The indicator parameter tracks which records exist in each dataframe, creating a _merge column with values like "both", "left_only", or "right_only"
Filters results to show only records where the course exists in the courses table but has no matching enrollment record
Returns specific columns (course_id and course_name) for courses that have no enrollments
Useful for audit purposes to identify inactive or unenrolled courses in a learning management system

Practice Task

Q5. Find total paid revenue by mentor.

python

mentor_revenue = (
    enrollments
    .merge(courses, how="left", on="course_id")
    .merge(mentors, how="left", on="mentor_id")
    .query("payment_status == 'paid'")
    .groupby(["mentor_id", "mentor"], as_index=False)["price"]
    .sum()
    .rename(columns={"price": "revenue"})
    .sort_values("revenue", ascending=False)
)

mentor_revenue

Explanation

Merges three datasets (enrollments, courses, and mentors) using left joins to combine enrollment data with course pricing and mentor information
Filters the merged dataset to include only paid enrollments using query method
Groups the filtered data by mentor ID and name, then sums the price column to calculate total revenue per mentor
Renames the summed price column to "revenue" for clarity and readability
Sorts the final result in descending order of revenue to show highest earning mentors first

Practice Task

Q6. Find students who enrolled in both months.

python

jan_learners = set(jan["learner_id"])
feb_learners = set(feb["learner_id"])

both_months = jan_learners.intersection(feb_learners)

learners[learners["learner_id"].isin(both_months)]

Explanation

Creates two sets containing unique learner IDs from January and February datasets respectively
Uses set intersection operation to identify learners who exist in both months
Filters the main learners dataframe to return only records where learner_id matches those found in both months
This approach efficiently identifies overlapping data between time periods using set operations

Practice Task

Q7. Find courses that were enrolled in January but not February.

python

jan_courses = set(jan["course_id"])
feb_courses = set(feb["course_id"])

jan_only_courses = jan_courses.difference(feb_courses)

courses[courses["course_id"].isin(jan_only_courses)]

Explanation

Creates two sets from course IDs in January and February dataframes to enable set operations
Uses the difference method to identify courses that exist in January but not in February
Filters the main courses dataframe to return only rows where course IDs match the January-only courses
This approach efficiently identifies courses that were exclusively offered during the January period

Practice Task

Q8. Build learner-referrer pairs.

python

referral_pairs = learners.merge(
    learners,
    how="left",
    left_on="referral_id",
    right_on="learner_id",
    suffixes=("_learner", "_referrer")
)

referral_pairs[["learner_learner", "learner_referrer"]]

Explanation

Merges the learners DataFrame with itself using a left join to establish referral relationships between learners and their referrers
Links records where the referral_id from one learner matches the learner_id of another learner to create referral pairs
Uses suffixes to distinguish between columns from the left (learner) and right (referrer) sides of the merge operation
Selects specific columns to display the learner and referrer identifiers from the resulting joined dataset
This approach enables analysis of referral patterns and tracking of which learners referred others in the system

Practice Task

Q9. Create a review table with learner name, course name, and rating.

python

review_table = (
    reviews
    .merge(learners, how="left", left_on="student_ref", right_on="learner_id")
    .merge(courses, how="left", left_on="course_ref", right_on="course_id")
)

review_table[["review_id", "learner", "course_name", "rating"]]

Explanation

The code performs two sequential merge operations to combine review data with corresponding learner and course information using foreign key relationships
First merge joins the reviews dataset with learners table on student_ref = learner_id to add learner details to each review
Second merge joins the result with courses table on course_ref = course_id to include course names and other course-related information
The final selection filters the merged dataset to show only specific columns: review identifier, learner name, course name, and rating score
This approach creates a unified view that connects individual reviews to both the students who wrote them and the courses they reviewed

Practice Task

Q10. Use validate to confirm that enrollment-to-course is many-to-one.

python

enrollments.merge(
    courses,
    how="left",
    on="course_id",
    validate="many_to_one"
)

Explanation

Performs a left join operation between enrollments and courses DataFrames based on the common course_id column
Preserves all records from the enrollments DataFrame while adding matching course data where available
Validates the merge relationship to ensure each enrollment maps to at most one course record
Useful for enriching enrollment data with course details like titles, credits, or department information
The many_to_one validation prevents duplicate course entries that could corrupt the merged dataset

Need	Use
Stack rows from similar files	`pd.concat([...], ignore_index=True)`
Keep source labels while stacking	`pd.concat([...], keys=[...])`
Match rows by key columns	`merge()`
Keep all rows from the left table	`how="left"`
Keep only matching rows	`how="inner"`
Keep all keys from both tables	`how="outer"`
Audit unmatched rows	`indicator=True`
Catch wrong relationship assumptions	`validate`
Merge using different key names	`left_on`, `right_on`
Merge a table with itself	self join
Join by index	`.join()`

If you remember only one rule, remember this:

Use concat() when files are the same kind of thing. Use merge() when files describe related things.

Official References

Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
Pandas merge API reference: https://pandas.pydata.org/docs/reference/api/pandas.merge.html
Pandas concat API reference: https://pandas.pydata.org/docs/reference/api/pandas.concat.html
Pandas DataFrame.join API reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

Pandas Merge, Join, and Concat: Essential Techniques

AI Insights

Q1. Append January and February enrollment files into one clean DataFrame.

Q2. Create a full enrollment table with learner and course names.

Q3. Find learners who never enrolled.

Q4. Find courses that never received any enrollment.

Q5. Find total paid revenue by mentor.

Q6. Find students who enrolled in both months.

Q7. Find courses that were enrolled in January but not February.

Q8. Build learner-referrer pairs.

Q9. Create a review table with learner name, course name, and rating.

Q10. Use validate to confirm that enrollment-to-course is many-to-one.