#Pandas#pythonIntermediate

Pandas Merge, Join, and Concat: Essential Techniques

Jun 5, 2026
48 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: pandas-merge-join-and-concat-essential-techniques
Quick Answer

Learn how to combine Pandas DataFrames with concat, merge, and join using original CSV datasets. Practice row appends, join types, self joins, anti joins, index joins, validation, indicators, suffixes, and realistic reporting workflows.

Quick Summary

Learn how to effectively merge, join, and concat DataFrames in Pandas with practical examples and tips for data analysis.

Pandas Merge, Join, and Concat: Combine DataFrames With Practice

Most real projects do not give you one perfect CSV.

You may get:

  • one file for January enrollments
  • one file for February enrollments
  • one file for learners
  • one file for courses
  • one file for mentors
  • one file for reviews

To answer useful questions, you need to combine these files correctly.

This is where pd.concat(), merge(), and join() become essential.

This guide uses original sample datasets created for this lesson. It does not use copied course, sports, movie, ecommerce, or public dataset examples.

Files used in this lesson:

  • pandas_join_learners.csv
  • pandas_join_courses.csv
  • pandas_join_mentors.csv
  • pandas_join_enrollments_jan.csv
  • pandas_join_enrollments_feb.csv
  • pandas_join_reviews.csv

Place all CSV files in the same folder as this Markdown file before running the examples.

What You Will Learn

By the end, you should be able to use:

  • pd.concat() for stacking DataFrames
  • ignore_index=True
  • keys= to create a MultiIndex while concatenating
  • axis=0 versus axis=1
  • merge() for database-style joins
  • inner, left, right, outer, and cross joins
  • on, left_on, and right_on
  • joining on multiple columns
  • suffixes
  • indicator=True
  • validate
  • self joins
  • semi joins and anti joins
  • index-based .join()
  • combine_first()
  • reporting workflows after merging

1. Setup

python
import pandas as pd
import numpy as np

Load the datasets:

python
learners = pd.read_csv("pandas_join_learners.csv")
courses = pd.read_csv("pandas_join_courses.csv")
mentors = pd.read_csv("pandas_join_mentors.csv")
jan = pd.read_csv("pandas_join_enrollments_jan.csv")
feb = pd.read_csv("pandas_join_enrollments_feb.csv")
reviews = pd.read_csv("pandas_join_reviews.csv")

Inspect the shapes:

python
for name, df in {
    "learners": learners,
    "courses": courses,
    "mentors": mentors,
    "jan": jan,
    "feb": feb,
    "reviews": reviews,
}.items():
    print(name, df.shape)

Look at the first few rows:

python
learners.head()
python
courses.head()
python
jan.head()

2. concat() For Stacking Rows

Use pd.concat() when two or more DataFrames have the same meaning and similar columns.

January enrollments and February enrollments are the same kind of data.

python
enrollments = pd.concat([jan, feb])
enrollments

Notice the index.

Each file kept its original index, so you may see duplicate index labels.

For most beginner workflows, reset the index while concatenating:

python
enrollments = pd.concat([jan, feb], ignore_index=True)
enrollments

Now the index is clean from 0 to n - 1.

Use this pattern for:

  • monthly sales files
  • daily logs
  • region-wise CSVs
  • scraped pages with the same schema
  • batches of survey responses

3. Add Source Labels With keys

Sometimes you want to remember which original file each row came from.

python
enrollments_by_month = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
)

enrollments_by_month

This creates a MultiIndex.

Get all January rows:

python
enrollments_by_month.loc["Jan"]

Get the first row inside February:

python
enrollments_by_month.loc[("Feb", 0)]

If you want the month label as a normal column:

python
enrollments_with_month = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "month"})

enrollments_with_month.head()

This is useful for month-wise revenue analysis.

4. concat() Side By Side With axis=1

axis=0 stacks rows.

axis=1 attaches columns side by side.

python
pd.concat([learners.head(3), courses.head(3)], axis=1)

This example is not a real relationship. It simply places columns next to each other based on index position.

Use axis=1 only when rows are already aligned by index and you know that row 0 in one DataFrame belongs with row 0 in another.

For most relationship-based combining, use merge().

5. merge() For Relationship-Based Joins

Use merge() when rows should match by one or more key columns.

Example:

python
enrollments.merge(learners, on="learner_id")

This matches enrollment rows to learner details using learner_id.

By default, merge() performs an inner join.

Inner join means:

  • keep rows with matching keys in both DataFrames
  • drop rows that do not match

6. Inner Join

Show only enrollments that have matching learner records:

python
enrollments.merge(learners, how="inner", on="learner_id")

Create a richer table by merging enrollments, learners, and courses:

python
enrollment_detail = (
    enrollments
    .merge(learners, how="inner", on="learner_id")
    .merge(courses, how="inner", on="course_id")
)

enrollment_detail.head()

Now each enrollment row has learner and course information.

7. Left Join

A left join keeps every row from the left DataFrame.

For example, keep all courses even if some courses have no enrollments:

python
courses.merge(enrollments, how="left", on="course_id")

Rows for courses without enrollments will have missing values in enrollment columns.

This is useful when the left table is your master list.

Examples:

  • all courses, whether enrolled or not
  • all customers, whether ordered or not
  • all employees, whether assigned or not
  • all products, whether sold or not

8. Right Join

A right join keeps every row from the right DataFrame.

python
learners.merge(enrollments, how="right", on="learner_id")

This keeps every enrollment row from the right table.

You can usually rewrite right joins as left joins by swapping the order of the DataFrames. Many teams prefer left joins because they are easier to read.

9. Outer Join

An outer join keeps all keys from both DataFrames.

python
learners.merge(enrollments, how="outer", on="learner_id")

This is useful for audits:

  • learners with enrollments
  • learners without enrollments
  • enrollment rows that reference missing learners

Outer joins help find data quality gaps.

10. Use indicator=True To Audit Matches

The indicator=True option adds a _merge column.

python
learner_enrollment_audit = learners.merge(
    enrollments,
    how="outer",
    on="learner_id",
    indicator=True
)

learner_enrollment_audit["_merge"].value_counts()

The _merge values are:

  • both: key exists in both tables
  • left_only: key exists only in the left table
  • right_only: key exists only in the right table

Find learners who have not enrolled:

python
learner_enrollment_audit[
    learner_enrollment_audit["_merge"] == "left_only"
]

Find enrollment records with missing learner profiles:

python
learner_enrollment_audit[
    learner_enrollment_audit["_merge"] == "right_only"
]

This is one of the best debugging tools for joins.

11. Anti Join

An anti join means: show rows from one table that do not have a match in another table.

Pandas does not have a direct anti_join() method, but you can build one with indicator=True.

Courses with no enrollment:

python
course_audit = courses.merge(
    enrollments,
    how="left",
    on="course_id",
    indicator=True
)

courses_without_enrollment = course_audit[
    course_audit["_merge"] == "left_only"
]

courses_without_enrollment[["course_id", "course_name", "category", "price"]]

Learners with no enrollment:

python
learner_audit = learners.merge(
    enrollments,
    how="left",
    on="learner_id",
    indicator=True
)

learners_without_enrollment = learner_audit[
    learner_audit["_merge"] == "left_only"
]

learners_without_enrollment[["learner_id", "learner", "city", "plan"]]

Anti joins are common in analytics work.

12. Semi Join

A semi join means: keep rows from one table if they have a match in another table.

Learners who enrolled at least once:

python
enrolled_learner_ids = enrollments["learner_id"].unique()

learners[learners["learner_id"].isin(enrolled_learner_ids)]

Courses that received at least one enrollment:

python
enrolled_course_ids = enrollments["course_id"].unique()

courses[courses["course_id"].isin(enrolled_course_ids)]

Use semi joins when you need the original table rows, not the combined table.

13. Merge With Different Column Names

The reviews dataset uses different key names:

  • student_ref points to learner_id
  • course_ref points to course_id

Use left_on and right_on.

python
reviews_with_learners = reviews.merge(
    learners,
    how="left",
    left_on="student_ref",
    right_on="learner_id"
)

reviews_with_learners.head()

Merge reviews with courses:

python
reviews_full = (
    reviews
    .merge(learners, how="left", left_on="student_ref", right_on="learner_id")
    .merge(courses, how="left", left_on="course_ref", right_on="course_id")
)

reviews_full[["review_id", "learner", "course_name", "rating", "review_text"]]

14. Merge On Multiple Columns

Sometimes one column is not enough to identify a match.

Create a small completion table:

python
completion_checks = pd.DataFrame({
    "learner_id": ["L001", "L002", "L006", "L010", "L015"],
    "course_id": ["C101", "C101", "C101", "C104", "C106"],
    "certificate_sent": [True, False, True, False, True]
})

Merge using both learner_id and course_id:

python
enrollments.merge(
    completion_checks,
    how="left",
    on=["learner_id", "course_id"]
)

Use multiple keys when the relationship depends on a combination of columns.

15. Handle Duplicate Column Names With suffixes

If both DataFrames contain columns with the same name, Pandas adds suffixes like _x and _y.

You can choose better suffixes.

python
mentor_course_table = courses.merge(
    mentors,
    how="left",
    on="mentor_id",
    suffixes=("_course", "_mentor")
)

mentor_course_table.head()

In this dataset, both tables have specialty, so custom suffixes keep the result readable.

16. Validate Join Relationships

validate helps catch accidental many-to-many joins.

For example, each course has one mentor, but one mentor can teach many courses.

python
courses.merge(
    mentors,
    how="left",
    on="mentor_id",
    validate="many_to_one"
)

Each enrollment belongs to one learner, but one learner can have many enrollments.

python
enrollments.merge(
    learners,
    how="left",
    on="learner_id",
    validate="many_to_one"
)

Common validation options:

OptionMeaning
one_to_oneeach key appears once in both tables
one_to_manyleft keys are unique, right keys may repeat
many_to_oneleft keys may repeat, right keys are unique
many_to_manykeys may repeat in both tables

Use validate when correctness matters.

It can save you from silent row explosions.

17. Many-To-Many Joins Can Expand Rows

If the same key appears multiple times in both tables, merge creates all matching combinations.

Example:

python
left = pd.DataFrame({
    "topic": ["pandas", "pandas", "sql"],
    "left_note": ["A", "B", "C"]
})

right = pd.DataFrame({
    "topic": ["pandas", "pandas", "numpy"],
    "right_note": ["X", "Y", "Z"]
})

left.merge(right, on="topic", how="inner")

The pandas rows multiply because there are two matches on each side.

Before merging, check key uniqueness:

python
learners["learner_id"].is_unique
python
courses["course_id"].is_unique
python
enrollments["learner_id"].is_unique

It is normal for enrollments to have repeated learner IDs, but it is not normal for the learner master table to have duplicate learner IDs.

18. Self Join

A self join means merging a table with itself.

In the learners dataset, referral_id points to another learner.

python
learners[["learner_id", "learner", "referral_id"]]

Merge learners with learners:

python
referral_map = learners.merge(
    learners,
    how="left",
    left_on="referral_id",
    right_on="learner_id",
    suffixes=("_learner", "_referrer")
)

referral_map[[
    "learner_id_learner",
    "learner_learner",
    "referral_id_learner",
    "learner_referrer"
]]

Self joins are useful for:

  • employee-manager relationships
  • user-referrer relationships
  • product replacement mappings
  • category-parent relationships

19. Index-Based .join()

.join() is mostly used for index-based joins.

Set learner ID as the index:

python
learners_by_id = learners.set_index("learner_id")
enrollments_by_learner = enrollments.set_index("learner_id")

Join enrollment rows with learner columns:

python
enrollments_by_learner.join(
    learners_by_id,
    how="left",
    rsuffix="_learner"
).head()

For beginners, merge() is usually clearer because the join keys are visible as columns.

Use .join() when your DataFrames are already indexed by the relationship key.

20. combine_first() For Filling Missing Values

combine_first() fills missing values in one object using values from another object with aligned indexes.

Example:

python
primary_prices = courses.set_index("course_id")["price"]

backup_prices = pd.Series({
    "C101": 1499,
    "C102": 999,
    "C103": 1299,
    "C104": 1199,
    "C105": 1999,
    "C106": 799
})

primary_prices.combine_first(backup_prices)

This is not a replacement for merge().

Use it when you want to fill missing aligned values.

21. Build A Complete Enrollment Report

Create one table with enrollment, learner, course, and mentor information.

python
enrollment_report = (
    enrollments
    .merge(learners, how="left", on="learner_id", validate="many_to_one")
    .merge(courses, how="left", on="course_id", validate="many_to_one")
    .merge(mentors, how="left", on="mentor_id", validate="many_to_one", suffixes=("_course", "_mentor"))
)

enrollment_report.head()

Select useful columns:

python
enrollment_report = enrollment_report[[
    "enrollment_id",
    "enrolled_at",
    "learner",
    "city",
    "plan",
    "course_name",
    "category",
    "price",
    "payment_status",
    "mentor",
    "coupon"
]]

enrollment_report.head()

22. Month-Wise Revenue

Use the month label from concat(keys=...).

python
enrollments_with_month = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "month"})

month_revenue = (
    enrollments_with_month
    .merge(courses, how="left", on="course_id")
    .query("payment_status == 'paid'")
    .groupby("month", as_index=False)["price"]
    .sum()
    .rename(columns={"price": "revenue"})
)

month_revenue

23. Course-Wise Revenue

python
course_revenue = (
    enrollments
    .merge(courses, how="left", on="course_id")
    .query("payment_status == 'paid'")
    .groupby(["course_id", "course_name"], as_index=False)["price"]
    .sum()
    .rename(columns={"price": "revenue"})
    .sort_values("revenue", ascending=False)
)

course_revenue

24. Top Learners By Spend

python
learner_spend = (
    enrollments
    .merge(learners, how="left", on="learner_id")
    .merge(courses, how="left", on="course_id")
    .query("payment_status == 'paid'")
    .groupby(["learner_id", "learner"], as_index=False)["price"]
    .sum()
    .rename(columns={"price": "total_spend"})
    .sort_values("total_spend", ascending=False)
)

learner_spend.head(5)

25. Courses With No Paid Enrollment

This is slightly different from courses with no enrollment at all.

python
paid_enrollments = enrollments[enrollments["payment_status"] == "paid"]

paid_course_audit = courses.merge(
    paid_enrollments,
    how="left",
    on="course_id",
    indicator=True
)

paid_course_audit[
    paid_course_audit["_merge"] == "left_only"
][["course_id", "course_name", "category", "price"]]

This helps identify courses that exist in the catalog but did not produce paid orders.

26. Review Report

Merge reviews with learners and courses.

python
review_report = (
    reviews
    .merge(learners, how="left", left_on="student_ref", right_on="learner_id")
    .merge(courses, how="left", left_on="course_ref", right_on="course_id")
)

review_report[[
    "review_id",
    "learner",
    "course_name",
    "rating",
    "review_text"
]]

Average rating by course:

python
review_report.groupby("course_name", as_index=False)["rating"].mean()

27. Cross Join

A cross join creates every combination of rows.

Example: every active mentor with every active course category.

python
mentor_names = mentors[["mentor_id", "mentor"]].head(3)
categories = pd.DataFrame({"category": ["Python", "Data", "Project"]})

mentor_names.merge(categories, how="cross")

Use cross joins carefully.

If the left table has 1000 rows and the right table has 1000 rows, the result has 1,000,000 rows.

28. Common Mistakes

Mistake 1: Using concat() when you need merge()

This only places rows or columns together:

python
pd.concat([learners, enrollments])

Explanation

  • The pd.concat() function vertically stacks two pandas DataFrames (learners and enrollments) by concatenating their rows
  • This operation preserves all columns from both DataFrames while appending the rows of the second DataFrame below the first
  • The resulting DataFrame contains combined data from both sources, useful for merging related datasets
  • Column names must match between DataFrames for proper alignment during concatenation
  • This technique is commonly used in data processing workflows to aggregate multiple data sources into a single dataset

It does not match learner details to enrollments.

For relationship matching, use:

python
enrollments.merge(learners, on="learner_id")

Explanation

  • Combines two datasets (enrollments and learners) based on matching learner IDs to create a unified view of enrollment details alongside learner demographics
  • The merge operation aligns records from both tables where the "learner_id" field values match between the datasets
  • This creates a comprehensive dataset that includes both enrollment information and corresponding learner attributes in a single dataframe
  • The resulting merged dataset enables analysis of learner behavior, enrollment patterns, and demographic breakdowns simultaneously
  • Commonly used in educational analytics to understand student engagement and course participation metrics

Mistake 2: Forgetting ignore_index=True

When stacking files, duplicated index labels can confuse later code.

python
pd.concat([jan, feb], ignore_index=True)

Explanation

  • Concatenates two pandas DataFrame objects (jan and feb) along the row axis (axis=0 by default)
  • The ignore_index=True parameter creates a new continuous integer index starting from zero
  • Preserves all columns from both DataFrames without any merging or joining operations
  • Useful for combining time-series data or appending datasets with identical column structures
  • Returns a new DataFrame object containing all rows from both input DataFrames sequentially

Mistake 3: Ignoring unmatched rows

An inner join can silently drop rows.

Audit with:

python
learners.merge(enrollments, how="outer", on="learner_id", indicator=True)

Explanation

  • Performs an outer join operation between learners and enrollments DataFrames based on the common learner_id column
  • The indicator=True parameter adds a special column showing which DataFrame each record came from (left_only, right_only, both)
  • Preserves all records from both datasets regardless of matching conditions, making it ideal for identifying missing enrollments or unenrolled learners
  • Useful for data quality analysis to spot records that exist in one dataset but not the other
  • Returns a merged DataFrame that can be used to analyze complete enrollment coverage across all learners

Mistake 4: Not checking key uniqueness

Before merging master tables, check:

python
learners["learner_id"].is_unique

Explanation

  • The code checks whether all values in the 'learner_id' column of the learners DataFrame are distinct and non-duplicate
  • The is_unique property returns a boolean value (True/False) indicating if all elements in the Series are unique
  • This is useful for data validation to ensure each learner has a unique identifier
  • The operation is efficient as it leverages pandas' optimized internal methods for checking uniqueness
  • This check helps maintain data integrity before performing operations that require unique identifiers
python
courses["course_id"].is_unique

Explanation

  • This code checks whether all values in the 'course_id' column of the courses DataFrame are distinct
  • The .is_unique property returns a boolean value (True/False) indicating if there are no duplicate entries
  • When True, it confirms that each course ID appears only once in the dataset
  • When False, it indicates there are duplicate course IDs that need to be addressed
  • This validation is essential for ensuring data integrity in educational course management systems

Mistake 5: Letting duplicate columns become confusing

Use meaningful suffixes:

python
courses.merge(
    mentors,
    on="mentor_id",
    suffixes=("_course", "_mentor")
)

Explanation

  • Combines two pandas DataFrames (courses and mentors) based on their common "mentor_id" column
  • Uses suffixes parameter to distinguish columns with identical names between the two dataframes
  • Appends "_course" and "_mentor" suffixes to overlapping column names respectively
  • Creates a unified dataset that preserves all course information while adding corresponding mentor details
  • Returns a new merged DataFrame containing combined data from both source datasets

29. Practice Questions

Try these before looking at the solutions.

Practice Task

Q1. Append January and February enrollment files into one clean DataFrame.

python
enrollments = pd.concat([jan, feb], ignore_index=True)
enrollments.head()

Explanation

  • The code merges two pandas DataFrames (jan and feb) into a single DataFrame called enrollments by stacking them vertically
  • The ignore_index=True parameter resets the row indices to create a continuous sequential index starting from zero
  • This operation is useful for combining monthly enrollment data into a unified dataset for analysis
  • The head() method displays the first 5 rows of the combined DataFrame to verify the concatenation worked correctly
  • The resulting enrollments DataFrame contains all records from both input DataFrames in chronological order

Practice Task

Q2. Create a full enrollment table with learner and course names.

python
full_enrollments = (
    enrollments
    .merge(learners, how="left", on="learner_id")
    .merge(courses, how="left", on="course_id")
)

full_enrollments[["enrollment_id", "learner", "course_name", "payment_status"]]

Explanation

  • Combines three datasets (enrollments, learners, and courses) using left joins to preserve all enrollment records while adding corresponding learner and course details
  • Creates a unified dataframe that links student enrollments to specific courses and learner information through common identifier fields
  • Selects only the most relevant columns (enrollment ID, learner name, course name, and payment status) for reporting or analysis purposes
  • The merge operations maintain data integrity by matching records based on learner_id and course_id foreign keys
  • This structured approach enables easy tracking of which students are enrolled in which courses and their payment statuses

Practice Task

Q3. Find learners who never enrolled.

python
audit = learners.merge(
    enrollments,
    how="left",
    on="learner_id",
    indicator=True
)

audit[audit["_merge"] == "left_only"][["learner_id", "learner", "city"]]

Explanation

  • Performs a left join between learners and enrollments datasets using learner_id as the common key
  • Adds a merge indicator column to track which records exist in both datasets versus only in the left dataset
  • Filters results to show only records that exist in learners but not in enrollments (left_only matches)
  • Returns specific columns (learner_id, learner name, and city) for audit purposes
  • Useful for data quality checks to identify unenrolled learners in the system

Practice Task

Q4. Find courses that never received any enrollment.

python
audit = courses.merge(
    enrollments,
    how="left",
    on="course_id",
    indicator=True
)

audit[audit["_merge"] == "left_only"][["course_id", "course_name"]]

Explanation

  • Performs a left join between courses and enrollments dataframes using course_id as the merging column
  • The indicator parameter tracks which records exist in each dataframe, creating a _merge column with values like "both", "left_only", or "right_only"
  • Filters results to show only records where the course exists in the courses table but has no matching enrollment record
  • Returns specific columns (course_id and course_name) for courses that have no enrollments
  • Useful for audit purposes to identify inactive or unenrolled courses in a learning management system

Practice Task

Q5. Find total paid revenue by mentor.

python
mentor_revenue = (
    enrollments
    .merge(courses, how="left", on="course_id")
    .merge(mentors, how="left", on="mentor_id")
    .query("payment_status == 'paid'")
    .groupby(["mentor_id", "mentor"], as_index=False)["price"]
    .sum()
    .rename(columns={"price": "revenue"})
    .sort_values("revenue", ascending=False)
)

mentor_revenue

Explanation

  • Merges three datasets (enrollments, courses, and mentors) using left joins to combine enrollment data with course pricing and mentor information
  • Filters the merged dataset to include only paid enrollments using query method
  • Groups the filtered data by mentor ID and name, then sums the price column to calculate total revenue per mentor
  • Renames the summed price column to "revenue" for clarity and readability
  • Sorts the final result in descending order of revenue to show highest earning mentors first

Practice Task

Q6. Find students who enrolled in both months.

python
jan_learners = set(jan["learner_id"])
feb_learners = set(feb["learner_id"])

both_months = jan_learners.intersection(feb_learners)

learners[learners["learner_id"].isin(both_months)]

Explanation

  • Creates two sets containing unique learner IDs from January and February datasets respectively
  • Uses set intersection operation to identify learners who exist in both months
  • Filters the main learners dataframe to return only records where learner_id matches those found in both months
  • This approach efficiently identifies overlapping data between time periods using set operations

Practice Task

Q7. Find courses that were enrolled in January but not February.

python
jan_courses = set(jan["course_id"])
feb_courses = set(feb["course_id"])

jan_only_courses = jan_courses.difference(feb_courses)

courses[courses["course_id"].isin(jan_only_courses)]

Explanation

  • Creates two sets from course IDs in January and February dataframes to enable set operations
  • Uses the difference method to identify courses that exist in January but not in February
  • Filters the main courses dataframe to return only rows where course IDs match the January-only courses
  • This approach efficiently identifies courses that were exclusively offered during the January period

Practice Task

Q8. Build learner-referrer pairs.

python
referral_pairs = learners.merge(
    learners,
    how="left",
    left_on="referral_id",
    right_on="learner_id",
    suffixes=("_learner", "_referrer")
)

referral_pairs[["learner_learner", "learner_referrer"]]

Explanation

  • Merges the learners DataFrame with itself using a left join to establish referral relationships between learners and their referrers
  • Links records where the referral_id from one learner matches the learner_id of another learner to create referral pairs
  • Uses suffixes to distinguish between columns from the left (learner) and right (referrer) sides of the merge operation
  • Selects specific columns to display the learner and referrer identifiers from the resulting joined dataset
  • This approach enables analysis of referral patterns and tracking of which learners referred others in the system

Practice Task

Q9. Create a review table with learner name, course name, and rating.

python
review_table = (
    reviews
    .merge(learners, how="left", left_on="student_ref", right_on="learner_id")
    .merge(courses, how="left", left_on="course_ref", right_on="course_id")
)

review_table[["review_id", "learner", "course_name", "rating"]]

Explanation

  • The code performs two sequential merge operations to combine review data with corresponding learner and course information using foreign key relationships
  • First merge joins the reviews dataset with learners table on student_ref = learner_id to add learner details to each review
  • Second merge joins the result with courses table on course_ref = course_id to include course names and other course-related information
  • The final selection filters the merged dataset to show only specific columns: review identifier, learner name, course name, and rating score
  • This approach creates a unified view that connects individual reviews to both the students who wrote them and the courses they reviewed

Practice Task

Q10. Use validate to confirm that enrollment-to-course is many-to-one.

python
enrollments.merge(
    courses,
    how="left",
    on="course_id",
    validate="many_to_one"
)

Explanation

  • Performs a left join operation between enrollments and courses DataFrames based on the common course_id column
  • Preserves all records from the enrollments DataFrame while adding matching course data where available
  • Validates the merge relationship to ensure each enrollment maps to at most one course record
  • Useful for enriching enrollment data with course details like titles, credits, or department information
  • The many_to_one validation prevents duplicate course entries that could corrupt the merged dataset

30. Interview Questions

1. What is the difference between concat() and merge()?

concat() stacks or attaches DataFrames along an axis. merge() combines DataFrames by matching key columns.

2. When should you use ignore_index=True?

Use it when stacking rows and you want a clean new index instead of preserving old indexes from each input DataFrame.

3. What is an inner join?

An inner join keeps only matching keys from both DataFrames.

4. What is a left join?

A left join keeps all rows from the left DataFrame and adds matching data from the right DataFrame.

5. What is an outer join?

An outer join keeps all keys from both DataFrames and uses missing values where no match exists.

6. What does indicator=True do?

It adds a _merge column that tells whether each row came from the left table only, right table only, or both tables.

7. What is an anti join?

An anti join returns rows from one table that do not have a match in another table.

8. What is a self join?

A self join merges a DataFrame with itself, usually to resolve parent-child or referral relationships.

9. Why should you use validate?

It checks whether the merge relationship matches your expectation, such as many_to_one. This helps catch accidental duplicate keys.

10. Why can many-to-many joins be risky?

They can multiply rows because every matching left row combines with every matching right row for the same key.

11. When should you use .join()?

Use .join() when DataFrames are already indexed by the relationship key.

12. What is a cross join?

A cross join returns every possible combination of rows from both DataFrames.

31. Final Mental Model

Use this table:

NeedUse
Stack rows from similar filespd.concat([...], ignore_index=True)
Keep source labels while stackingpd.concat([...], keys=[...])
Match rows by key columnsmerge()
Keep all rows from the left tablehow="left"
Keep only matching rowshow="inner"
Keep all keys from both tableshow="outer"
Audit unmatched rowsindicator=True
Catch wrong relationship assumptionsvalidate
Merge using different key namesleft_on, right_on
Merge a table with itselfself join
Join by index.join()

If you remember only one rule, remember this:

Use concat() when files are the same kind of thing. Use merge() when files describe related things.

Official References