Pandas Merge, Join, and Concat: Combine DataFrames With Practice
Most real projects do not give you one perfect CSV.
You may get:
- one file for January enrollments
- one file for February enrollments
- one file for learners
- one file for courses
- one file for mentors
- one file for reviews
To answer useful questions, you need to combine these files correctly.
This is where pd.concat(), merge(), and join() become essential.
This guide uses original sample datasets created for this lesson. It does not use copied course, sports, movie, ecommerce, or public dataset examples.
Files used in this lesson:
pandas_join_learners.csvpandas_join_courses.csvpandas_join_mentors.csvpandas_join_enrollments_jan.csvpandas_join_enrollments_feb.csvpandas_join_reviews.csv
Place all CSV files in the same folder as this Markdown file before running the examples.
What You Will Learn
By the end, you should be able to use:
pd.concat()for stacking DataFramesignore_index=Truekeys=to create a MultiIndex while concatenatingaxis=0versusaxis=1merge()for database-style joins- inner, left, right, outer, and cross joins
on,left_on, andright_on- joining on multiple columns
suffixesindicator=Truevalidate- self joins
- semi joins and anti joins
- index-based
.join() combine_first()- reporting workflows after merging
1. Setup
import pandas as pd
import numpy as npLoad the datasets:
learners = pd.read_csv("pandas_join_learners.csv")
courses = pd.read_csv("pandas_join_courses.csv")
mentors = pd.read_csv("pandas_join_mentors.csv")
jan = pd.read_csv("pandas_join_enrollments_jan.csv")
feb = pd.read_csv("pandas_join_enrollments_feb.csv")
reviews = pd.read_csv("pandas_join_reviews.csv")Inspect the shapes:
for name, df in {
"learners": learners,
"courses": courses,
"mentors": mentors,
"jan": jan,
"feb": feb,
"reviews": reviews,
}.items():
print(name, df.shape)Look at the first few rows:
learners.head()courses.head()jan.head()2. concat() For Stacking Rows
Use pd.concat() when two or more DataFrames have the same meaning and similar columns.
January enrollments and February enrollments are the same kind of data.
enrollments = pd.concat([jan, feb])
enrollmentsNotice the index.
Each file kept its original index, so you may see duplicate index labels.
For most beginner workflows, reset the index while concatenating:
enrollments = pd.concat([jan, feb], ignore_index=True)
enrollmentsNow the index is clean from 0 to n - 1.
Use this pattern for:
- monthly sales files
- daily logs
- region-wise CSVs
- scraped pages with the same schema
- batches of survey responses
3. Add Source Labels With keys
Sometimes you want to remember which original file each row came from.
enrollments_by_month = pd.concat(
[jan, feb],
keys=["Jan", "Feb"]
)
enrollments_by_monthThis creates a MultiIndex.
Get all January rows:
enrollments_by_month.loc["Jan"]Get the first row inside February:
enrollments_by_month.loc[("Feb", 0)]If you want the month label as a normal column:
enrollments_with_month = pd.concat(
[jan, feb],
keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "month"})
enrollments_with_month.head()This is useful for month-wise revenue analysis.
4. concat() Side By Side With axis=1
axis=0 stacks rows.
axis=1 attaches columns side by side.
pd.concat([learners.head(3), courses.head(3)], axis=1)This example is not a real relationship. It simply places columns next to each other based on index position.
Use axis=1 only when rows are already aligned by index and you know that row 0 in one DataFrame belongs with row 0 in another.
For most relationship-based combining, use merge().
5. merge() For Relationship-Based Joins
Use merge() when rows should match by one or more key columns.
Example:
enrollments.merge(learners, on="learner_id")This matches enrollment rows to learner details using learner_id.
By default, merge() performs an inner join.
Inner join means:
- keep rows with matching keys in both DataFrames
- drop rows that do not match
6. Inner Join
Show only enrollments that have matching learner records:
enrollments.merge(learners, how="inner", on="learner_id")Create a richer table by merging enrollments, learners, and courses:
enrollment_detail = (
enrollments
.merge(learners, how="inner", on="learner_id")
.merge(courses, how="inner", on="course_id")
)
enrollment_detail.head()Now each enrollment row has learner and course information.
7. Left Join
A left join keeps every row from the left DataFrame.
For example, keep all courses even if some courses have no enrollments:
courses.merge(enrollments, how="left", on="course_id")Rows for courses without enrollments will have missing values in enrollment columns.
This is useful when the left table is your master list.
Examples:
- all courses, whether enrolled or not
- all customers, whether ordered or not
- all employees, whether assigned or not
- all products, whether sold or not
8. Right Join
A right join keeps every row from the right DataFrame.
learners.merge(enrollments, how="right", on="learner_id")This keeps every enrollment row from the right table.
You can usually rewrite right joins as left joins by swapping the order of the DataFrames. Many teams prefer left joins because they are easier to read.
9. Outer Join
An outer join keeps all keys from both DataFrames.
learners.merge(enrollments, how="outer", on="learner_id")This is useful for audits:
- learners with enrollments
- learners without enrollments
- enrollment rows that reference missing learners
Outer joins help find data quality gaps.
10. Use indicator=True To Audit Matches
The indicator=True option adds a _merge column.
learner_enrollment_audit = learners.merge(
enrollments,
how="outer",
on="learner_id",
indicator=True
)
learner_enrollment_audit["_merge"].value_counts()The _merge values are:
both: key exists in both tablesleft_only: key exists only in the left tableright_only: key exists only in the right table
Find learners who have not enrolled:
learner_enrollment_audit[
learner_enrollment_audit["_merge"] == "left_only"
]Find enrollment records with missing learner profiles:
learner_enrollment_audit[
learner_enrollment_audit["_merge"] == "right_only"
]This is one of the best debugging tools for joins.
11. Anti Join
An anti join means: show rows from one table that do not have a match in another table.
Pandas does not have a direct anti_join() method, but you can build one with indicator=True.
Courses with no enrollment:
course_audit = courses.merge(
enrollments,
how="left",
on="course_id",
indicator=True
)
courses_without_enrollment = course_audit[
course_audit["_merge"] == "left_only"
]
courses_without_enrollment[["course_id", "course_name", "category", "price"]]Learners with no enrollment:
learner_audit = learners.merge(
enrollments,
how="left",
on="learner_id",
indicator=True
)
learners_without_enrollment = learner_audit[
learner_audit["_merge"] == "left_only"
]
learners_without_enrollment[["learner_id", "learner", "city", "plan"]]Anti joins are common in analytics work.
12. Semi Join
A semi join means: keep rows from one table if they have a match in another table.
Learners who enrolled at least once:
enrolled_learner_ids = enrollments["learner_id"].unique()
learners[learners["learner_id"].isin(enrolled_learner_ids)]Courses that received at least one enrollment:
enrolled_course_ids = enrollments["course_id"].unique()
courses[courses["course_id"].isin(enrolled_course_ids)]Use semi joins when you need the original table rows, not the combined table.
13. Merge With Different Column Names
The reviews dataset uses different key names:
student_refpoints tolearner_idcourse_refpoints tocourse_id
Use left_on and right_on.
reviews_with_learners = reviews.merge(
learners,
how="left",
left_on="student_ref",
right_on="learner_id"
)
reviews_with_learners.head()Merge reviews with courses:
reviews_full = (
reviews
.merge(learners, how="left", left_on="student_ref", right_on="learner_id")
.merge(courses, how="left", left_on="course_ref", right_on="course_id")
)
reviews_full[["review_id", "learner", "course_name", "rating", "review_text"]]14. Merge On Multiple Columns
Sometimes one column is not enough to identify a match.
Create a small completion table:
completion_checks = pd.DataFrame({
"learner_id": ["L001", "L002", "L006", "L010", "L015"],
"course_id": ["C101", "C101", "C101", "C104", "C106"],
"certificate_sent": [True, False, True, False, True]
})Merge using both learner_id and course_id:
enrollments.merge(
completion_checks,
how="left",
on=["learner_id", "course_id"]
)Use multiple keys when the relationship depends on a combination of columns.
15. Handle Duplicate Column Names With suffixes
If both DataFrames contain columns with the same name, Pandas adds suffixes like _x and _y.
You can choose better suffixes.
mentor_course_table = courses.merge(
mentors,
how="left",
on="mentor_id",
suffixes=("_course", "_mentor")
)
mentor_course_table.head()In this dataset, both tables have specialty, so custom suffixes keep the result readable.
16. Validate Join Relationships
validate helps catch accidental many-to-many joins.
For example, each course has one mentor, but one mentor can teach many courses.
courses.merge(
mentors,
how="left",
on="mentor_id",
validate="many_to_one"
)Each enrollment belongs to one learner, but one learner can have many enrollments.
enrollments.merge(
learners,
how="left",
on="learner_id",
validate="many_to_one"
)Common validation options:
| Option | Meaning |
|---|---|
one_to_one | each key appears once in both tables |
one_to_many | left keys are unique, right keys may repeat |
many_to_one | left keys may repeat, right keys are unique |
many_to_many | keys may repeat in both tables |
Use validate when correctness matters.
It can save you from silent row explosions.
17. Many-To-Many Joins Can Expand Rows
If the same key appears multiple times in both tables, merge creates all matching combinations.
Example:
left = pd.DataFrame({
"topic": ["pandas", "pandas", "sql"],
"left_note": ["A", "B", "C"]
})
right = pd.DataFrame({
"topic": ["pandas", "pandas", "numpy"],
"right_note": ["X", "Y", "Z"]
})
left.merge(right, on="topic", how="inner")The pandas rows multiply because there are two matches on each side.
Before merging, check key uniqueness:
learners["learner_id"].is_uniquecourses["course_id"].is_uniqueenrollments["learner_id"].is_uniqueIt is normal for enrollments to have repeated learner IDs, but it is not normal for the learner master table to have duplicate learner IDs.
18. Self Join
A self join means merging a table with itself.
In the learners dataset, referral_id points to another learner.
learners[["learner_id", "learner", "referral_id"]]Merge learners with learners:
referral_map = learners.merge(
learners,
how="left",
left_on="referral_id",
right_on="learner_id",
suffixes=("_learner", "_referrer")
)
referral_map[[
"learner_id_learner",
"learner_learner",
"referral_id_learner",
"learner_referrer"
]]Self joins are useful for:
- employee-manager relationships
- user-referrer relationships
- product replacement mappings
- category-parent relationships
19. Index-Based .join()
.join() is mostly used for index-based joins.
Set learner ID as the index:
learners_by_id = learners.set_index("learner_id")
enrollments_by_learner = enrollments.set_index("learner_id")Join enrollment rows with learner columns:
enrollments_by_learner.join(
learners_by_id,
how="left",
rsuffix="_learner"
).head()For beginners, merge() is usually clearer because the join keys are visible as columns.
Use .join() when your DataFrames are already indexed by the relationship key.
20. combine_first() For Filling Missing Values
combine_first() fills missing values in one object using values from another object with aligned indexes.
Example:
primary_prices = courses.set_index("course_id")["price"]
backup_prices = pd.Series({
"C101": 1499,
"C102": 999,
"C103": 1299,
"C104": 1199,
"C105": 1999,
"C106": 799
})
primary_prices.combine_first(backup_prices)This is not a replacement for merge().
Use it when you want to fill missing aligned values.
21. Build A Complete Enrollment Report
Create one table with enrollment, learner, course, and mentor information.
enrollment_report = (
enrollments
.merge(learners, how="left", on="learner_id", validate="many_to_one")
.merge(courses, how="left", on="course_id", validate="many_to_one")
.merge(mentors, how="left", on="mentor_id", validate="many_to_one", suffixes=("_course", "_mentor"))
)
enrollment_report.head()Select useful columns:
enrollment_report = enrollment_report[[
"enrollment_id",
"enrolled_at",
"learner",
"city",
"plan",
"course_name",
"category",
"price",
"payment_status",
"mentor",
"coupon"
]]
enrollment_report.head()22. Month-Wise Revenue
Use the month label from concat(keys=...).
enrollments_with_month = pd.concat(
[jan, feb],
keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "month"})
month_revenue = (
enrollments_with_month
.merge(courses, how="left", on="course_id")
.query("payment_status == 'paid'")
.groupby("month", as_index=False)["price"]
.sum()
.rename(columns={"price": "revenue"})
)
month_revenue23. Course-Wise Revenue
course_revenue = (
enrollments
.merge(courses, how="left", on="course_id")
.query("payment_status == 'paid'")
.groupby(["course_id", "course_name"], as_index=False)["price"]
.sum()
.rename(columns={"price": "revenue"})
.sort_values("revenue", ascending=False)
)
course_revenue24. Top Learners By Spend
learner_spend = (
enrollments
.merge(learners, how="left", on="learner_id")
.merge(courses, how="left", on="course_id")
.query("payment_status == 'paid'")
.groupby(["learner_id", "learner"], as_index=False)["price"]
.sum()
.rename(columns={"price": "total_spend"})
.sort_values("total_spend", ascending=False)
)
learner_spend.head(5)25. Courses With No Paid Enrollment
This is slightly different from courses with no enrollment at all.
paid_enrollments = enrollments[enrollments["payment_status"] == "paid"]
paid_course_audit = courses.merge(
paid_enrollments,
how="left",
on="course_id",
indicator=True
)
paid_course_audit[
paid_course_audit["_merge"] == "left_only"
][["course_id", "course_name", "category", "price"]]This helps identify courses that exist in the catalog but did not produce paid orders.
26. Review Report
Merge reviews with learners and courses.
review_report = (
reviews
.merge(learners, how="left", left_on="student_ref", right_on="learner_id")
.merge(courses, how="left", left_on="course_ref", right_on="course_id")
)
review_report[[
"review_id",
"learner",
"course_name",
"rating",
"review_text"
]]Average rating by course:
review_report.groupby("course_name", as_index=False)["rating"].mean()27. Cross Join
A cross join creates every combination of rows.
Example: every active mentor with every active course category.
mentor_names = mentors[["mentor_id", "mentor"]].head(3)
categories = pd.DataFrame({"category": ["Python", "Data", "Project"]})
mentor_names.merge(categories, how="cross")Use cross joins carefully.
If the left table has 1000 rows and the right table has 1000 rows, the result has 1,000,000 rows.
28. Common Mistakes
Mistake 1: Using concat() when you need merge()
This only places rows or columns together:
pd.concat([learners, enrollments])Explanation
- The pd.concat() function vertically stacks two pandas DataFrames (learners and enrollments) by concatenating their rows
- This operation preserves all columns from both DataFrames while appending the rows of the second DataFrame below the first
- The resulting DataFrame contains combined data from both sources, useful for merging related datasets
- Column names must match between DataFrames for proper alignment during concatenation
- This technique is commonly used in data processing workflows to aggregate multiple data sources into a single dataset
It does not match learner details to enrollments.
For relationship matching, use:
enrollments.merge(learners, on="learner_id")Explanation
- Combines two datasets (enrollments and learners) based on matching learner IDs to create a unified view of enrollment details alongside learner demographics
- The merge operation aligns records from both tables where the "learner_id" field values match between the datasets
- This creates a comprehensive dataset that includes both enrollment information and corresponding learner attributes in a single dataframe
- The resulting merged dataset enables analysis of learner behavior, enrollment patterns, and demographic breakdowns simultaneously
- Commonly used in educational analytics to understand student engagement and course participation metrics
Mistake 2: Forgetting ignore_index=True
When stacking files, duplicated index labels can confuse later code.
pd.concat([jan, feb], ignore_index=True)Explanation
- Concatenates two pandas DataFrame objects (jan and feb) along the row axis (axis=0 by default)
- The ignore_index=True parameter creates a new continuous integer index starting from zero
- Preserves all columns from both DataFrames without any merging or joining operations
- Useful for combining time-series data or appending datasets with identical column structures
- Returns a new DataFrame object containing all rows from both input DataFrames sequentially
Mistake 3: Ignoring unmatched rows
An inner join can silently drop rows.
Audit with:
learners.merge(enrollments, how="outer", on="learner_id", indicator=True)Explanation
- Performs an outer join operation between learners and enrollments DataFrames based on the common learner_id column
- The indicator=True parameter adds a special column showing which DataFrame each record came from (left_only, right_only, both)
- Preserves all records from both datasets regardless of matching conditions, making it ideal for identifying missing enrollments or unenrolled learners
- Useful for data quality analysis to spot records that exist in one dataset but not the other
- Returns a merged DataFrame that can be used to analyze complete enrollment coverage across all learners
Mistake 4: Not checking key uniqueness
Before merging master tables, check:
learners["learner_id"].is_uniqueExplanation
- The code checks whether all values in the 'learner_id' column of the learners DataFrame are distinct and non-duplicate
- The is_unique property returns a boolean value (True/False) indicating if all elements in the Series are unique
- This is useful for data validation to ensure each learner has a unique identifier
- The operation is efficient as it leverages pandas' optimized internal methods for checking uniqueness
- This check helps maintain data integrity before performing operations that require unique identifiers
courses["course_id"].is_uniqueExplanation
- This code checks whether all values in the 'course_id' column of the courses DataFrame are distinct
- The .is_unique property returns a boolean value (True/False) indicating if there are no duplicate entries
- When True, it confirms that each course ID appears only once in the dataset
- When False, it indicates there are duplicate course IDs that need to be addressed
- This validation is essential for ensuring data integrity in educational course management systems
Mistake 5: Letting duplicate columns become confusing
Use meaningful suffixes:
courses.merge(
mentors,
on="mentor_id",
suffixes=("_course", "_mentor")
)Explanation
- Combines two pandas DataFrames (courses and mentors) based on their common "mentor_id" column
- Uses suffixes parameter to distinguish columns with identical names between the two dataframes
- Appends "_course" and "_mentor" suffixes to overlapping column names respectively
- Creates a unified dataset that preserves all course information while adding corresponding mentor details
- Returns a new merged DataFrame containing combined data from both source datasets
29. Practice Questions
Try these before looking at the solutions.
Practice Task
Q1. Append January and February enrollment files into one clean DataFrame.
enrollments = pd.concat([jan, feb], ignore_index=True)
enrollments.head()Explanation
- The code merges two pandas DataFrames (jan and feb) into a single DataFrame called enrollments by stacking them vertically
- The ignore_index=True parameter resets the row indices to create a continuous sequential index starting from zero
- This operation is useful for combining monthly enrollment data into a unified dataset for analysis
- The head() method displays the first 5 rows of the combined DataFrame to verify the concatenation worked correctly
- The resulting enrollments DataFrame contains all records from both input DataFrames in chronological order
Practice Task
Q2. Create a full enrollment table with learner and course names.
full_enrollments = (
enrollments
.merge(learners, how="left", on="learner_id")
.merge(courses, how="left", on="course_id")
)
full_enrollments[["enrollment_id", "learner", "course_name", "payment_status"]]Explanation
- Combines three datasets (enrollments, learners, and courses) using left joins to preserve all enrollment records while adding corresponding learner and course details
- Creates a unified dataframe that links student enrollments to specific courses and learner information through common identifier fields
- Selects only the most relevant columns (enrollment ID, learner name, course name, and payment status) for reporting or analysis purposes
- The merge operations maintain data integrity by matching records based on learner_id and course_id foreign keys
- This structured approach enables easy tracking of which students are enrolled in which courses and their payment statuses
Practice Task
Q3. Find learners who never enrolled.
audit = learners.merge(
enrollments,
how="left",
on="learner_id",
indicator=True
)
audit[audit["_merge"] == "left_only"][["learner_id", "learner", "city"]]Explanation
- Performs a left join between learners and enrollments datasets using learner_id as the common key
- Adds a merge indicator column to track which records exist in both datasets versus only in the left dataset
- Filters results to show only records that exist in learners but not in enrollments (left_only matches)
- Returns specific columns (learner_id, learner name, and city) for audit purposes
- Useful for data quality checks to identify unenrolled learners in the system
Practice Task
Q4. Find courses that never received any enrollment.
audit = courses.merge(
enrollments,
how="left",
on="course_id",
indicator=True
)
audit[audit["_merge"] == "left_only"][["course_id", "course_name"]]Explanation
- Performs a left join between courses and enrollments dataframes using course_id as the merging column
- The indicator parameter tracks which records exist in each dataframe, creating a _merge column with values like "both", "left_only", or "right_only"
- Filters results to show only records where the course exists in the courses table but has no matching enrollment record
- Returns specific columns (course_id and course_name) for courses that have no enrollments
- Useful for audit purposes to identify inactive or unenrolled courses in a learning management system
Practice Task
Q5. Find total paid revenue by mentor.
mentor_revenue = (
enrollments
.merge(courses, how="left", on="course_id")
.merge(mentors, how="left", on="mentor_id")
.query("payment_status == 'paid'")
.groupby(["mentor_id", "mentor"], as_index=False)["price"]
.sum()
.rename(columns={"price": "revenue"})
.sort_values("revenue", ascending=False)
)
mentor_revenueExplanation
- Merges three datasets (enrollments, courses, and mentors) using left joins to combine enrollment data with course pricing and mentor information
- Filters the merged dataset to include only paid enrollments using query method
- Groups the filtered data by mentor ID and name, then sums the price column to calculate total revenue per mentor
- Renames the summed price column to "revenue" for clarity and readability
- Sorts the final result in descending order of revenue to show highest earning mentors first
Practice Task
Q6. Find students who enrolled in both months.
jan_learners = set(jan["learner_id"])
feb_learners = set(feb["learner_id"])
both_months = jan_learners.intersection(feb_learners)
learners[learners["learner_id"].isin(both_months)]Explanation
- Creates two sets containing unique learner IDs from January and February datasets respectively
- Uses set intersection operation to identify learners who exist in both months
- Filters the main learners dataframe to return only records where learner_id matches those found in both months
- This approach efficiently identifies overlapping data between time periods using set operations
Practice Task
Q7. Find courses that were enrolled in January but not February.
jan_courses = set(jan["course_id"])
feb_courses = set(feb["course_id"])
jan_only_courses = jan_courses.difference(feb_courses)
courses[courses["course_id"].isin(jan_only_courses)]Explanation
- Creates two sets from course IDs in January and February dataframes to enable set operations
- Uses the difference method to identify courses that exist in January but not in February
- Filters the main courses dataframe to return only rows where course IDs match the January-only courses
- This approach efficiently identifies courses that were exclusively offered during the January period
Practice Task
Q8. Build learner-referrer pairs.
referral_pairs = learners.merge(
learners,
how="left",
left_on="referral_id",
right_on="learner_id",
suffixes=("_learner", "_referrer")
)
referral_pairs[["learner_learner", "learner_referrer"]]Explanation
- Merges the learners DataFrame with itself using a left join to establish referral relationships between learners and their referrers
- Links records where the referral_id from one learner matches the learner_id of another learner to create referral pairs
- Uses suffixes to distinguish between columns from the left (learner) and right (referrer) sides of the merge operation
- Selects specific columns to display the learner and referrer identifiers from the resulting joined dataset
- This approach enables analysis of referral patterns and tracking of which learners referred others in the system
Practice Task
Q9. Create a review table with learner name, course name, and rating.
review_table = (
reviews
.merge(learners, how="left", left_on="student_ref", right_on="learner_id")
.merge(courses, how="left", left_on="course_ref", right_on="course_id")
)
review_table[["review_id", "learner", "course_name", "rating"]]Explanation
- The code performs two sequential merge operations to combine review data with corresponding learner and course information using foreign key relationships
- First merge joins the reviews dataset with learners table on student_ref = learner_id to add learner details to each review
- Second merge joins the result with courses table on course_ref = course_id to include course names and other course-related information
- The final selection filters the merged dataset to show only specific columns: review identifier, learner name, course name, and rating score
- This approach creates a unified view that connects individual reviews to both the students who wrote them and the courses they reviewed
Practice Task
Q10. Use validate to confirm that enrollment-to-course is many-to-one.
enrollments.merge(
courses,
how="left",
on="course_id",
validate="many_to_one"
)Explanation
- Performs a left join operation between enrollments and courses DataFrames based on the common course_id column
- Preserves all records from the enrollments DataFrame while adding matching course data where available
- Validates the merge relationship to ensure each enrollment maps to at most one course record
- Useful for enriching enrollment data with course details like titles, credits, or department information
- The many_to_one validation prevents duplicate course entries that could corrupt the merged dataset
30. Interview Questions
1. What is the difference between concat() and merge()?
concat() stacks or attaches DataFrames along an axis. merge() combines DataFrames by matching key columns.
2. When should you use ignore_index=True?
Use it when stacking rows and you want a clean new index instead of preserving old indexes from each input DataFrame.
3. What is an inner join?
An inner join keeps only matching keys from both DataFrames.
4. What is a left join?
A left join keeps all rows from the left DataFrame and adds matching data from the right DataFrame.
5. What is an outer join?
An outer join keeps all keys from both DataFrames and uses missing values where no match exists.
6. What does indicator=True do?
It adds a _merge column that tells whether each row came from the left table only, right table only, or both tables.
7. What is an anti join?
An anti join returns rows from one table that do not have a match in another table.
8. What is a self join?
A self join merges a DataFrame with itself, usually to resolve parent-child or referral relationships.
9. Why should you use validate?
It checks whether the merge relationship matches your expectation, such as many_to_one. This helps catch accidental duplicate keys.
10. Why can many-to-many joins be risky?
They can multiply rows because every matching left row combines with every matching right row for the same key.
11. When should you use .join()?
Use .join() when DataFrames are already indexed by the relationship key.
12. What is a cross join?
A cross join returns every possible combination of rows from both DataFrames.
31. Final Mental Model
Use this table:
| Need | Use |
|---|---|
| Stack rows from similar files | pd.concat([...], ignore_index=True) |
| Keep source labels while stacking | pd.concat([...], keys=[...]) |
| Match rows by key columns | merge() |
| Keep all rows from the left table | how="left" |
| Keep only matching rows | how="inner" |
| Keep all keys from both tables | how="outer" |
| Audit unmatched rows | indicator=True |
| Catch wrong relationship assumptions | validate |
| Merge using different key names | left_on, right_on |
| Merge a table with itself | self join |
| Join by index | .join() |
If you remember only one rule, remember this:
Use concat() when files are the same kind of thing. Use merge() when files describe related things.
Official References
- Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
- Pandas
mergeAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.merge.html - Pandas
concatAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.concat.html - Pandas
DataFrame.joinAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
