# Mastering Pandas: Merge, Join, and Concat DataFrames Effectively URL: https://madhudadi.in/blog/posts/pandas-merge-join-and-concat-essential-techniques Published: 2026-06-05 Tags: python, Pandas Read time: 48 min Difficulty: intermediate > Learn how to combine Pandas DataFrames with concat, merge, and join using original CSV datasets. Practice row appends, join types, self joins, anti joins, index joins, validation, indicators, suffixes, and realistic reporting workflows.# Pandas Merge, Join, and Concat: Combine DataFrames With Practice Most real projects do not give you one perfect CSV. You may get: - one file for January enrollments - one file for February enrollments - one file for learners - one file for courses - one file for mentors - one file for reviews To answer useful questions, you need to combine these files correctly. This is where `pd.concat()`, `merge()`, and `join()` become essential. This guide uses original sample datasets created for this lesson. It does not use copied course, sports, movie, ecommerce, or public dataset examples. Files used in this lesson: - `pandas_join_learners.csv` - `pandas_join_courses.csv` - `pandas_join_mentors.csv` - `pandas_join_enrollments_jan.csv` - `pandas_join_enrollments_feb.csv` - `pandas_join_reviews.csv` Place all CSV files in the same folder as this Markdown file before running the examples. ## What You Will Learn By the end, you should be able to use: - `pd.concat()` for stacking DataFrames - `ignore_index=True` - `keys=` to create a MultiIndex while concatenating - `axis=0` versus `axis=1` - `merge()` for database-style joins - inner, left, right, outer, and cross joins - `on`, `left_on`, and `right_on` - joining on multiple columns - `suffixes` - `indicator=True` - `validate` - self joins - semi joins and anti joins - index-based `.join()` - `combine_first()` - reporting workflows after merging ## 1. Setup ```python import pandas as pd import numpy as np ``` Load the datasets: ```python learners = pd.read_csv("pandas_join_learners.csv") courses = pd.read_csv("pandas_join_courses.csv") mentors = pd.read_csv("pandas_join_mentors.csv") jan = pd.read_csv("pandas_join_enrollments_jan.csv") feb = pd.read_csv("pandas_join_enrollments_feb.csv") reviews = pd.read_csv("pandas_join_reviews.csv") ``` Inspect the shapes: ```python for name, df in { "learners": learners, "courses": courses, "mentors": mentors, "jan": jan, "feb": feb, "reviews": reviews, }.items(): print(name, df.shape) ``` Look at the first few rows: ```python learners.head() ``` ```python courses.head() ``` ```python jan.head() ``` ## 2. `concat()` For Stacking Rows Use `pd.concat()` when two or more DataFrames have the same meaning and similar columns. January enrollments and February enrollments are the same kind of data. ```python enrollments = pd.concat([jan, feb]) enrollments ``` Notice the index. Each file kept its original index, so you may see duplicate index labels. For most beginner workflows, reset the index while concatenating: ```python enrollments = pd.concat([jan, feb], ignore_index=True) enrollments ``` Now the index is clean from `0` to `n - 1`. Use this pattern for: - monthly sales files - daily logs - region-wise CSVs - scraped pages with the same schema - batches of survey responses ## 3. Add Source Labels With `keys` Sometimes you want to remember which original file each row came from. ```python enrollments_by_month = pd.concat( [jan, feb], keys=["Jan", "Feb"] ) enrollments_by_month ``` This creates a MultiIndex. Get all January rows: ```python enrollments_by_month.loc["Jan"] ``` Get the first row inside February: ```python enrollments_by_month.loc[("Feb", 0)] ``` If you want the month label as a normal column: ```python enrollments_with_month = pd.concat( [jan, feb], keys=["Jan", "Feb"] ).reset_index(level=0).rename(columns={"level_0": "month"}) enrollments_with_month.head() ``` This is useful for month-wise revenue analysis. ## 4. `concat()` Side By Side With `axis=1` `axis=0` stacks rows. `axis=1` attaches columns side by side. ```python pd.concat([learners.head(3), courses.head(3)], axis=1) ``` This example is not a real relationship. It simply places columns next to each other based on index position. Use `axis=1` only when rows are already aligned by index and you know that row 0 in one DataFrame belongs with row 0 in another. For most relationship-based combining, use `merge()`. ## 5. `merge()` For Relationship-Based Joins Use `merge()` when rows should match by one or more key columns. Example: ```python enrollments.merge(learners, on="learner_id") ``` This matches enrollment rows to learner details using `learner_id`. By default, `merge()` performs an inner join. Inner join means: - keep rows with matching keys in both DataFrames - drop rows that do not match ## 6. Inner Join Show only enrollments that have matching learner records: ```python enrollments.merge(learners, how="inner", on="learner_id") ``` Create a richer table by merging enrollments, learners, and courses: ```python enrollment_detail = ( enrollments .merge(learners, how="inner", on="learner_id") .merge(courses, how="inner", on="course_id") ) enrollment_detail.head() ``` Now each enrollment row has learner and course information. ## 7. Left Join A left join keeps every row from the left DataFrame. For example, keep all courses even if some courses have no enrollments: ```python courses.merge(enrollments, how="left", on="course_id") ``` Rows for courses without enrollments will have missing values in enrollment columns. This is useful when the left table is your master list. Examples: - all courses, whether enrolled or not - all customers, whether ordered or not - all employees, whether assigned or not - all products, whether sold or not ## 8. Right Join A right join keeps every row from the right DataFrame. ```python learners.merge(enrollments, how="right", on="learner_id") ``` This keeps every enrollment row from the right table. You can usually rewrite right joins as left joins by swapping the order of the DataFrames. Many teams prefer left joins because they are easier to read. ## 9. Outer Join An outer join keeps all keys from both DataFrames. ```python learners.merge(enrollments, how="outer", on="learner_id") ``` This is useful for audits: - learners with enrollments - learners without enrollments - enrollment rows that reference missing learners Outer joins help find data quality gaps. ## 10. Use `indicator=True` To Audit Matches The `indicator=True` option adds a `_merge` column. ```python learner_enrollment_audit = learners.merge( enrollments, how="outer", on="learner_id", indicator=True ) learner_enrollment_audit["_merge"].value_counts() ``` The `_merge` values are: - `both`: key exists in both tables - `left_only`: key exists only in the left table - `right_only`: key exists only in the right table Find learners who have not enrolled: ```python learner_enrollment_audit[ learner_enrollment_audit["_merge"] == "left_only" ] ``` Find enrollment records with missing learner profiles: ```python learner_enrollment_audit[ learner_enrollment_audit["_merge"] == "right_only" ] ``` This is one of the best debugging tools for joins. ## 11. Anti Join An anti join means: show rows from one table that do not have a match in another table. Pandas does not have a direct `anti_join()` method, but you can build one with `indicator=True`. Courses with no enrollment: ```python course_audit = courses.merge( enrollments, how="left", on="course_id", indicator=True ) courses_without_enrollment = course_audit[ course_audit["_merge"] == "left_only" ] courses_without_enrollment[["course_id", "course_name", "category", "price"]] ``` Learners with no enrollment: ```python learner_audit = learners.merge( enrollments, how="left", on="learner_id", indicator=True ) learners_without_enrollment = learner_audit[ learner_audit["_merge"] == "left_only" ] learners_without_enrollment[["learner_id", "learner", "city", "plan"]] ``` Anti joins are common in analytics work. ## 12. Semi Join A semi join means: keep rows from one table if they have a match in another table. Learners who enrolled at least once: ```python enrolled_learner_ids = enrollments["learner_id"].unique() learners[learners["learner_id"].isin(enrolled_learner_ids)] ``` Courses that received at least one enrollment: ```python enrolled_course_ids = enrollments["course_id"].unique() courses[courses["course_id"].isin(enrolled_course_ids)] ``` Use semi joins when you need the original table rows, not the combined table. ## 13. Merge With Different Column Names The reviews dataset uses different key names: - `student_ref` points to `learner_id` - `course_ref` points to `course_id` Use `left_on` and `right_on`. ```python reviews_with_learners = reviews.merge( learners, how="left", left_on="student_ref", right_on="learner_id" ) reviews_with_learners.head() ``` Merge reviews with courses: ```python reviews_full = ( reviews .merge(learners, how="left", left_on="student_ref", right_on="learner_id") .merge(courses, how="left", left_on="course_ref", right_on="course_id") ) reviews_full[["review_id", "learner", "course_name", "rating", "review_text"]] ``` ## 14. Merge On Multiple Columns Sometimes one column is not enough to identify a match. Create a small completion table: ```python completion_checks = pd.DataFrame({ "learner_id": ["L001", "L002", "L006", "L010", "L015"], "course_id": ["C101", "C101", "C101", "C104", "C106"], "certificate_sent": [True, False, True, False, True] }) ``` Merge using both `learner_id` and `course_id`: ```python enrollments.merge( completion_checks, how="left", on=["learner_id", "course_id"] ) ``` Use multiple keys when the relationship depends on a combination of columns. ## 15. Handle Duplicate Column Names With `suffixes` If both DataFrames contain columns with the same name, Pandas adds suffixes like `_x` and `_y`. You can choose better suffixes. ```python mentor_course_table = courses.merge( mentors, how="left", on="mentor_id", suffixes=("_course", "_mentor") ) mentor_course_table.head() ``` In this dataset, both tables have `specialty`, so custom suffixes keep the result readable. ## 16. Validate Join Relationships `validate` helps catch accidental many-to-many joins. For example, each course has one mentor, but one mentor can teach many courses. ```python courses.merge( mentors, how="left", on="mentor_id", validate="many_to_one" ) ``` Each enrollment belongs to one learner, but one learner can have many enrollments. ```python enrollments.merge( learners, how="left", on="learner_id", validate="many_to_one" ) ``` Common validation options: | Option | Meaning | |---|---| | `one_to_one` | each key appears once in both tables | | `one_to_many` | left keys are unique, right keys may repeat | | `many_to_one` | left keys may repeat, right keys are unique | | `many_to_many` | keys may repeat in both tables | Use `validate` when correctness matters. It can save you from silent row explosions. ## 17. Many-To-Many Joins Can Expand Rows If the same key appears multiple times in both tables, merge creates all matching combinations. Example: ```python left = pd.DataFrame({ "topic": ["pandas", "pandas", "sql"], "left_note": ["A", "B", "C"] }) right = pd.DataFrame({ "topic": ["pandas", "pandas", "numpy"], "right_note": ["X", "Y", "Z"] }) left.merge(right, on="topic", how="inner") ``` The `pandas` rows multiply because there are two matches on each side. Before merging, check key uniqueness: ```python learners["learner_id"].is_unique ``` ```python courses["course_id"].is_unique ``` ```python enrollments["learner_id"].is_unique ``` It is normal for enrollments to have repeated learner IDs, but it is not normal for the learner master table to have duplicate learner IDs. ## 18. Self Join A self join means merging a table with itself. In the learners dataset, `referral_id` points to another learner. ```python learners[["learner_id", "learner", "referral_id"]] ``` Merge learners with learners: ```python referral_map = learners.merge( learners, how="left", left_on="referral_id", right_on="learner_id", suffixes=("_learner", "_referrer") ) referral_map[[ "learner_id_learner", "learner_learner", "referral_id_learner", "learner_referrer" ]] ``` Self joins are useful for: - employee-manager relationships - user-referrer relationships - product replacement mappings - category-parent relationships ## 19. Index-Based `.join()` `.join()` is mostly used for index-based joins. Set learner ID as the index: ```python learners_by_id = learners.set_index("learner_id") enrollments_by_learner = enrollments.set_index("learner_id") ``` Join enrollment rows with learner columns: ```python enrollments_by_learner.join( learners_by_id, how="left", rsuffix="_learner" ).head() ``` For beginners, `merge()` is usually clearer because the join keys are visible as columns. Use `.join()` when your DataFrames are already indexed by the relationship key. ## 20. `combine_first()` For Filling Missing Values `combine_first()` fills missing values in one object using values from another object with aligned indexes. Example: ```python primary_prices = courses.set_index("course_id")["price"] backup_prices = pd.Series({ "C101": 1499, "C102": 999, "C103": 1299, "C104": 1199, "C105": 1999, "C106": 799 }) primary_prices.combine_first(backup_prices) ``` This is not a replacement for `merge()`. Use it when you want to fill missing aligned values. ## 21. Build A Complete Enrollment Report Create one table with enrollment, learner, course, and mentor information. ```python enrollment_report = ( enrollments .merge(learners, how="left", on="learner_id", validate="many_to_one") .merge(courses, how="left", on="course_id", validate="many_to_one") .merge(mentors, how="left", on="mentor_id", validate="many_to_one", suffixes=("_course", "_mentor")) ) enrollment_report.head() ``` Select useful columns: ```python enrollment_report = enrollment_report[[ "enrollment_id", "enrolled_at", "learner", "city", "plan", "course_name", "category", "price", "payment_status", "mentor", "coupon" ]] enrollment_report.head() ``` ## 22. Month-Wise Revenue Use the month label from `concat(keys=...)`. ```python enrollments_with_month = pd.concat( [jan, feb], keys=["Jan", "Feb"] ).reset_index(level=0).rename(columns={"level_0": "month"}) month_revenue = ( enrollments_with_month .merge(courses, how="left", on="course_id") .query("payment_status == 'paid'") .groupby("month", as_index=False)["price"] .sum() .rename(columns={"price": "revenue"}) ) month_revenue ``` ## 23. Course-Wise Revenue ```python course_revenue = ( enrollments .merge(courses, how="left", on="course_id") .query("payment_status == 'paid'") .groupby(["course_id", "course_name"], as_index=False)["price"] .sum() .rename(columns={"price": "revenue"}) .sort_values("revenue", ascending=False) ) course_revenue ``` ## 24. Top Learners By Spend ```python learner_spend = ( enrollments .merge(learners, how="left", on="learner_id") .merge(courses, how="left", on="course_id") .query("payment_status == 'paid'") .groupby(["learner_id", "learner"], as_index=False)["price"] .sum() .rename(columns={"price": "total_spend"}) .sort_values("total_spend", ascending=False) ) learner_spend.head(5) ``` ## 25. Courses With No Paid Enrollment This is slightly different from courses with no enrollment at all. ```python paid_enrollments = enrollments[enrollments["payment_status"] == "paid"] paid_course_audit = courses.merge( paid_enrollments, how="left", on="course_id", indicator=True ) paid_course_audit[ paid_course_audit["_merge"] == "left_only" ][["course_id", "course_name", "category", "price"]] ``` This helps identify courses that exist in the catalog but did not produce paid orders. ## 26. Review Report Merge reviews with learners and courses. ```python review_report = ( reviews .merge(learners, how="left", left_on="student_ref", right_on="learner_id") .merge(courses, how="left", left_on="course_ref", right_on="course_id") ) review_report[[ "review_id", "learner", "course_name", "rating", "review_text" ]] ``` Average rating by course: ```python review_report.groupby("course_name", as_index=False)["rating"].mean() ``` ## 27. Cross Join A cross join creates every combination of rows. Example: every active mentor with every active course category. ```python mentor_names = mentors[["mentor_id", "mentor"]].head(3) categories = pd.DataFrame({"category": ["Python", "Data", "Project"]}) mentor_names.merge(categories, how="cross") ``` Use cross joins carefully. If the left table has 1000 rows and the right table has 1000 rows, the result has 1,000,000 rows. ## 28. Common Mistakes ### Mistake 1: Using `concat()` when you need `merge()` This only places rows or columns together: ```python pd.concat([learners, enrollments]) ``` **Explanation** - The pd.concat() function vertically stacks two pandas DataFrames (learners and enrollments) by concatenating their rows - This operation preserves all columns from both DataFrames while appending the rows of the second DataFrame below the first - The resulting DataFrame contains combined data from both sources, useful for merging related datasets - Column names must match between DataFrames for proper alignment during concatenation - This technique is commonly used in data processing workflows to aggregate multiple data sources into a single dataset It does not match learner details to enrollments. For relationship matching, use: ```python enrollments.merge(learners, on="learner_id") ``` **Explanation** - Combines two datasets (enrollments and learners) based on matching learner IDs to create a unified view of enrollment details alongside learner demographics - The merge operation aligns records from both tables where the "learner_id" field values match between the datasets - This creates a comprehensive dataset that includes both enrollment information and corresponding learner attributes in a single dataframe - The resulting merged dataset enables analysis of learner behavior, enrollment patterns, and demographic breakdowns simultaneously - Commonly used in educational analytics to understand student engagement and course participation metrics ### Mistake 2: Forgetting `ignore_index=True` When stacking files, duplicated index labels can confuse later code. ```python pd.concat([jan, feb], ignore_index=True) ``` **Explanation** - Concatenates two pandas DataFrame objects (jan and feb) along the row axis (axis=0 by default) - The ignore_index=True parameter creates a new continuous integer index starting from zero - Preserves all columns from both DataFrames without any merging or joining operations - Useful for combining time-series data or appending datasets with identical column structures - Returns a new DataFrame object containing all rows from both input DataFrames sequentially ### Mistake 3: Ignoring unmatched rows An inner join can silently drop rows. Audit with: ```python learners.merge(enrollments, how="outer", on="learner_id", indicator=True) ``` **Explanation** - Performs an outer join operation between learners and enrollments DataFrames based on the common learner_id column - The indicator=True parameter adds a special column showing which DataFrame each record came from (left_only, right_only, both) - Preserves all records from both datasets regardless of matching conditions, making it ideal for identifying missing enrollments or unenrolled learners - Useful for data quality analysis to spot records that exist in one dataset but not the other - Returns a merged DataFrame that can be used to analyze complete enrollment coverage across all learners ### Mistake 4: Not checking key uniqueness Before merging master tables, check: ```python learners["learner_id"].is_unique ``` **Explanation** - The code checks whether all values in the 'learner_id' column of the learners DataFrame are distinct and non-duplicate - The is_unique property returns a boolean value (True/False) indicating if all elements in the Series are unique - This is useful for data validation to ensure each learner has a unique identifier - The operation is efficient as it leverages pandas' optimized internal methods for checking uniqueness - This check helps maintain data integrity before performing operations that require unique identifiers ```python courses["course_id"].is_unique ``` **Explanation** - This code checks whether all values in the 'course_id' column of the courses DataFrame are distinct - The .is_unique property returns a boolean value (True/False) indicating if there are no duplicate entries - When True, it confirms that each course ID appears only once in the dataset - When False, it indicates there are duplicate course IDs that need to be addressed - This validation is essential for ensuring data integrity in educational course management systems ### Mistake 5: Letting duplicate columns become confusing Use meaningful suffixes: ```python courses.merge( mentors, on="mentor_id", suffixes=("_course", "_mentor") ) ``` **Explanation** - Combines two pandas DataFrames (courses and mentors) based on their common "mentor_id" column - Uses suffixes parameter to distinguish columns with identical names between the two dataframes - Appends "_course" and "_mentor" suffixes to overlapping column names respectively - Creates a unified dataset that preserves all course information while adding corresponding mentor details - Returns a new merged DataFrame containing combined data from both source datasets ## 29. Practice Questions Try these before looking at the solutions. ### Q1. Append January and February enrollment files into one clean DataFrame. ```python enrollments = pd.concat([jan, feb], ignore_index=True) enrollments.head() ``` **Explanation** - The code merges two pandas DataFrames (jan and feb) into a single DataFrame called enrollments by stacking them vertically - The ignore_index=True parameter resets the row indices to create a continuous sequential index starting from zero - This operation is useful for combining monthly enrollment data into a unified dataset for analysis - The head() method displays the first 5 rows of the combined DataFrame to verify the concatenation worked correctly - The resulting enrollments DataFrame contains all records from both input DataFrames in chronological order ### Q2. Create a full enrollment table with learner and course names. ```python full_enrollments = ( enrollments .merge(learners, how="left", on="learner_id") .merge(courses, how="left", on="course_id") ) full_enrollments[["enrollment_id", "learner", "course_name", "payment_status"]] ``` **Explanation** - Combines three datasets (enrollments, learners, and courses) using left joins to preserve all enrollment records while adding corresponding learner and course details - Creates a unified dataframe that links student enrollments to specific courses and learner information through common identifier fields - Selects only the most relevant columns (enrollment ID, learner name, course name, and payment status) for reporting or analysis purposes - The merge operations maintain data integrity by matching records based on learner_id and course_id foreign keys - This structured approach enables easy tracking of which students are enrolled in which courses and their payment statuses ### Q3. Find learners who never enrolled. ```python audit = learners.merge( enrollments, how="left", on="learner_id", indicator=True ) audit[audit["_merge"] == "left_only"][["learner_id", "learner", "city"]] ``` **Explanation** - Performs a left join between learners and enrollments datasets using learner_id as the common key - Adds a merge indicator column to track which records exist in both datasets versus only in the left dataset - Filters results to show only records that exist in learners but not in enrollments (left_only matches) - Returns specific columns (learner_id, learner name, and city) for audit purposes - Useful for data quality checks to identify unenrolled learners in the system ### Q4. Find courses that never received any enrollment. ```python audit = courses.merge( enrollments, how="left", on="course_id", indicator=True ) audit[audit["_merge"] == "left_only"][["course_id", "course_name"]] ``` **Explanation** - Performs a left join between courses and enrollments dataframes using course_id as the merging column - The indicator parameter tracks which records exist in each dataframe, creating a _merge column with values like "both", "left_only", or "right_only" - Filters results to show only records where the course exists in the courses table but has no matching enrollment record - Returns specific columns (course_id and course_name) for courses that have no enrollments - Useful for audit purposes to identify inactive or unenrolled courses in a learning management system ### Q5. Find total paid revenue by mentor. ```python mentor_revenue = ( enrollments .merge(courses, how="left", on="course_id") .merge(mentors, how="left", on="mentor_id") .query("payment_status == 'paid'") .groupby(["mentor_id", "mentor"], as_index=False)["price"] .sum() .rename(columns={"price": "revenue"}) .sort_values("revenue", ascending=False) ) mentor_revenue ``` **Explanation** - Merges three datasets (enrollments, courses, and mentors) using left joins to combine enrollment data with course pricing and mentor information - Filters the merged dataset to include only paid enrollments using query method - Groups the filtered data by mentor ID and name, then sums the price column to calculate total revenue per mentor - Renames the summed price column to "revenue" for clarity and readability - Sorts the final result in descending order of revenue to show highest earning mentors first ### Q6. Find students who enrolled in both months. ```python jan_learners = set(jan["learner_id"]) feb_learners = set(feb["learner_id"]) both_months = jan_learners.intersection(feb_learners) learners[learners["learner_id"].isin(both_months)] ``` **Explanation** - Creates two sets containing unique learner IDs from January and February datasets respectively - Uses set intersection operation to identify learners who exist in both months - Filters the main learners dataframe to return only records where learner_id matches those found in both months - This approach efficiently identifies overlapping data between time periods using set operations ### Q7. Find courses that were enrolled in January but not February. ```python jan_courses = set(jan["course_id"]) feb_courses = set(feb["course_id"]) jan_only_courses = jan_courses.difference(feb_courses) courses[courses["course_id"].isin(jan_only_courses)] ``` **Explanation** - Creates two sets from course IDs in January and February dataframes to enable set operations - Uses the difference method to identify courses that exist in January but not in February - Filters the main courses dataframe to return only rows where course IDs match the January-only courses - This approach efficiently identifies courses that were exclusively offered during the January period ### Q8. Build learner-referrer pairs. ```python referral_pairs = learners.merge( learners, how="left", left_on="referral_id", right_on="learner_id", suffixes=("_learner", "_referrer") ) referral_pairs[["learner_learner", "learner_referrer"]] ``` **Explanation** - Merges the learners DataFrame with itself using a left join to establish referral relationships between learners and their referrers - Links records where the referral_id from one learner matches the learner_id of another learner to create referral pairs - Uses suffixes to distinguish between columns from the left (learner) and right (referrer) sides of the merge operation - Selects specific columns to display the learner and referrer identifiers from the resulting joined dataset - This approach enables analysis of referral patterns and tracking of which learners referred others in the system ### Q9. Create a review table with learner name, course name, and rating. ```python review_table = ( reviews .merge(learners, how="left", left_on="student_ref", right_on="learner_id") .merge(courses, how="left", left_on="course_ref", right_on="course_id") ) review_table[["review_id", "learner", "course_name", "rating"]] ``` **Explanation** - The code performs two sequential merge operations to combine review data with corresponding learner and course information using foreign key relationships - First merge joins the reviews dataset with learners table on student_ref = learner_id to add learner details to each review - Second merge joins the result with courses table on course_ref = course_id to include course names and other course-related information - The final selection filters the merged dataset to show only specific columns: review identifier, learner name, course name, and rating score - This approach creates a unified view that connects individual reviews to both the students who wrote them and the courses they reviewed ### Q10. Use `validate` to confirm that enrollment-to-course is many-to-one. ```python enrollments.merge( courses, how="left", on="course_id", validate="many_to_one" ) ``` **Explanation** - Performs a left join operation between enrollments and courses DataFrames based on the common course_id column - Preserves all records from the enrollments DataFrame while adding matching course data where available - Validates the merge relationship to ensure each enrollment maps to at most one course record - Useful for enriching enrollment data with course details like titles, credits, or department information - The many_to_one validation prevents duplicate course entries that could corrupt the merged dataset ## 30. Interview Questions ### 1. What is the difference between `concat()` and `merge()`? `concat()` stacks or attaches DataFrames along an axis. `merge()` combines DataFrames by matching key columns. ### 2. When should you use `ignore_index=True`? Use it when stacking rows and you want a clean new index instead of preserving old indexes from each input DataFrame. ### 3. What is an inner join? An inner join keeps only matching keys from both DataFrames. ### 4. What is a left join? A left join keeps all rows from the left DataFrame and adds matching data from the right DataFrame. ### 5. What is an outer join? An outer join keeps all keys from both DataFrames and uses missing values where no match exists. ### 6. What does `indicator=True` do? It adds a `_merge` column that tells whether each row came from the left table only, right table only, or both tables. ### 7. What is an anti join? An anti join returns rows from one table that do not have a match in another table. ### 8. What is a self join? A self join merges a DataFrame with itself, usually to resolve parent-child or referral relationships. ### 9. Why should you use `validate`? It checks whether the merge relationship matches your expectation, such as `many_to_one`. This helps catch accidental duplicate keys. ### 10. Why can many-to-many joins be risky? They can multiply rows because every matching left row combines with every matching right row for the same key. ### 11. When should you use `.join()`? Use `.join()` when DataFrames are already indexed by the relationship key. ### 12. What is a cross join? A cross join returns every possible combination of rows from both DataFrames. ## 31. Final Mental Model Use this table: | Need | Use | |---|---| | Stack rows from similar files | `pd.concat([...], ignore_index=True)` | | Keep source labels while stacking | `pd.concat([...], keys=[...])` | | Match rows by key columns | `merge()` | | Keep all rows from the left table | `how="left"` | | Keep only matching rows | `how="inner"` | | Keep all keys from both tables | `how="outer"` | | Audit unmatched rows | `indicator=True` | | Catch wrong relationship assumptions | `validate` | | Merge using different key names | `left_on`, `right_on` | | Merge a table with itself | self join | | Join by index | `.join()` | If you remember only one rule, remember this: Use `concat()` when files are the same kind of thing. Use `merge()` when files describe related things. ## Official References - Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html - Pandas `merge` API reference: https://pandas.pydata.org/docs/reference/api/pandas.merge.html - Pandas `concat` API reference: https://pandas.pydata.org/docs/reference/api/pandas.concat.html - Pandas `DataFrame.join` API reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html