What are the main components of data wrangling?

Data wrangling has three major parts: gather data, assess data, and clean data.

What is the difference between dirty data and messy data?

Dirty data has quality issues such as missing values and duplicate rows, while messy data has structural issues, where tidy data follows the rules of each variable being a column, each observation a row, and each observational unit a table.

What files are used in this guide for data assessing and cleaning?

The guide uses the following CSV files: edamessylearners.csv, edacourseprogress.csv, edacourseprogressextra.csv, and edalearningissues.csv.

What should you be able to do by the end of this guide?

By the end of the guide, you should be able to explain the difference between dirty and messy data, write a dataset summary, perform assessments, document data quality issues, identify problems, clean data, and reshape data among other skills.

What is the purpose of the define-code-test workflow in data cleaning?

The define-code-test workflow is used to clean data by defining the problem, writing code to fix it, and testing to ensure the cleaning step worked.

EDA Data Assessing and Cleaning: Key

EDA Data Assessing and Cleaning: Quality, Tidiness, Missing Values, Duplicates, Types, and Reshaping

After data gathering, the next job is to decide whether the data is trustworthy enough for analysis.

This stage is called data assessing and cleaning.

You inspect the data, document the issues, clean them in a controlled order, and test that each cleaning step worked.

This guide uses original synthetic learning-platform datasets. It does not use copied clinical-trial data, patient data, proprietary course notebooks, or third-party CSV files.

Files Used In This Guide

Use these CSV files:

eda_messy_learners.csv
eda_course_progress.csv
eda_course_progress_extra.csv
eda_learning_issues.csv

Place them in the same folder as your notebook or script.

If you keep them in a data/ folder, change the paths:

python

learners = pd.read_csv("data/eda_messy_learners.csv")
progress = pd.read_csv("data/eda_course_progress.csv")
progress_extra = pd.read_csv("data/eda_course_progress_extra.csv")
issues = pd.read_csv("data/eda_learning_issues.csv")

What You Will Learn

By the end, you should be able to:

explain the difference between dirty data and messy data
write a short dataset summary and column dictionary
perform manual and programmatic assessment
document data quality issues by dimension
identify completeness, validity, accuracy, and consistency problems
identify tidiness problems
clean data with the define-code-test workflow
copy raw DataFrames before cleaning
fix missing values, duplicates, bad types, and invalid values
extract email and phone fields with regex
combine split tables
reshape wide course columns into tidy rows
split progress ranges into start and end columns
merge issue flags back into the cleaned table

1. Data Wrangling Refresher

Data wrangling has three major parts:

Gather data.
Assess data.
Clean data.

Assessment and cleaning should not be mixed randomly.

First, discover and document problems. Then clean them one by one.

2. Dirty Data vs Messy Data

Unclean data usually has two broad problem types.

Dirty data has quality issues.

Examples:

missing values
duplicate rows
invalid IDs
impossible ages
inconsistent category names
incorrect data types
inaccurate values

Messy data has structural issues.

Tidy data follows three rules:

each variable is a column
each observation is a row
each observational unit is a table

If course names are stored as separate columns instead of rows, that is usually a tidiness issue.

3. Setup

Install the required packages:

bash

pip install pandas openpyxl

Import libraries:

python

import re

import numpy as np
import pandas as pd

Load the datasets:

python

learners = pd.read_csv("eda_messy_learners.csv")
progress = pd.read_csv("eda_course_progress.csv")
progress_extra = pd.read_csv("eda_course_progress_extra.csv")
issues = pd.read_csv("eda_learning_issues.csv")

Preview them:

python

learners.head()

python

progress.head()

python

issues.head()

4. Write A Dataset Summary

Before cleaning, write a plain-English summary.

Example:

This dataset contains learner profile records, course progress ranges, and issue reports for a synthetic online learning platform. The goal is to prepare the data for exploratory analysis of study behavior, course completion, and learner support issues.

This summary prevents you from cleaning without context.

5. Write Column Descriptions

Create a small data dictionary.

python

data_dictionary = pd.DataFrame(
    [
        ["learners", "learner_id", "Unique learner identifier"],
        ["learners", "given_name", "Learner first name"],
        ["learners", "surname", "Learner last name"],
        ["learners", "city", "Learner city"],
        ["learners", "state", "State name or abbreviation"],
        ["learners", "pin_code", "Postal code"],
        ["learners", "contact", "Combined email and phone text"],
        ["learners", "age", "Learner age in years"],
        ["learners", "study_hours", "Total study hours in the period"],
        ["progress", "python/sql/statistics", "Start-to-end course completion range"],
        ["issues", "issue_type", "Support or data-quality issue label"],
    ],
    columns=["table", "column", "description"],
)

data_dictionary

Column descriptions help you decide whether a value is wrong or just unfamiliar.

6. Manual Assessment

Manual assessment means looking at the data directly.

Export tables to Excel for inspection:

python

with pd.ExcelWriter("eda_assessment_review.xlsx", engine="openpyxl") as writer:
    learners.to_excel(writer, sheet_name="learners", index=False)
    progress.to_excel(writer, sheet_name="progress", index=False)
    progress_extra.to_excel(writer, sheet_name="progress_extra", index=False)
    issues.to_excel(writer, sheet_name="issues", index=False)
    data_dictionary.to_excel(writer, sheet_name="dictionary", index=False)

Manual assessment is useful for spotting:

strange spellings
impossible values
repeated people
mixed formats
columns that should be split
columns that should be reshaped

7. Programmatic Assessment

Programmatic assessment uses Pandas checks.

python

learners.info()

python

learners.sample(5, random_state=7)

python

learners.isna().sum()

python

learners.duplicated().sum()

python

learners.duplicated(subset=["learner_id"]).sum()

python

learners.describe(include="all")

Use both manual and programmatic assessment. Each catches different problems.

8. Document Assessment Findings

Document issues before cleaning.

python

assessment_findings = pd.DataFrame(
    [
        ["learners", "learner_id", "duplicate learner_id L012", "accuracy"],
        ["learners", "given_name/surname", "same learner appears twice with a new ID", "accuracy"],
        ["learners", "city/state/pin_code/country", "one learner has missing address fields", "completeness"],
        ["learners", "state", "state values mix names and abbreviations", "consistency"],
        ["learners", "pin_code", "one pin code has only four digits", "validity"],
        ["learners", "contact", "one learner has missing contact details", "completeness"],
        ["learners", "age", "one learner has impossible age 120", "accuracy"],
        ["learners", "study_hours", "one learner has unrealistic study_hours 500", "accuracy"],
        ["progress/progress_extra", "course columns", "course names are columns instead of rows", "tidiness"],
        ["progress/progress_extra", "course progress", "ranges include % signs and need splitting", "validity"],
        ["progress/progress_extra", "given_name/surname", "names are lowercase while learners table uses title case", "consistency"],
    ],
    columns=["table", "column", "issue", "dimension"],
)

assessment_findings

Good cleaning starts with documented issues.

9. Data Quality Dimensions

Four useful quality dimensions:

Dimension	Question	Example
Completeness	Is data missing?	missing city or contact
Validity	Does the value follow the expected rule?	four-digit pin code
Accuracy	Is the value correct in the real world?	age 120 for a learner record
Consistency	Is the same idea represented the same way?	`MH` and `Maharashtra`

Tidiness is separate from quality. A dataset can have valid values but still be structurally awkward.

10. Cleaning Order

A practical cleaning order:

Copy raw data.
Fix completeness issues.
Fix tidiness issues.
Fix validity issues.
Fix accuracy issues.
Fix consistency issues.
Test everything.

Always use define-code-test.

Define: describe the cleaning action.
Code: perform the cleaning.
Test: verify it worked.

11. Copy Before Cleaning

Never clean the only copy.

python

learners_clean = learners.copy()
progress_clean = progress.copy()
progress_extra_clean = progress_extra.copy()
issues_clean = issues.copy()

Raw data should remain available for audit and recovery.

12. Completeness: Fill Missing Address Fields

Define

Replace missing location fields with "Unknown" and missing contact with "No contact provided".

Code

python

location_cols = ["city", "state", "pin_code", "country"]

learners_clean[location_cols] = learners_clean[location_cols].fillna("Unknown")
learners_clean["contact"] = learners_clean["contact"].fillna("No contact provided")

Explanation

The code identifies specific columns related to location information including city, state, pin code, and country
It fills all missing values in these location columns with the string "Unknown" to maintain data consistency
The contact column receives a different default value of "No contact provided" for missing entries
This approach ensures no null values remain in critical identification fields while preserving the original data structure
The operation modifies the dataframe in-place by assigning the filled values back to their respective columns

Test

python

learners_clean[location_cols + ["contact"]].isna().sum()

Explanation

This code examines the learners dataset to identify missing values in specific columns
It selects columns related to location information and contact details using list concatenation
The .isna() method creates a boolean DataFrame where True represents missing values
The .sum() function counts the total number of missing values per column
This analysis helps determine data completeness and identify areas needing data cleaning or imputation strategies

13. Completeness: Avoid Blind Filling

Not every missing value should be filled with text.

For numeric columns, choose carefully:

python

learners_clean[["age", "study_hours"]].isna().sum()

Explanation

The code selects the "age" and "study_hours" columns from the learners_clean DataFrame using double brackets notation
It applies the isna() method to identify all missing values (NaN) in these two columns, returning a boolean DataFrame
The sum() method counts the total number of True values (missing entries) for each column
This operation efficiently provides a quick overview of data quality by showing how many records are missing values in these specific fields
The result displays the count of missing values for each column, helping identify potential data integrity issues

If numeric values are missing, options include:

leave as missing
impute with median
impute by group
remove rows if justified
ask the source owner

In this dataset, numeric missingness is not the main issue; unrealistic numeric values are.

14. Tidiness: Combine Split Progress Tables

The progress data is split across two files.

Define

Append progress_extra below progress.

Code

python

all_progress = pd.concat(
    [progress_clean, progress_extra_clean],
    ignore_index=True,
)

Explanation

Concatenates two pandas DataFrames (progress_clean and progress_extra_clean) along the row axis (axis=0 by default)
The ignore_index=True parameter resets the index values to create a continuous sequence from 0 to n-1
This operation preserves all rows from both DataFrames in a single unified DataFrame structure
Useful for merging datasets with identical column structures while maintaining proper sequential indexing
Returns a new DataFrame object containing combined data from both input DataFrames

Test

python

print(len(progress_clean), len(progress_extra_clean), len(all_progress))

Explanation

This code snippet outputs the number of elements in three different progress-related variables using the len() function
The variables appear to represent different stages of data cleaning where progress_clean contains raw progress data, progress_extra_clean contains more thoroughly cleaned data, and all_progress contains comprehensive progress information
The comparison of these lengths helps determine how much data was removed or filtered during the cleaning process
This type of analysis is commonly used in data preprocessing to track data quality and understand the impact of cleaning operations
The output values would show whether data was lost, gained, or remained consistent across different cleaning steps

The combined row count should equal the sum of the two input tables.

15. Tidiness: Reshape Course Columns

The columns python, sql, and statistics are course names. They should be values in a course column.

Define

Use melt to convert wide course progress into tidy long format.

Code

python

progress_long = all_progress.melt(
    id_vars=[
        "given_name",
        "surname",
        "project_score_start",
        "project_score_end",
    ],
    value_vars=["python", "sql", "statistics"],
    var_name="course",
    value_name="progress_range",
)

Explanation

The code reshapes a DataFrame from wide format to long format using the melt function
It preserves identifying columns (given_name, surname, project_score_start, project_score_end) as identifiers while melting course-specific columns
The melt operation converts python, sql, and statistics columns into two new columns: course (containing the original column names) and progress_range (containing their values)
This transformation makes it easier to analyze progress across different courses in a unified structure
The resulting DataFrame has one row per student-course combination instead of separate columns for each course

Remove learners who did not take a course:

python

progress_long = progress_long[progress_long["progress_range"] != "-"].copy()

Explanation

Removes rows where the "progress_range" column contains empty string values ("-") from the progress_long DataFrame
Creates a new filtered DataFrame by copying the subset that excludes these empty entries
The operation preserves all other columns and rows while eliminating invalid progress range data
This filtering step ensures data integrity by removing incomplete or placeholder progress range information
The copy() method creates an explicit copy of the filtered results to avoid potential reference issues

Test

python

progress_long.head()

Explanation

Shows the initial portion of a DataFrame named 'progress_long' to examine its structure and content
The head() method displays the first 5 rows by default, providing a quick overview of the dataset
Useful for verifying data loading and understanding column names and data types
Helps identify any immediate issues with data formatting or missing values before further analysis

python

progress_long["course"].value_counts()

Explanation

The code performs a frequency count of unique values in the "course" column of the progress_long DataFrame
It returns a Series with each course name as the index and its corresponding enrollment count as the value
This operation helps identify which courses have the highest and lowest student participation rates
The result can be used for educational planning, resource allocation, and identifying popular vs underutilized courses
Value_counts() automatically sorts results in descending order by default, showing most enrolled courses first

Now each row represents one learner-course observation.

16. Validity: Split Progress Ranges

Progress values like 22%-76% should become two numeric columns.

Define

Split progress_range into progress_start_pct and progress_end_pct.

Code

python

progress_parts = progress_long["progress_range"].str.replace("%", "", regex=False).str.split("-", expand=True)

progress_long["progress_start_pct"] = progress_parts[0].astype(int)
progress_long["progress_end_pct"] = progress_parts[1].astype(int)

progress_long = progress_long.drop(columns=["progress_range"])

Explanation

Removes the percentage symbol from the progress_range column values and splits the resulting strings by the dash character to separate start and end values
Converts the separated start and end percentage values into integers and assigns them to new columns progress_start_pct and progress_end_pct
Drops the original progress_range column since its data has been extracted into the two new numeric columns
This transformation converts categorical string representations of progress ranges into discrete numerical values for easier analysis and calculations

Test

python

progress_long[["progress_start_pct", "progress_end_pct"]].describe()

Explanation

This code performs descriptive statistics analysis on two specific columns containing percentage values
The describe() method calculates key statistical measures including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values
It provides insights into the distribution and central tendency of progress percentage data across the dataset
The output helps identify data patterns, outliers, and overall range of progress measurements
This analysis is useful for understanding how progress percentages are distributed and identifying any potential anomalies in the data

python

(progress_long["progress_end_pct"] >= progress_long["progress_start_pct"]).all()

Explanation

This code checks if every record in the progress_long DataFrame has a progress_end_pct value that is greater than or equal to its progress_start_pct value
The comparison operation creates a boolean Series for each row, returning True when the end percentage meets or exceeds the start percentage
The .all() method aggregates all boolean results into a single boolean value, returning True only if ALL records pass the validation check
This validation ensures data integrity by confirming that progress measurements don't regress from start to end points
The result helps identify potential data entry errors or processing issues where end progress values are unexpectedly lower than start values

The end value should be greater than or equal to the start value.

17. Validity: Fix Data Types

Dates and postal codes need careful handling.

python

learners_clean["signup_date"] = pd.to_datetime(learners_clean["signup_date"], errors="coerce")
learners_clean["pin_code"] = learners_clean["pin_code"].astype(str).str.replace(".0", "", regex=False)

Explanation

Converts the signup_date column to proper datetime format while handling invalid dates gracefully through error coercion
Transforms pin_code values to strings and removes trailing ".0" decimal notation to standardize numeric postal codes
Uses regex=False parameter in string replacement for better performance when simple pattern matching is sufficient
The coerce parameter ensures malformed date entries don't raise exceptions but instead become NaT (Not a Time) values
This data cleaning approach prepares structured data for time-based analysis and consistent numeric formatting

Postal codes are identifiers, not numbers. Keep them as strings.

Check date parsing:

python

learners_clean["signup_date"].isna().sum()

Explanation

The code accesses the "signup_date" column of the learners_clean DataFrame.
It uses the isna() method to create a boolean Series indicating which values are NaN (missing).
The sum() function then counts the total number of True values in the boolean Series, effectively counting the missing signup dates.
This is useful for data cleaning and understanding the completeness of the dataset.

18. Validity: Validate Pin Codes

In this synthetic India-style dataset, a valid pin code should have six digits.

python

pin_is_valid = learners_clean["pin_code"].str.fullmatch(r"\d{6}")

learners_clean.loc[~pin_is_valid & learners_clean["pin_code"].ne("Unknown"), ["learner_id", "pin_code"]]

Explanation

The code checks if the "pin_code" column in the DataFrame learners_clean contains valid 6-digit numeric values using a regular expression.
It creates a boolean Series pin_is_valid that indicates whether each pin code matches the specified pattern.
The loc method is used to filter rows where the pin code is invalid (not matching the pattern) and is not equal to "Unknown".
The filtered DataFrame displays only the "learner_id" and "pin_code" columns for the invalid entries, allowing for easy identification of issues.

Fix the known bad value:

python

learners_clean.loc[learners_clean["pin_code"] == "4001", "pin_code"] = "400001"

Explanation

The code targets a DataFrame named learners_clean which likely contains learner information.
It uses the .loc method to locate rows where the "pin_code" column matches the value "4001".
For those identified rows, it updates the "pin_code" to "400001", ensuring a standardized format.
This operation is useful for data cleaning and preparation before analysis or reporting.
The change is made in-place, meaning the original DataFrame is modified directly without creating a copy.

Test again:

python

pin_is_valid = learners_clean["pin_code"].str.fullmatch(r"\d{6}")
learners_clean.loc[~pin_is_valid & learners_clean["pin_code"].ne("Unknown"), ["learner_id", "pin_code"]]

Explanation

The code checks if the "pin_code" column in the learners_clean DataFrame contains valid 6-digit numeric PINs using a regular expression.
The str.fullmatch method returns a boolean Series indicating which entries match the 6-digit pattern.
The loc method is then used to filter the DataFrame, selecting rows where the PIN is invalid (not matching the regex) and not equal to "Unknown".
The resulting DataFrame displays only the "learner_id" and "pin_code" columns for the filtered entries.

19. Accuracy: Handle Outliers

The dataset intentionally includes unrealistic values.

python

learners_clean[["age", "study_hours"]].describe()

Explanation

The code selects the "age" and "study_hours" columns from the learners_clean DataFrame.
It uses the describe() method to compute summary statistics such as count, mean, standard deviation, min, max, and quartiles for the selected columns.
This method provides a quick overview of the distribution and central tendency of the data in these columns.
The output is useful for understanding the characteristics of the learners' age and study habits.

Find suspicious rows:

python

learners_clean.query("age > 90 or study_hours > 120")

Explanation

The code uses the query method to filter rows in the learners_clean DataFrame.
It selects rows where the age column is greater than 90 or the study_hours column exceeds 120.
This allows for efficient data manipulation and retrieval based on specific criteria.
The resulting DataFrame will only include learners who meet at least one of the specified conditions.
This approach enhances readability and maintainability of the code compared to traditional indexing methods.

Define

Replace impossible age and unrealistic study_hours with missing values for review.

Code

python

learners_clean.loc[learners_clean["age"] > 90, "age"] = np.nan
learners_clean.loc[learners_clean["study_hours"] > 120, "study_hours"] = np.nan

Explanation

The code uses the loc method from pandas to access specific rows in the DataFrame learners_clean.
It checks for ages greater than 90 and replaces those values with NaN, indicating missing data.
Similarly, it identifies study hours exceeding 120 and sets those values to NaN.
This data cleaning step helps ensure that the dataset is more accurate and reliable for analysis.
Using np.nan from the NumPy library allows for consistent handling of missing values in the DataFrame.

Test

python

learners_clean.query("age > 90 or study_hours > 120")

Explanation

The code uses the query method to filter rows in the learners_clean DataFrame.
It selects rows where the age column is greater than 90 or the study_hours column exceeds 120.
This allows for efficient data manipulation and retrieval based on specific criteria.
The resulting DataFrame will only include learners who meet at least one of the specified conditions.
This approach enhances readability and maintainability of the code compared to traditional indexing methods.

This is safer than inventing corrected values.

20. Accuracy: Remove Duplicate Records

Check duplicate IDs:

python

learners_clean[learners_clean.duplicated(subset=["learner_id"], keep=False)]

Explanation

The code uses the duplicated() method from the pandas library to check for duplicate rows in the DataFrame learners_clean.
The subset parameter specifies that only the learner_id column should be considered when identifying duplicates.
The keep=False argument ensures that all occurrences of the duplicates are marked as True, not just the first or last occurrence.
The result is a DataFrame containing all rows that have duplicate learner_id values, allowing for further analysis or cleaning of the data.

Check duplicate names and contact:

python

learners_clean[
    learners_clean.duplicated(subset=["given_name", "surname", "contact"], keep=False)
]

Explanation

The code filters a DataFrame named learners_clean to identify duplicate rows.
It uses the duplicated method, specifying the subset of columns ["given_name", "surname", "contact"] to check for duplicates.
The keep=False argument ensures that all occurrences of duplicates are marked as True, not just the subsequent ones.
The result is a DataFrame containing only the rows that have duplicates in the specified columns.

Define

Remove exact repeated learner records and keep the first copy.

Code

python

learners_clean = learners_clean.drop_duplicates()
learners_clean = learners_clean.drop_duplicates(subset=["learner_id"], keep="first")

Explanation

The first line removes all duplicate rows from the learners_clean DataFrame, keeping only unique combinations of all column values
The second line specifically targets duplicate entries based on the "learner_id" column, keeping only the first occurrence of each unique learner ID
This two-step approach ensures comprehensive deduplication while maintaining data integrity for individual learners
The drop_duplicates() method with keep="first" parameter preserves the earliest record when duplicates are found
This cleaning process prevents data skewing in analytics and ensures accurate learner tracking

Test

python

learners_clean.duplicated(subset=["learner_id"]).sum()

Explanation

The code checks for duplicate rows in a DataFrame called learners_clean based on the learner_id column
It uses the duplicated() method with subset parameter to identify duplicate entries while focusing only on the specified column
The sum() function counts the total number of True values returned by duplicated(), giving the count of duplicate records
This approach efficiently identifies how many learner records appear more than once in the dataset
The result helps data analysts understand data quality issues and potential redundancy in learner identification

Some duplicates need business rules. Do not blindly drop rows when records conflict.

21. Consistency: Normalize State Names

State values mix abbreviations and full names.

python

learners_clean["state"].value_counts(dropna=False)

Explanation

The code examines the frequency distribution of values in the "state" column of the learners_clean DataFrame
It uses value_counts() method to count occurrences of each unique state value while preserving NaN values through dropna=False parameter
This approach helps identify how many records exist for each state category and reveals the extent of missing data in the state field
The results show both populated states and any null/missing entries that might need handling in subsequent data processing steps
Useful for understanding data completeness and identifying potential issues with state information across the dataset

Create a mapping:

python

state_map = {
    "MH": "Maharashtra",
    "Maharashtra": "Maharashtra",
    "DL": "Delhi",
    "Delhi": "Delhi",
    "Karnataka": "Karnataka",
    "Telangana": "Telangana",
    "TS": "Telangana",
    "TN": "Tamil Nadu",
    "Tamil Nadu": "Tamil Nadu",
    "Gujarat": "Gujarat",
    "West Bengal": "West Bengal",
    "Unknown": "Unknown",
}

learners_clean["state"] = learners_clean["state"].map(state_map).fillna(learners_clean["state"])

Explanation

Creates a dictionary mapping state abbreviations and partial names to their complete state names for data standardization
Uses the pandas map() function to replace state values in the DataFrame column with their full names based on the dictionary lookup
Applies fillna() to preserve original values for any states that don't match the mapping dictionary, ensuring no data loss
This approach handles multiple naming conventions (abbreviations vs full names) in a single clean operation
The resulting DataFrame column contains standardized state names for consistent data analysis

Test:

python

learners_clean["state"].value_counts(dropna=False)

Explanation

The code examines the frequency distribution of values in the "state" column of the learners_clean DataFrame
It uses value_counts() method to count occurrences of each unique state value while preserving NaN values through dropna=False parameter
This approach helps identify how many records exist for each state category and reveals the extent of missing data in the state field
The results show both populated states and any null/missing entries that might need handling in subsequent data processing steps
Useful for understanding data completeness and identifying potential issues with state information across the dataset

22. Consistency: Normalize Names For Joining

Progress and issue tables use lowercase names. Learner records use title case.

Create join keys in all tables:

python

def make_name_key(df):
    return (
        df["given_name"].str.strip().str.lower()
        + "|"
        + df["surname"].str.strip().str.lower()
    )

learners_clean["name_key"] = make_name_key(learners_clean)
progress_long["name_key"] = make_name_key(progress_long)
issues_clean["name_key"] = make_name_key(issues_clean)

Explanation

The function make_name_key generates a consistent identifier by combining stripped and lowercased given names and surnames with a pipe delimiter
Each dataframe (learners_clean, progress_long, issues_clean) receives a new name_key column that enables reliable matching across datasets
The string operations ensure case-insensitive comparison and remove whitespace variations that could prevent proper data joining
This approach facilitates efficient data merging and deduplication when working with potentially inconsistent name formatting
The resulting name_key columns serve as primary keys for aligning learner information across different data tables

Test key coverage:

python

missing_progress_matches = set(progress_long["name_key"]) - set(learners_clean["name_key"])
missing_progress_matches

Explanation

Calculates the difference between two sets of keys to identify records that exist in progress data but not in learners data
Uses set subtraction operation to find missing matches where learner names don't have corresponding entries in the clean learners dataset
The result shows which name keys from progress_long are absent from learners_clean, helping identify data quality issues
This technique efficiently identifies orphaned records that may need investigation or reconciliation
The output displays the actual missing key values that can be used for further data analysis or cleaning steps

23. Extract Email And Phone With Regex

The contact column combines email and phone.

Define

Create separate email and phone columns, then remove the original combined field.

Code

python

email_pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
phone_pattern = re.compile(r"(?:\+91\s*)?\d{5}[-\s]?\d{5}")

learners_clean["email"] = learners_clean["contact"].str.extract(f"({email_pattern.pattern})", expand=False)
learners_clean["phone"] = learners_clean["contact"].str.extract(f"({phone_pattern.pattern})", expand=False)

learners_clean["phone"] = (
    learners_clean["phone"]
    .str.replace("+91", "", regex=False)
    .str.replace(" ", "", regex=False)
    .str.replace("-", "", regex=False)
)

learners_clean = learners_clean.drop(columns=["contact"])

Explanation

Compiles regular expression patterns to identify email addresses and Indian phone numbers in the contact data
Uses string extraction methods to isolate matching email and phone number patterns from the contact column
Cleans phone numbers by removing country code prefix (+91) and any whitespace or hyphens for standardization
Drops the original contact column since extracted email and phone number data are now stored separately

Test

python

learners_clean[["email", "phone"]].head()

Explanation

This code selects and displays the first 5 rows of data from two specific columns ('email' and 'phone') in the learners_clean DataFrame
The double square brackets [[ ]] indicate selection of multiple columns by name rather than a single column
The .head() method limits the output to show only the initial records, making it useful for quick data inspection
This approach helps verify the structure and content of contact information fields before further data processing
The operation is purely observational and doesn't modify the original dataset

python

learners_clean[["email", "phone"]].isna().sum()

Explanation

This code examines the learners_clean DataFrame to identify missing data in the email and phone columns
The double brackets [["email", "phone"]] select multiple columns simultaneously from the DataFrame
The .isna() method returns a boolean DataFrame where True indicates missing values (NaN) and False indicates valid entries
The .sum() function counts the True values (missing entries) for each column, returning a Series with the total missing count per column
This approach efficiently provides a quick overview of data quality issues in contact information fields

Missing contact fields should remain missing. Do not fabricate personal details.

24. Merge Cleaned Tables

Merge learner details with progress rows.

python

learner_progress = progress_long.merge(
    learners_clean,
    on="name_key",
    how="left",
    validate="many_to_one",
    suffixes=("_progress", "_learner"),
)

learner_progress.head()

Explanation

Combines two datasets (progress_long and learners_clean) based on the common "name_key" column to create a comprehensive learner progress dataset
Uses a left join operation to preserve all records from the progress dataset while adding matching learner information
Implements validation to ensure the merge maintains a many-to-one relationship between datasets
Applies suffixes to distinguish columns with identical names between the two datasets during the merge process
Displays the first few rows of the merged dataset to verify the successful combination of data

Merge issue flags:

python

issue_summary = (
    issues_clean.groupby("name_key", as_index=False)
    .agg(
        issue_count=("issue_type", "count"),
        highest_severity=("severity", "max"),
    )
)

learner_progress = learner_progress.merge(
    issue_summary,
    on="name_key",
    how="left",
    validate="many_to_one",
)

learner_progress["issue_count"] = learner_progress["issue_count"].fillna(0).astype(int)
learner_progress["highest_severity"] = learner_progress["highest_severity"].fillna("none")

Explanation

Groups cleaned issues by name key and calculates total issue count and highest severity level for each group
Merges the aggregated issue data with learner progress records using a left join to preserve all learner data
Handles missing values by filling null issue counts with zero and null severities with "none" string
Ensures proper data types by converting issue count to integer and maintaining severity as string format

Test:

python

learner_progress[["course", "progress_start_pct", "progress_end_pct", "issue_count"]].head()

Explanation

Retrieves specific columns (course name, initial progress percentage, final progress percentage, and issue count) from the learner_progress DataFrame
Uses double square brackets to select multiple columns as a subset of the original dataset
Calls head() method to show only the first 5 rows of this filtered data for quick preview
This operation helps analyze how learners progressed through courses and identifies potential problem areas with issue counts
Commonly used for data exploration and validation before performing deeper analysis on learning metrics

25. Create Derived Cleaning Columns

Calculate changes after cleaning.

python

learner_progress["progress_gain_pct"] = (
    learner_progress["progress_end_pct"] - learner_progress["progress_start_pct"]
)

learner_progress["project_score_gain"] = (
    learner_progress["project_score_end"] - learner_progress["project_score_start"]
)

Explanation

Computes the percentage point gain in learner progress by subtracting initial progress from final progress percentages
Calculates the numerical score improvement in learner projects by finding the difference between end and start project scores
These calculations help measure learning effectiveness and performance enhancement over time
The operations create new columns in the learner progress dataframe that quantify measurable improvements
This approach enables tracking of both relative progress changes and absolute score gains for performance analysis

Check results:

python

learner_progress[["course", "progress_gain_pct", "project_score_gain"]].describe()

Explanation

The code snippet utilizes the describe() method from the Pandas library to generate summary statistics.
It focuses on three specific columns: "course", "progress_gain_pct", and "project_score_gain" from the learner_progress DataFrame.
The output includes key statistical measures such as count, mean, standard deviation, minimum, and maximum values for the selected columns.
This analysis helps in understanding the overall performance and progress of learners in various courses.

Derived columns should be created after types and structure are fixed.

26. Final Quality Checks

Run final checks before saving.

python

checks = {
    "duplicate_learner_ids": learners_clean.duplicated(subset=["learner_id"]).sum(),
    "missing_name_keys_in_progress": learner_progress["learner_id"].isna().sum(),
    "invalid_progress_gain": (learner_progress["progress_gain_pct"] < 0).sum(),
    "missing_course": learner_progress["course"].isna().sum(),
}

checks

Explanation

A dictionary named checks is created to store various validation results related to learner data.
The duplicate_learner_ids key counts the number of duplicate entries in the learner_id column of the learners_clean DataFrame.
The missing_name_keys_in_progress key calculates how many learner_id entries in the learner_progress DataFrame are missing (NaN).
The invalid_progress_gain key sums up instances where the progress_gain_pct is less than zero, indicating invalid progress data.
The missing_course key counts the number of missing course entries in the learner_progress DataFrame.

All checks should be explainable.

27. Save Cleaned Outputs

Save cleaned tables separately.

python

learners_clean.to_csv("eda_learners_clean.csv", index=False)
progress_long.to_csv("eda_progress_tidy.csv", index=False)
learner_progress.to_csv("eda_learner_progress_analysis_ready.csv", index=False)
assessment_findings.to_csv("eda_assessment_findings.csv", index=False)

Explanation

Each line of code uses the to_csv method to export a DataFrame to a CSV file.
The first argument specifies the filename, while index=False prevents the DataFrame index from being written to the file.
This approach is useful for preparing cleaned and processed data for further analysis or sharing.
The filenames indicate that the data pertains to learners, progress, and assessment findings, suggesting a focus on educational data analysis.
By saving the data in CSV format, it ensures compatibility with various data analysis tools and software.

Keep raw files and cleaned files separate.

28. Cleaning Checklist

Before moving into deeper EDA, confirm:

raw DataFrames were copied before cleaning
missing values were handled intentionally
duplicate IDs were checked
duplicate entity records were checked
date columns were parsed
ID-like columns stayed as strings
invalid values were fixed or marked missing
inconsistent categories were normalized
untidy wide columns were reshaped
derived columns were created after cleaning
joins used validation
cleaning steps were tested
findings were documented

29. Common Mistakes

Cleaning Without Documentation

If you do not document issues, later readers cannot tell why the code exists.

Filling Every Missing Value With One Word

"Unknown" may work for address fields. It is usually wrong for numeric measurements.

Treating IDs As Numbers

Postal codes, learner IDs, phone numbers, and course IDs are identifiers. Store them as strings.

Dropping Duplicates Blindly

Some duplicates are exact repeats. Others are conflicting records. Conflicting records need business rules.

Fixing Outliers By Guessing

If you cannot verify the true value, mark it missing or flag it for review.

Joining Before Normalizing Keys

Names, IDs, and category labels should be normalized before joins.

30. Practice Tasks

Use the included CSV files:

Load all four datasets and write a short summary.
Create a data dictionary for all tables.
Export all raw tables into one Excel workbook.
Find missing values in eda_messy_learners.csv.
Find duplicate learner IDs.
Detect unrealistic age and study_hours values.
Normalize state names.
Extract email and phone from contact.
Combine the two course progress tables.
Reshape course columns with melt.
Split progress ranges into start and end percentages.
Merge learner, progress, and issue tables.
Create progress gain and project score gain columns.
Save cleaned analysis-ready CSV files.

Good cleaning is not about making data look perfect. It is about making every assumption visible and every transformation testable.

EDA Data Assessing and Cleaning: Key Techniques Explained

AI Insights

Frequently Asked Questions

Related Work

FastAPI Browser-Workflow Automation

AI Visibility & SEO Auditing Platform

RAG-Powered Learning Platform with AI Assistant