EDA Data Assessing and Cleaning: Quality, Tidiness, Missing Values, Duplicates, Types, and Reshaping
After data gathering, the next job is to decide whether the data is trustworthy enough for analysis.
This stage is called data assessing and cleaning.
You inspect the data, document the issues, clean them in a controlled order, and test that each cleaning step worked.
This guide uses original synthetic learning-platform datasets. It does not use copied clinical-trial data, patient data, proprietary course notebooks, or third-party CSV files.
Files Used In This Guide
Use these CSV files:
eda_messy_learners.csveda_course_progress.csveda_course_progress_extra.csveda_learning_issues.csv
Place them in the same folder as your notebook or script.
If you keep them in a data/ folder, change the paths:
learners = pd.read_csv("data/eda_messy_learners.csv")
progress = pd.read_csv("data/eda_course_progress.csv")
progress_extra = pd.read_csv("data/eda_course_progress_extra.csv")
issues = pd.read_csv("data/eda_learning_issues.csv")What You Will Learn
By the end, you should be able to:
- explain the difference between dirty data and messy data
- write a short dataset summary and column dictionary
- perform manual and programmatic assessment
- document data quality issues by dimension
- identify completeness, validity, accuracy, and consistency problems
- identify tidiness problems
- clean data with the define-code-test workflow
- copy raw DataFrames before cleaning
- fix missing values, duplicates, bad types, and invalid values
- extract email and phone fields with regex
- combine split tables
- reshape wide course columns into tidy rows
- split progress ranges into start and end columns
- merge issue flags back into the cleaned table
1. Data Wrangling Refresher
Data wrangling has three major parts:
- Gather data.
- Assess data.
- Clean data.
Assessment and cleaning should not be mixed randomly.
First, discover and document problems. Then clean them one by one.
2. Dirty Data vs Messy Data
Unclean data usually has two broad problem types.
Dirty data has quality issues.
Examples:
- missing values
- duplicate rows
- invalid IDs
- impossible ages
- inconsistent category names
- incorrect data types
- inaccurate values
Messy data has structural issues.
Tidy data follows three rules:
- each variable is a column
- each observation is a row
- each observational unit is a table
If course names are stored as separate columns instead of rows, that is usually a tidiness issue.
3. Setup
Install the required packages:
pip install pandas openpyxlImport libraries:
import re
import numpy as np
import pandas as pdLoad the datasets:
learners = pd.read_csv("eda_messy_learners.csv")
progress = pd.read_csv("eda_course_progress.csv")
progress_extra = pd.read_csv("eda_course_progress_extra.csv")
issues = pd.read_csv("eda_learning_issues.csv")Preview them:
learners.head()progress.head()issues.head()4. Write A Dataset Summary
Before cleaning, write a plain-English summary.
Example:
This dataset contains learner profile records, course progress ranges, and issue reports for a synthetic online learning platform. The goal is to prepare the data for exploratory analysis of study behavior, course completion, and learner support issues.
This summary prevents you from cleaning without context.
5. Write Column Descriptions
Create a small data dictionary.
data_dictionary = pd.DataFrame(
[
["learners", "learner_id", "Unique learner identifier"],
["learners", "given_name", "Learner first name"],
["learners", "surname", "Learner last name"],
["learners", "city", "Learner city"],
["learners", "state", "State name or abbreviation"],
["learners", "pin_code", "Postal code"],
["learners", "contact", "Combined email and phone text"],
["learners", "age", "Learner age in years"],
["learners", "study_hours", "Total study hours in the period"],
["progress", "python/sql/statistics", "Start-to-end course completion range"],
["issues", "issue_type", "Support or data-quality issue label"],
],
columns=["table", "column", "description"],
)
data_dictionaryColumn descriptions help you decide whether a value is wrong or just unfamiliar.
6. Manual Assessment
Manual assessment means looking at the data directly.
Export tables to Excel for inspection:
with pd.ExcelWriter("eda_assessment_review.xlsx", engine="openpyxl") as writer:
learners.to_excel(writer, sheet_name="learners", index=False)
progress.to_excel(writer, sheet_name="progress", index=False)
progress_extra.to_excel(writer, sheet_name="progress_extra", index=False)
issues.to_excel(writer, sheet_name="issues", index=False)
data_dictionary.to_excel(writer, sheet_name="dictionary", index=False)Manual assessment is useful for spotting:
- strange spellings
- impossible values
- repeated people
- mixed formats
- columns that should be split
- columns that should be reshaped
7. Programmatic Assessment
Programmatic assessment uses Pandas checks.
learners.info()learners.sample(5, random_state=7)learners.isna().sum()learners.duplicated().sum()learners.duplicated(subset=["learner_id"]).sum()learners.describe(include="all")Use both manual and programmatic assessment. Each catches different problems.
8. Document Assessment Findings
Document issues before cleaning.
assessment_findings = pd.DataFrame(
[
["learners", "learner_id", "duplicate learner_id L012", "accuracy"],
["learners", "given_name/surname", "same learner appears twice with a new ID", "accuracy"],
["learners", "city/state/pin_code/country", "one learner has missing address fields", "completeness"],
["learners", "state", "state values mix names and abbreviations", "consistency"],
["learners", "pin_code", "one pin code has only four digits", "validity"],
["learners", "contact", "one learner has missing contact details", "completeness"],
["learners", "age", "one learner has impossible age 120", "accuracy"],
["learners", "study_hours", "one learner has unrealistic study_hours 500", "accuracy"],
["progress/progress_extra", "course columns", "course names are columns instead of rows", "tidiness"],
["progress/progress_extra", "course progress", "ranges include % signs and need splitting", "validity"],
["progress/progress_extra", "given_name/surname", "names are lowercase while learners table uses title case", "consistency"],
],
columns=["table", "column", "issue", "dimension"],
)
assessment_findingsGood cleaning starts with documented issues.
9. Data Quality Dimensions
Four useful quality dimensions:
| Dimension | Question | Example |
|---|---|---|
| Completeness | Is data missing? | missing city or contact |
| Validity | Does the value follow the expected rule? | four-digit pin code |
| Accuracy | Is the value correct in the real world? | age 120 for a learner record |
| Consistency | Is the same idea represented the same way? | MH and Maharashtra |
Tidiness is separate from quality. A dataset can have valid values but still be structurally awkward.
10. Cleaning Order
A practical cleaning order:
- Copy raw data.
- Fix completeness issues.
- Fix tidiness issues.
- Fix validity issues.
- Fix accuracy issues.
- Fix consistency issues.
- Test everything.
Always use define-code-test.
- Define: describe the cleaning action.
- Code: perform the cleaning.
- Test: verify it worked.
11. Copy Before Cleaning
Never clean the only copy.
learners_clean = learners.copy()
progress_clean = progress.copy()
progress_extra_clean = progress_extra.copy()
issues_clean = issues.copy()Raw data should remain available for audit and recovery.
12. Completeness: Fill Missing Address Fields
Define
Replace missing location fields with "Unknown" and missing contact with "No contact provided".
Code
location_cols = ["city", "state", "pin_code", "country"]
learners_clean[location_cols] = learners_clean[location_cols].fillna("Unknown")
learners_clean["contact"] = learners_clean["contact"].fillna("No contact provided")Explanation
- The code identifies specific columns related to location information including city, state, pin code, and country
- It fills all missing values in these location columns with the string "Unknown" to maintain data consistency
- The contact column receives a different default value of "No contact provided" for missing entries
- This approach ensures no null values remain in critical identification fields while preserving the original data structure
- The operation modifies the dataframe in-place by assigning the filled values back to their respective columns
Test
learners_clean[location_cols + ["contact"]].isna().sum()Explanation
- This code examines the learners dataset to identify missing values in specific columns
- It selects columns related to location information and contact details using list concatenation
- The .isna() method creates a boolean DataFrame where True represents missing values
- The .sum() function counts the total number of missing values per column
- This analysis helps determine data completeness and identify areas needing data cleaning or imputation strategies
13. Completeness: Avoid Blind Filling
Not every missing value should be filled with text.
For numeric columns, choose carefully:
learners_clean[["age", "study_hours"]].isna().sum()Explanation
- The code selects the "age" and "study_hours" columns from the learners_clean DataFrame using double brackets notation
- It applies the isna() method to identify all missing values (NaN) in these two columns, returning a boolean DataFrame
- The sum() method counts the total number of True values (missing entries) for each column
- This operation efficiently provides a quick overview of data quality by showing how many records are missing values in these specific fields
- The result displays the count of missing values for each column, helping identify potential data integrity issues
If numeric values are missing, options include:
- leave as missing
- impute with median
- impute by group
- remove rows if justified
- ask the source owner
In this dataset, numeric missingness is not the main issue; unrealistic numeric values are.
14. Tidiness: Combine Split Progress Tables
The progress data is split across two files.
Define
Append progress_extra below progress.
Code
all_progress = pd.concat(
[progress_clean, progress_extra_clean],
ignore_index=True,
)Explanation
- Concatenates two pandas DataFrames (progress_clean and progress_extra_clean) along the row axis (axis=0 by default)
- The ignore_index=True parameter resets the index values to create a continuous sequence from 0 to n-1
- This operation preserves all rows from both DataFrames in a single unified DataFrame structure
- Useful for merging datasets with identical column structures while maintaining proper sequential indexing
- Returns a new DataFrame object containing combined data from both input DataFrames
Test
print(len(progress_clean), len(progress_extra_clean), len(all_progress))Explanation
- This code snippet outputs the number of elements in three different progress-related variables using the len() function
- The variables appear to represent different stages of data cleaning where progress_clean contains raw progress data, progress_extra_clean contains more thoroughly cleaned data, and all_progress contains comprehensive progress information
- The comparison of these lengths helps determine how much data was removed or filtered during the cleaning process
- This type of analysis is commonly used in data preprocessing to track data quality and understand the impact of cleaning operations
- The output values would show whether data was lost, gained, or remained consistent across different cleaning steps
The combined row count should equal the sum of the two input tables.
15. Tidiness: Reshape Course Columns
The columns python, sql, and statistics are course names. They should be values in a course column.
Define
Use melt to convert wide course progress into tidy long format.
Code
progress_long = all_progress.melt(
id_vars=[
"given_name",
"surname",
"project_score_start",
"project_score_end",
],
value_vars=["python", "sql", "statistics"],
var_name="course",
value_name="progress_range",
)Explanation
- The code reshapes a DataFrame from wide format to long format using the melt function
- It preserves identifying columns (given_name, surname, project_score_start, project_score_end) as identifiers while melting course-specific columns
- The melt operation converts python, sql, and statistics columns into two new columns: course (containing the original column names) and progress_range (containing their values)
- This transformation makes it easier to analyze progress across different courses in a unified structure
- The resulting DataFrame has one row per student-course combination instead of separate columns for each course
Remove learners who did not take a course:
progress_long = progress_long[progress_long["progress_range"] != "-"].copy()Explanation
- Removes rows where the "progress_range" column contains empty string values ("-") from the progress_long DataFrame
- Creates a new filtered DataFrame by copying the subset that excludes these empty entries
- The operation preserves all other columns and rows while eliminating invalid progress range data
- This filtering step ensures data integrity by removing incomplete or placeholder progress range information
- The copy() method creates an explicit copy of the filtered results to avoid potential reference issues
Test
progress_long.head()Explanation
- Shows the initial portion of a DataFrame named 'progress_long' to examine its structure and content
- The head() method displays the first 5 rows by default, providing a quick overview of the dataset
- Useful for verifying data loading and understanding column names and data types
- Helps identify any immediate issues with data formatting or missing values before further analysis
progress_long["course"].value_counts()Explanation
- The code performs a frequency count of unique values in the "course" column of the progress_long DataFrame
- It returns a Series with each course name as the index and its corresponding enrollment count as the value
- This operation helps identify which courses have the highest and lowest student participation rates
- The result can be used for educational planning, resource allocation, and identifying popular vs underutilized courses
- Value_counts() automatically sorts results in descending order by default, showing most enrolled courses first
Now each row represents one learner-course observation.
16. Validity: Split Progress Ranges
Progress values like 22%-76% should become two numeric columns.
Define
Split progress_range into progress_start_pct and progress_end_pct.
Code
progress_parts = progress_long["progress_range"].str.replace("%", "", regex=False).str.split("-", expand=True)
progress_long["progress_start_pct"] = progress_parts[0].astype(int)
progress_long["progress_end_pct"] = progress_parts[1].astype(int)
progress_long = progress_long.drop(columns=["progress_range"])Explanation
- Removes the percentage symbol from the progress_range column values and splits the resulting strings by the dash character to separate start and end values
- Converts the separated start and end percentage values into integers and assigns them to new columns progress_start_pct and progress_end_pct
- Drops the original progress_range column since its data has been extracted into the two new numeric columns
- This transformation converts categorical string representations of progress ranges into discrete numerical values for easier analysis and calculations
Test
progress_long[["progress_start_pct", "progress_end_pct"]].describe()Explanation
- This code performs descriptive statistics analysis on two specific columns containing percentage values
- The describe() method calculates key statistical measures including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values
- It provides insights into the distribution and central tendency of progress percentage data across the dataset
- The output helps identify data patterns, outliers, and overall range of progress measurements
- This analysis is useful for understanding how progress percentages are distributed and identifying any potential anomalies in the data
(progress_long["progress_end_pct"] >= progress_long["progress_start_pct"]).all()Explanation
- This code checks if every record in the progress_long DataFrame has a progress_end_pct value that is greater than or equal to its progress_start_pct value
- The comparison operation creates a boolean Series for each row, returning True when the end percentage meets or exceeds the start percentage
- The .all() method aggregates all boolean results into a single boolean value, returning True only if ALL records pass the validation check
- This validation ensures data integrity by confirming that progress measurements don't regress from start to end points
- The result helps identify potential data entry errors or processing issues where end progress values are unexpectedly lower than start values
The end value should be greater than or equal to the start value.
17. Validity: Fix Data Types
Dates and postal codes need careful handling.
learners_clean["signup_date"] = pd.to_datetime(learners_clean["signup_date"], errors="coerce")
learners_clean["pin_code"] = learners_clean["pin_code"].astype(str).str.replace(".0", "", regex=False)Explanation
- Converts the signup_date column to proper datetime format while handling invalid dates gracefully through error coercion
- Transforms pin_code values to strings and removes trailing ".0" decimal notation to standardize numeric postal codes
- Uses regex=False parameter in string replacement for better performance when simple pattern matching is sufficient
- The coerce parameter ensures malformed date entries don't raise exceptions but instead become NaT (Not a Time) values
- This data cleaning approach prepares structured data for time-based analysis and consistent numeric formatting
Postal codes are identifiers, not numbers. Keep them as strings.
Check date parsing:
learners_clean["signup_date"].isna().sum()Explanation
- The code accesses the "signup_date" column of the
learners_cleanDataFrame. - It uses the
isna()method to create a boolean Series indicating which values are NaN (missing). - The
sum()function then counts the total number of True values in the boolean Series, effectively counting the missing signup dates. - This is useful for data cleaning and understanding the completeness of the dataset.
18. Validity: Validate Pin Codes
In this synthetic India-style dataset, a valid pin code should have six digits.
pin_is_valid = learners_clean["pin_code"].str.fullmatch(r"\d{6}")
learners_clean.loc[~pin_is_valid & learners_clean["pin_code"].ne("Unknown"), ["learner_id", "pin_code"]]Explanation
- The code checks if the "pin_code" column in the DataFrame
learners_cleancontains valid 6-digit numeric values using a regular expression. - It creates a boolean Series
pin_is_validthat indicates whether each pin code matches the specified pattern. - The
locmethod is used to filter rows where the pin code is invalid (not matching the pattern) and is not equal to "Unknown". - The filtered DataFrame displays only the "learner_id" and "pin_code" columns for the invalid entries, allowing for easy identification of issues.
Fix the known bad value:
learners_clean.loc[learners_clean["pin_code"] == "4001", "pin_code"] = "400001"Explanation
- The code targets a DataFrame named
learners_cleanwhich likely contains learner information. - It uses the
.locmethod to locate rows where the "pin_code" column matches the value "4001". - For those identified rows, it updates the "pin_code" to "400001", ensuring a standardized format.
- This operation is useful for data cleaning and preparation before analysis or reporting.
- The change is made in-place, meaning the original DataFrame is modified directly without creating a copy.
Test again:
pin_is_valid = learners_clean["pin_code"].str.fullmatch(r"\d{6}")
learners_clean.loc[~pin_is_valid & learners_clean["pin_code"].ne("Unknown"), ["learner_id", "pin_code"]]Explanation
- The code checks if the "pin_code" column in the
learners_cleanDataFrame contains valid 6-digit numeric PINs using a regular expression. - The
str.fullmatchmethod returns a boolean Series indicating which entries match the 6-digit pattern. - The
locmethod is then used to filter the DataFrame, selecting rows where the PIN is invalid (not matching the regex) and not equal to "Unknown". - The resulting DataFrame displays only the "learner_id" and "pin_code" columns for the filtered entries.
19. Accuracy: Handle Outliers
The dataset intentionally includes unrealistic values.
learners_clean[["age", "study_hours"]].describe()Explanation
- The code selects the "age" and "study_hours" columns from the
learners_cleanDataFrame. - It uses the
describe()method to compute summary statistics such as count, mean, standard deviation, min, max, and quartiles for the selected columns. - This method provides a quick overview of the distribution and central tendency of the data in these columns.
- The output is useful for understanding the characteristics of the learners' age and study habits.
Find suspicious rows:
learners_clean.query("age > 90 or study_hours > 120")Explanation
- The code uses the
querymethod to filter rows in thelearners_cleanDataFrame. - It selects rows where the
agecolumn is greater than 90 or thestudy_hourscolumn exceeds 120. - This allows for efficient data manipulation and retrieval based on specific criteria.
- The resulting DataFrame will only include learners who meet at least one of the specified conditions.
- This approach enhances readability and maintainability of the code compared to traditional indexing methods.
Define
Replace impossible age and unrealistic study_hours with missing values for review.
Code
learners_clean.loc[learners_clean["age"] > 90, "age"] = np.nan
learners_clean.loc[learners_clean["study_hours"] > 120, "study_hours"] = np.nanExplanation
- The code uses the
locmethod from pandas to access specific rows in the DataFramelearners_clean. - It checks for ages greater than 90 and replaces those values with
NaN, indicating missing data. - Similarly, it identifies study hours exceeding 120 and sets those values to
NaN. - This data cleaning step helps ensure that the dataset is more accurate and reliable for analysis.
- Using
np.nanfrom the NumPy library allows for consistent handling of missing values in the DataFrame.
Test
learners_clean.query("age > 90 or study_hours > 120")Explanation
- The code uses the
querymethod to filter rows in thelearners_cleanDataFrame. - It selects rows where the
agecolumn is greater than 90 or thestudy_hourscolumn exceeds 120. - This allows for efficient data manipulation and retrieval based on specific criteria.
- The resulting DataFrame will only include learners who meet at least one of the specified conditions.
- This approach enhances readability and maintainability of the code compared to traditional indexing methods.
This is safer than inventing corrected values.
20. Accuracy: Remove Duplicate Records
Check duplicate IDs:
learners_clean[learners_clean.duplicated(subset=["learner_id"], keep=False)]Explanation
- The code uses the
duplicated()method from the pandas library to check for duplicate rows in the DataFramelearners_clean. - The
subsetparameter specifies that only thelearner_idcolumn should be considered when identifying duplicates. - The
keep=Falseargument ensures that all occurrences of the duplicates are marked as True, not just the first or last occurrence. - The result is a DataFrame containing all rows that have duplicate
learner_idvalues, allowing for further analysis or cleaning of the data.
Check duplicate names and contact:
learners_clean[
learners_clean.duplicated(subset=["given_name", "surname", "contact"], keep=False)
]Explanation
- The code filters a DataFrame named
learners_cleanto identify duplicate rows. - It uses the
duplicatedmethod, specifying the subset of columns["given_name", "surname", "contact"]to check for duplicates. - The
keep=Falseargument ensures that all occurrences of duplicates are marked as True, not just the subsequent ones. - The result is a DataFrame containing only the rows that have duplicates in the specified columns.
Define
Remove exact repeated learner records and keep the first copy.
Code
learners_clean = learners_clean.drop_duplicates()
learners_clean = learners_clean.drop_duplicates(subset=["learner_id"], keep="first")Explanation
- The first line removes all duplicate rows from the learners_clean DataFrame, keeping only unique combinations of all column values
- The second line specifically targets duplicate entries based on the "learner_id" column, keeping only the first occurrence of each unique learner ID
- This two-step approach ensures comprehensive deduplication while maintaining data integrity for individual learners
- The drop_duplicates() method with keep="first" parameter preserves the earliest record when duplicates are found
- This cleaning process prevents data skewing in analytics and ensures accurate learner tracking
Test
learners_clean.duplicated(subset=["learner_id"]).sum()Explanation
- The code checks for duplicate rows in a DataFrame called learners_clean based on the learner_id column
- It uses the duplicated() method with subset parameter to identify duplicate entries while focusing only on the specified column
- The sum() function counts the total number of True values returned by duplicated(), giving the count of duplicate records
- This approach efficiently identifies how many learner records appear more than once in the dataset
- The result helps data analysts understand data quality issues and potential redundancy in learner identification
Some duplicates need business rules. Do not blindly drop rows when records conflict.
21. Consistency: Normalize State Names
State values mix abbreviations and full names.
learners_clean["state"].value_counts(dropna=False)Explanation
- The code examines the frequency distribution of values in the "state" column of the learners_clean DataFrame
- It uses value_counts() method to count occurrences of each unique state value while preserving NaN values through dropna=False parameter
- This approach helps identify how many records exist for each state category and reveals the extent of missing data in the state field
- The results show both populated states and any null/missing entries that might need handling in subsequent data processing steps
- Useful for understanding data completeness and identifying potential issues with state information across the dataset
Create a mapping:
state_map = {
"MH": "Maharashtra",
"Maharashtra": "Maharashtra",
"DL": "Delhi",
"Delhi": "Delhi",
"Karnataka": "Karnataka",
"Telangana": "Telangana",
"TS": "Telangana",
"TN": "Tamil Nadu",
"Tamil Nadu": "Tamil Nadu",
"Gujarat": "Gujarat",
"West Bengal": "West Bengal",
"Unknown": "Unknown",
}
learners_clean["state"] = learners_clean["state"].map(state_map).fillna(learners_clean["state"])Explanation
- Creates a dictionary mapping state abbreviations and partial names to their complete state names for data standardization
- Uses the pandas map() function to replace state values in the DataFrame column with their full names based on the dictionary lookup
- Applies fillna() to preserve original values for any states that don't match the mapping dictionary, ensuring no data loss
- This approach handles multiple naming conventions (abbreviations vs full names) in a single clean operation
- The resulting DataFrame column contains standardized state names for consistent data analysis
Test:
learners_clean["state"].value_counts(dropna=False)Explanation
- The code examines the frequency distribution of values in the "state" column of the learners_clean DataFrame
- It uses value_counts() method to count occurrences of each unique state value while preserving NaN values through dropna=False parameter
- This approach helps identify how many records exist for each state category and reveals the extent of missing data in the state field
- The results show both populated states and any null/missing entries that might need handling in subsequent data processing steps
- Useful for understanding data completeness and identifying potential issues with state information across the dataset
22. Consistency: Normalize Names For Joining
Progress and issue tables use lowercase names. Learner records use title case.
Create join keys in all tables:
def make_name_key(df):
return (
df["given_name"].str.strip().str.lower()
+ "|"
+ df["surname"].str.strip().str.lower()
)
learners_clean["name_key"] = make_name_key(learners_clean)
progress_long["name_key"] = make_name_key(progress_long)
issues_clean["name_key"] = make_name_key(issues_clean)Explanation
- The function
make_name_keygenerates a consistent identifier by combining stripped and lowercased given names and surnames with a pipe delimiter - Each dataframe (learners_clean, progress_long, issues_clean) receives a new name_key column that enables reliable matching across datasets
- The string operations ensure case-insensitive comparison and remove whitespace variations that could prevent proper data joining
- This approach facilitates efficient data merging and deduplication when working with potentially inconsistent name formatting
- The resulting name_key columns serve as primary keys for aligning learner information across different data tables
Test key coverage:
missing_progress_matches = set(progress_long["name_key"]) - set(learners_clean["name_key"])
missing_progress_matchesExplanation
- Calculates the difference between two sets of keys to identify records that exist in progress data but not in learners data
- Uses set subtraction operation to find missing matches where learner names don't have corresponding entries in the clean learners dataset
- The result shows which name keys from progress_long are absent from learners_clean, helping identify data quality issues
- This technique efficiently identifies orphaned records that may need investigation or reconciliation
- The output displays the actual missing key values that can be used for further data analysis or cleaning steps
23. Extract Email And Phone With Regex
The contact column combines email and phone.
Define
Create separate email and phone columns, then remove the original combined field.
Code
email_pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
phone_pattern = re.compile(r"(?:\+91\s*)?\d{5}[-\s]?\d{5}")
learners_clean["email"] = learners_clean["contact"].str.extract(f"({email_pattern.pattern})", expand=False)
learners_clean["phone"] = learners_clean["contact"].str.extract(f"({phone_pattern.pattern})", expand=False)
learners_clean["phone"] = (
learners_clean["phone"]
.str.replace("+91", "", regex=False)
.str.replace(" ", "", regex=False)
.str.replace("-", "", regex=False)
)
learners_clean = learners_clean.drop(columns=["contact"])Explanation
- Compiles regular expression patterns to identify email addresses and Indian phone numbers in the contact data
- Uses string extraction methods to isolate matching email and phone number patterns from the contact column
- Cleans phone numbers by removing country code prefix (+91) and any whitespace or hyphens for standardization
- Drops the original contact column since extracted email and phone number data are now stored separately
Test
learners_clean[["email", "phone"]].head()Explanation
- This code selects and displays the first 5 rows of data from two specific columns ('email' and 'phone') in the learners_clean DataFrame
- The double square brackets [[ ]] indicate selection of multiple columns by name rather than a single column
- The .head() method limits the output to show only the initial records, making it useful for quick data inspection
- This approach helps verify the structure and content of contact information fields before further data processing
- The operation is purely observational and doesn't modify the original dataset
learners_clean[["email", "phone"]].isna().sum()Explanation
- This code examines the learners_clean DataFrame to identify missing data in the email and phone columns
- The double brackets [["email", "phone"]] select multiple columns simultaneously from the DataFrame
- The .isna() method returns a boolean DataFrame where True indicates missing values (NaN) and False indicates valid entries
- The .sum() function counts the True values (missing entries) for each column, returning a Series with the total missing count per column
- This approach efficiently provides a quick overview of data quality issues in contact information fields
Missing contact fields should remain missing. Do not fabricate personal details.
24. Merge Cleaned Tables
Merge learner details with progress rows.
learner_progress = progress_long.merge(
learners_clean,
on="name_key",
how="left",
validate="many_to_one",
suffixes=("_progress", "_learner"),
)
learner_progress.head()Explanation
- Combines two datasets (progress_long and learners_clean) based on the common "name_key" column to create a comprehensive learner progress dataset
- Uses a left join operation to preserve all records from the progress dataset while adding matching learner information
- Implements validation to ensure the merge maintains a many-to-one relationship between datasets
- Applies suffixes to distinguish columns with identical names between the two datasets during the merge process
- Displays the first few rows of the merged dataset to verify the successful combination of data
Merge issue flags:
issue_summary = (
issues_clean.groupby("name_key", as_index=False)
.agg(
issue_count=("issue_type", "count"),
highest_severity=("severity", "max"),
)
)
learner_progress = learner_progress.merge(
issue_summary,
on="name_key",
how="left",
validate="many_to_one",
)
learner_progress["issue_count"] = learner_progress["issue_count"].fillna(0).astype(int)
learner_progress["highest_severity"] = learner_progress["highest_severity"].fillna("none")Explanation
- Groups cleaned issues by name key and calculates total issue count and highest severity level for each group
- Merges the aggregated issue data with learner progress records using a left join to preserve all learner data
- Handles missing values by filling null issue counts with zero and null severities with "none" string
- Ensures proper data types by converting issue count to integer and maintaining severity as string format
Test:
learner_progress[["course", "progress_start_pct", "progress_end_pct", "issue_count"]].head()Explanation
- Retrieves specific columns (course name, initial progress percentage, final progress percentage, and issue count) from the learner_progress DataFrame
- Uses double square brackets to select multiple columns as a subset of the original dataset
- Calls head() method to show only the first 5 rows of this filtered data for quick preview
- This operation helps analyze how learners progressed through courses and identifies potential problem areas with issue counts
- Commonly used for data exploration and validation before performing deeper analysis on learning metrics
25. Create Derived Cleaning Columns
Calculate changes after cleaning.
learner_progress["progress_gain_pct"] = (
learner_progress["progress_end_pct"] - learner_progress["progress_start_pct"]
)
learner_progress["project_score_gain"] = (
learner_progress["project_score_end"] - learner_progress["project_score_start"]
)Explanation
- Computes the percentage point gain in learner progress by subtracting initial progress from final progress percentages
- Calculates the numerical score improvement in learner projects by finding the difference between end and start project scores
- These calculations help measure learning effectiveness and performance enhancement over time
- The operations create new columns in the learner progress dataframe that quantify measurable improvements
- This approach enables tracking of both relative progress changes and absolute score gains for performance analysis
Check results:
learner_progress[["course", "progress_gain_pct", "project_score_gain"]].describe()Explanation
- The code snippet utilizes the
describe()method from the Pandas library to generate summary statistics. - It focuses on three specific columns: "course", "progress_gain_pct", and "project_score_gain" from the
learner_progressDataFrame. - The output includes key statistical measures such as count, mean, standard deviation, minimum, and maximum values for the selected columns.
- This analysis helps in understanding the overall performance and progress of learners in various courses.
Derived columns should be created after types and structure are fixed.
26. Final Quality Checks
Run final checks before saving.
checks = {
"duplicate_learner_ids": learners_clean.duplicated(subset=["learner_id"]).sum(),
"missing_name_keys_in_progress": learner_progress["learner_id"].isna().sum(),
"invalid_progress_gain": (learner_progress["progress_gain_pct"] < 0).sum(),
"missing_course": learner_progress["course"].isna().sum(),
}
checksExplanation
- A dictionary named
checksis created to store various validation results related to learner data. - The
duplicate_learner_idskey counts the number of duplicate entries in thelearner_idcolumn of thelearners_cleanDataFrame. - The
missing_name_keys_in_progresskey calculates how manylearner_identries in thelearner_progressDataFrame are missing (NaN). - The
invalid_progress_gainkey sums up instances where theprogress_gain_pctis less than zero, indicating invalid progress data. - The
missing_coursekey counts the number of missing course entries in thelearner_progressDataFrame.
All checks should be explainable.
27. Save Cleaned Outputs
Save cleaned tables separately.
learners_clean.to_csv("eda_learners_clean.csv", index=False)
progress_long.to_csv("eda_progress_tidy.csv", index=False)
learner_progress.to_csv("eda_learner_progress_analysis_ready.csv", index=False)
assessment_findings.to_csv("eda_assessment_findings.csv", index=False)Explanation
- Each line of code uses the
to_csvmethod to export a DataFrame to a CSV file. - The first argument specifies the filename, while
index=Falseprevents the DataFrame index from being written to the file. - This approach is useful for preparing cleaned and processed data for further analysis or sharing.
- The filenames indicate that the data pertains to learners, progress, and assessment findings, suggesting a focus on educational data analysis.
- By saving the data in CSV format, it ensures compatibility with various data analysis tools and software.
Keep raw files and cleaned files separate.
28. Cleaning Checklist
Before moving into deeper EDA, confirm:
- raw DataFrames were copied before cleaning
- missing values were handled intentionally
- duplicate IDs were checked
- duplicate entity records were checked
- date columns were parsed
- ID-like columns stayed as strings
- invalid values were fixed or marked missing
- inconsistent categories were normalized
- untidy wide columns were reshaped
- derived columns were created after cleaning
- joins used validation
- cleaning steps were tested
- findings were documented
29. Common Mistakes
Cleaning Without Documentation
If you do not document issues, later readers cannot tell why the code exists.
Filling Every Missing Value With One Word
"Unknown" may work for address fields. It is usually wrong for numeric measurements.
Treating IDs As Numbers
Postal codes, learner IDs, phone numbers, and course IDs are identifiers. Store them as strings.
Dropping Duplicates Blindly
Some duplicates are exact repeats. Others are conflicting records. Conflicting records need business rules.
Fixing Outliers By Guessing
If you cannot verify the true value, mark it missing or flag it for review.
Joining Before Normalizing Keys
Names, IDs, and category labels should be normalized before joins.
30. Practice Tasks
Use the included CSV files:
- Load all four datasets and write a short summary.
- Create a data dictionary for all tables.
- Export all raw tables into one Excel workbook.
- Find missing values in
eda_messy_learners.csv. - Find duplicate learner IDs.
- Detect unrealistic
ageandstudy_hoursvalues. - Normalize state names.
- Extract
emailandphonefromcontact. - Combine the two course progress tables.
- Reshape course columns with
melt. - Split progress ranges into start and end percentages.
- Merge learner, progress, and issue tables.
- Create progress gain and project score gain columns.
- Save cleaned analysis-ready CSV files.
31. Interview-Style Questions
What is the difference between dirty and messy data?
Dirty data has content quality problems. Messy data has structural problems.
What are the three rules of tidy data?
Each variable is a column, each observation is a row, and each observational unit is a table.
Why should you copy a DataFrame before cleaning?
Copying preserves the raw data so you can audit, compare, or restart cleaning if needed.
What does define-code-test mean?
Define the cleaning action, write the code, then test that the action worked.
Why should postal codes be strings?
Postal codes are identifiers. Numeric conversion can remove leading zeros or create decimal artifacts.
Why is melt useful?
melt converts wide columns into tidy long rows, which often makes grouping and plotting easier.
Why should you use validate in merge?
It checks whether the join relationship matches your expectation and catches accidental row multiplication.
Should outliers always be removed?
No. Outliers should be investigated. Remove, correct, or mark them only when you have a defensible reason.
32. Final Notes
Assessment and cleaning are iterative.
You rarely discover every issue in one pass. A chart, join, or summary table may reveal a new problem. When that happens, go back to assessment, document the issue, clean it, and test again.
Good cleaning is not about making data look perfect. It is about making every assumption visible and every transformation testable.
