End-to-End Smartphone Marketplace EDA Case Study: From Messy Listings to Business Insights
This case study uses original synthetic data.
It does not use copied course data, scraped marketplace data, proprietary rows, or any third-party CSV.
The goal is to practice a complete EDA workflow on a realistic smartphone listing problem: raw product pages contain inconsistent brands, mixed currency formats, missing ratings, duplicate listings, non-phone devices, text-heavy specifications, invalid dates, outliers, and a second table with sales-channel metrics.
Files Used In This Case Study
Use these CSV files from the same folder as this post:
smartphone_eda_listings_raw.csvsmartphone_eda_sales_channels.csvsmartphone_eda_brand_reference.csvsmartphone_eda_data_dictionary.csv
Load them like this:
import re
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
DATA_DIR = Path(".")
listings_raw = pd.read_csv(DATA_DIR / "smartphone_eda_listings_raw.csv")
sales_raw = pd.read_csv(DATA_DIR / "smartphone_eda_sales_channels.csv")
brand_ref = pd.read_csv(DATA_DIR / "smartphone_eda_brand_reference.csv")
dictionary = pd.read_csv(DATA_DIR / "smartphone_eda_data_dictionary.csv")Business Questions
Before cleaning anything, define what the analysis should answer.
In this case study, the marketplace team wants to know:
- Which price segments dominate the current smartphone catalog?
- Which brands and channels are driving units sold?
- How do specs such as RAM, storage, 5G, battery, display refresh rate, and camera resolution relate to price?
- Are discounts associated with higher sales?
- Which listings look suspicious, incomplete, duplicated, invalid, or operationally risky?
- What data-quality rules should be added before the next listing ingest?
Non-goals:
- predicting future sales
- scraping live ecommerce websites
- recommending real investment or purchasing decisions
- treating the synthetic dataset as market truth
Full EDA Workflow Map
Use this map as the checklist for the notebook.
- Problem framing
- Data inventory
- Data dictionary review
- Manual assessment
- Programmatic assessment
- Quality issue log
- Tidiness issue log
- Cleaning plan
- Type conversion
- Category standardization
- Text parsing and feature extraction
- Duplicate detection
- Missing-value analysis
- Invalid-value checks
- Outlier detection
- Domain-rule validation
- Feature engineering
- Table joins
- Post-cleaning tests
- Univariate analysis
- Bivariate analysis
- Multivariate analysis
- Segment and cohort analysis
- Time-based checks
- Correlation and association checks
- Distribution diagnostics
- Robust statistics
- Sensitivity analysis
- Bias and data-collection limitations
- Final insight report
- Export of analysis-ready data
No single project can literally cover every possible EDA method ever invented, but this workflow covers the practical EDA steps expected in production analytics, data-science interviews, and portfolio notebooks.
1. Data Inventory
Start with shapes and a preview.
tables = {
"listings_raw": listings_raw,
"sales_raw": sales_raw,
"brand_ref": brand_ref,
"dictionary": dictionary,
}
for name, df in tables.items():
print(name, df.shape)
display(df.head())Expected observations:
listings_rawis the primary table.sales_rawis a listing-level performance table.brand_refis a reference table used for canonical brand names.dictionaryexplains the columns and known intentional quality issues.
2. Manual Assessment
Look at real rows before writing cleaning code.
pd.set_option("display.max_columns", None)
listings_raw.sample(10, random_state=42)Manual assessment should catch issues like:
- inconsistent brand spellings such as
Samsung,samsung,Moto,Nothng - mixed price formats such as
INR 34,999,Rs. 27999,TBD - duplicate listing IDs
- non-smartphone rows
- invalid ratings
- impossible return windows
- invalid launch dates
- specs stored as dense text instead of numeric columns
Write the issue log before cleaning.
issue_log = pd.DataFrame(
[
["brand_raw", "consistency", "Same brand appears in multiple spellings"],
["price_text", "validity", "Currency symbols, spaces, commas, and TBD appear"],
["rating_text", "validity", "Rating should be between 0 and 5"],
["launch_date_text", "validity", "Multiple formats and one invalid date"],
["device_type_raw", "validity", "Dataset includes non-smartphone devices"],
["listing_id", "uniqueness", "One listing ID appears more than once"],
["return_window_days", "validity", "Return window should not be negative"],
["spec columns", "tidiness", "Processor, battery, display, memory, and camera contain multiple variables"],
],
columns=["column", "issue_type", "description"],
)
issue_log3. Programmatic Assessment
Use code to quantify what manual review suggested.
listings_raw.info()listings_raw.describe(include="all").Tmissing_summary = (
listings_raw.isna().sum()
.rename("missing_count")
.to_frame()
.assign(missing_pct=lambda x: x["missing_count"] / len(listings_raw) * 100)
.sort_values("missing_pct", ascending=False)
)
missing_summarylistings_raw.duplicated().sum()listings_raw["listing_id"].duplicated().sum()for col in ["device_type_raw", "brand_raw", "marketplace", "stock_status", "seller_tier"]:
display(listings_raw[col].value_counts(dropna=False))4. Cleaning Helpers
Create small reusable helpers instead of repeating regex logic.
def parse_money(value):
if pd.isna(value):
return np.nan
text = str(value).strip().lower()
if text in {"", "tbd", "na", "n/a", "not available"}:
return np.nan
digits = re.sub(r"[^0-9.]", "", text)
return float(digits) if digits else np.nan
def parse_number(value):
if pd.isna(value):
return np.nan
text = str(value).strip()
text = text.replace(",", "")
match = re.search(r"-?[0-9]+(?:\.[0-9]+)?", text)
return float(match.group()) if match else np.nan
def contains_token(series, pattern):
return series.fillna("").str.contains(pattern, case=False, regex=True)5. Define-Code-Test Cleaning Workflow
Copy raw data first.
listings = listings_raw.copy()
sales = sales_raw.copy()Normalize obvious string columns.
text_cols = listings.select_dtypes(include="object").columns
for col in text_cols:
listings[col] = listings[col].str.strip()Parse dates.
listings["snapshot_date"] = pd.to_datetime(listings["snapshot_date"], errors="coerce")
listings["launch_date"] = pd.to_datetime(listings["launch_date_text"], errors="coerce", dayfirst=False)
listings.loc[listings["launch_date"].isna(), ["listing_id", "launch_date_text"]]Parse numeric fields.
listings["price_inr"] = listings["price_text"].apply(parse_money)
listings["mrp_inr"] = listings["mrp_text"].apply(parse_money)
listings["rating"] = listings["rating_text"].apply(parse_number)
listings["review_count"] = listings["review_count_text"].apply(parse_number)
listings["return_window_days_num"] = listings["return_window_days"].apply(parse_number)Test the conversions.
assert listings["price_inr"].dropna().ge(0).all()
assert listings["mrp_inr"].dropna().ge(0).all()6. Canonical Brand Mapping
Use the reference table rather than manual replacement chains.
listings = listings.merge(
brand_ref,
left_on="brand_raw",
right_on="raw_brand",
how="left",
validate="many_to_one",
)
listings["brand"] = listings["canonical_brand"].fillna(listings["brand_raw"].str.title())
listings.loc[listings["canonical_brand"].isna(), ["brand_raw", "model_raw"]]7. Device-Type Cleaning
The raw table intentionally contains non-phone rows.
listings["device_type"] = (
listings["device_type_raw"]
.str.lower()
.str.replace(r"\s+", " ", regex=True)
.str.strip()
)
smartphone_mask = listings["device_type"].isin(
["smartphone", "smart phone", "phone", "foldable phone"]
)
non_phone_rows = listings.loc[~smartphone_mask, ["listing_id", "device_type_raw", "brand_raw", "model_raw"]]
non_phone_rowsFor smartphone EDA, remove non-phone rows after documenting them.
listings = listings.loc[smartphone_mask].copy()8. Duplicate Handling
Find exact duplicates and key duplicates separately.
exact_duplicate_count = listings.duplicated().sum()
id_duplicate_rows = listings[listings["listing_id"].duplicated(keep=False)].sort_values("listing_id")
exact_duplicate_count, id_duplicate_rows[["listing_id", "brand", "model_raw", "price_inr"]]Deduplicate by keeping the first exact listing record.
listings = listings.drop_duplicates()
listings = listings.sort_values(["listing_id", "snapshot_date"]).drop_duplicates("listing_id", keep="last")9. Connectivity Feature Extraction
listings["has_5g"] = contains_token(listings["sim_features"], r"\b5g\b")
listings["has_nfc"] = contains_token(listings["sim_features"], r"\bnfc\b")
listings["has_ir_blaster"] = contains_token(listings["sim_features"], r"\bir\b|ir blaster")
listings["has_esim"] = contains_token(listings["sim_features"], r"\besim\b")
listings["has_uwb"] = contains_token(listings["sim_features"], r"\buwb\b")10. Processor Parsing
Extract chipset, core label, and clock speed.
listings["processor_clean"] = listings["processor_text"].str.replace("|", ",", regex=False)
listings["processor_clock_ghz"] = (
listings["processor_clean"]
.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*ghz", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["processor_core_label"] = (
listings["processor_clean"]
.str.extract(r"((?:hexa|octa|quad|8)\s*(?:core|cores|core processor)?)", flags=re.IGNORECASE)[0]
.str.lower()
.str.replace("8 cores", "octa core")
.str.replace("8 core processor", "octa core")
)
listings["processor_family"] = (
listings["processor_clean"]
.str.extract(r"^(.*?)(?:,|\||$)")[0]
.str.strip()
)11. Memory and Storage Parsing
listings["ram_gb"] = (
listings["memory_text"]
.str.extract(r"([0-9]+)\s*gb\s*ram", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["storage_gb"] = (
listings["memory_text"]
.str.extract(r"([0-9]+)\s*gb\s*(?:rom|storage)", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["storage_gb"] = listings["storage_gb"].fillna(
listings["memory_text"].str.extract(r"([0-9]+)\s*gb$", flags=re.IGNORECASE)[0].astype(float)
)12. Battery and Charging Parsing
listings["battery_mah"] = (
listings["battery_text"]
.str.extract(r"([0-9]{4})\s*mah", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["fast_charging_w"] = (
listings["battery_text"]
.str.extract(r"([0-9]{2,3})\s*w", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["charger_not_bundled"] = contains_token(listings["battery_text"], r"not bundled|no charger")13. Display Parsing
listings["screen_size_in"] = (
listings["display_text"]
.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*(?:inch|in)", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["refresh_rate_hz"] = (
listings["display_text"]
.str.extract(r"([0-9]{2,3})\s*hz", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["panel_type"] = np.select(
[
contains_token(listings["display_text"], "ltpo"),
contains_token(listings["display_text"], "amoled"),
contains_token(listings["display_text"], "oled"),
contains_token(listings["display_text"], "lcd"),
],
["LTPO OLED", "AMOLED", "OLED", "LCD"],
default="Unknown",
)14. Camera Parsing
rear_text = listings["camera_text"].str.split("&| and ", regex=True).str[0]
front_text = listings["camera_text"].str.extract(r"(?:&|and)\s*(.*front.*|.*selfie.*)", flags=re.IGNORECASE)[0]
rear_text_clean = rear_text.fillna("").str.lower()
listings["primary_rear_mp"] = rear_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["front_camera_mp"] = front_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["rear_camera_count"] = rear_text_clean.str.count(r"\bmp\b")15. Card Slot and OS Normalization
listings["has_memory_card"] = contains_token(listings["card_slot"], r"supported|microsd|hybrid|dedicated")
listings["hybrid_card_slot"] = contains_token(listings["card_slot"], r"hybrid")
listings["max_card_tb"] = (
listings["card_slot"]
.str.extract(r"upto\s*([0-9]+)\s*tb", flags=re.IGNORECASE)[0]
.astype(float)
)
listings["os_family"] = np.select(
[
contains_token(listings["os_text"], "ios"),
contains_token(listings["os_text"], "harmony|emui"),
contains_token(listings["os_text"], "android|oxygen|coloros|funtouch|hyperos|realme|magic|xos|hios|rog|nothing|one ui"),
],
["iOS", "HarmonyOS/EMUI", "Android"],
default="Other/Unknown",
)16. Stock Status and Return Policy
listings["stock_status_clean"] = np.select(
[
contains_token(listings["stock_status"], r"out"),
contains_token(listings["stock_status"], r"preorder"),
contains_token(listings["stock_status"], r"limited"),
contains_token(listings["stock_status"], r"in stock|available"),
],
["out_of_stock", "preorder", "limited", "available"],
default="unknown",
)
invalid_return_windows = listings[listings["return_window_days_num"] < 0]
invalid_return_windows[["listing_id", "return_window_days"]]Set invalid return windows to missing.
listings.loc[listings["return_window_days_num"] < 0, "return_window_days_num"] = np.nan17. Domain Validation
Create validation checks after cleaning.
validation_report = {
"negative_price_rows": int((listings["price_inr"] < 0).sum()),
"price_above_mrp_rows": int((listings["price_inr"] > listings["mrp_inr"]).sum()),
"invalid_rating_rows": int(((listings["rating"] < 0) | (listings["rating"] > 5)).sum()),
"missing_brand_rows": int(listings["brand"].isna().sum()),
"missing_price_rows": int(listings["price_inr"].isna().sum()),
"missing_launch_date_rows": int(listings["launch_date"].isna().sum()),
"duplicate_listing_ids": int(listings["listing_id"].duplicated().sum()),
}
validation_reportHandle invalid ratings.
listings.loc[(listings["rating"] < 0) | (listings["rating"] > 5), "rating"] = np.nan18. Outlier Detection
Use multiple methods; do not blindly delete rows.
price_q1 = listings["price_inr"].quantile(0.25)
price_q3 = listings["price_inr"].quantile(0.75)
price_iqr = price_q3 - price_q1
lower_fence = price_q1 - 1.5 * price_iqr
upper_fence = price_q3 + 1.5 * price_iqr
price_outliers = listings[
(listings["price_inr"] < lower_fence) |
(listings["price_inr"] > upper_fence)
][["listing_id", "brand", "model_raw", "price_inr", "mrp_inr"]]
price_outliersFor this dataset, investigate the extreme low price row.
listings["price_issue_flag"] = listings["price_inr"].lt(3000) | listings["price_inr"].isna()19. Missing-Value Strategy
Separate missing because unavailable from missing because invalid.
missing_after_cleaning = (
listings[
[
"price_inr",
"rating",
"review_count",
"launch_date",
"ram_gb",
"storage_gb",
"battery_mah",
"refresh_rate_hz",
"front_camera_mp",
]
]
.isna()
.mean()
.mul(100)
.sort_values(ascending=False)
)
missing_after_cleaningPractical strategy:
- keep missing
price_inras missing for price analysis - keep missing
ratingas missing, not zero - impute missing
refresh_rate_hzto 60 only if listing copy clearly implies a basic display - do not impute launch dates unless there is a trusted catalog source
- use missingness flags when missingness may carry business meaning
listings["rating_missing"] = listings["rating"].isna()
listings["launch_date_missing"] = listings["launch_date"].isna()20. Feature Engineering
listings["discount_pct"] = (1 - listings["price_inr"] / listings["mrp_inr"]) * 100
listings["listing_age_days"] = (listings["snapshot_date"] - listings["launch_date"]).dt.days
listings["price_segment"] = pd.cut(
listings["price_inr"],
bins=[0, 15000, 30000, 50000, 80000, np.inf],
labels=["budget", "midrange", "upper_midrange", "premium", "ultra_premium"],
)
listings["storage_per_1000_inr"] = listings["storage_gb"] / (listings["price_inr"] / 1000)
listings["battery_per_1000_inr"] = listings["battery_mah"] / (listings["price_inr"] / 1000)21. Join Sales Channel Data
Clean numeric sales fields first.
sales["units_sold_30d"] = sales["units_sold_30d"].apply(parse_number)
sales["ad_spend_inr"] = sales["ad_spend_inr"].apply(parse_number)
sales["avg_delivery_days"] = sales["avg_delivery_days"].apply(parse_number)
sales["return_rate_pct"] = sales["return_rate_pct"].apply(parse_number)Join to the cleaned listing table.
analysis = listings.merge(
sales,
on="listing_id",
how="left",
validate="one_to_one",
)
analysis.shapeCreate sales efficiency metrics.
analysis["ad_spend_per_unit"] = analysis["ad_spend_inr"] / analysis["units_sold_30d"].replace(0, np.nan)
analysis["revenue_30d"] = analysis["price_inr"] * analysis["units_sold_30d"]22. Post-Cleaning Dataset Contract
required_columns = [
"listing_id",
"brand",
"model_raw",
"price_inr",
"rating",
"ram_gb",
"storage_gb",
"battery_mah",
"has_5g",
"price_segment",
"units_sold_30d",
]
assert analysis["listing_id"].is_unique
assert set(required_columns).issubset(analysis.columns)
assert analysis["rating"].dropna().between(0, 5).all()
assert (analysis["price_inr"].dropna() >= 0).all()23. Univariate EDA
Numeric summaries:
numeric_cols = [
"price_inr",
"mrp_inr",
"discount_pct",
"rating",
"review_count",
"ram_gb",
"storage_gb",
"battery_mah",
"fast_charging_w",
"screen_size_in",
"refresh_rate_hz",
"primary_rear_mp",
"units_sold_30d",
"return_rate_pct",
]
analysis[numeric_cols].describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).TCategorical summaries:
for col in ["brand", "marketplace", "price_segment", "os_family", "stock_status_clean", "seller_tier", "channel"]:
display(analysis[col].value_counts(dropna=False))Visuals:
sns.histplot(data=analysis, x="price_inr", bins=20, kde=True)
plt.title("Distribution of Smartphone Prices")
plt.show()sns.boxplot(data=analysis, x="price_segment", y="price_inr")
plt.title("Price Spread by Segment")
plt.xticks(rotation=30)
plt.show()24. Bivariate EDA
Price by brand:
brand_price = (
analysis.groupby("brand", as_index=False)
.agg(median_price=("price_inr", "median"), listings=("listing_id", "count"))
.sort_values("median_price", ascending=False)
)
brand_priceSpecs vs price:
sns.scatterplot(data=analysis, x="ram_gb", y="price_inr", hue="price_segment")
plt.title("RAM vs Price")
plt.show()sns.scatterplot(data=analysis, x="battery_mah", y="price_inr", hue="has_5g")
plt.title("Battery Capacity vs Price")
plt.show()Sales vs discount:
sns.regplot(data=analysis.dropna(subset=["discount_pct", "units_sold_30d"]), x="discount_pct", y="units_sold_30d")
plt.title("Discount vs Units Sold")
plt.show()25. Multivariate EDA
sns.scatterplot(
data=analysis,
x="price_inr",
y="units_sold_30d",
hue="price_segment",
size="rating",
sizes=(30, 220),
)
plt.title("Price, Sales, Segment, and Rating")
plt.show()pivot = analysis.pivot_table(
index="price_segment",
columns="channel",
values="units_sold_30d",
aggfunc="median",
)
sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Median Units Sold by Price Segment and Channel")
plt.show()26. Correlation and Association
Correlation is not causation, but it helps find relationships worth investigating.
corr_cols = [
"price_inr",
"discount_pct",
"rating",
"review_count",
"ram_gb",
"storage_gb",
"battery_mah",
"fast_charging_w",
"screen_size_in",
"refresh_rate_hz",
"primary_rear_mp",
"units_sold_30d",
"return_rate_pct",
]
corr = analysis[corr_cols].corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Numeric Feature Correlation")
plt.show()For categorical association, compare group distributions.
pd.crosstab(analysis["price_segment"], analysis["has_5g"], normalize="index")pd.crosstab(analysis["channel"], analysis["city_tier"], normalize="index")27. Segment Analysis
segment_summary = (
analysis.groupby("price_segment", observed=True)
.agg(
listings=("listing_id", "count"),
median_price=("price_inr", "median"),
median_discount=("discount_pct", "median"),
median_rating=("rating", "median"),
median_units=("units_sold_30d", "median"),
median_return_rate=("return_rate_pct", "median"),
)
.reset_index()
)
segment_summaryBrand-positioning summary:
positioning_summary = (
analysis.groupby("typical_positioning")
.agg(
listings=("listing_id", "count"),
median_price=("price_inr", "median"),
median_units=("units_sold_30d", "median"),
median_return_rate=("return_rate_pct", "median"),
)
.sort_values("median_units", ascending=False)
)
positioning_summary28. Time-Based Checks
analysis["launch_month"] = analysis["launch_date"].dt.to_period("M").astype(str)
monthly_launches = analysis.groupby("launch_month").size()
monthly_launches.plot(kind="bar", title="Launch Month Distribution")
plt.show()Check whether newer phones sell differently.
sns.scatterplot(data=analysis, x="listing_age_days", y="units_sold_30d", hue="price_segment")
plt.title("Listing Age vs Units Sold")
plt.show()29. Robust Statistics
Outliers can distort means, so compare mean and median.
analysis.groupby("price_segment", observed=True)["price_inr"].agg(["count", "mean", "median", "std"])Use trimmed means for a less outlier-sensitive summary.
def trimmed_mean(series, trim=0.1):
s = series.dropna().sort_values()
n = len(s)
k = int(n * trim)
if n == 0 or 2 * k >= n:
return np.nan
return s.iloc[k:n-k].mean()
analysis.groupby("price_segment", observed=True)["price_inr"].apply(trimmed_mean)30. Statistical Checks
EDA is not formal modeling, but lightweight checks can prevent weak claims.
Difference in sales between 5G and non-5G listings:
analysis.groupby("has_5g")["units_sold_30d"].agg(["count", "mean", "median"])Bootstrap a confidence interval for median price.
rng = np.random.default_rng(42)
prices = analysis["price_inr"].dropna().to_numpy()
boot_medians = [
np.median(rng.choice(prices, size=len(prices), replace=True))
for _ in range(2000)
]
np.percentile(boot_medians, [2.5, 50, 97.5])31. Sensitivity Analysis
Check whether decisions change when suspicious rows are excluded.
clean_core = analysis[
(~analysis["price_issue_flag"]) &
(analysis["stock_status_clean"] != "out_of_stock") &
(analysis["price_inr"].notna())
].copy()
comparison = pd.DataFrame(
{
"all_rows": analysis["price_inr"].describe(),
"clean_core": clean_core["price_inr"].describe(),
}
)
comparison32. Data Bias and Limitations
Document limitations in the notebook.
This synthetic dataset has:
- only a snapshot window, not full historical pricing
- marketplace-visible ratings, not verified buyer satisfaction
- synthetic sales metrics, not real revenue
- intentional defects inserted for learning
- no warranty, exchange, financing, or inventory-cost fields
- incomplete information about regional availability
In a real project, ask how listings were sampled and whether missing rows differ from included rows.
33. Final Insight Template
Use a short executive summary.
Example format:
1. Catalog shape:
Most listings are concentrated in midrange and upper-midrange segments.
2. Data quality:
Brand normalization, duplicate IDs, invalid rating values, invalid launch dates,
and text-heavy specs were the biggest cleaning risks.
3. Product patterns:
Higher RAM/storage and 5G support generally appear in higher price segments,
but discount and channel effects need separate analysis before causal claims.
4. Operational risks:
Out-of-stock/preorder rows and suspicious prices should be excluded from sales-efficiency reporting.
5. Next data work:
Add ingestion validation for price, rating, device type, duplicate listing IDs,
launch date parsing, and negative return windows.34. Export Analysis-Ready Data
analysis_ready_cols = [
"listing_id",
"snapshot_date",
"marketplace",
"brand",
"model_raw",
"parent_company",
"country_or_region",
"typical_positioning",
"price_inr",
"mrp_inr",
"discount_pct",
"price_segment",
"rating",
"review_count",
"launch_date",
"listing_age_days",
"has_5g",
"has_nfc",
"has_ir_blaster",
"has_esim",
"has_uwb",
"processor_family",
"processor_core_label",
"processor_clock_ghz",
"ram_gb",
"storage_gb",
"battery_mah",
"fast_charging_w",
"screen_size_in",
"refresh_rate_hz",
"panel_type",
"primary_rear_mp",
"front_camera_mp",
"rear_camera_count",
"has_memory_card",
"os_family",
"seller_tier",
"stock_status_clean",
"return_window_days_num",
"channel",
"city_tier",
"units_sold_30d",
"ad_spend_inr",
"avg_delivery_days",
"return_rate_pct",
"promo_label",
"ad_spend_per_unit",
"revenue_30d",
"price_issue_flag",
]
analysis_ready = analysis[analysis_ready_cols].copy()
analysis_ready.to_csv("smartphone_eda_analysis_ready.csv", index=False)35. Notebook Review Checklist
Before publishing the post, check:
- the problem statement is clear
- raw data is never overwritten
- cleaning decisions are explained before code
- invalid rows are documented before removal
- duplicates are handled by rule, not by guesswork
- missing values are not silently filled with fake certainty
- outliers are investigated before exclusion
- plots have titles and readable labels
- every major insight has a table or visual behind it
- correlation is not described as causation
- limitations are included
- the analysis-ready CSV can be regenerated from raw files
Practice Lab
Practice Extensions
- Add a warranty table and analyze whether longer warranty periods reduce returns.
- Build a small validation report that fails when a new raw file has invalid ratings or duplicate IDs.
- Compare robust regression and ordinary least squares for price estimation.
- Add a Plotly dashboard for segment, brand, and channel filters.
- Convert the cleaning workflow into a reusable Python script.
