What is the primary goal of this EDA case study?

The goal is to practice a complete EDA workflow on a realistic smartphone listing problem, focusing on transforming messy listings into business insights.

Which CSV files are used in this case study?

The CSV files used are smartphoneedalistingsraw.csv, smartphoneedasaleschannels.csv, smartphoneedabrandreference.csv, and smartphoneedadatadictionary.csv.

What are some of the business questions the analysis aims to answer?

The analysis aims to answer which price segments dominate the smartphone catalog, which brands and channels drive units sold, how specs relate to price, if discounts are associated with higher sales, and which listings are suspicious or risky.

What are the non-goals of this EDA case study?

The non-goals include predicting future sales, scraping live ecommerce websites, recommending real investment or purchasing decisions, and treating the synthetic dataset as market truth.

Smartphone Marketplace EDA Case Study:

End-to-End Smartphone Marketplace EDA Case Study: From Messy Listings to Business Insights

This case study uses original synthetic data.

It does not use copied course data, scraped marketplace data, proprietary rows, or any third-party CSV.

The goal is to practice a complete EDA workflow on a realistic smartphone listing problem: raw product pages contain inconsistent brands, mixed currency formats, missing ratings, duplicate listings, non-phone devices, text-heavy specifications, invalid dates, outliers, and a second table with sales-channel metrics.

Files Used In This Case Study

Use these CSV files from the same folder as this post:

smartphone_eda_listings_raw.csv
smartphone_eda_sales_channels.csv
smartphone_eda_brand_reference.csv
smartphone_eda_data_dictionary.csv

Load them like this:

python

import re
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

DATA_DIR = Path(".")

listings_raw = pd.read_csv(DATA_DIR / "smartphone_eda_listings_raw.csv")
sales_raw = pd.read_csv(DATA_DIR / "smartphone_eda_sales_channels.csv")
brand_ref = pd.read_csv(DATA_DIR / "smartphone_eda_brand_reference.csv")
dictionary = pd.read_csv(DATA_DIR / "smartphone_eda_data_dictionary.csv")

Business Questions

Before cleaning anything, define what the analysis should answer.

In this case study, the marketplace team wants to know:

Which price segments dominate the current smartphone catalog?
Which brands and channels are driving units sold?
How do specs such as RAM, storage, 5G, battery, display refresh rate, and camera resolution relate to price?
Are discounts associated with higher sales?
Which listings look suspicious, incomplete, duplicated, invalid, or operationally risky?
What data-quality rules should be added before the next listing ingest?

Non-goals:

predicting future sales
scraping live ecommerce websites
recommending real investment or purchasing decisions
treating the synthetic dataset as market truth

Full EDA Workflow Map

Use this map as the checklist for the notebook.

Problem framing
Data inventory
Data dictionary review
Manual assessment
Programmatic assessment
Quality issue log
Tidiness issue log
Cleaning plan
Type conversion
Category standardization
Text parsing and feature extraction
Duplicate detection
Missing-value analysis
Invalid-value checks
Outlier detection
Domain-rule validation
Feature engineering
Table joins
Post-cleaning tests
Univariate analysis
Bivariate analysis
Multivariate analysis
Segment and cohort analysis
Time-based checks
Correlation and association checks
Distribution diagnostics
Robust statistics
Sensitivity analysis
Bias and data-collection limitations
Final insight report
Export of analysis-ready data

No single project can literally cover every possible EDA method ever invented, but this workflow covers the practical EDA steps expected in production analytics, data-science interviews, and portfolio notebooks.

1. Data Inventory

Start with shapes and a preview.

python

tables = {
    "listings_raw": listings_raw,
    "sales_raw": sales_raw,
    "brand_ref": brand_ref,
    "dictionary": dictionary,
}

for name, df in tables.items():
    print(name, df.shape)
    display(df.head())

Expected observations:

listings_raw is the primary table.
sales_raw is a listing-level performance table.
brand_ref is a reference table used for canonical brand names.
dictionary explains the columns and known intentional quality issues.

2. Manual Assessment

Look at real rows before writing cleaning code.

python

pd.set_option("display.max_columns", None)

listings_raw.sample(10, random_state=42)

Manual assessment should catch issues like:

inconsistent brand spellings such as Samsung, samsung, Moto, Nothng
mixed price formats such as INR 34,999, Rs. 27999, TBD
duplicate listing IDs
non-smartphone rows
invalid ratings
impossible return windows
invalid launch dates
specs stored as dense text instead of numeric columns

Write the issue log before cleaning.

python

issue_log = pd.DataFrame(
    [
        ["brand_raw", "consistency", "Same brand appears in multiple spellings"],
        ["price_text", "validity", "Currency symbols, spaces, commas, and TBD appear"],
        ["rating_text", "validity", "Rating should be between 0 and 5"],
        ["launch_date_text", "validity", "Multiple formats and one invalid date"],
        ["device_type_raw", "validity", "Dataset includes non-smartphone devices"],
        ["listing_id", "uniqueness", "One listing ID appears more than once"],
        ["return_window_days", "validity", "Return window should not be negative"],
        ["spec columns", "tidiness", "Processor, battery, display, memory, and camera contain multiple variables"],
    ],
    columns=["column", "issue_type", "description"],
)

issue_log

3. Programmatic Assessment

Use code to quantify what manual review suggested.

python

listings_raw.info()

python

listings_raw.describe(include="all").T

python

missing_summary = (
    listings_raw.isna().sum()
    .rename("missing_count")
    .to_frame()
    .assign(missing_pct=lambda x: x["missing_count"] / len(listings_raw) * 100)
    .sort_values("missing_pct", ascending=False)
)

missing_summary

python

listings_raw.duplicated().sum()

python

listings_raw["listing_id"].duplicated().sum()

python

for col in ["device_type_raw", "brand_raw", "marketplace", "stock_status", "seller_tier"]:
    display(listings_raw[col].value_counts(dropna=False))

4. Cleaning Helpers

Create small reusable helpers instead of repeating regex logic.

python

def parse_money(value):
    if pd.isna(value):
        return np.nan
    text = str(value).strip().lower()
    if text in {"", "tbd", "na", "n/a", "not available"}:
        return np.nan
    digits = re.sub(r"[^0-9.]", "", text)
    return float(digits) if digits else np.nan

def parse_number(value):
    if pd.isna(value):
        return np.nan
    text = str(value).strip()
    text = text.replace(",", "")
    match = re.search(r"-?[0-9]+(?:\.[0-9]+)?", text)
    return float(match.group()) if match else np.nan

def contains_token(series, pattern):
    return series.fillna("").str.contains(pattern, case=False, regex=True)

5. Define-Code-Test Cleaning Workflow

Copy raw data first.

python

listings = listings_raw.copy()
sales = sales_raw.copy()

Normalize obvious string columns.

python

text_cols = listings.select_dtypes(include="object").columns
for col in text_cols:
    listings[col] = listings[col].str.strip()

Parse dates.

python

listings["snapshot_date"] = pd.to_datetime(listings["snapshot_date"], errors="coerce")
listings["launch_date"] = pd.to_datetime(listings["launch_date_text"], errors="coerce", dayfirst=False)

listings.loc[listings["launch_date"].isna(), ["listing_id", "launch_date_text"]]

Parse numeric fields.

python

listings["price_inr"] = listings["price_text"].apply(parse_money)
listings["mrp_inr"] = listings["mrp_text"].apply(parse_money)
listings["rating"] = listings["rating_text"].apply(parse_number)
listings["review_count"] = listings["review_count_text"].apply(parse_number)
listings["return_window_days_num"] = listings["return_window_days"].apply(parse_number)

Test the conversions.

python

assert listings["price_inr"].dropna().ge(0).all()
assert listings["mrp_inr"].dropna().ge(0).all()

6. Canonical Brand Mapping

Use the reference table rather than manual replacement chains.

python

listings = listings.merge(
    brand_ref,
    left_on="brand_raw",
    right_on="raw_brand",
    how="left",
    validate="many_to_one",
)

listings["brand"] = listings["canonical_brand"].fillna(listings["brand_raw"].str.title())

listings.loc[listings["canonical_brand"].isna(), ["brand_raw", "model_raw"]]

7. Device-Type Cleaning

The raw table intentionally contains non-phone rows.

python

listings["device_type"] = (
    listings["device_type_raw"]
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

smartphone_mask = listings["device_type"].isin(
    ["smartphone", "smart phone", "phone", "foldable phone"]
)

non_phone_rows = listings.loc[~smartphone_mask, ["listing_id", "device_type_raw", "brand_raw", "model_raw"]]
non_phone_rows

For smartphone EDA, remove non-phone rows after documenting them.

python

listings = listings.loc[smartphone_mask].copy()

8. Duplicate Handling

Find exact duplicates and key duplicates separately.

python

exact_duplicate_count = listings.duplicated().sum()
id_duplicate_rows = listings[listings["listing_id"].duplicated(keep=False)].sort_values("listing_id")

exact_duplicate_count, id_duplicate_rows[["listing_id", "brand", "model_raw", "price_inr"]]

Deduplicate by keeping the first exact listing record.

python

listings = listings.drop_duplicates()
listings = listings.sort_values(["listing_id", "snapshot_date"]).drop_duplicates("listing_id", keep="last")

9. Connectivity Feature Extraction

python

listings["has_5g"] = contains_token(listings["sim_features"], r"\b5g\b")
listings["has_nfc"] = contains_token(listings["sim_features"], r"\bnfc\b")
listings["has_ir_blaster"] = contains_token(listings["sim_features"], r"\bir\b|ir blaster")
listings["has_esim"] = contains_token(listings["sim_features"], r"\besim\b")
listings["has_uwb"] = contains_token(listings["sim_features"], r"\buwb\b")

10. Processor Parsing

Extract chipset, core label, and clock speed.

python

listings["processor_clean"] = listings["processor_text"].str.replace("|", ",", regex=False)

listings["processor_clock_ghz"] = (
    listings["processor_clean"]
    .str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*ghz", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["processor_core_label"] = (
    listings["processor_clean"]
    .str.extract(r"((?:hexa|octa|quad|8)\s*(?:core|cores|core processor)?)", flags=re.IGNORECASE)[0]
    .str.lower()
    .str.replace("8 cores", "octa core")
    .str.replace("8 core processor", "octa core")
)

listings["processor_family"] = (
    listings["processor_clean"]
    .str.extract(r"^(.*?)(?:,|\||$)")[0]
    .str.strip()
)

11. Memory and Storage Parsing

python

listings["ram_gb"] = (
    listings["memory_text"]
    .str.extract(r"([0-9]+)\s*gb\s*ram", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["storage_gb"] = (
    listings["memory_text"]
    .str.extract(r"([0-9]+)\s*gb\s*(?:rom|storage)", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["storage_gb"] = listings["storage_gb"].fillna(
    listings["memory_text"].str.extract(r"([0-9]+)\s*gb$", flags=re.IGNORECASE)[0].astype(float)
)

12. Battery and Charging Parsing

python

listings["battery_mah"] = (
    listings["battery_text"]
    .str.extract(r"([0-9]{4})\s*mah", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["fast_charging_w"] = (
    listings["battery_text"]
    .str.extract(r"([0-9]{2,3})\s*w", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["charger_not_bundled"] = contains_token(listings["battery_text"], r"not bundled|no charger")

13. Display Parsing

python

listings["screen_size_in"] = (
    listings["display_text"]
    .str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*(?:inch|in)", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["refresh_rate_hz"] = (
    listings["display_text"]
    .str.extract(r"([0-9]{2,3})\s*hz", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["panel_type"] = np.select(
    [
        contains_token(listings["display_text"], "ltpo"),
        contains_token(listings["display_text"], "amoled"),
        contains_token(listings["display_text"], "oled"),
        contains_token(listings["display_text"], "lcd"),
    ],
    ["LTPO OLED", "AMOLED", "OLED", "LCD"],
    default="Unknown",
)

14. Camera Parsing

python

rear_text = listings["camera_text"].str.split("&| and ", regex=True).str[0]
front_text = listings["camera_text"].str.extract(r"(?:&|and)\s*(.*front.*|.*selfie.*)", flags=re.IGNORECASE)[0]
rear_text_clean = rear_text.fillna("").str.lower()

listings["primary_rear_mp"] = rear_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["front_camera_mp"] = front_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["rear_camera_count"] = rear_text_clean.str.count(r"\bmp\b")

15. Card Slot and OS Normalization

python

listings["has_memory_card"] = contains_token(listings["card_slot"], r"supported|microsd|hybrid|dedicated")
listings["hybrid_card_slot"] = contains_token(listings["card_slot"], r"hybrid")

listings["max_card_tb"] = (
    listings["card_slot"]
    .str.extract(r"upto\s*([0-9]+)\s*tb", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["os_family"] = np.select(
    [
        contains_token(listings["os_text"], "ios"),
        contains_token(listings["os_text"], "harmony|emui"),
        contains_token(listings["os_text"], "android|oxygen|coloros|funtouch|hyperos|realme|magic|xos|hios|rog|nothing|one ui"),
    ],
    ["iOS", "HarmonyOS/EMUI", "Android"],
    default="Other/Unknown",
)

16. Stock Status and Return Policy

python

listings["stock_status_clean"] = np.select(
    [
        contains_token(listings["stock_status"], r"out"),
        contains_token(listings["stock_status"], r"preorder"),
        contains_token(listings["stock_status"], r"limited"),
        contains_token(listings["stock_status"], r"in stock|available"),
    ],
    ["out_of_stock", "preorder", "limited", "available"],
    default="unknown",
)

invalid_return_windows = listings[listings["return_window_days_num"] < 0]
invalid_return_windows[["listing_id", "return_window_days"]]

Set invalid return windows to missing.

python

listings.loc[listings["return_window_days_num"] < 0, "return_window_days_num"] = np.nan

17. Domain Validation

Create validation checks after cleaning.

python

validation_report = {
    "negative_price_rows": int((listings["price_inr"] < 0).sum()),
    "price_above_mrp_rows": int((listings["price_inr"] > listings["mrp_inr"]).sum()),
    "invalid_rating_rows": int(((listings["rating"] < 0) | (listings["rating"] > 5)).sum()),
    "missing_brand_rows": int(listings["brand"].isna().sum()),
    "missing_price_rows": int(listings["price_inr"].isna().sum()),
    "missing_launch_date_rows": int(listings["launch_date"].isna().sum()),
    "duplicate_listing_ids": int(listings["listing_id"].duplicated().sum()),
}

validation_report

Handle invalid ratings.

python

listings.loc[(listings["rating"] < 0) | (listings["rating"] > 5), "rating"] = np.nan

18. Outlier Detection

Use multiple methods; do not blindly delete rows.

python

price_q1 = listings["price_inr"].quantile(0.25)
price_q3 = listings["price_inr"].quantile(0.75)
price_iqr = price_q3 - price_q1

lower_fence = price_q1 - 1.5 * price_iqr
upper_fence = price_q3 + 1.5 * price_iqr

price_outliers = listings[
    (listings["price_inr"] < lower_fence) |
    (listings["price_inr"] > upper_fence)
][["listing_id", "brand", "model_raw", "price_inr", "mrp_inr"]]

price_outliers

For this dataset, investigate the extreme low price row.

python

listings["price_issue_flag"] = listings["price_inr"].lt(3000) | listings["price_inr"].isna()

19. Missing-Value Strategy

Separate missing because unavailable from missing because invalid.

python

missing_after_cleaning = (
    listings[
        [
            "price_inr",
            "rating",
            "review_count",
            "launch_date",
            "ram_gb",
            "storage_gb",
            "battery_mah",
            "refresh_rate_hz",
            "front_camera_mp",
        ]
    ]
    .isna()
    .mean()
    .mul(100)
    .sort_values(ascending=False)
)

missing_after_cleaning

Practical strategy:

keep missing price_inr as missing for price analysis
keep missing rating as missing, not zero
impute missing refresh_rate_hz to 60 only if listing copy clearly implies a basic display
do not impute launch dates unless there is a trusted catalog source
use missingness flags when missingness may carry business meaning

python

listings["rating_missing"] = listings["rating"].isna()
listings["launch_date_missing"] = listings["launch_date"].isna()

20. Feature Engineering

python

listings["discount_pct"] = (1 - listings["price_inr"] / listings["mrp_inr"]) * 100
listings["listing_age_days"] = (listings["snapshot_date"] - listings["launch_date"]).dt.days

listings["price_segment"] = pd.cut(
    listings["price_inr"],
    bins=[0, 15000, 30000, 50000, 80000, np.inf],
    labels=["budget", "midrange", "upper_midrange", "premium", "ultra_premium"],
)

listings["storage_per_1000_inr"] = listings["storage_gb"] / (listings["price_inr"] / 1000)
listings["battery_per_1000_inr"] = listings["battery_mah"] / (listings["price_inr"] / 1000)

21. Join Sales Channel Data

Clean numeric sales fields first.

python

sales["units_sold_30d"] = sales["units_sold_30d"].apply(parse_number)
sales["ad_spend_inr"] = sales["ad_spend_inr"].apply(parse_number)
sales["avg_delivery_days"] = sales["avg_delivery_days"].apply(parse_number)
sales["return_rate_pct"] = sales["return_rate_pct"].apply(parse_number)

Join to the cleaned listing table.

python

analysis = listings.merge(
    sales,
    on="listing_id",
    how="left",
    validate="one_to_one",
)

analysis.shape

Create sales efficiency metrics.

python

analysis["ad_spend_per_unit"] = analysis["ad_spend_inr"] / analysis["units_sold_30d"].replace(0, np.nan)
analysis["revenue_30d"] = analysis["price_inr"] * analysis["units_sold_30d"]

22. Post-Cleaning Dataset Contract

python

required_columns = [
    "listing_id",
    "brand",
    "model_raw",
    "price_inr",
    "rating",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "has_5g",
    "price_segment",
    "units_sold_30d",
]

assert analysis["listing_id"].is_unique
assert set(required_columns).issubset(analysis.columns)
assert analysis["rating"].dropna().between(0, 5).all()
assert (analysis["price_inr"].dropna() >= 0).all()

23. Univariate EDA

Numeric summaries:

python

numeric_cols = [
    "price_inr",
    "mrp_inr",
    "discount_pct",
    "rating",
    "review_count",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "primary_rear_mp",
    "units_sold_30d",
    "return_rate_pct",
]

analysis[numeric_cols].describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T

Categorical summaries:

python

for col in ["brand", "marketplace", "price_segment", "os_family", "stock_status_clean", "seller_tier", "channel"]:
    display(analysis[col].value_counts(dropna=False))

Visuals:

python

sns.histplot(data=analysis, x="price_inr", bins=20, kde=True)
plt.title("Distribution of Smartphone Prices")
plt.show()

python

sns.boxplot(data=analysis, x="price_segment", y="price_inr")
plt.title("Price Spread by Segment")
plt.xticks(rotation=30)
plt.show()

24. Bivariate EDA

Price by brand:

python

brand_price = (
    analysis.groupby("brand", as_index=False)
    .agg(median_price=("price_inr", "median"), listings=("listing_id", "count"))
    .sort_values("median_price", ascending=False)
)

brand_price

Specs vs price:

python

sns.scatterplot(data=analysis, x="ram_gb", y="price_inr", hue="price_segment")
plt.title("RAM vs Price")
plt.show()

python

sns.scatterplot(data=analysis, x="battery_mah", y="price_inr", hue="has_5g")
plt.title("Battery Capacity vs Price")
plt.show()

Sales vs discount:

python

sns.regplot(data=analysis.dropna(subset=["discount_pct", "units_sold_30d"]), x="discount_pct", y="units_sold_30d")
plt.title("Discount vs Units Sold")
plt.show()

25. Multivariate EDA

python

sns.scatterplot(
    data=analysis,
    x="price_inr",
    y="units_sold_30d",
    hue="price_segment",
    size="rating",
    sizes=(30, 220),
)
plt.title("Price, Sales, Segment, and Rating")
plt.show()

python

pivot = analysis.pivot_table(
    index="price_segment",
    columns="channel",
    values="units_sold_30d",
    aggfunc="median",
)

sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Median Units Sold by Price Segment and Channel")
plt.show()

26. Correlation and Association

Correlation is not causation, but it helps find relationships worth investigating.

python

corr_cols = [
    "price_inr",
    "discount_pct",
    "rating",
    "review_count",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "primary_rear_mp",
    "units_sold_30d",
    "return_rate_pct",
]

corr = analysis[corr_cols].corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Numeric Feature Correlation")
plt.show()

For categorical association, compare group distributions.

python

pd.crosstab(analysis["price_segment"], analysis["has_5g"], normalize="index")

python

pd.crosstab(analysis["channel"], analysis["city_tier"], normalize="index")

27. Segment Analysis

python

segment_summary = (
    analysis.groupby("price_segment", observed=True)
    .agg(
        listings=("listing_id", "count"),
        median_price=("price_inr", "median"),
        median_discount=("discount_pct", "median"),
        median_rating=("rating", "median"),
        median_units=("units_sold_30d", "median"),
        median_return_rate=("return_rate_pct", "median"),
    )
    .reset_index()
)

segment_summary

Brand-positioning summary:

python

positioning_summary = (
    analysis.groupby("typical_positioning")
    .agg(
        listings=("listing_id", "count"),
        median_price=("price_inr", "median"),
        median_units=("units_sold_30d", "median"),
        median_return_rate=("return_rate_pct", "median"),
    )
    .sort_values("median_units", ascending=False)
)

positioning_summary

28. Time-Based Checks

python

analysis["launch_month"] = analysis["launch_date"].dt.to_period("M").astype(str)

monthly_launches = analysis.groupby("launch_month").size()
monthly_launches.plot(kind="bar", title="Launch Month Distribution")
plt.show()

Check whether newer phones sell differently.

python

sns.scatterplot(data=analysis, x="listing_age_days", y="units_sold_30d", hue="price_segment")
plt.title("Listing Age vs Units Sold")
plt.show()

29. Robust Statistics

Outliers can distort means, so compare mean and median.

python

analysis.groupby("price_segment", observed=True)["price_inr"].agg(["count", "mean", "median", "std"])

Use trimmed means for a less outlier-sensitive summary.

python

def trimmed_mean(series, trim=0.1):
    s = series.dropna().sort_values()
    n = len(s)
    k = int(n * trim)
    if n == 0 or 2 * k >= n:
        return np.nan
    return s.iloc[k:n-k].mean()

analysis.groupby("price_segment", observed=True)["price_inr"].apply(trimmed_mean)

30. Statistical Checks

EDA is not formal modeling, but lightweight checks can prevent weak claims.

Difference in sales between 5G and non-5G listings:

python

analysis.groupby("has_5g")["units_sold_30d"].agg(["count", "mean", "median"])

Bootstrap a confidence interval for median price.

python

rng = np.random.default_rng(42)
prices = analysis["price_inr"].dropna().to_numpy()

boot_medians = [
    np.median(rng.choice(prices, size=len(prices), replace=True))
    for _ in range(2000)
]

np.percentile(boot_medians, [2.5, 50, 97.5])

31. Sensitivity Analysis

Check whether decisions change when suspicious rows are excluded.

python

clean_core = analysis[
    (~analysis["price_issue_flag"]) &
    (analysis["stock_status_clean"] != "out_of_stock") &
    (analysis["price_inr"].notna())
].copy()

comparison = pd.DataFrame(
    {
        "all_rows": analysis["price_inr"].describe(),
        "clean_core": clean_core["price_inr"].describe(),
    }
)

comparison

32. Data Bias and Limitations

Document limitations in the notebook.

This synthetic dataset has:

only a snapshot window, not full historical pricing
marketplace-visible ratings, not verified buyer satisfaction
synthetic sales metrics, not real revenue
intentional defects inserted for learning
no warranty, exchange, financing, or inventory-cost fields
incomplete information about regional availability

In a real project, ask how listings were sampled and whether missing rows differ from included rows.

33. Final Insight Template

Use a short executive summary.

Example format:

text

1. Catalog shape:
   Most listings are concentrated in midrange and upper-midrange segments.

2. Data quality:
   Brand normalization, duplicate IDs, invalid rating values, invalid launch dates,
   and text-heavy specs were the biggest cleaning risks.

3. Product patterns:
   Higher RAM/storage and 5G support generally appear in higher price segments,
   but discount and channel effects need separate analysis before causal claims.

4. Operational risks:
   Out-of-stock/preorder rows and suspicious prices should be excluded from sales-efficiency reporting.

5. Next data work:
   Add ingestion validation for price, rating, device type, duplicate listing IDs,
   launch date parsing, and negative return windows.

34. Export Analysis-Ready Data

python

analysis_ready_cols = [
    "listing_id",
    "snapshot_date",
    "marketplace",
    "brand",
    "model_raw",
    "parent_company",
    "country_or_region",
    "typical_positioning",
    "price_inr",
    "mrp_inr",
    "discount_pct",
    "price_segment",
    "rating",
    "review_count",
    "launch_date",
    "listing_age_days",
    "has_5g",
    "has_nfc",
    "has_ir_blaster",
    "has_esim",
    "has_uwb",
    "processor_family",
    "processor_core_label",
    "processor_clock_ghz",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "panel_type",
    "primary_rear_mp",
    "front_camera_mp",
    "rear_camera_count",
    "has_memory_card",
    "os_family",
    "seller_tier",
    "stock_status_clean",
    "return_window_days_num",
    "channel",
    "city_tier",
    "units_sold_30d",
    "ad_spend_inr",
    "avg_delivery_days",
    "return_rate_pct",
    "promo_label",
    "ad_spend_per_unit",
    "revenue_30d",
    "price_issue_flag",
]

analysis_ready = analysis[analysis_ready_cols].copy()
analysis_ready.to_csv("smartphone_eda_analysis_ready.csv", index=False)

35. Notebook Review Checklist

Before publishing the post, check:

the problem statement is clear
raw data is never overwritten
cleaning decisions are explained before code
invalid rows are documented before removal
duplicates are handled by rule, not by guesswork
missing values are not silently filled with fake certainty
outliers are investigated before exclusion
plots have titles and readable labels
every major insight has a table or visual behind it
correlation is not described as causation
limitations are included
the analysis-ready CSV can be regenerated from raw files

Practice Lab

Practice Extensions

Add a warranty table and analyze whether longer warranty periods reduce returns.
Build a small validation report that fails when a new raw file has invalid ratings or duplicate IDs.
Compare robust regression and ordinary least squares for price estimation.
Add a Plotly dashboard for segment, brand, and channel filters.
Convert the cleaning workflow into a reusable Python script.

Smartphone Marketplace EDA Case Study: Insights & Analysis

AI Insights

Practice Extensions

Frequently Asked Questions

Related Work

FastAPI Browser-Workflow Automation

AI Visibility & SEO Auditing Platform

RAG-Powered Learning Platform with AI Assistant