Smartphone Marketplace EDA Case Study: Insights & Analysis

Jul 4, 2026
75 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: smartphone-marketplace-eda-case-study-insights-analysis
Quick Answer

A copyright-safe, original smartphone marketplace EDA case study covering data assessment, cleaning, validation, feature extraction, missing values, outliers, joins, univariate, bivariate, multivariate, statistical, visual, and reporting workflows.

Quick Summary

Explore an in-depth EDA case study on smartphone listings, uncovering insights from messy data to actionable business intelligence.

End-to-End Smartphone Marketplace EDA Case Study: From Messy Listings to Business Insights

This case study uses original synthetic data.

It does not use copied course data, scraped marketplace data, proprietary rows, or any third-party CSV.

The goal is to practice a complete EDA workflow on a realistic smartphone listing problem: raw product pages contain inconsistent brands, mixed currency formats, missing ratings, duplicate listings, non-phone devices, text-heavy specifications, invalid dates, outliers, and a second table with sales-channel metrics.

Files Used In This Case Study

Use these CSV files from the same folder as this post:

  • smartphone_eda_listings_raw.csv
  • smartphone_eda_sales_channels.csv
  • smartphone_eda_brand_reference.csv
  • smartphone_eda_data_dictionary.csv

Load them like this:

python
import re
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

DATA_DIR = Path(".")

listings_raw = pd.read_csv(DATA_DIR / "smartphone_eda_listings_raw.csv")
sales_raw = pd.read_csv(DATA_DIR / "smartphone_eda_sales_channels.csv")
brand_ref = pd.read_csv(DATA_DIR / "smartphone_eda_brand_reference.csv")
dictionary = pd.read_csv(DATA_DIR / "smartphone_eda_data_dictionary.csv")

Business Questions

Before cleaning anything, define what the analysis should answer.

In this case study, the marketplace team wants to know:

  1. Which price segments dominate the current smartphone catalog?
  2. Which brands and channels are driving units sold?
  3. How do specs such as RAM, storage, 5G, battery, display refresh rate, and camera resolution relate to price?
  4. Are discounts associated with higher sales?
  5. Which listings look suspicious, incomplete, duplicated, invalid, or operationally risky?
  6. What data-quality rules should be added before the next listing ingest?

Non-goals:

  • predicting future sales
  • scraping live ecommerce websites
  • recommending real investment or purchasing decisions
  • treating the synthetic dataset as market truth

Full EDA Workflow Map

Use this map as the checklist for the notebook.

  1. Problem framing
  2. Data inventory
  3. Data dictionary review
  4. Manual assessment
  5. Programmatic assessment
  6. Quality issue log
  7. Tidiness issue log
  8. Cleaning plan
  9. Type conversion
  10. Category standardization
  11. Text parsing and feature extraction
  12. Duplicate detection
  13. Missing-value analysis
  14. Invalid-value checks
  15. Outlier detection
  16. Domain-rule validation
  17. Feature engineering
  18. Table joins
  19. Post-cleaning tests
  20. Univariate analysis
  21. Bivariate analysis
  22. Multivariate analysis
  23. Segment and cohort analysis
  24. Time-based checks
  25. Correlation and association checks
  26. Distribution diagnostics
  27. Robust statistics
  28. Sensitivity analysis
  29. Bias and data-collection limitations
  30. Final insight report
  31. Export of analysis-ready data

No single project can literally cover every possible EDA method ever invented, but this workflow covers the practical EDA steps expected in production analytics, data-science interviews, and portfolio notebooks.

1. Data Inventory

Start with shapes and a preview.

python
tables = {
    "listings_raw": listings_raw,
    "sales_raw": sales_raw,
    "brand_ref": brand_ref,
    "dictionary": dictionary,
}

for name, df in tables.items():
    print(name, df.shape)
    display(df.head())

Expected observations:

  • listings_raw is the primary table.
  • sales_raw is a listing-level performance table.
  • brand_ref is a reference table used for canonical brand names.
  • dictionary explains the columns and known intentional quality issues.

2. Manual Assessment

Look at real rows before writing cleaning code.

python
pd.set_option("display.max_columns", None)

listings_raw.sample(10, random_state=42)

Manual assessment should catch issues like:

  • inconsistent brand spellings such as Samsung, samsung, Moto, Nothng
  • mixed price formats such as INR 34,999, Rs. 27999, TBD
  • duplicate listing IDs
  • non-smartphone rows
  • invalid ratings
  • impossible return windows
  • invalid launch dates
  • specs stored as dense text instead of numeric columns

Write the issue log before cleaning.

python
issue_log = pd.DataFrame(
    [
        ["brand_raw", "consistency", "Same brand appears in multiple spellings"],
        ["price_text", "validity", "Currency symbols, spaces, commas, and TBD appear"],
        ["rating_text", "validity", "Rating should be between 0 and 5"],
        ["launch_date_text", "validity", "Multiple formats and one invalid date"],
        ["device_type_raw", "validity", "Dataset includes non-smartphone devices"],
        ["listing_id", "uniqueness", "One listing ID appears more than once"],
        ["return_window_days", "validity", "Return window should not be negative"],
        ["spec columns", "tidiness", "Processor, battery, display, memory, and camera contain multiple variables"],
    ],
    columns=["column", "issue_type", "description"],
)

issue_log

3. Programmatic Assessment

Use code to quantify what manual review suggested.

python
listings_raw.info()
python
listings_raw.describe(include="all").T
python
missing_summary = (
    listings_raw.isna().sum()
    .rename("missing_count")
    .to_frame()
    .assign(missing_pct=lambda x: x["missing_count"] / len(listings_raw) * 100)
    .sort_values("missing_pct", ascending=False)
)

missing_summary
python
listings_raw.duplicated().sum()
python
listings_raw["listing_id"].duplicated().sum()
python
for col in ["device_type_raw", "brand_raw", "marketplace", "stock_status", "seller_tier"]:
    display(listings_raw[col].value_counts(dropna=False))

4. Cleaning Helpers

Create small reusable helpers instead of repeating regex logic.

python
def parse_money(value):
    if pd.isna(value):
        return np.nan
    text = str(value).strip().lower()
    if text in {"", "tbd", "na", "n/a", "not available"}:
        return np.nan
    digits = re.sub(r"[^0-9.]", "", text)
    return float(digits) if digits else np.nan

def parse_number(value):
    if pd.isna(value):
        return np.nan
    text = str(value).strip()
    text = text.replace(",", "")
    match = re.search(r"-?[0-9]+(?:\.[0-9]+)?", text)
    return float(match.group()) if match else np.nan

def contains_token(series, pattern):
    return series.fillna("").str.contains(pattern, case=False, regex=True)

5. Define-Code-Test Cleaning Workflow

Copy raw data first.

python
listings = listings_raw.copy()
sales = sales_raw.copy()

Normalize obvious string columns.

python
text_cols = listings.select_dtypes(include="object").columns
for col in text_cols:
    listings[col] = listings[col].str.strip()

Parse dates.

python
listings["snapshot_date"] = pd.to_datetime(listings["snapshot_date"], errors="coerce")
listings["launch_date"] = pd.to_datetime(listings["launch_date_text"], errors="coerce", dayfirst=False)

listings.loc[listings["launch_date"].isna(), ["listing_id", "launch_date_text"]]

Parse numeric fields.

python
listings["price_inr"] = listings["price_text"].apply(parse_money)
listings["mrp_inr"] = listings["mrp_text"].apply(parse_money)
listings["rating"] = listings["rating_text"].apply(parse_number)
listings["review_count"] = listings["review_count_text"].apply(parse_number)
listings["return_window_days_num"] = listings["return_window_days"].apply(parse_number)

Test the conversions.

python
assert listings["price_inr"].dropna().ge(0).all()
assert listings["mrp_inr"].dropna().ge(0).all()

6. Canonical Brand Mapping

Use the reference table rather than manual replacement chains.

python
listings = listings.merge(
    brand_ref,
    left_on="brand_raw",
    right_on="raw_brand",
    how="left",
    validate="many_to_one",
)

listings["brand"] = listings["canonical_brand"].fillna(listings["brand_raw"].str.title())

listings.loc[listings["canonical_brand"].isna(), ["brand_raw", "model_raw"]]

7. Device-Type Cleaning

The raw table intentionally contains non-phone rows.

python
listings["device_type"] = (
    listings["device_type_raw"]
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

smartphone_mask = listings["device_type"].isin(
    ["smartphone", "smart phone", "phone", "foldable phone"]
)

non_phone_rows = listings.loc[~smartphone_mask, ["listing_id", "device_type_raw", "brand_raw", "model_raw"]]
non_phone_rows

For smartphone EDA, remove non-phone rows after documenting them.

python
listings = listings.loc[smartphone_mask].copy()

8. Duplicate Handling

Find exact duplicates and key duplicates separately.

python
exact_duplicate_count = listings.duplicated().sum()
id_duplicate_rows = listings[listings["listing_id"].duplicated(keep=False)].sort_values("listing_id")

exact_duplicate_count, id_duplicate_rows[["listing_id", "brand", "model_raw", "price_inr"]]

Deduplicate by keeping the first exact listing record.

python
listings = listings.drop_duplicates()
listings = listings.sort_values(["listing_id", "snapshot_date"]).drop_duplicates("listing_id", keep="last")

9. Connectivity Feature Extraction

python
listings["has_5g"] = contains_token(listings["sim_features"], r"\b5g\b")
listings["has_nfc"] = contains_token(listings["sim_features"], r"\bnfc\b")
listings["has_ir_blaster"] = contains_token(listings["sim_features"], r"\bir\b|ir blaster")
listings["has_esim"] = contains_token(listings["sim_features"], r"\besim\b")
listings["has_uwb"] = contains_token(listings["sim_features"], r"\buwb\b")

10. Processor Parsing

Extract chipset, core label, and clock speed.

python
listings["processor_clean"] = listings["processor_text"].str.replace("|", ",", regex=False)

listings["processor_clock_ghz"] = (
    listings["processor_clean"]
    .str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*ghz", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["processor_core_label"] = (
    listings["processor_clean"]
    .str.extract(r"((?:hexa|octa|quad|8)\s*(?:core|cores|core processor)?)", flags=re.IGNORECASE)[0]
    .str.lower()
    .str.replace("8 cores", "octa core")
    .str.replace("8 core processor", "octa core")
)

listings["processor_family"] = (
    listings["processor_clean"]
    .str.extract(r"^(.*?)(?:,|\||$)")[0]
    .str.strip()
)

11. Memory and Storage Parsing

python
listings["ram_gb"] = (
    listings["memory_text"]
    .str.extract(r"([0-9]+)\s*gb\s*ram", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["storage_gb"] = (
    listings["memory_text"]
    .str.extract(r"([0-9]+)\s*gb\s*(?:rom|storage)", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["storage_gb"] = listings["storage_gb"].fillna(
    listings["memory_text"].str.extract(r"([0-9]+)\s*gb$", flags=re.IGNORECASE)[0].astype(float)
)

12. Battery and Charging Parsing

python
listings["battery_mah"] = (
    listings["battery_text"]
    .str.extract(r"([0-9]{4})\s*mah", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["fast_charging_w"] = (
    listings["battery_text"]
    .str.extract(r"([0-9]{2,3})\s*w", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["charger_not_bundled"] = contains_token(listings["battery_text"], r"not bundled|no charger")

13. Display Parsing

python
listings["screen_size_in"] = (
    listings["display_text"]
    .str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*(?:inch|in)", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["refresh_rate_hz"] = (
    listings["display_text"]
    .str.extract(r"([0-9]{2,3})\s*hz", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["panel_type"] = np.select(
    [
        contains_token(listings["display_text"], "ltpo"),
        contains_token(listings["display_text"], "amoled"),
        contains_token(listings["display_text"], "oled"),
        contains_token(listings["display_text"], "lcd"),
    ],
    ["LTPO OLED", "AMOLED", "OLED", "LCD"],
    default="Unknown",
)

14. Camera Parsing

python
rear_text = listings["camera_text"].str.split("&| and ", regex=True).str[0]
front_text = listings["camera_text"].str.extract(r"(?:&|and)\s*(.*front.*|.*selfie.*)", flags=re.IGNORECASE)[0]
rear_text_clean = rear_text.fillna("").str.lower()

listings["primary_rear_mp"] = rear_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["front_camera_mp"] = front_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["rear_camera_count"] = rear_text_clean.str.count(r"\bmp\b")

15. Card Slot and OS Normalization

python
listings["has_memory_card"] = contains_token(listings["card_slot"], r"supported|microsd|hybrid|dedicated")
listings["hybrid_card_slot"] = contains_token(listings["card_slot"], r"hybrid")

listings["max_card_tb"] = (
    listings["card_slot"]
    .str.extract(r"upto\s*([0-9]+)\s*tb", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["os_family"] = np.select(
    [
        contains_token(listings["os_text"], "ios"),
        contains_token(listings["os_text"], "harmony|emui"),
        contains_token(listings["os_text"], "android|oxygen|coloros|funtouch|hyperos|realme|magic|xos|hios|rog|nothing|one ui"),
    ],
    ["iOS", "HarmonyOS/EMUI", "Android"],
    default="Other/Unknown",
)

16. Stock Status and Return Policy

python
listings["stock_status_clean"] = np.select(
    [
        contains_token(listings["stock_status"], r"out"),
        contains_token(listings["stock_status"], r"preorder"),
        contains_token(listings["stock_status"], r"limited"),
        contains_token(listings["stock_status"], r"in stock|available"),
    ],
    ["out_of_stock", "preorder", "limited", "available"],
    default="unknown",
)

invalid_return_windows = listings[listings["return_window_days_num"] < 0]
invalid_return_windows[["listing_id", "return_window_days"]]

Set invalid return windows to missing.

python
listings.loc[listings["return_window_days_num"] < 0, "return_window_days_num"] = np.nan

17. Domain Validation

Create validation checks after cleaning.

python
validation_report = {
    "negative_price_rows": int((listings["price_inr"] < 0).sum()),
    "price_above_mrp_rows": int((listings["price_inr"] > listings["mrp_inr"]).sum()),
    "invalid_rating_rows": int(((listings["rating"] < 0) | (listings["rating"] > 5)).sum()),
    "missing_brand_rows": int(listings["brand"].isna().sum()),
    "missing_price_rows": int(listings["price_inr"].isna().sum()),
    "missing_launch_date_rows": int(listings["launch_date"].isna().sum()),
    "duplicate_listing_ids": int(listings["listing_id"].duplicated().sum()),
}

validation_report

Handle invalid ratings.

python
listings.loc[(listings["rating"] < 0) | (listings["rating"] > 5), "rating"] = np.nan

18. Outlier Detection

Use multiple methods; do not blindly delete rows.

python
price_q1 = listings["price_inr"].quantile(0.25)
price_q3 = listings["price_inr"].quantile(0.75)
price_iqr = price_q3 - price_q1

lower_fence = price_q1 - 1.5 * price_iqr
upper_fence = price_q3 + 1.5 * price_iqr

price_outliers = listings[
    (listings["price_inr"] < lower_fence) |
    (listings["price_inr"] > upper_fence)
][["listing_id", "brand", "model_raw", "price_inr", "mrp_inr"]]

price_outliers

For this dataset, investigate the extreme low price row.

python
listings["price_issue_flag"] = listings["price_inr"].lt(3000) | listings["price_inr"].isna()

19. Missing-Value Strategy

Separate missing because unavailable from missing because invalid.

python
missing_after_cleaning = (
    listings[
        [
            "price_inr",
            "rating",
            "review_count",
            "launch_date",
            "ram_gb",
            "storage_gb",
            "battery_mah",
            "refresh_rate_hz",
            "front_camera_mp",
        ]
    ]
    .isna()
    .mean()
    .mul(100)
    .sort_values(ascending=False)
)

missing_after_cleaning

Practical strategy:

  • keep missing price_inr as missing for price analysis
  • keep missing rating as missing, not zero
  • impute missing refresh_rate_hz to 60 only if listing copy clearly implies a basic display
  • do not impute launch dates unless there is a trusted catalog source
  • use missingness flags when missingness may carry business meaning
python
listings["rating_missing"] = listings["rating"].isna()
listings["launch_date_missing"] = listings["launch_date"].isna()

20. Feature Engineering

python
listings["discount_pct"] = (1 - listings["price_inr"] / listings["mrp_inr"]) * 100
listings["listing_age_days"] = (listings["snapshot_date"] - listings["launch_date"]).dt.days

listings["price_segment"] = pd.cut(
    listings["price_inr"],
    bins=[0, 15000, 30000, 50000, 80000, np.inf],
    labels=["budget", "midrange", "upper_midrange", "premium", "ultra_premium"],
)

listings["storage_per_1000_inr"] = listings["storage_gb"] / (listings["price_inr"] / 1000)
listings["battery_per_1000_inr"] = listings["battery_mah"] / (listings["price_inr"] / 1000)

21. Join Sales Channel Data

Clean numeric sales fields first.

python
sales["units_sold_30d"] = sales["units_sold_30d"].apply(parse_number)
sales["ad_spend_inr"] = sales["ad_spend_inr"].apply(parse_number)
sales["avg_delivery_days"] = sales["avg_delivery_days"].apply(parse_number)
sales["return_rate_pct"] = sales["return_rate_pct"].apply(parse_number)

Join to the cleaned listing table.

python
analysis = listings.merge(
    sales,
    on="listing_id",
    how="left",
    validate="one_to_one",
)

analysis.shape

Create sales efficiency metrics.

python
analysis["ad_spend_per_unit"] = analysis["ad_spend_inr"] / analysis["units_sold_30d"].replace(0, np.nan)
analysis["revenue_30d"] = analysis["price_inr"] * analysis["units_sold_30d"]

22. Post-Cleaning Dataset Contract

python
required_columns = [
    "listing_id",
    "brand",
    "model_raw",
    "price_inr",
    "rating",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "has_5g",
    "price_segment",
    "units_sold_30d",
]

assert analysis["listing_id"].is_unique
assert set(required_columns).issubset(analysis.columns)
assert analysis["rating"].dropna().between(0, 5).all()
assert (analysis["price_inr"].dropna() >= 0).all()

23. Univariate EDA

Numeric summaries:

python
numeric_cols = [
    "price_inr",
    "mrp_inr",
    "discount_pct",
    "rating",
    "review_count",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "primary_rear_mp",
    "units_sold_30d",
    "return_rate_pct",
]

analysis[numeric_cols].describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T

Categorical summaries:

python
for col in ["brand", "marketplace", "price_segment", "os_family", "stock_status_clean", "seller_tier", "channel"]:
    display(analysis[col].value_counts(dropna=False))

Visuals:

python
sns.histplot(data=analysis, x="price_inr", bins=20, kde=True)
plt.title("Distribution of Smartphone Prices")
plt.show()
python
sns.boxplot(data=analysis, x="price_segment", y="price_inr")
plt.title("Price Spread by Segment")
plt.xticks(rotation=30)
plt.show()

24. Bivariate EDA

Price by brand:

python
brand_price = (
    analysis.groupby("brand", as_index=False)
    .agg(median_price=("price_inr", "median"), listings=("listing_id", "count"))
    .sort_values("median_price", ascending=False)
)

brand_price

Specs vs price:

python
sns.scatterplot(data=analysis, x="ram_gb", y="price_inr", hue="price_segment")
plt.title("RAM vs Price")
plt.show()
python
sns.scatterplot(data=analysis, x="battery_mah", y="price_inr", hue="has_5g")
plt.title("Battery Capacity vs Price")
plt.show()

Sales vs discount:

python
sns.regplot(data=analysis.dropna(subset=["discount_pct", "units_sold_30d"]), x="discount_pct", y="units_sold_30d")
plt.title("Discount vs Units Sold")
plt.show()

25. Multivariate EDA

python
sns.scatterplot(
    data=analysis,
    x="price_inr",
    y="units_sold_30d",
    hue="price_segment",
    size="rating",
    sizes=(30, 220),
)
plt.title("Price, Sales, Segment, and Rating")
plt.show()
python
pivot = analysis.pivot_table(
    index="price_segment",
    columns="channel",
    values="units_sold_30d",
    aggfunc="median",
)

sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Median Units Sold by Price Segment and Channel")
plt.show()

26. Correlation and Association

Correlation is not causation, but it helps find relationships worth investigating.

python
corr_cols = [
    "price_inr",
    "discount_pct",
    "rating",
    "review_count",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "primary_rear_mp",
    "units_sold_30d",
    "return_rate_pct",
]

corr = analysis[corr_cols].corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Numeric Feature Correlation")
plt.show()

For categorical association, compare group distributions.

python
pd.crosstab(analysis["price_segment"], analysis["has_5g"], normalize="index")
python
pd.crosstab(analysis["channel"], analysis["city_tier"], normalize="index")

27. Segment Analysis

python
segment_summary = (
    analysis.groupby("price_segment", observed=True)
    .agg(
        listings=("listing_id", "count"),
        median_price=("price_inr", "median"),
        median_discount=("discount_pct", "median"),
        median_rating=("rating", "median"),
        median_units=("units_sold_30d", "median"),
        median_return_rate=("return_rate_pct", "median"),
    )
    .reset_index()
)

segment_summary

Brand-positioning summary:

python
positioning_summary = (
    analysis.groupby("typical_positioning")
    .agg(
        listings=("listing_id", "count"),
        median_price=("price_inr", "median"),
        median_units=("units_sold_30d", "median"),
        median_return_rate=("return_rate_pct", "median"),
    )
    .sort_values("median_units", ascending=False)
)

positioning_summary

28. Time-Based Checks

python
analysis["launch_month"] = analysis["launch_date"].dt.to_period("M").astype(str)

monthly_launches = analysis.groupby("launch_month").size()
monthly_launches.plot(kind="bar", title="Launch Month Distribution")
plt.show()

Check whether newer phones sell differently.

python
sns.scatterplot(data=analysis, x="listing_age_days", y="units_sold_30d", hue="price_segment")
plt.title("Listing Age vs Units Sold")
plt.show()

29. Robust Statistics

Outliers can distort means, so compare mean and median.

python
analysis.groupby("price_segment", observed=True)["price_inr"].agg(["count", "mean", "median", "std"])

Use trimmed means for a less outlier-sensitive summary.

python
def trimmed_mean(series, trim=0.1):
    s = series.dropna().sort_values()
    n = len(s)
    k = int(n * trim)
    if n == 0 or 2 * k >= n:
        return np.nan
    return s.iloc[k:n-k].mean()

analysis.groupby("price_segment", observed=True)["price_inr"].apply(trimmed_mean)

30. Statistical Checks

EDA is not formal modeling, but lightweight checks can prevent weak claims.

Difference in sales between 5G and non-5G listings:

python
analysis.groupby("has_5g")["units_sold_30d"].agg(["count", "mean", "median"])

Bootstrap a confidence interval for median price.

python
rng = np.random.default_rng(42)
prices = analysis["price_inr"].dropna().to_numpy()

boot_medians = [
    np.median(rng.choice(prices, size=len(prices), replace=True))
    for _ in range(2000)
]

np.percentile(boot_medians, [2.5, 50, 97.5])

31. Sensitivity Analysis

Check whether decisions change when suspicious rows are excluded.

python
clean_core = analysis[
    (~analysis["price_issue_flag"]) &
    (analysis["stock_status_clean"] != "out_of_stock") &
    (analysis["price_inr"].notna())
].copy()

comparison = pd.DataFrame(
    {
        "all_rows": analysis["price_inr"].describe(),
        "clean_core": clean_core["price_inr"].describe(),
    }
)

comparison

32. Data Bias and Limitations

Document limitations in the notebook.

This synthetic dataset has:

  • only a snapshot window, not full historical pricing
  • marketplace-visible ratings, not verified buyer satisfaction
  • synthetic sales metrics, not real revenue
  • intentional defects inserted for learning
  • no warranty, exchange, financing, or inventory-cost fields
  • incomplete information about regional availability

In a real project, ask how listings were sampled and whether missing rows differ from included rows.

33. Final Insight Template

Use a short executive summary.

Example format:

text
1. Catalog shape:
   Most listings are concentrated in midrange and upper-midrange segments.

2. Data quality:
   Brand normalization, duplicate IDs, invalid rating values, invalid launch dates,
   and text-heavy specs were the biggest cleaning risks.

3. Product patterns:
   Higher RAM/storage and 5G support generally appear in higher price segments,
   but discount and channel effects need separate analysis before causal claims.

4. Operational risks:
   Out-of-stock/preorder rows and suspicious prices should be excluded from sales-efficiency reporting.

5. Next data work:
   Add ingestion validation for price, rating, device type, duplicate listing IDs,
   launch date parsing, and negative return windows.

34. Export Analysis-Ready Data

python
analysis_ready_cols = [
    "listing_id",
    "snapshot_date",
    "marketplace",
    "brand",
    "model_raw",
    "parent_company",
    "country_or_region",
    "typical_positioning",
    "price_inr",
    "mrp_inr",
    "discount_pct",
    "price_segment",
    "rating",
    "review_count",
    "launch_date",
    "listing_age_days",
    "has_5g",
    "has_nfc",
    "has_ir_blaster",
    "has_esim",
    "has_uwb",
    "processor_family",
    "processor_core_label",
    "processor_clock_ghz",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "panel_type",
    "primary_rear_mp",
    "front_camera_mp",
    "rear_camera_count",
    "has_memory_card",
    "os_family",
    "seller_tier",
    "stock_status_clean",
    "return_window_days_num",
    "channel",
    "city_tier",
    "units_sold_30d",
    "ad_spend_inr",
    "avg_delivery_days",
    "return_rate_pct",
    "promo_label",
    "ad_spend_per_unit",
    "revenue_30d",
    "price_issue_flag",
]

analysis_ready = analysis[analysis_ready_cols].copy()
analysis_ready.to_csv("smartphone_eda_analysis_ready.csv", index=False)

35. Notebook Review Checklist

Before publishing the post, check:

  • the problem statement is clear
  • raw data is never overwritten
  • cleaning decisions are explained before code
  • invalid rows are documented before removal
  • duplicates are handled by rule, not by guesswork
  • missing values are not silently filled with fake certainty
  • outliers are investigated before exclusion
  • plots have titles and readable labels
  • every major insight has a table or visual behind it
  • correlation is not described as causation
  • limitations are included
  • the analysis-ready CSV can be regenerated from raw files

Practice Lab

Practice Extensions

  1. Add a warranty table and analyze whether longer warranty periods reduce returns.
  2. Build a small validation report that fails when a new raw file has invalid ratings or duplicate IDs.
  3. Compare robust regression and ordinary least squares for price estimation.
  4. Add a Plotly dashboard for segment, brand, and channel filters.
  5. Convert the cleaning workflow into a reusable Python script.

Frequently Asked Questions

What is the primary goal of this EDA case study?
The goal is to practice a complete EDA workflow on a realistic smartphone listing problem, focusing on transforming messy listings into business insights.
What types of data issues are addressed in the case study?
The case study addresses inconsistent brands, mixed currency formats, missing ratings, duplicate listings, non-phone devices, text-heavy specifications, invalid dates, and outliers.
Which CSV files are used in this case study?
The CSV files used are smartphoneedalistingsraw.csv, smartphoneedasaleschannels.csv, smartphoneedabrandreference.csv, and smartphoneedadatadictionary.csv.
What are some of the business questions the analysis aims to answer?
The analysis aims to answer which price segments dominate the smartphone catalog, which brands and channels drive units sold, how specs relate to price, if discounts are associated with higher sales, and which listings are suspicious or risky.
What are the non-goals of this EDA case study?
The non-goals include predicting future sales, scraping live ecommerce websites, recommending real investment or purchasing decisions, and treating the synthetic dataset as market truth.

Related Work

See how this thinking shows up in shipped systems.