# Comprehensive EDA Case Study on Smartphone Market Listings
URL: https://madhudadi.in/blog/posts/smartphone-marketplace-eda-case-study-insights-analysis
Published: 2026-07-04
Tags: Pandas, python, Exploratory Data Analysis
Read time: 75 min
Difficulty: intermediate
> A copyright-safe, original smartphone marketplace EDA case study covering data assessment, cleaning, validation, feature extraction, missing values, outliers, joins, univariate, bivariate, multivariate, statistical, visual, and reporting workflows.# End-to-End Smartphone Marketplace EDA Case Study: From Messy Listings to Business Insights

This case study uses original synthetic data.

It does not use copied course data, scraped marketplace data, proprietary rows, or any third-party CSV.

The goal is to practice a complete EDA workflow on a realistic smartphone listing problem: raw product pages contain inconsistent brands, mixed currency formats, missing ratings, duplicate listings, non-phone devices, text-heavy specifications, invalid dates, outliers, and a second table with sales-channel metrics.

## Files Used In This Case Study

Use these CSV files from the same folder as this post:

- `smartphone_eda_listings_raw.csv`
- `smartphone_eda_sales_channels.csv`
- `smartphone_eda_brand_reference.csv`
- `smartphone_eda_data_dictionary.csv`

Load them like this:

```python
import re
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

DATA_DIR = Path(".")

listings_raw = pd.read_csv(DATA_DIR / "smartphone_eda_listings_raw.csv")
sales_raw = pd.read_csv(DATA_DIR / "smartphone_eda_sales_channels.csv")
brand_ref = pd.read_csv(DATA_DIR / "smartphone_eda_brand_reference.csv")
dictionary = pd.read_csv(DATA_DIR / "smartphone_eda_data_dictionary.csv")
```

## Business Questions

Before cleaning anything, define what the analysis should answer.

In this case study, the marketplace team wants to know:

1. Which price segments dominate the current smartphone catalog?
2. Which brands and channels are driving units sold?
3. How do specs such as RAM, storage, 5G, battery, display refresh rate, and camera resolution relate to price?
4. Are discounts associated with higher sales?
5. Which listings look suspicious, incomplete, duplicated, invalid, or operationally risky?
6. What data-quality rules should be added before the next listing ingest?

Non-goals:

- predicting future sales
- scraping live ecommerce websites
- recommending real investment or purchasing decisions
- treating the synthetic dataset as market truth

## Full EDA Workflow Map

Use this map as the checklist for the notebook.

1. Problem framing
2. Data inventory
3. Data dictionary review
4. Manual assessment
5. Programmatic assessment
6. Quality issue log
7. Tidiness issue log
8. Cleaning plan
9. Type conversion
10. Category standardization
11. Text parsing and feature extraction
12. Duplicate detection
13. Missing-value analysis
14. Invalid-value checks
15. Outlier detection
16. Domain-rule validation
17. Feature engineering
18. Table joins
19. Post-cleaning tests
20. Univariate analysis
21. Bivariate analysis
22. Multivariate analysis
23. Segment and cohort analysis
24. Time-based checks
25. Correlation and association checks
26. Distribution diagnostics
27. Robust statistics
28. Sensitivity analysis
29. Bias and data-collection limitations
30. Final insight report
31. Export of analysis-ready data

No single project can literally cover every possible EDA method ever invented, but this workflow covers the practical EDA steps expected in production analytics, data-science interviews, and portfolio notebooks.

## 1. Data Inventory

Start with shapes and a preview.

```python
tables = {
    "listings_raw": listings_raw,
    "sales_raw": sales_raw,
    "brand_ref": brand_ref,
    "dictionary": dictionary,
}

for name, df in tables.items():
    print(name, df.shape)
    display(df.head())
```

Expected observations:

- `listings_raw` is the primary table.
- `sales_raw` is a listing-level performance table.
- `brand_ref` is a reference table used for canonical brand names.
- `dictionary` explains the columns and known intentional quality issues.

## 2. Manual Assessment

Look at real rows before writing cleaning code.

```python
pd.set_option("display.max_columns", None)

listings_raw.sample(10, random_state=42)
```

Manual assessment should catch issues like:

- inconsistent brand spellings such as `Samsung`, `samsung`, `Moto`, `Nothng`
- mixed price formats such as `INR 34,999`, `Rs. 27999`, `TBD`
- duplicate listing IDs
- non-smartphone rows
- invalid ratings
- impossible return windows
- invalid launch dates
- specs stored as dense text instead of numeric columns

Write the issue log before cleaning.

```python
issue_log = pd.DataFrame(
    [
        ["brand_raw", "consistency", "Same brand appears in multiple spellings"],
        ["price_text", "validity", "Currency symbols, spaces, commas, and TBD appear"],
        ["rating_text", "validity", "Rating should be between 0 and 5"],
        ["launch_date_text", "validity", "Multiple formats and one invalid date"],
        ["device_type_raw", "validity", "Dataset includes non-smartphone devices"],
        ["listing_id", "uniqueness", "One listing ID appears more than once"],
        ["return_window_days", "validity", "Return window should not be negative"],
        ["spec columns", "tidiness", "Processor, battery, display, memory, and camera contain multiple variables"],
    ],
    columns=["column", "issue_type", "description"],
)

issue_log
```

## 3. Programmatic Assessment

Use code to quantify what manual review suggested.

```python
listings_raw.info()
```

```python
listings_raw.describe(include="all").T
```

```python
missing_summary = (
    listings_raw.isna().sum()
    .rename("missing_count")
    .to_frame()
    .assign(missing_pct=lambda x: x["missing_count"] / len(listings_raw) * 100)
    .sort_values("missing_pct", ascending=False)
)

missing_summary
```

```python
listings_raw.duplicated().sum()
```

```python
listings_raw["listing_id"].duplicated().sum()
```

```python
for col in ["device_type_raw", "brand_raw", "marketplace", "stock_status", "seller_tier"]:
    display(listings_raw[col].value_counts(dropna=False))
```

## 4. Cleaning Helpers

Create small reusable helpers instead of repeating regex logic.

```python
def parse_money(value):
    if pd.isna(value):
        return np.nan
    text = str(value).strip().lower()
    if text in {"", "tbd", "na", "n/a", "not available"}:
        return np.nan
    digits = re.sub(r"[^0-9.]", "", text)
    return float(digits) if digits else np.nan


def parse_number(value):
    if pd.isna(value):
        return np.nan
    text = str(value).strip()
    text = text.replace(",", "")
    match = re.search(r"-?[0-9]+(?:\.[0-9]+)?", text)
    return float(match.group()) if match else np.nan


def contains_token(series, pattern):
    return series.fillna("").str.contains(pattern, case=False, regex=True)
```

## 5. Define-Code-Test Cleaning Workflow

Copy raw data first.

```python
listings = listings_raw.copy()
sales = sales_raw.copy()
```

Normalize obvious string columns.

```python
text_cols = listings.select_dtypes(include="object").columns
for col in text_cols:
    listings[col] = listings[col].str.strip()
```

Parse dates.

```python
listings["snapshot_date"] = pd.to_datetime(listings["snapshot_date"], errors="coerce")
listings["launch_date"] = pd.to_datetime(listings["launch_date_text"], errors="coerce", dayfirst=False)

listings.loc[listings["launch_date"].isna(), ["listing_id", "launch_date_text"]]
```

Parse numeric fields.

```python
listings["price_inr"] = listings["price_text"].apply(parse_money)
listings["mrp_inr"] = listings["mrp_text"].apply(parse_money)
listings["rating"] = listings["rating_text"].apply(parse_number)
listings["review_count"] = listings["review_count_text"].apply(parse_number)
listings["return_window_days_num"] = listings["return_window_days"].apply(parse_number)
```

Test the conversions.

```python
assert listings["price_inr"].dropna().ge(0).all()
assert listings["mrp_inr"].dropna().ge(0).all()
```

## 6. Canonical Brand Mapping

Use the reference table rather than manual replacement chains.

```python
listings = listings.merge(
    brand_ref,
    left_on="brand_raw",
    right_on="raw_brand",
    how="left",
    validate="many_to_one",
)

listings["brand"] = listings["canonical_brand"].fillna(listings["brand_raw"].str.title())

listings.loc[listings["canonical_brand"].isna(), ["brand_raw", "model_raw"]]
```

## 7. Device-Type Cleaning

The raw table intentionally contains non-phone rows.

```python
listings["device_type"] = (
    listings["device_type_raw"]
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

smartphone_mask = listings["device_type"].isin(
    ["smartphone", "smart phone", "phone", "foldable phone"]
)

non_phone_rows = listings.loc[~smartphone_mask, ["listing_id", "device_type_raw", "brand_raw", "model_raw"]]
non_phone_rows
```

For smartphone EDA, remove non-phone rows after documenting them.

```python
listings = listings.loc[smartphone_mask].copy()
```

## 8. Duplicate Handling

Find exact duplicates and key duplicates separately.

```python
exact_duplicate_count = listings.duplicated().sum()
id_duplicate_rows = listings[listings["listing_id"].duplicated(keep=False)].sort_values("listing_id")

exact_duplicate_count, id_duplicate_rows[["listing_id", "brand", "model_raw", "price_inr"]]
```

Deduplicate by keeping the first exact listing record.

```python
listings = listings.drop_duplicates()
listings = listings.sort_values(["listing_id", "snapshot_date"]).drop_duplicates("listing_id", keep="last")
```

## 9. Connectivity Feature Extraction

```python
listings["has_5g"] = contains_token(listings["sim_features"], r"\b5g\b")
listings["has_nfc"] = contains_token(listings["sim_features"], r"\bnfc\b")
listings["has_ir_blaster"] = contains_token(listings["sim_features"], r"\bir\b|ir blaster")
listings["has_esim"] = contains_token(listings["sim_features"], r"\besim\b")
listings["has_uwb"] = contains_token(listings["sim_features"], r"\buwb\b")
```

## 10. Processor Parsing

Extract chipset, core label, and clock speed.

```python
listings["processor_clean"] = listings["processor_text"].str.replace("|", ",", regex=False)

listings["processor_clock_ghz"] = (
    listings["processor_clean"]
    .str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*ghz", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["processor_core_label"] = (
    listings["processor_clean"]
    .str.extract(r"((?:hexa|octa|quad|8)\s*(?:core|cores|core processor)?)", flags=re.IGNORECASE)[0]
    .str.lower()
    .str.replace("8 cores", "octa core")
    .str.replace("8 core processor", "octa core")
)

listings["processor_family"] = (
    listings["processor_clean"]
    .str.extract(r"^(.*?)(?:,|\||$)")[0]
    .str.strip()
)
```

## 11. Memory and Storage Parsing

```python
listings["ram_gb"] = (
    listings["memory_text"]
    .str.extract(r"([0-9]+)\s*gb\s*ram", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["storage_gb"] = (
    listings["memory_text"]
    .str.extract(r"([0-9]+)\s*gb\s*(?:rom|storage)", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["storage_gb"] = listings["storage_gb"].fillna(
    listings["memory_text"].str.extract(r"([0-9]+)\s*gb$", flags=re.IGNORECASE)[0].astype(float)
)
```

## 12. Battery and Charging Parsing

```python
listings["battery_mah"] = (
    listings["battery_text"]
    .str.extract(r"([0-9]{4})\s*mah", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["fast_charging_w"] = (
    listings["battery_text"]
    .str.extract(r"([0-9]{2,3})\s*w", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["charger_not_bundled"] = contains_token(listings["battery_text"], r"not bundled|no charger")
```

## 13. Display Parsing

```python
listings["screen_size_in"] = (
    listings["display_text"]
    .str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*(?:inch|in)", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["refresh_rate_hz"] = (
    listings["display_text"]
    .str.extract(r"([0-9]{2,3})\s*hz", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["panel_type"] = np.select(
    [
        contains_token(listings["display_text"], "ltpo"),
        contains_token(listings["display_text"], "amoled"),
        contains_token(listings["display_text"], "oled"),
        contains_token(listings["display_text"], "lcd"),
    ],
    ["LTPO OLED", "AMOLED", "OLED", "LCD"],
    default="Unknown",
)
```

## 14. Camera Parsing

```python
rear_text = listings["camera_text"].str.split("&| and ", regex=True).str[0]
front_text = listings["camera_text"].str.extract(r"(?:&|and)\s*(.*front.*|.*selfie.*)", flags=re.IGNORECASE)[0]
rear_text_clean = rear_text.fillna("").str.lower()

listings["primary_rear_mp"] = rear_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["front_camera_mp"] = front_text.str.extract(r"([0-9]+(?:\.[0-9]+)?)\s*mp", flags=re.IGNORECASE)[0].astype(float)
listings["rear_camera_count"] = rear_text_clean.str.count(r"\bmp\b")
```

## 15. Card Slot and OS Normalization

```python
listings["has_memory_card"] = contains_token(listings["card_slot"], r"supported|microsd|hybrid|dedicated")
listings["hybrid_card_slot"] = contains_token(listings["card_slot"], r"hybrid")

listings["max_card_tb"] = (
    listings["card_slot"]
    .str.extract(r"upto\s*([0-9]+)\s*tb", flags=re.IGNORECASE)[0]
    .astype(float)
)

listings["os_family"] = np.select(
    [
        contains_token(listings["os_text"], "ios"),
        contains_token(listings["os_text"], "harmony|emui"),
        contains_token(listings["os_text"], "android|oxygen|coloros|funtouch|hyperos|realme|magic|xos|hios|rog|nothing|one ui"),
    ],
    ["iOS", "HarmonyOS/EMUI", "Android"],
    default="Other/Unknown",
)
```

## 16. Stock Status and Return Policy

```python
listings["stock_status_clean"] = np.select(
    [
        contains_token(listings["stock_status"], r"out"),
        contains_token(listings["stock_status"], r"preorder"),
        contains_token(listings["stock_status"], r"limited"),
        contains_token(listings["stock_status"], r"in stock|available"),
    ],
    ["out_of_stock", "preorder", "limited", "available"],
    default="unknown",
)

invalid_return_windows = listings[listings["return_window_days_num"] < 0]
invalid_return_windows[["listing_id", "return_window_days"]]
```

Set invalid return windows to missing.

```python
listings.loc[listings["return_window_days_num"] < 0, "return_window_days_num"] = np.nan
```

## 17. Domain Validation

Create validation checks after cleaning.

```python
validation_report = {
    "negative_price_rows": int((listings["price_inr"] < 0).sum()),
    "price_above_mrp_rows": int((listings["price_inr"] > listings["mrp_inr"]).sum()),
    "invalid_rating_rows": int(((listings["rating"] < 0) | (listings["rating"] > 5)).sum()),
    "missing_brand_rows": int(listings["brand"].isna().sum()),
    "missing_price_rows": int(listings["price_inr"].isna().sum()),
    "missing_launch_date_rows": int(listings["launch_date"].isna().sum()),
    "duplicate_listing_ids": int(listings["listing_id"].duplicated().sum()),
}

validation_report
```

Handle invalid ratings.

```python
listings.loc[(listings["rating"] < 0) | (listings["rating"] > 5), "rating"] = np.nan
```

## 18. Outlier Detection

Use multiple methods; do not blindly delete rows.

```python
price_q1 = listings["price_inr"].quantile(0.25)
price_q3 = listings["price_inr"].quantile(0.75)
price_iqr = price_q3 - price_q1

lower_fence = price_q1 - 1.5 * price_iqr
upper_fence = price_q3 + 1.5 * price_iqr

price_outliers = listings[
    (listings["price_inr"] < lower_fence) |
    (listings["price_inr"] > upper_fence)
][["listing_id", "brand", "model_raw", "price_inr", "mrp_inr"]]

price_outliers
```

For this dataset, investigate the extreme low price row.

```python
listings["price_issue_flag"] = listings["price_inr"].lt(3000) | listings["price_inr"].isna()
```

## 19. Missing-Value Strategy

Separate missing because unavailable from missing because invalid.

```python
missing_after_cleaning = (
    listings[
        [
            "price_inr",
            "rating",
            "review_count",
            "launch_date",
            "ram_gb",
            "storage_gb",
            "battery_mah",
            "refresh_rate_hz",
            "front_camera_mp",
        ]
    ]
    .isna()
    .mean()
    .mul(100)
    .sort_values(ascending=False)
)

missing_after_cleaning
```

Practical strategy:

- keep missing `price_inr` as missing for price analysis
- keep missing `rating` as missing, not zero
- impute missing `refresh_rate_hz` to 60 only if listing copy clearly implies a basic display
- do not impute launch dates unless there is a trusted catalog source
- use missingness flags when missingness may carry business meaning

```python
listings["rating_missing"] = listings["rating"].isna()
listings["launch_date_missing"] = listings["launch_date"].isna()
```

## 20. Feature Engineering

```python
listings["discount_pct"] = (1 - listings["price_inr"] / listings["mrp_inr"]) * 100
listings["listing_age_days"] = (listings["snapshot_date"] - listings["launch_date"]).dt.days

listings["price_segment"] = pd.cut(
    listings["price_inr"],
    bins=[0, 15000, 30000, 50000, 80000, np.inf],
    labels=["budget", "midrange", "upper_midrange", "premium", "ultra_premium"],
)

listings["storage_per_1000_inr"] = listings["storage_gb"] / (listings["price_inr"] / 1000)
listings["battery_per_1000_inr"] = listings["battery_mah"] / (listings["price_inr"] / 1000)
```

## 21. Join Sales Channel Data

Clean numeric sales fields first.

```python
sales["units_sold_30d"] = sales["units_sold_30d"].apply(parse_number)
sales["ad_spend_inr"] = sales["ad_spend_inr"].apply(parse_number)
sales["avg_delivery_days"] = sales["avg_delivery_days"].apply(parse_number)
sales["return_rate_pct"] = sales["return_rate_pct"].apply(parse_number)
```

Join to the cleaned listing table.

```python
analysis = listings.merge(
    sales,
    on="listing_id",
    how="left",
    validate="one_to_one",
)

analysis.shape
```

Create sales efficiency metrics.

```python
analysis["ad_spend_per_unit"] = analysis["ad_spend_inr"] / analysis["units_sold_30d"].replace(0, np.nan)
analysis["revenue_30d"] = analysis["price_inr"] * analysis["units_sold_30d"]
```

## 22. Post-Cleaning Dataset Contract

```python
required_columns = [
    "listing_id",
    "brand",
    "model_raw",
    "price_inr",
    "rating",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "has_5g",
    "price_segment",
    "units_sold_30d",
]

assert analysis["listing_id"].is_unique
assert set(required_columns).issubset(analysis.columns)
assert analysis["rating"].dropna().between(0, 5).all()
assert (analysis["price_inr"].dropna() >= 0).all()
```

## 23. Univariate EDA

Numeric summaries:

```python
numeric_cols = [
    "price_inr",
    "mrp_inr",
    "discount_pct",
    "rating",
    "review_count",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "primary_rear_mp",
    "units_sold_30d",
    "return_rate_pct",
]

analysis[numeric_cols].describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T
```

Categorical summaries:

```python
for col in ["brand", "marketplace", "price_segment", "os_family", "stock_status_clean", "seller_tier", "channel"]:
    display(analysis[col].value_counts(dropna=False))
```

Visuals:

```python
sns.histplot(data=analysis, x="price_inr", bins=20, kde=True)
plt.title("Distribution of Smartphone Prices")
plt.show()
```

```python
sns.boxplot(data=analysis, x="price_segment", y="price_inr")
plt.title("Price Spread by Segment")
plt.xticks(rotation=30)
plt.show()
```

## 24. Bivariate EDA

Price by brand:

```python
brand_price = (
    analysis.groupby("brand", as_index=False)
    .agg(median_price=("price_inr", "median"), listings=("listing_id", "count"))
    .sort_values("median_price", ascending=False)
)

brand_price
```

Specs vs price:

```python
sns.scatterplot(data=analysis, x="ram_gb", y="price_inr", hue="price_segment")
plt.title("RAM vs Price")
plt.show()
```

```python
sns.scatterplot(data=analysis, x="battery_mah", y="price_inr", hue="has_5g")
plt.title("Battery Capacity vs Price")
plt.show()
```

Sales vs discount:

```python
sns.regplot(data=analysis.dropna(subset=["discount_pct", "units_sold_30d"]), x="discount_pct", y="units_sold_30d")
plt.title("Discount vs Units Sold")
plt.show()
```

## 25. Multivariate EDA

```python
sns.scatterplot(
    data=analysis,
    x="price_inr",
    y="units_sold_30d",
    hue="price_segment",
    size="rating",
    sizes=(30, 220),
)
plt.title("Price, Sales, Segment, and Rating")
plt.show()
```

```python
pivot = analysis.pivot_table(
    index="price_segment",
    columns="channel",
    values="units_sold_30d",
    aggfunc="median",
)

sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Median Units Sold by Price Segment and Channel")
plt.show()
```

## 26. Correlation and Association

Correlation is not causation, but it helps find relationships worth investigating.

```python
corr_cols = [
    "price_inr",
    "discount_pct",
    "rating",
    "review_count",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "primary_rear_mp",
    "units_sold_30d",
    "return_rate_pct",
]

corr = analysis[corr_cols].corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Numeric Feature Correlation")
plt.show()
```

For categorical association, compare group distributions.

```python
pd.crosstab(analysis["price_segment"], analysis["has_5g"], normalize="index")
```

```python
pd.crosstab(analysis["channel"], analysis["city_tier"], normalize="index")
```

## 27. Segment Analysis

```python
segment_summary = (
    analysis.groupby("price_segment", observed=True)
    .agg(
        listings=("listing_id", "count"),
        median_price=("price_inr", "median"),
        median_discount=("discount_pct", "median"),
        median_rating=("rating", "median"),
        median_units=("units_sold_30d", "median"),
        median_return_rate=("return_rate_pct", "median"),
    )
    .reset_index()
)

segment_summary
```

Brand-positioning summary:

```python
positioning_summary = (
    analysis.groupby("typical_positioning")
    .agg(
        listings=("listing_id", "count"),
        median_price=("price_inr", "median"),
        median_units=("units_sold_30d", "median"),
        median_return_rate=("return_rate_pct", "median"),
    )
    .sort_values("median_units", ascending=False)
)

positioning_summary
```

## 28. Time-Based Checks

```python
analysis["launch_month"] = analysis["launch_date"].dt.to_period("M").astype(str)

monthly_launches = analysis.groupby("launch_month").size()
monthly_launches.plot(kind="bar", title="Launch Month Distribution")
plt.show()
```

Check whether newer phones sell differently.

```python
sns.scatterplot(data=analysis, x="listing_age_days", y="units_sold_30d", hue="price_segment")
plt.title("Listing Age vs Units Sold")
plt.show()
```

## 29. Robust Statistics

Outliers can distort means, so compare mean and median.

```python
analysis.groupby("price_segment", observed=True)["price_inr"].agg(["count", "mean", "median", "std"])
```

Use trimmed means for a less outlier-sensitive summary.

```python
def trimmed_mean(series, trim=0.1):
    s = series.dropna().sort_values()
    n = len(s)
    k = int(n * trim)
    if n == 0 or 2 * k >= n:
        return np.nan
    return s.iloc[k:n-k].mean()


analysis.groupby("price_segment", observed=True)["price_inr"].apply(trimmed_mean)
```

## 30. Statistical Checks

EDA is not formal modeling, but lightweight checks can prevent weak claims.

Difference in sales between 5G and non-5G listings:

```python
analysis.groupby("has_5g")["units_sold_30d"].agg(["count", "mean", "median"])
```

Bootstrap a confidence interval for median price.

```python
rng = np.random.default_rng(42)
prices = analysis["price_inr"].dropna().to_numpy()

boot_medians = [
    np.median(rng.choice(prices, size=len(prices), replace=True))
    for _ in range(2000)
]

np.percentile(boot_medians, [2.5, 50, 97.5])
```

## 31. Sensitivity Analysis

Check whether decisions change when suspicious rows are excluded.

```python
clean_core = analysis[
    (~analysis["price_issue_flag"]) &
    (analysis["stock_status_clean"] != "out_of_stock") &
    (analysis["price_inr"].notna())
].copy()

comparison = pd.DataFrame(
    {
        "all_rows": analysis["price_inr"].describe(),
        "clean_core": clean_core["price_inr"].describe(),
    }
)

comparison
```

## 32. Data Bias and Limitations

Document limitations in the notebook.

This synthetic dataset has:

- only a snapshot window, not full historical pricing
- marketplace-visible ratings, not verified buyer satisfaction
- synthetic sales metrics, not real revenue
- intentional defects inserted for learning
- no warranty, exchange, financing, or inventory-cost fields
- incomplete information about regional availability

In a real project, ask how listings were sampled and whether missing rows differ from included rows.

## 33. Final Insight Template

Use a short executive summary.

Example format:

```text
1. Catalog shape:
   Most listings are concentrated in midrange and upper-midrange segments.

2. Data quality:
   Brand normalization, duplicate IDs, invalid rating values, invalid launch dates,
   and text-heavy specs were the biggest cleaning risks.

3. Product patterns:
   Higher RAM/storage and 5G support generally appear in higher price segments,
   but discount and channel effects need separate analysis before causal claims.

4. Operational risks:
   Out-of-stock/preorder rows and suspicious prices should be excluded from sales-efficiency reporting.

5. Next data work:
   Add ingestion validation for price, rating, device type, duplicate listing IDs,
   launch date parsing, and negative return windows.
```

## 34. Export Analysis-Ready Data

```python
analysis_ready_cols = [
    "listing_id",
    "snapshot_date",
    "marketplace",
    "brand",
    "model_raw",
    "parent_company",
    "country_or_region",
    "typical_positioning",
    "price_inr",
    "mrp_inr",
    "discount_pct",
    "price_segment",
    "rating",
    "review_count",
    "launch_date",
    "listing_age_days",
    "has_5g",
    "has_nfc",
    "has_ir_blaster",
    "has_esim",
    "has_uwb",
    "processor_family",
    "processor_core_label",
    "processor_clock_ghz",
    "ram_gb",
    "storage_gb",
    "battery_mah",
    "fast_charging_w",
    "screen_size_in",
    "refresh_rate_hz",
    "panel_type",
    "primary_rear_mp",
    "front_camera_mp",
    "rear_camera_count",
    "has_memory_card",
    "os_family",
    "seller_tier",
    "stock_status_clean",
    "return_window_days_num",
    "channel",
    "city_tier",
    "units_sold_30d",
    "ad_spend_inr",
    "avg_delivery_days",
    "return_rate_pct",
    "promo_label",
    "ad_spend_per_unit",
    "revenue_30d",
    "price_issue_flag",
]

analysis_ready = analysis[analysis_ready_cols].copy()
analysis_ready.to_csv("smartphone_eda_analysis_ready.csv", index=False)
```

## 35. Notebook Review Checklist

Before publishing the post, check:

- the problem statement is clear
- raw data is never overwritten
- cleaning decisions are explained before code
- invalid rows are documented before removal
- duplicates are handled by rule, not by guesswork
- missing values are not silently filled with fake certainty
- outliers are investigated before exclusion
- plots have titles and readable labels
- every major insight has a table or visual behind it
- correlation is not described as causation
- limitations are included
- the analysis-ready CSV can be regenerated from raw files

## Practice Extensions

1. Add a warranty table and analyze whether longer warranty periods reduce returns.
2. Build a small validation report that fails when a new raw file has invalid ratings or duplicate IDs.
3. Compare robust regression and ordinary least squares for price estimation.
4. Add a Plotly dashboard for segment, brand, and channel filters.
5. Convert the cleaning workflow into a reusable Python script.
