# Pandas Case Study: Cleaning and Analyzing Text Feedback Data
URL: https://madhudadi.in/blog/posts/pandas-text-data-case-study-clean-and-analyze-feedback
Published: 2026-06-06
Tags: Pandas, python
Read time: 84 min
Difficulty: advanced
> Build a complete Pandas text data case study with original content-feedback data. Clean messy messages, remove HTML and URLs, normalize text, extract keywords, tokenize, remove stopwords, build n-grams and a document-term matrix, join post and author data, parse dates, create pivots, resample trends, and produce final content-quality reports.# Pandas Text Data Case Study: Clean, Explore, Vectorize, and Report

This is a full Pandas case study for working with textual data.

The scenario is simple:

A learning platform collects feedback from learners on blog posts, coding lessons, projects, and interview-prep content. The feedback arrives as messy text with mixed casing, HTML snippets, URLs, typos, repeated records, ratings, timestamps, and support categories.

Your job is to turn raw feedback into useful content-quality reports.

By the end, you will answer questions like:

- Which posts receive the most negative feedback?
- Which authors get the strongest ratings?
- Which topics are learners asking about?
- Which feedback messages mention refund, invoice, bugs, copyright, or confusion?
- Which channels bring high-quality feedback?
- Which words and phrases appear most often?
- How do feedback volume and response time change over time?
- Which posts need review first?

This case study uses original sample datasets created for this lesson. It does not use copied movie reviews, social media posts, ecommerce reviews, public datasets, or third-party course material.

Files used in this case study:

- `pandas_text_case_posts.csv`
- `pandas_text_case_authors.csv`
- `pandas_text_case_feedback_jan.csv`
- `pandas_text_case_feedback_feb.csv`
- `pandas_text_case_topic_rules.csv`

Place all CSV files in the same folder as this Markdown file before running the examples.

## What This Case Study Covers

This project brings together many Pandas skills:

- `read_csv`
- `concat`
- `merge`
- `validate`
- `indicator=True`
- `shape`, `info`, `describe`
- missing values
- duplicate rows
- string cleaning with `.str`
- regex replacement
- regex extraction
- URL and HTML removal
- punctuation removal
- text normalization
- text feature engineering
- tokenization with `.str.split()`
- `explode`
- stopword filtering
- word counts
- n-grams
- keyword rule matching
- document-term matrices
- simple TF-IDF with Pandas and NumPy
- datetime parsing
- Timedelta response-time analysis
- `groupby`
- named aggregation
- `transform`
- `pivot_table`
- `set_index`
- `resample`
- rolling averages
- MultiIndex-style summaries
- final export-ready reports

## 1. Import Libraries

```python
import pandas as pd
import numpy as np
```

Optional display settings:

```python
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)
```

## 2. Load The Data

```python
posts = pd.read_csv("pandas_text_case_posts.csv")
authors = pd.read_csv("pandas_text_case_authors.csv")
rules = pd.read_csv("pandas_text_case_topic_rules.csv")
jan = pd.read_csv("pandas_text_case_feedback_jan.csv")
feb = pd.read_csv("pandas_text_case_feedback_feb.csv")
```

Check the shapes:

```python
for name, df in {
    "posts": posts,
    "authors": authors,
    "rules": rules,
    "jan": jan,
    "feb": feb,
}.items():
    print(name, df.shape)
```

Preview the data:

```python
posts.head()
```

```python
jan.head()
```

```python
rules.head()
```

## 3. Understand The Data Model

The data has three supporting tables:

- `posts`: one row per content item
- `authors`: one row per content author
- `rules`: keyword rules used to detect topics in feedback text

It has two monthly feedback tables:

- `jan`: feedback received in January
- `feb`: feedback received in February

Important keys:

| Table | Key Columns |
|---|---|
| posts | `post_id`, `author_id` |
| authors | `author_id` |
| feedback | `feedback_id`, `post_id` |
| rules | `rule_id` |

## 4. Inspect Raw Tables

```python
posts.info()
```

```python
jan.info()
```

```python
jan.describe(include="object")
```

Check numeric fields:

```python
jan.describe(numeric_only=True)
```

Check sentiment labels:

```python
jan["sentiment_label"].value_counts(dropna=False)
```

## 5. Concatenate Monthly Feedback

January and February feedback have the same schema.

```python
feedback = pd.concat([jan, feb], ignore_index=True)
feedback.shape
```

Keep source month when needed:

```python
feedback_with_source = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "source_month"})

feedback_with_source.head()
```

For the main workflow:

```python
feedback = pd.concat([jan, feb], ignore_index=True)
```

## 6. Data Quality Checks

Check duplicate full rows:

```python
feedback.duplicated().sum()
```

Drop exact duplicates:

```python
feedback = feedback.drop_duplicates().copy()
```

Check duplicate feedback IDs:

```python
feedback["feedback_id"].duplicated().sum()
```

Check missing values:

```python
feedback.isna().sum()
```

Check unknown post IDs before merging:

```python
feedback.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts()
```

Find feedback rows pointing to missing posts:

```python
feedback.merge(posts, how="left", on="post_id", indicator=True) \
    .query("_merge == 'left_only'")[["feedback_id", "post_id", "raw_message"]]
```

Do not hide this. Data quality findings are part of the case study.

## 7. Parse Dates

Convert timestamps:

```python
feedback["submitted_at"] = pd.to_datetime(feedback["submitted_at"], errors="coerce")
feedback["responded_at"] = pd.to_datetime(feedback["responded_at"], errors="coerce")
posts["published_at"] = pd.to_datetime(posts["published_at"], errors="coerce")
```

Find invalid submitted dates:

```python
feedback[feedback["submitted_at"].isna()]
```

Keep valid submitted feedback for time analysis:

```python
feedback_valid = feedback.dropna(subset=["submitted_at"]).copy()
```

Create response-time features:

```python
feedback_valid["response_time"] = feedback_valid["responded_at"] - feedback_valid["submitted_at"]
feedback_valid["response_hours"] = (
    feedback_valid["response_time"].dt.total_seconds() / 3600
).round(2)
```

Flag feedback not answered within 24 hours:

```python
feedback_valid["late_response"] = feedback_valid["response_hours"] > 24
```

Rows with missing `responded_at` will have missing response hours.

## 8. Clean Column Names

Column names can also be text data.

```python
feedback_valid.columns = (
    feedback_valid.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_", regex=False)
)
```

This dataset already has clean column names, but the pattern is useful in real projects.

## 9. Clean Raw Feedback Text

Start with a copy:

```python
text_df = feedback_valid.copy()
```

Inspect raw text:

```python
text_df["raw_message"].head(10)
```

Create a cleaned text column:

```python
text_df["message_clean"] = text_df["raw_message"].fillna("")
```

Lowercase:

```python
text_df["message_clean"] = text_df["message_clean"].str.lower()
```

Remove leading and trailing spaces:

```python
text_df["message_clean"] = text_df["message_clean"].str.strip()
```

Remove HTML tags:

```python
text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"<[^>]+>",
    " ",
    regex=True
)
```

Remove URLs:

```python
text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"https?://\S+|www\.\S+",
    " ",
    regex=True
)
```

Normalize a few common contractions and short forms:

```python
replacements = {
    r"\bcan't\b": "cannot",
    r"\bwon't\b": "will not",
    r"\bi'm\b": "i am",
    r"\bi've\b": "i have",
    r"\bpls\b": "please",
    r"\bplz\b": "please",
}

text_df["message_clean"] = text_df["message_clean"].replace(
    replacements,
    regex=True
)
```

Remove punctuation and symbols:

```python
text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"[^a-z0-9\s]",
    " ",
    regex=True
)
```

Collapse repeated spaces:

```python
text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"\s+",
    " ",
    regex=True
).str.strip()
```

Compare raw and cleaned text:

```python
text_df[["raw_message", "message_clean"]].head(10)
```

## 10. Text Quality Features

Create length features:

```python
text_df["char_count_raw"] = text_df["raw_message"].fillna("").str.len()
text_df["char_count_clean"] = text_df["message_clean"].str.len()
```

Word count:

```python
text_df["word_count"] = text_df["message_clean"].str.split().str.len()
```

Count exclamation marks in raw text:

```python
text_df["exclamation_count"] = text_df["raw_message"].fillna("").str.count("!")
```

Check whether the raw message had a URL:

```python
text_df["has_url"] = text_df["raw_message"].fillna("").str.contains(
    r"https?://|www\.",
    regex=True,
    case=False,
    na=False
)
```

Check whether the raw message had HTML:

```python
text_df["has_html"] = text_df["raw_message"].fillna("").str.contains(
    r"<[^>]+>",
    regex=True,
    na=False
)
```

Feature summary:

```python
text_df[[
    "feedback_id",
    "char_count_raw",
    "char_count_clean",
    "word_count",
    "exclamation_count",
    "has_url",
    "has_html"
]].head()
```

## 11. Extract Useful Text Patterns

Extract email-like text if a user typed one into the message:

```python
text_df["mentioned_email"] = text_df["raw_message"].str.extract(
    r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})"
)
```

Extract issue IDs like `ERR-204`:

```python
text_df["issue_code"] = text_df["raw_message"].str.extract(
    r"\b(ERR-\d{3})\b"
)
```

Find messages that mention money or billing:

```python
text_df["mentions_billing"] = text_df["message_clean"].str.contains(
    r"\b(?:refund|invoice|payment|paid|money|billing)\b",
    regex=True,
    na=False
)
```

Find messages that mention bugs:

```python
text_df["mentions_bug"] = text_df["message_clean"].str.contains(
    r"\b(?:error|bug|crash|broken|not working|err)\b",
    regex=True,
    na=False
)
```

Find messages that mention content clarity:

```python
text_df["mentions_confusion"] = text_df["message_clean"].str.contains(
    r"\b(?:confusing|unclear|hard|lost|explain|example)\b",
    regex=True,
    na=False
)
```

## 12. Tokenize Text

Tokenization means splitting text into words.

```python
text_df["tokens"] = text_df["message_clean"].str.split()
```

Inspect:

```python
text_df[["feedback_id", "tokens"]].head()
```

Explode tokens into one row per token:

```python
tokens = text_df[["feedback_id", "post_id", "sentiment_label", "tokens"]].explode("tokens")
tokens = tokens.rename(columns={"tokens": "token"})
tokens.head()
```

Remove missing or empty tokens:

```python
tokens = tokens.dropna(subset=["token"])
tokens = tokens[tokens["token"].str.len() > 0]
```

## 13. Remove Stopwords

Use a small custom stopword list.

```python
stopwords = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "can", "for", "from", "has", "have", "i", "in", "is",
    "it", "my", "of", "on", "or", "please", "so", "that",
    "the", "this", "to", "was", "with", "you", "your"
}
```

Filter:

```python
meaningful_tokens = tokens[~tokens["token"].isin(stopwords)].copy()
```

Top words:

```python
meaningful_tokens["token"].value_counts().head(20)
```

Top words by sentiment:

```python
meaningful_tokens.groupby("sentiment_label")["token"].value_counts().groupby(level=0).head(10)
```

## 14. N-Grams With Pandas

An n-gram is a sequence of words.

For bigrams, use the current token plus the next token inside the same feedback message.

```python
meaningful_tokens = meaningful_tokens.sort_values(["feedback_id"]).copy()
meaningful_tokens["next_token"] = meaningful_tokens.groupby("feedback_id")["token"].shift(-1)
meaningful_tokens["bigram"] = meaningful_tokens["token"] + " " + meaningful_tokens["next_token"]
```

Remove rows where there is no next token:

```python
bigrams = meaningful_tokens.dropna(subset=["next_token"]).copy()
```

Top bigrams:

```python
bigrams["bigram"].value_counts().head(20)
```

Top bigrams by sentiment:

```python
bigrams.groupby("sentiment_label")["bigram"].value_counts().groupby(level=0).head(10)
```

## 15. Build A Document-Term Matrix

A document-term matrix has:

- one row per document
- one column per token
- values showing token counts

```python
token_counts = (
    meaningful_tokens
    .groupby(["feedback_id", "token"])
    .size()
    .rename("count")
)
```

Convert to wide format:

```python
dtm = token_counts.unstack(fill_value=0)
dtm.head()
```

Keep the most common tokens:

```python
top_tokens = meaningful_tokens["token"].value_counts().head(25).index
dtm_top = dtm.reindex(columns=top_tokens, fill_value=0)
dtm_top.head()
```

This is a Pandas-native version of a simple bag-of-words table.

## 16. Simple TF-IDF With Pandas

Term frequency:

```python
tf = dtm_top.div(dtm_top.sum(axis=1).replace(0, np.nan), axis=0).fillna(0)
```

Document frequency:

```python
doc_freq = (dtm_top > 0).sum(axis=0)
```

Inverse document frequency:

```python
idf = np.log((1 + len(dtm_top)) / (1 + doc_freq)) + 1
```

TF-IDF:

```python
tfidf = tf * idf
tfidf.head()
```

Top TF-IDF token per feedback:

```python
top_tfidf = tfidf.idxmax(axis=1).rename("top_keyword").reset_index()
top_tfidf.head()
```

This is not a full NLP library, but it teaches the idea using Pandas.

## 17. Match Topic Rules

The topic rules table has regex patterns.

```python
rules
```

Create a long table of rule hits.

```python
rule_hits = []

for row in rules.itertuples(index=False):
    mask = text_df["message_clean"].str.contains(row.pattern, regex=True, na=False)
    matched = text_df.loc[mask, ["feedback_id", "post_id", "message_clean"]].copy()
    matched["rule_id"] = row.rule_id
    matched["topic"] = row.topic
    matched["severity"] = row.severity
    matched["recommended_action"] = row.recommended_action
    rule_hits.append(matched)

rule_hits = pd.concat(rule_hits, ignore_index=True)
rule_hits.head()
```

Topic counts:

```python
rule_hits["topic"].value_counts()
```

Severity counts:

```python
rule_hits["severity"].value_counts()
```

Topics by post:

```python
rule_hits.groupby(["post_id", "topic"]).size().reset_index(name="mentions")
```

## 18. Merge With Posts And Authors

Audit post matches:

```python
text_df.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts()
```

Build the analysis table:

```python
analysis = (
    text_df
    .merge(posts, how="left", on="post_id", validate="many_to_one")
    .merge(authors, how="left", on="author_id", validate="many_to_one")
)

analysis.head()
```

Select useful columns:

```python
analysis = analysis[[
    "feedback_id",
    "post_id",
    "slug",
    "title",
    "series",
    "post_type",
    "author",
    "submitted_at",
    "responded_at",
    "response_hours",
    "late_response",
    "rating",
    "sentiment_label",
    "channel",
    "status",
    "message_clean",
    "word_count",
    "char_count_clean",
    "exclamation_count",
    "has_url",
    "has_html",
    "mentions_billing",
    "mentions_bug",
    "mentions_confusion"
]]

analysis.head()
```

## 19. Post-Level Content Quality Report

Create a post-level report:

```python
post_quality = (
    analysis
    .groupby(["post_id", "title", "series", "post_type", "author"], as_index=False)
    .agg(
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        negative_count=("sentiment_label", lambda s: (s == "negative").sum()),
        positive_count=("sentiment_label", lambda s: (s == "positive").sum()),
        avg_word_count=("word_count", "mean"),
        billing_mentions=("mentions_billing", "sum"),
        bug_mentions=("mentions_bug", "sum"),
        confusion_mentions=("mentions_confusion", "sum"),
        late_responses=("late_response", "sum")
    )
)

post_quality["avg_rating"] = post_quality["avg_rating"].round(2)
post_quality["avg_word_count"] = post_quality["avg_word_count"].round(2)
```

Create a review priority score:

```python
post_quality["review_priority_score"] = (
    post_quality["negative_count"] * 3
    + post_quality["bug_mentions"] * 3
    + post_quality["confusion_mentions"] * 2
    + post_quality["billing_mentions"] * 2
    + post_quality["late_responses"]
)

post_quality.sort_values("review_priority_score", ascending=False)
```

This table tells the content team what to improve first.

## 20. Author-Level Report

```python
author_report = (
    analysis
    .groupby("author", as_index=False)
    .agg(
        posts=("post_id", "nunique"),
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        negative_count=("sentiment_label", lambda s: (s == "negative").sum()),
        avg_response_hours=("response_hours", "mean")
    )
)

author_report["avg_rating"] = author_report["avg_rating"].round(2)
author_report["avg_response_hours"] = author_report["avg_response_hours"].round(2)
author_report.sort_values("feedback_count", ascending=False)
```

## 21. Channel And Sentiment Report

```python
channel_report = (
    analysis
    .groupby(["channel", "sentiment_label"], as_index=False)
    .agg(
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        avg_word_count=("word_count", "mean")
    )
)

channel_report["avg_rating"] = channel_report["avg_rating"].round(2)
channel_report["avg_word_count"] = channel_report["avg_word_count"].round(2)
channel_report
```

Pivot version:

```python
analysis.pivot_table(
    index="channel",
    columns="sentiment_label",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)
```

## 22. Topic Report From Rule Hits

Merge rule hits with post metadata:

```python
topic_report_base = (
    rule_hits
    .merge(posts, how="left", on="post_id", validate="many_to_one")
    .merge(authors, how="left", on="author_id", validate="many_to_one")
)
```

Topic count by series:

```python
topic_by_series = (
    topic_report_base
    .groupby(["series", "topic"], as_index=False)
    .agg(mentions=("feedback_id", "count"))
    .sort_values(["series", "mentions"], ascending=[True, False])
)

topic_by_series
```

Topic severity report:

```python
topic_report_base.pivot_table(
    index="topic",
    columns="severity",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)
```

## 23. Date Features And Time-Based Analysis

Add date features:

```python
analysis["submitted_date"] = analysis["submitted_at"].dt.date
analysis["submitted_month"] = analysis["submitted_at"].dt.to_period("M").astype(str)
analysis["submitted_day_name"] = analysis["submitted_at"].dt.day_name()
analysis["submitted_hour"] = analysis["submitted_at"].dt.hour
```

Feedback by day name:

```python
analysis.groupby("submitted_day_name")["feedback_id"].count()
```

Feedback by hour:

```python
analysis.groupby("submitted_hour")["feedback_id"].count().sort_index()
```

Monthly sentiment:

```python
analysis.pivot_table(
    index="submitted_month",
    columns="sentiment_label",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)
```

## 24. Resample Feedback Trends

Set a datetime index:

```python
analysis_ts = analysis.set_index("submitted_at").sort_index()
```

Daily feedback count:

```python
daily_feedback = analysis_ts["feedback_id"].resample("D").count()
daily_feedback.head()
```

Daily average rating:

```python
daily_rating = analysis_ts["rating"].resample("D").mean()
daily_rating.head()
```

Daily negative feedback:

```python
daily_negative = (
    analysis_ts["sentiment_label"]
    .eq("negative")
    .resample("D")
    .sum()
)
```

Combine daily metrics:

```python
daily_report = pd.DataFrame({
    "feedback_count": daily_feedback,
    "avg_rating": daily_rating,
    "negative_count": daily_negative
})

daily_report.head()
```

Rolling 7-day feedback volume:

```python
daily_report["feedback_7d_avg"] = daily_report["feedback_count"].rolling(
    7,
    min_periods=1
).mean()
```

Weekly report:

```python
weekly_report = analysis_ts.resample("W").agg(
    feedback_count=("feedback_id", "count"),
    avg_rating=("rating", "mean"),
    avg_response_hours=("response_hours", "mean")
)

weekly_report.round(2)
```

## 25. Response Time SLA Report

Create an SLA bucket:

```python
analysis["response_bucket"] = pd.cut(
    analysis["response_hours"],
    bins=[-0.01, 6, 24, 48, np.inf],
    labels=["0-6h", "6-24h", "24-48h", "48h+"]
)
```

SLA by channel:

```python
analysis.pivot_table(
    index="channel",
    columns="response_bucket",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)
```

Average response by post type:

```python
analysis.groupby("post_type", as_index=False).agg(
    feedback_count=("feedback_id", "count"),
    avg_response_hours=("response_hours", "mean"),
    late_responses=("late_response", "sum")
).round(2)
```

## 26. MultiIndex Text Summary

Group by series, post type, and sentiment:

```python
multi_summary = (
    analysis
    .groupby(["series", "post_type", "sentiment_label"])
    .agg(
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        avg_word_count=("word_count", "mean")
    )
)

multi_summary
```

Select one series:

```python
multi_summary.loc["Pandas"]
```

Unstack sentiment:

```python
multi_summary["feedback_count"].unstack("sentiment_label", fill_value=0)
```

Reset index for export:

```python
multi_summary.reset_index().head()
```

## 27. Build A Simple Triage Score

Create a row-level score:

```python
analysis["triage_score"] = (
    analysis["mentions_bug"].astype(int) * 4
    + analysis["mentions_billing"].astype(int) * 3
    + analysis["mentions_confusion"].astype(int) * 2
    + analysis["sentiment_label"].eq("negative").astype(int) * 3
    + analysis["late_response"].fillna(False).astype(int)
    + analysis["exclamation_count"].clip(upper=3)
)
```

Assign priority:

```python
analysis["triage_priority"] = pd.cut(
    analysis["triage_score"],
    bins=[-1, 2, 5, 99],
    labels=["low", "medium", "high"]
)
```

Top messages to review:

```python
analysis.sort_values("triage_score", ascending=False)[[
    "feedback_id",
    "title",
    "sentiment_label",
    "triage_score",
    "triage_priority",
    "message_clean"
]].head(10)
```

## 28. Create Final Reports

Executive summary:

```python
executive_summary = pd.DataFrame({
    "metric": [
        "feedback_rows",
        "unique_posts",
        "average_rating",
        "negative_feedback",
        "billing_mentions",
        "bug_mentions",
        "confusion_mentions",
        "average_response_hours",
        "late_responses"
    ],
    "value": [
        analysis["feedback_id"].nunique(),
        analysis["post_id"].nunique(),
        round(analysis["rating"].mean(), 2),
        int(analysis["sentiment_label"].eq("negative").sum()),
        int(analysis["mentions_billing"].sum()),
        int(analysis["mentions_bug"].sum()),
        int(analysis["mentions_confusion"].sum()),
        round(analysis["response_hours"].mean(), 2),
        int(analysis["late_response"].sum())
    ]
})

executive_summary
```

Final reports dictionary:

```python
reports = {
    "executive_summary": executive_summary,
    "post_quality": post_quality,
    "author_report": author_report,
    "channel_report": channel_report,
    "topic_by_series": topic_by_series,
    "daily_report": daily_report,
    "weekly_report": weekly_report,
    "multi_summary": multi_summary.reset_index(),
}
```

Optional exports:

```python
# executive_summary.to_csv("text_case_executive_summary.csv", index=False)
# post_quality.to_csv("text_case_post_quality.csv", index=False)
# daily_report.to_csv("text_case_daily_report.csv")
```

## 29. Case Study Checklist

Use this checklist for text-heavy Pandas projects:

1. Load all raw files.
2. Concatenate same-schema monthly files.
3. Check shapes, types, missing values, and duplicates.
4. Parse timestamps with `errors="coerce"`.
5. Audit joins before merging.
6. Create a clean text column.
7. Lowercase and strip text.
8. Remove HTML, URLs, punctuation, and repeated spaces.
9. Normalize common short forms.
10. Extract useful patterns with regex.
11. Create text-length features.
12. Tokenize with `.str.split()`.
13. Use `explode()` for one-token-per-row analysis.
14. Remove stopwords.
15. Count words and phrases.
16. Build a document-term matrix.
17. Match topic rules.
18. Merge metadata.
19. Build post, author, channel, and topic reports.
20. Use pivots for cross-tab reports.
21. Use `resample()` for time trends.
22. Use rolling metrics for smoothing.
23. Create triage scores.
24. Create export-ready final tables.

## 30. Practice Tasks

Try these after completing the walkthrough.

### Task 1. Find the top 10 longest feedback messages.

```python
analysis.sort_values("word_count", ascending=False)[[
    "feedback_id",
    "title",
    "word_count",
    "message_clean"
]].head(10)
```
**Explanation**

- Sorts a DataFrame named 'analysis' by the 'word_count' column in descending order to prioritize longer entries
- Selects specific columns: 'feedback_id', 'title', 'word_count', and 'message_clean' for display
- Returns only the top 10 rows with the highest word counts using the head() method
- This approach helps quickly identify the most verbose feedback entries for detailed review
- The resulting DataFrame shows the longest feedback messages along with their identifying information


### Task 2. Find posts with average rating below 3.5.

```python
post_quality[post_quality["avg_rating"] < 3.5].sort_values("avg_rating")
```
**Explanation**

- Filters the post_quality DataFrame to include only rows where avg_rating is less than 3.5
- Sorts the filtered results by avg_rating column in ascending order (lowest ratings first)
- Returns a new DataFrame containing only poor-rated posts ordered from worst to best among the low-rated subset
- Useful for identifying underperforming content that needs improvement or review
- The chained operations allow for quick identification of posts requiring attention based on quality metrics


### Task 3. Find the most common token in negative feedback.

```python
meaningful_tokens[meaningful_tokens["sentiment_label"] == "negative"]["token"] \
    .value_counts() \
    .head(10)
```
**Explanation**

- Filters the meaningful_tokens DataFrame to include only rows where sentiment_label equals "negative"
- Selects the token column from the filtered results
- Counts the frequency of each unique token value using value_counts()
- Returns the top 10 most frequently occurring tokens with negative sentiment labels
- This analysis helps identify common words associated with negative sentiment in the dataset


### Task 4. Find messages that mention both bug and confusion.

```python
analysis[analysis["mentions_bug"] & analysis["mentions_confusion"]][[
    "feedback_id",
    "title",
    "message_clean"
]]
```
**Explanation**

- The code filters a DataFrame named 'analysis' to include only rows where both "mentions_bug" and "mentions_confusion" columns contain True values
- It then selects three specific columns: "feedback_id", "title", and "message_clean" from the filtered results
- The boolean filtering uses the & operator to combine two conditions, ensuring both criteria must be met
- This approach is commonly used for data analysis tasks where you need to isolate specific subsets of data based on multiple attributes
- The result returns a new DataFrame containing only the specified columns for entries that satisfy both boolean conditions


### Task 5. Find average response time by sentiment.

```python
analysis.groupby("sentiment_label", as_index=False)["response_hours"].mean().round(2)
```
**Explanation**

- Groups data by the "sentiment_label" column to aggregate responses by sentiment category
- Calculates the mean of "response_hours" for each sentiment group using the mean() function
- Rounds the resulting averages to 2 decimal places for cleaner presentation
- Returns a new DataFrame with sentiment labels and their corresponding average response times
- The as_index=False parameter ensures the grouping column remains a regular column rather than becoming the index


### Task 6. Find top bigrams for the Pandas series.

```python
pandas_feedback_ids = analysis[analysis["series"] == "Pandas"]["feedback_id"]

bigrams[bigrams["feedback_id"].isin(pandas_feedback_ids)]["bigram"] \
    .value_counts() \
    .head(10)
```
**Explanation**

- Filters the analysis dataframe to select only rows where the series column equals "Pandas" and extracts the corresponding feedback_id values
- Uses the extracted feedback_ids to filter the bigrams dataframe for matching entries
- Counts the frequency of each unique bigram in the filtered results
- Returns the top 10 most frequently occurring bigrams from the Pandas series feedback data


### Task 7. Build a post type and sentiment pivot.

```python
analysis.pivot_table(
    index="post_type",
    columns="sentiment_label",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)
```
**Explanation**

- Generates a cross-tabulation showing how feedback counts are distributed between different post types and sentiment labels
- Uses post_type as row identifiers and sentiment_label as column headers for structured data analysis
- Counts feedback_id occurrences within each combination of post type and sentiment category
- Fills missing combinations with zero values to ensure complete dataset representation
- Provides insights into which post types generate feedback with specific sentiment patterns


### Task 8. Find weekly negative feedback rate.

```python
weekly_negative_rate = (
    analysis_ts["sentiment_label"].eq("negative").resample("W").mean() * 100
)

weekly_negative_rate.round(2)
```
**Explanation**

- Creates a time series analysis by counting negative sentiment occurrences and resampling to weekly intervals
- Converts boolean True/False values to numeric 1/0 for calculation using eq("negative") method
- Applies mean() function to compute the average proportion of negative sentiments per week
- Multiplies by 100 to convert decimal proportions to percentage values
- Rounds final results to 2 decimal places for improved readability


### Task 9. Find posts with high review priority and many confusion mentions.

```python
post_quality.query("review_priority_score >= 5 and confusion_mentions > 0") \
    .sort_values("review_priority_score", ascending=False)
```
**Explanation**

- Filters customer reviews where the review priority score is 5 or higher and there are mentions of confusion
- Sorts the filtered results by review priority score in descending order to show highest priority items first
- Uses pandas query method to apply boolean conditions on the post_quality DataFrame
- Returns a sorted subset of reviews that meet both quality and confusion criteria for prioritized attention


### Task 10. Build a term-frequency table by sentiment.

```python
sentiment_token_counts = (
    meaningful_tokens
    .groupby(["sentiment_label", "token"])
    .size()
    .rename("count")
    .reset_index()
    .sort_values(["sentiment_label", "count"], ascending=[True, False])
)

sentiment_token_counts.groupby("sentiment_label").head(10)
```
**Explanation**

- Groups meaningful tokens by sentiment label and token identity to count occurrences of each token within each sentiment category
- Renames the count column for clarity and resets the index to create a clean dataframe structure
- Sorts results by sentiment label ascending and token count descending to prioritize highest frequency tokens within each sentiment group
- Uses groupby with head(10) to extract the top 10 most frequent tokens for each sentiment label, enabling comparative analysis across sentiment categories


## 31. Interview Questions From This Case Study

### 1. Why should text cleaning happen before token analysis?

Cleaning removes inconsistent casing, symbols, URLs, and formatting noise, so token counts represent meaning instead of messy variants.

### 2. Why use `.str` methods instead of Python string methods directly?

A Pandas Series is a column of values, not one string. `.str` applies string methods vectorized across the Series and handles missing values better.

### 3. Why use regex for HTML and URL removal?

HTML tags and URLs follow patterns, so regex can remove many variants with one expression.

### 4. Why should stopword removal happen after tokenization?

Stopwords are word-level items, so you need tokens before filtering them.

### 5. What does `explode()` do?

It turns each list-like value into multiple rows, which is perfect for token-level analysis.

### 6. What is a document-term matrix?

It is a table where each row is a document, each column is a token, and each value is the count of that token in that document.

### 7. What is TF-IDF?

TF-IDF increases the importance of words that are common in one document but not common across all documents.

### 8. Why use rule-based topic matching?

It gives a transparent first-pass classification system that analysts can inspect and improve.

### 9. Why audit joins before merging?

Join audits reveal missing post IDs, missing author IDs, or feedback records that will lose context after a merge.

### 10. Why use `validate` in `merge()`?

It checks that the relationship is what you expect, such as many feedback rows matching one post.

### 11. Why parse dates in a text case study?

Feedback is text, but it also arrives over time. Dates let you analyze trends, response times, and weekly quality changes.

### 12. Why create a triage score?

A triage score turns many signals into one review priority, helping teams decide what to handle first.

### 13. What is the limitation of simple keyword rules?

They can miss context, sarcasm, synonyms, and complex intent. They are useful for transparent baseline analysis, not perfect understanding.

### 14. Why keep raw and cleaned text?

Raw text preserves evidence. Cleaned text supports analysis.

### 15. Why is this a Pandas case study instead of an NLP project?

The focus is on data wrangling, quality checks, feature engineering, aggregation, and reporting using Pandas. NLP models can come later.

## 32. Final Mental Model

For textual data in Pandas, think in layers:

| Layer | Pandas Tools |
|---|---|
| Ingest | `read_csv`, `concat` |
| Inspect | `shape`, `info`, `describe`, `isna`, `duplicated` |
| Clean | `.str.lower`, `.str.strip`, `.str.replace` |
| Extract | `.str.extract`, `.str.contains`, regex |
| Tokenize | `.str.split`, `explode` |
| Count | `value_counts`, `groupby`, `size` |
| Vectorize | `unstack`, document-term matrix, TF-IDF |
| Enrich | `merge`, `validate`, `indicator` |
| Time | `to_datetime`, Timedelta, `resample`, rolling |
| Report | `agg`, `pivot_table`, sorted summaries |

If you remember only one thing, remember this:

Text analysis in Pandas is not just cleaning strings. It is a complete workflow for turning messy human language into structured signals that can be joined, counted, trended, and reported.

## Official References

- Pandas working with text data user guide: https://pandas.pydata.org/docs/user_guide/text.html
- Pandas `Series.str` API reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html
- Pandas missing data user guide: https://pandas.pydata.org/docs/user_guide/missing_data.html
- Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
- Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
- Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html