What types of feedback data does the learning platform collect?

The platform collects feedback on blog posts, coding lessons, projects, and interview-prep content, which includes messy text with mixed casing, HTML snippets, URLs, typos, repeated records, ratings, timestamps, and support categories.

What are some of the key questions this case study aims to answer?

The case study aims to answer questions such as which posts receive the most negative feedback, which authors get the strongest ratings, which topics learners are asking about, and which feedback messages mention specific issues like refund or bugs.

What skills and techniques does this Pandas case study cover?

The case study covers skills such as reading CSV files, merging data, cleaning text with regex, removing URLs and HTML, text normalization, tokenization, stopword filtering, creating document-term matrices, and performing datetime parsing and analysis.

What files are used in this case study?

The files used are pandastextcaseposts.csv, pandastextcaseauthors.csv, pandastextcasefeedbackjan.csv, pandastextcasefeedbackfeb.csv, and pandastextcasetopicrules.csv.

What is the purpose of the 'rules' table in the data model?

The 'rules' table contains keyword rules used to detect topics in feedback text.

Pandas Text Data Case Study: Clean and Analyze

Pandas Text Data Case Study: Clean, Explore, Vectorize, and Report

This is a full Pandas case study for working with textual data.

The scenario is simple:

A learning platform collects feedback from learners on blog posts, coding lessons, projects, and interview-prep content. The feedback arrives as messy text with mixed casing, HTML snippets, URLs, typos, repeated records, ratings, timestamps, and support categories.

Your job is to turn raw feedback into useful content-quality reports.

By the end, you will answer questions like:

Which posts receive the most negative feedback?
Which authors get the strongest ratings?
Which topics are learners asking about?
Which feedback messages mention refund, invoice, bugs, copyright, or confusion?
Which channels bring high-quality feedback?
Which words and phrases appear most often?
How do feedback volume and response time change over time?
Which posts need review first?

This case study uses original sample datasets created for this lesson. It does not use copied movie reviews, social media posts, ecommerce reviews, public datasets, or third-party course material.

Files used in this case study:

pandas_text_case_posts.csv
pandas_text_case_authors.csv
pandas_text_case_feedback_jan.csv
pandas_text_case_feedback_feb.csv
pandas_text_case_topic_rules.csv

Place all CSV files in the same folder as this Markdown file before running the examples.

What This Case Study Covers

This project brings together many Pandas skills:

read_csv
concat
merge
validate
indicator=True
shape, info, describe
missing values
duplicate rows
string cleaning with .str
regex replacement
regex extraction
URL and HTML removal
punctuation removal
text normalization
text feature engineering
tokenization with .str.split()
explode
stopword filtering
word counts
n-grams
keyword rule matching
document-term matrices
simple TF-IDF with Pandas and NumPy
datetime parsing
Timedelta response-time analysis
groupby
named aggregation
transform
pivot_table
set_index
resample
rolling averages
MultiIndex-style summaries
final export-ready reports

1. Import Libraries

python

import pandas as pd
import numpy as np

Optional display settings:

python

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

2. Load The Data

python

posts = pd.read_csv("pandas_text_case_posts.csv")
authors = pd.read_csv("pandas_text_case_authors.csv")
rules = pd.read_csv("pandas_text_case_topic_rules.csv")
jan = pd.read_csv("pandas_text_case_feedback_jan.csv")
feb = pd.read_csv("pandas_text_case_feedback_feb.csv")

Check the shapes:

python

for name, df in {
    "posts": posts,
    "authors": authors,
    "rules": rules,
    "jan": jan,
    "feb": feb,
}.items():
    print(name, df.shape)

Preview the data:

python

posts.head()

python

jan.head()

python

rules.head()

3. Understand The Data Model

The data has three supporting tables:

posts: one row per content item
authors: one row per content author
rules: keyword rules used to detect topics in feedback text

It has two monthly feedback tables:

jan: feedback received in January
feb: feedback received in February

Important keys:

Table	Key Columns
posts	`post_id`, `author_id`
authors	`author_id`
feedback	`feedback_id`, `post_id`
rules	`rule_id`

4. Inspect Raw Tables

python

posts.info()

python

jan.info()

python

jan.describe(include="object")

Check numeric fields:

python

jan.describe(numeric_only=True)

Check sentiment labels:

python

jan["sentiment_label"].value_counts(dropna=False)

5. Concatenate Monthly Feedback

January and February feedback have the same schema.

python

feedback = pd.concat([jan, feb], ignore_index=True)
feedback.shape

Keep source month when needed:

python

feedback_with_source = pd.concat(
    [jan, feb],
    keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "source_month"})

feedback_with_source.head()

For the main workflow:

python

feedback = pd.concat([jan, feb], ignore_index=True)

6. Data Quality Checks

Check duplicate full rows:

python

feedback.duplicated().sum()

Drop exact duplicates:

python

feedback = feedback.drop_duplicates().copy()

Check duplicate feedback IDs:

python

feedback["feedback_id"].duplicated().sum()

Check missing values:

python

feedback.isna().sum()

Check unknown post IDs before merging:

python

feedback.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts()

Find feedback rows pointing to missing posts:

python

feedback.merge(posts, how="left", on="post_id", indicator=True) \
    .query("_merge == 'left_only'")[["feedback_id", "post_id", "raw_message"]]

Do not hide this. Data quality findings are part of the case study.

7. Parse Dates

Convert timestamps:

python

feedback["submitted_at"] = pd.to_datetime(feedback["submitted_at"], errors="coerce")
feedback["responded_at"] = pd.to_datetime(feedback["responded_at"], errors="coerce")
posts["published_at"] = pd.to_datetime(posts["published_at"], errors="coerce")

Find invalid submitted dates:

python

feedback[feedback["submitted_at"].isna()]

Keep valid submitted feedback for time analysis:

python

feedback_valid = feedback.dropna(subset=["submitted_at"]).copy()

Create response-time features:

python

feedback_valid["response_time"] = feedback_valid["responded_at"] - feedback_valid["submitted_at"]
feedback_valid["response_hours"] = (
    feedback_valid["response_time"].dt.total_seconds() / 3600
).round(2)

Flag feedback not answered within 24 hours:

python

feedback_valid["late_response"] = feedback_valid["response_hours"] > 24

Rows with missing responded_at will have missing response hours.

8. Clean Column Names

Column names can also be text data.

python

feedback_valid.columns = (
    feedback_valid.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_", regex=False)
)

This dataset already has clean column names, but the pattern is useful in real projects.

9. Clean Raw Feedback Text

Start with a copy:

python

text_df = feedback_valid.copy()

Inspect raw text:

python

text_df["raw_message"].head(10)

Create a cleaned text column:

python

text_df["message_clean"] = text_df["raw_message"].fillna("")

Lowercase:

python

text_df["message_clean"] = text_df["message_clean"].str.lower()

Remove leading and trailing spaces:

python

text_df["message_clean"] = text_df["message_clean"].str.strip()

Remove HTML tags:

python

text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"<[^>]+>",
    " ",
    regex=True
)

Remove URLs:

python

text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"https?://\S+|www\.\S+",
    " ",
    regex=True
)

Normalize a few common contractions and short forms:

python

replacements = {
    r"\bcan't\b": "cannot",
    r"\bwon't\b": "will not",
    r"\bi'm\b": "i am",
    r"\bi've\b": "i have",
    r"\bpls\b": "please",
    r"\bplz\b": "please",
}

text_df["message_clean"] = text_df["message_clean"].replace(
    replacements,
    regex=True
)

Remove punctuation and symbols:

python

text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"[^a-z0-9\s]",
    " ",
    regex=True
)

Collapse repeated spaces:

python

text_df["message_clean"] = text_df["message_clean"].str.replace(
    r"\s+",
    " ",
    regex=True
).str.strip()

Compare raw and cleaned text:

python

text_df[["raw_message", "message_clean"]].head(10)

10. Text Quality Features

Create length features:

python

text_df["char_count_raw"] = text_df["raw_message"].fillna("").str.len()
text_df["char_count_clean"] = text_df["message_clean"].str.len()

Word count:

python

text_df["word_count"] = text_df["message_clean"].str.split().str.len()

Count exclamation marks in raw text:

python

text_df["exclamation_count"] = text_df["raw_message"].fillna("").str.count("!")

Check whether the raw message had a URL:

python

text_df["has_url"] = text_df["raw_message"].fillna("").str.contains(
    r"https?://|www\.",
    regex=True,
    case=False,
    na=False
)

Check whether the raw message had HTML:

python

text_df["has_html"] = text_df["raw_message"].fillna("").str.contains(
    r"<[^>]+>",
    regex=True,
    na=False
)

Feature summary:

python

text_df[[
    "feedback_id",
    "char_count_raw",
    "char_count_clean",
    "word_count",
    "exclamation_count",
    "has_url",
    "has_html"
]].head()

11. Extract Useful Text Patterns

Extract email-like text if a user typed one into the message:

python

text_df["mentioned_email"] = text_df["raw_message"].str.extract(
    r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})"
)

Extract issue IDs like ERR-204:

python

text_df["issue_code"] = text_df["raw_message"].str.extract(
    r"\b(ERR-\d{3})\b"
)

Find messages that mention money or billing:

python

text_df["mentions_billing"] = text_df["message_clean"].str.contains(
    r"\b(?:refund|invoice|payment|paid|money|billing)\b",
    regex=True,
    na=False
)

Find messages that mention bugs:

python

text_df["mentions_bug"] = text_df["message_clean"].str.contains(
    r"\b(?:error|bug|crash|broken|not working|err)\b",
    regex=True,
    na=False
)

Find messages that mention content clarity:

python

text_df["mentions_confusion"] = text_df["message_clean"].str.contains(
    r"\b(?:confusing|unclear|hard|lost|explain|example)\b",
    regex=True,
    na=False
)

12. Tokenize Text

Tokenization means splitting text into words.

python

text_df["tokens"] = text_df["message_clean"].str.split()

Inspect:

python

text_df[["feedback_id", "tokens"]].head()

Explode tokens into one row per token:

python

tokens = text_df[["feedback_id", "post_id", "sentiment_label", "tokens"]].explode("tokens")
tokens = tokens.rename(columns={"tokens": "token"})
tokens.head()

Remove missing or empty tokens:

python

tokens = tokens.dropna(subset=["token"])
tokens = tokens[tokens["token"].str.len() > 0]

13. Remove Stopwords

Use a small custom stopword list.

python

stopwords = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "can", "for", "from", "has", "have", "i", "in", "is",
    "it", "my", "of", "on", "or", "please", "so", "that",
    "the", "this", "to", "was", "with", "you", "your"
}

Filter:

python

meaningful_tokens = tokens[~tokens["token"].isin(stopwords)].copy()

Top words:

python

meaningful_tokens["token"].value_counts().head(20)

Top words by sentiment:

python

meaningful_tokens.groupby("sentiment_label")["token"].value_counts().groupby(level=0).head(10)

14. N-Grams With Pandas

An n-gram is a sequence of words.

For bigrams, use the current token plus the next token inside the same feedback message.

python

meaningful_tokens = meaningful_tokens.sort_values(["feedback_id"]).copy()
meaningful_tokens["next_token"] = meaningful_tokens.groupby("feedback_id")["token"].shift(-1)
meaningful_tokens["bigram"] = meaningful_tokens["token"] + " " + meaningful_tokens["next_token"]

Remove rows where there is no next token:

python

bigrams = meaningful_tokens.dropna(subset=["next_token"]).copy()

Top bigrams:

python

bigrams["bigram"].value_counts().head(20)

Top bigrams by sentiment:

python

bigrams.groupby("sentiment_label")["bigram"].value_counts().groupby(level=0).head(10)

15. Build A Document-Term Matrix

A document-term matrix has:

one row per document
one column per token
values showing token counts

python

token_counts = (
    meaningful_tokens
    .groupby(["feedback_id", "token"])
    .size()
    .rename("count")
)

Convert to wide format:

python

dtm = token_counts.unstack(fill_value=0)
dtm.head()

Keep the most common tokens:

python

top_tokens = meaningful_tokens["token"].value_counts().head(25).index
dtm_top = dtm.reindex(columns=top_tokens, fill_value=0)
dtm_top.head()

This is a Pandas-native version of a simple bag-of-words table.

16. Simple TF-IDF With Pandas

Term frequency:

python

tf = dtm_top.div(dtm_top.sum(axis=1).replace(0, np.nan), axis=0).fillna(0)

Document frequency:

python

doc_freq = (dtm_top > 0).sum(axis=0)

Inverse document frequency:

python

idf = np.log((1 + len(dtm_top)) / (1 + doc_freq)) + 1

TF-IDF:

python

tfidf = tf * idf
tfidf.head()

Top TF-IDF token per feedback:

python

top_tfidf = tfidf.idxmax(axis=1).rename("top_keyword").reset_index()
top_tfidf.head()

This is not a full NLP library, but it teaches the idea using Pandas.

17. Match Topic Rules

The topic rules table has regex patterns.

python

rules

Create a long table of rule hits.

python

rule_hits = []

for row in rules.itertuples(index=False):
    mask = text_df["message_clean"].str.contains(row.pattern, regex=True, na=False)
    matched = text_df.loc[mask, ["feedback_id", "post_id", "message_clean"]].copy()
    matched["rule_id"] = row.rule_id
    matched["topic"] = row.topic
    matched["severity"] = row.severity
    matched["recommended_action"] = row.recommended_action
    rule_hits.append(matched)

rule_hits = pd.concat(rule_hits, ignore_index=True)
rule_hits.head()

Topic counts:

python

rule_hits["topic"].value_counts()

Severity counts:

python

rule_hits["severity"].value_counts()

Topics by post:

python

rule_hits.groupby(["post_id", "topic"]).size().reset_index(name="mentions")

18. Merge With Posts And Authors

Audit post matches:

python

text_df.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts()

Build the analysis table:

python

analysis = (
    text_df
    .merge(posts, how="left", on="post_id", validate="many_to_one")
    .merge(authors, how="left", on="author_id", validate="many_to_one")
)

analysis.head()

Select useful columns:

python

analysis = analysis[[
    "feedback_id",
    "post_id",
    "slug",
    "title",
    "series",
    "post_type",
    "author",
    "submitted_at",
    "responded_at",
    "response_hours",
    "late_response",
    "rating",
    "sentiment_label",
    "channel",
    "status",
    "message_clean",
    "word_count",
    "char_count_clean",
    "exclamation_count",
    "has_url",
    "has_html",
    "mentions_billing",
    "mentions_bug",
    "mentions_confusion"
]]

analysis.head()

19. Post-Level Content Quality Report

Create a post-level report:

python

post_quality = (
    analysis
    .groupby(["post_id", "title", "series", "post_type", "author"], as_index=False)
    .agg(
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        negative_count=("sentiment_label", lambda s: (s == "negative").sum()),
        positive_count=("sentiment_label", lambda s: (s == "positive").sum()),
        avg_word_count=("word_count", "mean"),
        billing_mentions=("mentions_billing", "sum"),
        bug_mentions=("mentions_bug", "sum"),
        confusion_mentions=("mentions_confusion", "sum"),
        late_responses=("late_response", "sum")
    )
)

post_quality["avg_rating"] = post_quality["avg_rating"].round(2)
post_quality["avg_word_count"] = post_quality["avg_word_count"].round(2)

Create a review priority score:

python

post_quality["review_priority_score"] = (
    post_quality["negative_count"] * 3
    + post_quality["bug_mentions"] * 3
    + post_quality["confusion_mentions"] * 2
    + post_quality["billing_mentions"] * 2
    + post_quality["late_responses"]
)

post_quality.sort_values("review_priority_score", ascending=False)

This table tells the content team what to improve first.

20. Author-Level Report

python

author_report = (
    analysis
    .groupby("author", as_index=False)
    .agg(
        posts=("post_id", "nunique"),
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        negative_count=("sentiment_label", lambda s: (s == "negative").sum()),
        avg_response_hours=("response_hours", "mean")
    )
)

author_report["avg_rating"] = author_report["avg_rating"].round(2)
author_report["avg_response_hours"] = author_report["avg_response_hours"].round(2)
author_report.sort_values("feedback_count", ascending=False)

21. Channel And Sentiment Report

python

channel_report = (
    analysis
    .groupby(["channel", "sentiment_label"], as_index=False)
    .agg(
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        avg_word_count=("word_count", "mean")
    )
)

channel_report["avg_rating"] = channel_report["avg_rating"].round(2)
channel_report["avg_word_count"] = channel_report["avg_word_count"].round(2)
channel_report

Pivot version:

python

analysis.pivot_table(
    index="channel",
    columns="sentiment_label",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)

22. Topic Report From Rule Hits

Merge rule hits with post metadata:

python

topic_report_base = (
    rule_hits
    .merge(posts, how="left", on="post_id", validate="many_to_one")
    .merge(authors, how="left", on="author_id", validate="many_to_one")
)

Topic count by series:

python

topic_by_series = (
    topic_report_base
    .groupby(["series", "topic"], as_index=False)
    .agg(mentions=("feedback_id", "count"))
    .sort_values(["series", "mentions"], ascending=[True, False])
)

topic_by_series

Topic severity report:

python

topic_report_base.pivot_table(
    index="topic",
    columns="severity",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)

23. Date Features And Time-Based Analysis

Add date features:

python

analysis["submitted_date"] = analysis["submitted_at"].dt.date
analysis["submitted_month"] = analysis["submitted_at"].dt.to_period("M").astype(str)
analysis["submitted_day_name"] = analysis["submitted_at"].dt.day_name()
analysis["submitted_hour"] = analysis["submitted_at"].dt.hour

Feedback by day name:

python

analysis.groupby("submitted_day_name")["feedback_id"].count()

Feedback by hour:

python

analysis.groupby("submitted_hour")["feedback_id"].count().sort_index()

Monthly sentiment:

python

analysis.pivot_table(
    index="submitted_month",
    columns="sentiment_label",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)

24. Resample Feedback Trends

Set a datetime index:

python

analysis_ts = analysis.set_index("submitted_at").sort_index()

Daily feedback count:

python

daily_feedback = analysis_ts["feedback_id"].resample("D").count()
daily_feedback.head()

Daily average rating:

python

daily_rating = analysis_ts["rating"].resample("D").mean()
daily_rating.head()

Daily negative feedback:

python

daily_negative = (
    analysis_ts["sentiment_label"]
    .eq("negative")
    .resample("D")
    .sum()
)

Combine daily metrics:

python

daily_report = pd.DataFrame({
    "feedback_count": daily_feedback,
    "avg_rating": daily_rating,
    "negative_count": daily_negative
})

daily_report.head()

Rolling 7-day feedback volume:

python

daily_report["feedback_7d_avg"] = daily_report["feedback_count"].rolling(
    7,
    min_periods=1
).mean()

Weekly report:

python

weekly_report = analysis_ts.resample("W").agg(
    feedback_count=("feedback_id", "count"),
    avg_rating=("rating", "mean"),
    avg_response_hours=("response_hours", "mean")
)

weekly_report.round(2)

25. Response Time SLA Report

Create an SLA bucket:

python

analysis["response_bucket"] = pd.cut(
    analysis["response_hours"],
    bins=[-0.01, 6, 24, 48, np.inf],
    labels=["0-6h", "6-24h", "24-48h", "48h+"]
)

SLA by channel:

python

analysis.pivot_table(
    index="channel",
    columns="response_bucket",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)

Average response by post type:

python

analysis.groupby("post_type", as_index=False).agg(
    feedback_count=("feedback_id", "count"),
    avg_response_hours=("response_hours", "mean"),
    late_responses=("late_response", "sum")
).round(2)

26. MultiIndex Text Summary

Group by series, post type, and sentiment:

python

multi_summary = (
    analysis
    .groupby(["series", "post_type", "sentiment_label"])
    .agg(
        feedback_count=("feedback_id", "count"),
        avg_rating=("rating", "mean"),
        avg_word_count=("word_count", "mean")
    )
)

multi_summary

Select one series:

python

multi_summary.loc["Pandas"]

Unstack sentiment:

python

multi_summary["feedback_count"].unstack("sentiment_label", fill_value=0)

Reset index for export:

python

multi_summary.reset_index().head()

27. Build A Simple Triage Score

Create a row-level score:

python

analysis["triage_score"] = (
    analysis["mentions_bug"].astype(int) * 4
    + analysis["mentions_billing"].astype(int) * 3
    + analysis["mentions_confusion"].astype(int) * 2
    + analysis["sentiment_label"].eq("negative").astype(int) * 3
    + analysis["late_response"].fillna(False).astype(int)
    + analysis["exclamation_count"].clip(upper=3)
)

Assign priority:

python

analysis["triage_priority"] = pd.cut(
    analysis["triage_score"],
    bins=[-1, 2, 5, 99],
    labels=["low", "medium", "high"]
)

Top messages to review:

python

analysis.sort_values("triage_score", ascending=False)[[
    "feedback_id",
    "title",
    "sentiment_label",
    "triage_score",
    "triage_priority",
    "message_clean"
]].head(10)

28. Create Final Reports

Executive summary:

python

executive_summary = pd.DataFrame({
    "metric": [
        "feedback_rows",
        "unique_posts",
        "average_rating",
        "negative_feedback",
        "billing_mentions",
        "bug_mentions",
        "confusion_mentions",
        "average_response_hours",
        "late_responses"
    ],
    "value": [
        analysis["feedback_id"].nunique(),
        analysis["post_id"].nunique(),
        round(analysis["rating"].mean(), 2),
        int(analysis["sentiment_label"].eq("negative").sum()),
        int(analysis["mentions_billing"].sum()),
        int(analysis["mentions_bug"].sum()),
        int(analysis["mentions_confusion"].sum()),
        round(analysis["response_hours"].mean(), 2),
        int(analysis["late_response"].sum())
    ]
})

executive_summary

Final reports dictionary:

python

reports = {
    "executive_summary": executive_summary,
    "post_quality": post_quality,
    "author_report": author_report,
    "channel_report": channel_report,
    "topic_by_series": topic_by_series,
    "daily_report": daily_report,
    "weekly_report": weekly_report,
    "multi_summary": multi_summary.reset_index(),
}

Optional exports:

python

# executive_summary.to_csv("text_case_executive_summary.csv", index=False)
# post_quality.to_csv("text_case_post_quality.csv", index=False)
# daily_report.to_csv("text_case_daily_report.csv")

29. Case Study Checklist

Use this checklist for text-heavy Pandas projects:

Load all raw files.
Concatenate same-schema monthly files.
Check shapes, types, missing values, and duplicates.
Parse timestamps with errors="coerce".
Audit joins before merging.
Create a clean text column.
Lowercase and strip text.
Remove HTML, URLs, punctuation, and repeated spaces.
Normalize common short forms.
Extract useful patterns with regex.
Create text-length features.
Tokenize with .str.split().
Use explode() for one-token-per-row analysis.
Remove stopwords.
Count words and phrases.
Build a document-term matrix.
Match topic rules.
Merge metadata.
Build post, author, channel, and topic reports.
Use pivots for cross-tab reports.
Use resample() for time trends.
Use rolling metrics for smoothing.
Create triage scores.
Create export-ready final tables.

30. Practice Tasks

Try these after completing the walkthrough.

Practice Lab

Task 1. Find the top 10 longest feedback messages.

python

analysis.sort_values("word_count", ascending=False)[[
    "feedback_id",
    "title",
    "word_count",
    "message_clean"
]].head(10)

Explanation

Sorts a DataFrame named 'analysis' by the 'word_count' column in descending order to prioritize longer entries
Selects specific columns: 'feedback_id', 'title', 'word_count', and 'message_clean' for display
Returns only the top 10 rows with the highest word counts using the head() method
This approach helps quickly identify the most verbose feedback entries for detailed review
The resulting DataFrame shows the longest feedback messages along with their identifying information

Practice Lab

Task 2. Find posts with average rating below 3.5.

python

post_quality[post_quality["avg_rating"] < 3.5].sort_values("avg_rating")

Explanation

Filters the post_quality DataFrame to include only rows where avg_rating is less than 3.5
Sorts the filtered results by avg_rating column in ascending order (lowest ratings first)
Returns a new DataFrame containing only poor-rated posts ordered from worst to best among the low-rated subset
Useful for identifying underperforming content that needs improvement or review
The chained operations allow for quick identification of posts requiring attention based on quality metrics

Practice Lab

Task 3. Find the most common token in negative feedback.

python

meaningful_tokens[meaningful_tokens["sentiment_label"] == "negative"]["token"] \
    .value_counts() \
    .head(10)

Explanation

Filters the meaningful_tokens DataFrame to include only rows where sentiment_label equals "negative"
Selects the token column from the filtered results
Counts the frequency of each unique token value using value_counts()
Returns the top 10 most frequently occurring tokens with negative sentiment labels
This analysis helps identify common words associated with negative sentiment in the dataset

Practice Lab

Task 4. Find messages that mention both bug and confusion.

python

analysis[analysis["mentions_bug"] & analysis["mentions_confusion"]][[
    "feedback_id",
    "title",
    "message_clean"
]]

Explanation

The code filters a DataFrame named 'analysis' to include only rows where both "mentions_bug" and "mentions_confusion" columns contain True values
It then selects three specific columns: "feedback_id", "title", and "message_clean" from the filtered results
The boolean filtering uses the & operator to combine two conditions, ensuring both criteria must be met
This approach is commonly used for data analysis tasks where you need to isolate specific subsets of data based on multiple attributes
The result returns a new DataFrame containing only the specified columns for entries that satisfy both boolean conditions

Practice Lab

Task 5. Find average response time by sentiment.

python

analysis.groupby("sentiment_label", as_index=False)["response_hours"].mean().round(2)

Explanation

Groups data by the "sentiment_label" column to aggregate responses by sentiment category
Calculates the mean of "response_hours" for each sentiment group using the mean() function
Rounds the resulting averages to 2 decimal places for cleaner presentation
Returns a new DataFrame with sentiment labels and their corresponding average response times
The as_index=False parameter ensures the grouping column remains a regular column rather than becoming the index

Practice Lab

Task 6. Find top bigrams for the Pandas series.

python

pandas_feedback_ids = analysis[analysis["series"] == "Pandas"]["feedback_id"]

bigrams[bigrams["feedback_id"].isin(pandas_feedback_ids)]["bigram"] \
    .value_counts() \
    .head(10)

Explanation

Filters the analysis dataframe to select only rows where the series column equals "Pandas" and extracts the corresponding feedback_id values
Uses the extracted feedback_ids to filter the bigrams dataframe for matching entries
Counts the frequency of each unique bigram in the filtered results
Returns the top 10 most frequently occurring bigrams from the Pandas series feedback data

Practice Lab

Task 7. Build a post type and sentiment pivot.

python

analysis.pivot_table(
    index="post_type",
    columns="sentiment_label",
    values="feedback_id",
    aggfunc="count",
    fill_value=0
)

Explanation

Generates a cross-tabulation showing how feedback counts are distributed between different post types and sentiment labels
Uses post_type as row identifiers and sentiment_label as column headers for structured data analysis
Counts feedback_id occurrences within each combination of post type and sentiment category
Fills missing combinations with zero values to ensure complete dataset representation
Provides insights into which post types generate feedback with specific sentiment patterns

Practice Lab

Task 8. Find weekly negative feedback rate.

python

weekly_negative_rate = (
    analysis_ts["sentiment_label"].eq("negative").resample("W").mean() * 100
)

weekly_negative_rate.round(2)

Explanation

Creates a time series analysis by counting negative sentiment occurrences and resampling to weekly intervals
Converts boolean True/False values to numeric 1/0 for calculation using eq("negative") method
Applies mean() function to compute the average proportion of negative sentiments per week
Multiplies by 100 to convert decimal proportions to percentage values
Rounds final results to 2 decimal places for improved readability

Practice Lab

Task 9. Find posts with high review priority and many confusion mentions.

python

post_quality.query("review_priority_score >= 5 and confusion_mentions > 0") \
    .sort_values("review_priority_score", ascending=False)

Explanation

Filters customer reviews where the review priority score is 5 or higher and there are mentions of confusion
Sorts the filtered results by review priority score in descending order to show highest priority items first
Uses pandas query method to apply boolean conditions on the post_quality DataFrame
Returns a sorted subset of reviews that meet both quality and confusion criteria for prioritized attention

Practice Lab

Task 10. Build a term-frequency table by sentiment.

python

sentiment_token_counts = (
    meaningful_tokens
    .groupby(["sentiment_label", "token"])
    .size()
    .rename("count")
    .reset_index()
    .sort_values(["sentiment_label", "count"], ascending=[True, False])
)

sentiment_token_counts.groupby("sentiment_label").head(10)

Explanation

Groups meaningful tokens by sentiment label and token identity to count occurrences of each token within each sentiment category
Renames the count column for clarity and resets the index to create a clean dataframe structure
Sorts results by sentiment label ascending and token count descending to prioritize highest frequency tokens within each sentiment group
Uses groupby with head(10) to extract the top 10 most frequent tokens for each sentiment label, enabling comparative analysis across sentiment categories

Layer	Pandas Tools
Ingest	`read_csv`, `concat`
Inspect	`shape`, `info`, `describe`, `isna`, `duplicated`
Clean	`.str.lower`, `.str.strip`, `.str.replace`
Extract	`.str.extract`, `.str.contains`, regex
Tokenize	`.str.split`, `explode`
Count	`value_counts`, `groupby`, `size`
Vectorize	`unstack`, document-term matrix, TF-IDF
Enrich	`merge`, `validate`, `indicator`
Time	`to_datetime`, Timedelta, `resample`, rolling
Report	`agg`, `pivot_table`, sorted summaries

If you remember only one thing, remember this:

Text analysis in Pandas is not just cleaning strings. It is a complete workflow for turning messy human language into structured signals that can be joined, counted, trended, and reported.

Official References

Pandas working with text data user guide: https://pandas.pydata.org/docs/user_guide/text.html
Pandas Series.str API reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html
Pandas missing data user guide: https://pandas.pydata.org/docs/user_guide/missing_data.html
Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html

Pandas Text Data Case Study: Clean and Analyze Feedback

AI Insights

Task 1. Find the top 10 longest feedback messages.

Task 2. Find posts with average rating below 3.5.

Task 3. Find the most common token in negative feedback.

Task 4. Find messages that mention both bug and confusion.

Task 5. Find average response time by sentiment.

Task 6. Find top bigrams for the Pandas series.

Task 7. Build a post type and sentiment pivot.

Task 8. Find weekly negative feedback rate.

Task 9. Find posts with high review priority and many confusion mentions.

Task 10. Build a term-frequency table by sentiment.