Pandas Text Data Case Study: Clean, Explore, Vectorize, and Report
This is a full Pandas case study for working with textual data.
The scenario is simple:
A learning platform collects feedback from learners on blog posts, coding lessons, projects, and interview-prep content. The feedback arrives as messy text with mixed casing, HTML snippets, URLs, typos, repeated records, ratings, timestamps, and support categories.
Your job is to turn raw feedback into useful content-quality reports.
By the end, you will answer questions like:
- Which posts receive the most negative feedback?
- Which authors get the strongest ratings?
- Which topics are learners asking about?
- Which feedback messages mention refund, invoice, bugs, copyright, or confusion?
- Which channels bring high-quality feedback?
- Which words and phrases appear most often?
- How do feedback volume and response time change over time?
- Which posts need review first?
This case study uses original sample datasets created for this lesson. It does not use copied movie reviews, social media posts, ecommerce reviews, public datasets, or third-party course material.
Files used in this case study:
pandas_text_case_posts.csvpandas_text_case_authors.csvpandas_text_case_feedback_jan.csvpandas_text_case_feedback_feb.csvpandas_text_case_topic_rules.csv
Place all CSV files in the same folder as this Markdown file before running the examples.
What This Case Study Covers
This project brings together many Pandas skills:
read_csvconcatmergevalidateindicator=Trueshape,info,describe- missing values
- duplicate rows
- string cleaning with
.str - regex replacement
- regex extraction
- URL and HTML removal
- punctuation removal
- text normalization
- text feature engineering
- tokenization with
.str.split() explode- stopword filtering
- word counts
- n-grams
- keyword rule matching
- document-term matrices
- simple TF-IDF with Pandas and NumPy
- datetime parsing
- Timedelta response-time analysis
groupby- named aggregation
transformpivot_tableset_indexresample- rolling averages
- MultiIndex-style summaries
- final export-ready reports
1. Import Libraries
import pandas as pd
import numpy as npOptional display settings:
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)2. Load The Data
posts = pd.read_csv("pandas_text_case_posts.csv")
authors = pd.read_csv("pandas_text_case_authors.csv")
rules = pd.read_csv("pandas_text_case_topic_rules.csv")
jan = pd.read_csv("pandas_text_case_feedback_jan.csv")
feb = pd.read_csv("pandas_text_case_feedback_feb.csv")Check the shapes:
for name, df in {
"posts": posts,
"authors": authors,
"rules": rules,
"jan": jan,
"feb": feb,
}.items():
print(name, df.shape)Preview the data:
posts.head()jan.head()rules.head()3. Understand The Data Model
The data has three supporting tables:
posts: one row per content itemauthors: one row per content authorrules: keyword rules used to detect topics in feedback text
It has two monthly feedback tables:
jan: feedback received in Januaryfeb: feedback received in February
Important keys:
| Table | Key Columns |
|---|---|
| posts | post_id, author_id |
| authors | author_id |
| feedback | feedback_id, post_id |
| rules | rule_id |
4. Inspect Raw Tables
posts.info()jan.info()jan.describe(include="object")Check numeric fields:
jan.describe(numeric_only=True)Check sentiment labels:
jan["sentiment_label"].value_counts(dropna=False)5. Concatenate Monthly Feedback
January and February feedback have the same schema.
feedback = pd.concat([jan, feb], ignore_index=True)
feedback.shapeKeep source month when needed:
feedback_with_source = pd.concat(
[jan, feb],
keys=["Jan", "Feb"]
).reset_index(level=0).rename(columns={"level_0": "source_month"})
feedback_with_source.head()For the main workflow:
feedback = pd.concat([jan, feb], ignore_index=True)6. Data Quality Checks
Check duplicate full rows:
feedback.duplicated().sum()Drop exact duplicates:
feedback = feedback.drop_duplicates().copy()Check duplicate feedback IDs:
feedback["feedback_id"].duplicated().sum()Check missing values:
feedback.isna().sum()Check unknown post IDs before merging:
feedback.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts()Find feedback rows pointing to missing posts:
feedback.merge(posts, how="left", on="post_id", indicator=True) \
.query("_merge == 'left_only'")[["feedback_id", "post_id", "raw_message"]]Do not hide this. Data quality findings are part of the case study.
7. Parse Dates
Convert timestamps:
feedback["submitted_at"] = pd.to_datetime(feedback["submitted_at"], errors="coerce")
feedback["responded_at"] = pd.to_datetime(feedback["responded_at"], errors="coerce")
posts["published_at"] = pd.to_datetime(posts["published_at"], errors="coerce")Find invalid submitted dates:
feedback[feedback["submitted_at"].isna()]Keep valid submitted feedback for time analysis:
feedback_valid = feedback.dropna(subset=["submitted_at"]).copy()Create response-time features:
feedback_valid["response_time"] = feedback_valid["responded_at"] - feedback_valid["submitted_at"]
feedback_valid["response_hours"] = (
feedback_valid["response_time"].dt.total_seconds() / 3600
).round(2)Flag feedback not answered within 24 hours:
feedback_valid["late_response"] = feedback_valid["response_hours"] > 24Rows with missing responded_at will have missing response hours.
8. Clean Column Names
Column names can also be text data.
feedback_valid.columns = (
feedback_valid.columns
.str.strip()
.str.lower()
.str.replace(" ", "_", regex=False)
)This dataset already has clean column names, but the pattern is useful in real projects.
9. Clean Raw Feedback Text
Start with a copy:
text_df = feedback_valid.copy()Inspect raw text:
text_df["raw_message"].head(10)Create a cleaned text column:
text_df["message_clean"] = text_df["raw_message"].fillna("")Lowercase:
text_df["message_clean"] = text_df["message_clean"].str.lower()Remove leading and trailing spaces:
text_df["message_clean"] = text_df["message_clean"].str.strip()Remove HTML tags:
text_df["message_clean"] = text_df["message_clean"].str.replace(
r"<[^>]+>",
" ",
regex=True
)Remove URLs:
text_df["message_clean"] = text_df["message_clean"].str.replace(
r"https?://\S+|www\.\S+",
" ",
regex=True
)Normalize a few common contractions and short forms:
replacements = {
r"\bcan't\b": "cannot",
r"\bwon't\b": "will not",
r"\bi'm\b": "i am",
r"\bi've\b": "i have",
r"\bpls\b": "please",
r"\bplz\b": "please",
}
text_df["message_clean"] = text_df["message_clean"].replace(
replacements,
regex=True
)Remove punctuation and symbols:
text_df["message_clean"] = text_df["message_clean"].str.replace(
r"[^a-z0-9\s]",
" ",
regex=True
)Collapse repeated spaces:
text_df["message_clean"] = text_df["message_clean"].str.replace(
r"\s+",
" ",
regex=True
).str.strip()Compare raw and cleaned text:
text_df[["raw_message", "message_clean"]].head(10)10. Text Quality Features
Create length features:
text_df["char_count_raw"] = text_df["raw_message"].fillna("").str.len()
text_df["char_count_clean"] = text_df["message_clean"].str.len()Word count:
text_df["word_count"] = text_df["message_clean"].str.split().str.len()Count exclamation marks in raw text:
text_df["exclamation_count"] = text_df["raw_message"].fillna("").str.count("!")Check whether the raw message had a URL:
text_df["has_url"] = text_df["raw_message"].fillna("").str.contains(
r"https?://|www\.",
regex=True,
case=False,
na=False
)Check whether the raw message had HTML:
text_df["has_html"] = text_df["raw_message"].fillna("").str.contains(
r"<[^>]+>",
regex=True,
na=False
)Feature summary:
text_df[[
"feedback_id",
"char_count_raw",
"char_count_clean",
"word_count",
"exclamation_count",
"has_url",
"has_html"
]].head()11. Extract Useful Text Patterns
Extract email-like text if a user typed one into the message:
text_df["mentioned_email"] = text_df["raw_message"].str.extract(
r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})"
)Extract issue IDs like ERR-204:
text_df["issue_code"] = text_df["raw_message"].str.extract(
r"\b(ERR-\d{3})\b"
)Find messages that mention money or billing:
text_df["mentions_billing"] = text_df["message_clean"].str.contains(
r"\b(?:refund|invoice|payment|paid|money|billing)\b",
regex=True,
na=False
)Find messages that mention bugs:
text_df["mentions_bug"] = text_df["message_clean"].str.contains(
r"\b(?:error|bug|crash|broken|not working|err)\b",
regex=True,
na=False
)Find messages that mention content clarity:
text_df["mentions_confusion"] = text_df["message_clean"].str.contains(
r"\b(?:confusing|unclear|hard|lost|explain|example)\b",
regex=True,
na=False
)12. Tokenize Text
Tokenization means splitting text into words.
text_df["tokens"] = text_df["message_clean"].str.split()Inspect:
text_df[["feedback_id", "tokens"]].head()Explode tokens into one row per token:
tokens = text_df[["feedback_id", "post_id", "sentiment_label", "tokens"]].explode("tokens")
tokens = tokens.rename(columns={"tokens": "token"})
tokens.head()Remove missing or empty tokens:
tokens = tokens.dropna(subset=["token"])
tokens = tokens[tokens["token"].str.len() > 0]13. Remove Stopwords
Use a small custom stopword list.
stopwords = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"can", "for", "from", "has", "have", "i", "in", "is",
"it", "my", "of", "on", "or", "please", "so", "that",
"the", "this", "to", "was", "with", "you", "your"
}Filter:
meaningful_tokens = tokens[~tokens["token"].isin(stopwords)].copy()Top words:
meaningful_tokens["token"].value_counts().head(20)Top words by sentiment:
meaningful_tokens.groupby("sentiment_label")["token"].value_counts().groupby(level=0).head(10)14. N-Grams With Pandas
An n-gram is a sequence of words.
For bigrams, use the current token plus the next token inside the same feedback message.
meaningful_tokens = meaningful_tokens.sort_values(["feedback_id"]).copy()
meaningful_tokens["next_token"] = meaningful_tokens.groupby("feedback_id")["token"].shift(-1)
meaningful_tokens["bigram"] = meaningful_tokens["token"] + " " + meaningful_tokens["next_token"]Remove rows where there is no next token:
bigrams = meaningful_tokens.dropna(subset=["next_token"]).copy()Top bigrams:
bigrams["bigram"].value_counts().head(20)Top bigrams by sentiment:
bigrams.groupby("sentiment_label")["bigram"].value_counts().groupby(level=0).head(10)15. Build A Document-Term Matrix
A document-term matrix has:
- one row per document
- one column per token
- values showing token counts
token_counts = (
meaningful_tokens
.groupby(["feedback_id", "token"])
.size()
.rename("count")
)Convert to wide format:
dtm = token_counts.unstack(fill_value=0)
dtm.head()Keep the most common tokens:
top_tokens = meaningful_tokens["token"].value_counts().head(25).index
dtm_top = dtm.reindex(columns=top_tokens, fill_value=0)
dtm_top.head()This is a Pandas-native version of a simple bag-of-words table.
16. Simple TF-IDF With Pandas
Term frequency:
tf = dtm_top.div(dtm_top.sum(axis=1).replace(0, np.nan), axis=0).fillna(0)Document frequency:
doc_freq = (dtm_top > 0).sum(axis=0)Inverse document frequency:
idf = np.log((1 + len(dtm_top)) / (1 + doc_freq)) + 1TF-IDF:
tfidf = tf * idf
tfidf.head()Top TF-IDF token per feedback:
top_tfidf = tfidf.idxmax(axis=1).rename("top_keyword").reset_index()
top_tfidf.head()This is not a full NLP library, but it teaches the idea using Pandas.
17. Match Topic Rules
The topic rules table has regex patterns.
rulesCreate a long table of rule hits.
rule_hits = []
for row in rules.itertuples(index=False):
mask = text_df["message_clean"].str.contains(row.pattern, regex=True, na=False)
matched = text_df.loc[mask, ["feedback_id", "post_id", "message_clean"]].copy()
matched["rule_id"] = row.rule_id
matched["topic"] = row.topic
matched["severity"] = row.severity
matched["recommended_action"] = row.recommended_action
rule_hits.append(matched)
rule_hits = pd.concat(rule_hits, ignore_index=True)
rule_hits.head()Topic counts:
rule_hits["topic"].value_counts()Severity counts:
rule_hits["severity"].value_counts()Topics by post:
rule_hits.groupby(["post_id", "topic"]).size().reset_index(name="mentions")18. Merge With Posts And Authors
Audit post matches:
text_df.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts()Build the analysis table:
analysis = (
text_df
.merge(posts, how="left", on="post_id", validate="many_to_one")
.merge(authors, how="left", on="author_id", validate="many_to_one")
)
analysis.head()Select useful columns:
analysis = analysis[[
"feedback_id",
"post_id",
"slug",
"title",
"series",
"post_type",
"author",
"submitted_at",
"responded_at",
"response_hours",
"late_response",
"rating",
"sentiment_label",
"channel",
"status",
"message_clean",
"word_count",
"char_count_clean",
"exclamation_count",
"has_url",
"has_html",
"mentions_billing",
"mentions_bug",
"mentions_confusion"
]]
analysis.head()19. Post-Level Content Quality Report
Create a post-level report:
post_quality = (
analysis
.groupby(["post_id", "title", "series", "post_type", "author"], as_index=False)
.agg(
feedback_count=("feedback_id", "count"),
avg_rating=("rating", "mean"),
negative_count=("sentiment_label", lambda s: (s == "negative").sum()),
positive_count=("sentiment_label", lambda s: (s == "positive").sum()),
avg_word_count=("word_count", "mean"),
billing_mentions=("mentions_billing", "sum"),
bug_mentions=("mentions_bug", "sum"),
confusion_mentions=("mentions_confusion", "sum"),
late_responses=("late_response", "sum")
)
)
post_quality["avg_rating"] = post_quality["avg_rating"].round(2)
post_quality["avg_word_count"] = post_quality["avg_word_count"].round(2)Create a review priority score:
post_quality["review_priority_score"] = (
post_quality["negative_count"] * 3
+ post_quality["bug_mentions"] * 3
+ post_quality["confusion_mentions"] * 2
+ post_quality["billing_mentions"] * 2
+ post_quality["late_responses"]
)
post_quality.sort_values("review_priority_score", ascending=False)This table tells the content team what to improve first.
20. Author-Level Report
author_report = (
analysis
.groupby("author", as_index=False)
.agg(
posts=("post_id", "nunique"),
feedback_count=("feedback_id", "count"),
avg_rating=("rating", "mean"),
negative_count=("sentiment_label", lambda s: (s == "negative").sum()),
avg_response_hours=("response_hours", "mean")
)
)
author_report["avg_rating"] = author_report["avg_rating"].round(2)
author_report["avg_response_hours"] = author_report["avg_response_hours"].round(2)
author_report.sort_values("feedback_count", ascending=False)21. Channel And Sentiment Report
channel_report = (
analysis
.groupby(["channel", "sentiment_label"], as_index=False)
.agg(
feedback_count=("feedback_id", "count"),
avg_rating=("rating", "mean"),
avg_word_count=("word_count", "mean")
)
)
channel_report["avg_rating"] = channel_report["avg_rating"].round(2)
channel_report["avg_word_count"] = channel_report["avg_word_count"].round(2)
channel_reportPivot version:
analysis.pivot_table(
index="channel",
columns="sentiment_label",
values="feedback_id",
aggfunc="count",
fill_value=0
)22. Topic Report From Rule Hits
Merge rule hits with post metadata:
topic_report_base = (
rule_hits
.merge(posts, how="left", on="post_id", validate="many_to_one")
.merge(authors, how="left", on="author_id", validate="many_to_one")
)Topic count by series:
topic_by_series = (
topic_report_base
.groupby(["series", "topic"], as_index=False)
.agg(mentions=("feedback_id", "count"))
.sort_values(["series", "mentions"], ascending=[True, False])
)
topic_by_seriesTopic severity report:
topic_report_base.pivot_table(
index="topic",
columns="severity",
values="feedback_id",
aggfunc="count",
fill_value=0
)23. Date Features And Time-Based Analysis
Add date features:
analysis["submitted_date"] = analysis["submitted_at"].dt.date
analysis["submitted_month"] = analysis["submitted_at"].dt.to_period("M").astype(str)
analysis["submitted_day_name"] = analysis["submitted_at"].dt.day_name()
analysis["submitted_hour"] = analysis["submitted_at"].dt.hourFeedback by day name:
analysis.groupby("submitted_day_name")["feedback_id"].count()Feedback by hour:
analysis.groupby("submitted_hour")["feedback_id"].count().sort_index()Monthly sentiment:
analysis.pivot_table(
index="submitted_month",
columns="sentiment_label",
values="feedback_id",
aggfunc="count",
fill_value=0
)24. Resample Feedback Trends
Set a datetime index:
analysis_ts = analysis.set_index("submitted_at").sort_index()Daily feedback count:
daily_feedback = analysis_ts["feedback_id"].resample("D").count()
daily_feedback.head()Daily average rating:
daily_rating = analysis_ts["rating"].resample("D").mean()
daily_rating.head()Daily negative feedback:
daily_negative = (
analysis_ts["sentiment_label"]
.eq("negative")
.resample("D")
.sum()
)Combine daily metrics:
daily_report = pd.DataFrame({
"feedback_count": daily_feedback,
"avg_rating": daily_rating,
"negative_count": daily_negative
})
daily_report.head()Rolling 7-day feedback volume:
daily_report["feedback_7d_avg"] = daily_report["feedback_count"].rolling(
7,
min_periods=1
).mean()Weekly report:
weekly_report = analysis_ts.resample("W").agg(
feedback_count=("feedback_id", "count"),
avg_rating=("rating", "mean"),
avg_response_hours=("response_hours", "mean")
)
weekly_report.round(2)25. Response Time SLA Report
Create an SLA bucket:
analysis["response_bucket"] = pd.cut(
analysis["response_hours"],
bins=[-0.01, 6, 24, 48, np.inf],
labels=["0-6h", "6-24h", "24-48h", "48h+"]
)SLA by channel:
analysis.pivot_table(
index="channel",
columns="response_bucket",
values="feedback_id",
aggfunc="count",
fill_value=0
)Average response by post type:
analysis.groupby("post_type", as_index=False).agg(
feedback_count=("feedback_id", "count"),
avg_response_hours=("response_hours", "mean"),
late_responses=("late_response", "sum")
).round(2)26. MultiIndex Text Summary
Group by series, post type, and sentiment:
multi_summary = (
analysis
.groupby(["series", "post_type", "sentiment_label"])
.agg(
feedback_count=("feedback_id", "count"),
avg_rating=("rating", "mean"),
avg_word_count=("word_count", "mean")
)
)
multi_summarySelect one series:
multi_summary.loc["Pandas"]Unstack sentiment:
multi_summary["feedback_count"].unstack("sentiment_label", fill_value=0)Reset index for export:
multi_summary.reset_index().head()27. Build A Simple Triage Score
Create a row-level score:
analysis["triage_score"] = (
analysis["mentions_bug"].astype(int) * 4
+ analysis["mentions_billing"].astype(int) * 3
+ analysis["mentions_confusion"].astype(int) * 2
+ analysis["sentiment_label"].eq("negative").astype(int) * 3
+ analysis["late_response"].fillna(False).astype(int)
+ analysis["exclamation_count"].clip(upper=3)
)Assign priority:
analysis["triage_priority"] = pd.cut(
analysis["triage_score"],
bins=[-1, 2, 5, 99],
labels=["low", "medium", "high"]
)Top messages to review:
analysis.sort_values("triage_score", ascending=False)[[
"feedback_id",
"title",
"sentiment_label",
"triage_score",
"triage_priority",
"message_clean"
]].head(10)28. Create Final Reports
Executive summary:
executive_summary = pd.DataFrame({
"metric": [
"feedback_rows",
"unique_posts",
"average_rating",
"negative_feedback",
"billing_mentions",
"bug_mentions",
"confusion_mentions",
"average_response_hours",
"late_responses"
],
"value": [
analysis["feedback_id"].nunique(),
analysis["post_id"].nunique(),
round(analysis["rating"].mean(), 2),
int(analysis["sentiment_label"].eq("negative").sum()),
int(analysis["mentions_billing"].sum()),
int(analysis["mentions_bug"].sum()),
int(analysis["mentions_confusion"].sum()),
round(analysis["response_hours"].mean(), 2),
int(analysis["late_response"].sum())
]
})
executive_summaryFinal reports dictionary:
reports = {
"executive_summary": executive_summary,
"post_quality": post_quality,
"author_report": author_report,
"channel_report": channel_report,
"topic_by_series": topic_by_series,
"daily_report": daily_report,
"weekly_report": weekly_report,
"multi_summary": multi_summary.reset_index(),
}Optional exports:
# executive_summary.to_csv("text_case_executive_summary.csv", index=False)
# post_quality.to_csv("text_case_post_quality.csv", index=False)
# daily_report.to_csv("text_case_daily_report.csv")29. Case Study Checklist
Use this checklist for text-heavy Pandas projects:
- Load all raw files.
- Concatenate same-schema monthly files.
- Check shapes, types, missing values, and duplicates.
- Parse timestamps with
errors="coerce". - Audit joins before merging.
- Create a clean text column.
- Lowercase and strip text.
- Remove HTML, URLs, punctuation, and repeated spaces.
- Normalize common short forms.
- Extract useful patterns with regex.
- Create text-length features.
- Tokenize with
.str.split(). - Use
explode()for one-token-per-row analysis. - Remove stopwords.
- Count words and phrases.
- Build a document-term matrix.
- Match topic rules.
- Merge metadata.
- Build post, author, channel, and topic reports.
- Use pivots for cross-tab reports.
- Use
resample()for time trends. - Use rolling metrics for smoothing.
- Create triage scores.
- Create export-ready final tables.
30. Practice Tasks
Try these after completing the walkthrough.
Practice Lab
Task 1. Find the top 10 longest feedback messages.
analysis.sort_values("word_count", ascending=False)[[
"feedback_id",
"title",
"word_count",
"message_clean"
]].head(10)Explanation
- Sorts a DataFrame named 'analysis' by the 'word_count' column in descending order to prioritize longer entries
- Selects specific columns: 'feedback_id', 'title', 'word_count', and 'message_clean' for display
- Returns only the top 10 rows with the highest word counts using the head() method
- This approach helps quickly identify the most verbose feedback entries for detailed review
- The resulting DataFrame shows the longest feedback messages along with their identifying information
Practice Lab
Task 2. Find posts with average rating below 3.5.
post_quality[post_quality["avg_rating"] < 3.5].sort_values("avg_rating")Explanation
- Filters the post_quality DataFrame to include only rows where avg_rating is less than 3.5
- Sorts the filtered results by avg_rating column in ascending order (lowest ratings first)
- Returns a new DataFrame containing only poor-rated posts ordered from worst to best among the low-rated subset
- Useful for identifying underperforming content that needs improvement or review
- The chained operations allow for quick identification of posts requiring attention based on quality metrics
Practice Lab
Task 3. Find the most common token in negative feedback.
meaningful_tokens[meaningful_tokens["sentiment_label"] == "negative"]["token"] \
.value_counts() \
.head(10)Explanation
- Filters the meaningful_tokens DataFrame to include only rows where sentiment_label equals "negative"
- Selects the token column from the filtered results
- Counts the frequency of each unique token value using value_counts()
- Returns the top 10 most frequently occurring tokens with negative sentiment labels
- This analysis helps identify common words associated with negative sentiment in the dataset
Practice Lab
Task 4. Find messages that mention both bug and confusion.
analysis[analysis["mentions_bug"] & analysis["mentions_confusion"]][[
"feedback_id",
"title",
"message_clean"
]]Explanation
- The code filters a DataFrame named 'analysis' to include only rows where both "mentions_bug" and "mentions_confusion" columns contain True values
- It then selects three specific columns: "feedback_id", "title", and "message_clean" from the filtered results
- The boolean filtering uses the & operator to combine two conditions, ensuring both criteria must be met
- This approach is commonly used for data analysis tasks where you need to isolate specific subsets of data based on multiple attributes
- The result returns a new DataFrame containing only the specified columns for entries that satisfy both boolean conditions
Practice Lab
Task 5. Find average response time by sentiment.
analysis.groupby("sentiment_label", as_index=False)["response_hours"].mean().round(2)Explanation
- Groups data by the "sentiment_label" column to aggregate responses by sentiment category
- Calculates the mean of "response_hours" for each sentiment group using the mean() function
- Rounds the resulting averages to 2 decimal places for cleaner presentation
- Returns a new DataFrame with sentiment labels and their corresponding average response times
- The as_index=False parameter ensures the grouping column remains a regular column rather than becoming the index
Practice Lab
Task 6. Find top bigrams for the Pandas series.
pandas_feedback_ids = analysis[analysis["series"] == "Pandas"]["feedback_id"]
bigrams[bigrams["feedback_id"].isin(pandas_feedback_ids)]["bigram"] \
.value_counts() \
.head(10)Explanation
- Filters the analysis dataframe to select only rows where the series column equals "Pandas" and extracts the corresponding feedback_id values
- Uses the extracted feedback_ids to filter the bigrams dataframe for matching entries
- Counts the frequency of each unique bigram in the filtered results
- Returns the top 10 most frequently occurring bigrams from the Pandas series feedback data
Practice Lab
Task 7. Build a post type and sentiment pivot.
analysis.pivot_table(
index="post_type",
columns="sentiment_label",
values="feedback_id",
aggfunc="count",
fill_value=0
)Explanation
- Generates a cross-tabulation showing how feedback counts are distributed between different post types and sentiment labels
- Uses post_type as row identifiers and sentiment_label as column headers for structured data analysis
- Counts feedback_id occurrences within each combination of post type and sentiment category
- Fills missing combinations with zero values to ensure complete dataset representation
- Provides insights into which post types generate feedback with specific sentiment patterns
Practice Lab
Task 8. Find weekly negative feedback rate.
weekly_negative_rate = (
analysis_ts["sentiment_label"].eq("negative").resample("W").mean() * 100
)
weekly_negative_rate.round(2)Explanation
- Creates a time series analysis by counting negative sentiment occurrences and resampling to weekly intervals
- Converts boolean True/False values to numeric 1/0 for calculation using eq("negative") method
- Applies mean() function to compute the average proportion of negative sentiments per week
- Multiplies by 100 to convert decimal proportions to percentage values
- Rounds final results to 2 decimal places for improved readability
Practice Lab
Task 9. Find posts with high review priority and many confusion mentions.
post_quality.query("review_priority_score >= 5 and confusion_mentions > 0") \
.sort_values("review_priority_score", ascending=False)Explanation
- Filters customer reviews where the review priority score is 5 or higher and there are mentions of confusion
- Sorts the filtered results by review priority score in descending order to show highest priority items first
- Uses pandas query method to apply boolean conditions on the post_quality DataFrame
- Returns a sorted subset of reviews that meet both quality and confusion criteria for prioritized attention
Practice Lab
Task 10. Build a term-frequency table by sentiment.
sentiment_token_counts = (
meaningful_tokens
.groupby(["sentiment_label", "token"])
.size()
.rename("count")
.reset_index()
.sort_values(["sentiment_label", "count"], ascending=[True, False])
)
sentiment_token_counts.groupby("sentiment_label").head(10)Explanation
- Groups meaningful tokens by sentiment label and token identity to count occurrences of each token within each sentiment category
- Renames the count column for clarity and resets the index to create a clean dataframe structure
- Sorts results by sentiment label ascending and token count descending to prioritize highest frequency tokens within each sentiment group
- Uses groupby with head(10) to extract the top 10 most frequent tokens for each sentiment label, enabling comparative analysis across sentiment categories
31. Interview Questions From This Case Study
1. Why should text cleaning happen before token analysis?
Cleaning removes inconsistent casing, symbols, URLs, and formatting noise, so token counts represent meaning instead of messy variants.
2. Why use .str methods instead of Python string methods directly?
A Pandas Series is a column of values, not one string. .str applies string methods vectorized across the Series and handles missing values better.
3. Why use regex for HTML and URL removal?
HTML tags and URLs follow patterns, so regex can remove many variants with one expression.
4. Why should stopword removal happen after tokenization?
Stopwords are word-level items, so you need tokens before filtering them.
5. What does explode() do?
It turns each list-like value into multiple rows, which is perfect for token-level analysis.
6. What is a document-term matrix?
It is a table where each row is a document, each column is a token, and each value is the count of that token in that document.
7. What is TF-IDF?
TF-IDF increases the importance of words that are common in one document but not common across all documents.
8. Why use rule-based topic matching?
It gives a transparent first-pass classification system that analysts can inspect and improve.
9. Why audit joins before merging?
Join audits reveal missing post IDs, missing author IDs, or feedback records that will lose context after a merge.
10. Why use validate in merge()?
It checks that the relationship is what you expect, such as many feedback rows matching one post.
11. Why parse dates in a text case study?
Feedback is text, but it also arrives over time. Dates let you analyze trends, response times, and weekly quality changes.
12. Why create a triage score?
A triage score turns many signals into one review priority, helping teams decide what to handle first.
13. What is the limitation of simple keyword rules?
They can miss context, sarcasm, synonyms, and complex intent. They are useful for transparent baseline analysis, not perfect understanding.
14. Why keep raw and cleaned text?
Raw text preserves evidence. Cleaned text supports analysis.
15. Why is this a Pandas case study instead of an NLP project?
The focus is on data wrangling, quality checks, feature engineering, aggregation, and reporting using Pandas. NLP models can come later.
32. Final Mental Model
For textual data in Pandas, think in layers:
| Layer | Pandas Tools |
|---|---|
| Ingest | read_csv, concat |
| Inspect | shape, info, describe, isna, duplicated |
| Clean | .str.lower, .str.strip, .str.replace |
| Extract | .str.extract, .str.contains, regex |
| Tokenize | .str.split, explode |
| Count | value_counts, groupby, size |
| Vectorize | unstack, document-term matrix, TF-IDF |
| Enrich | merge, validate, indicator |
| Time | to_datetime, Timedelta, resample, rolling |
| Report | agg, pivot_table, sorted summaries |
If you remember only one thing, remember this:
Text analysis in Pandas is not just cleaning strings. It is a complete workflow for turning messy human language into structured signals that can be joined, counted, trended, and reported.
Official References
- Pandas working with text data user guide: https://pandas.pydata.org/docs/user_guide/text.html
- Pandas
Series.strAPI reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html - Pandas missing data user guide: https://pandas.pydata.org/docs/user_guide/missing_data.html
- Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html
- Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html
- Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html
