# Pandas Case Study: Cleaning and Analyzing Text Feedback Data URL: https://madhudadi.in/blog/posts/pandas-text-data-case-study-clean-and-analyze-feedback Published: 2026-06-06 Tags: Pandas, python Read time: 84 min Difficulty: advanced > Build a complete Pandas text data case study with original content-feedback data. Clean messy messages, remove HTML and URLs, normalize text, extract keywords, tokenize, remove stopwords, build n-grams and a document-term matrix, join post and author data, parse dates, create pivots, resample trends, and produce final content-quality reports.# Pandas Text Data Case Study: Clean, Explore, Vectorize, and Report This is a full Pandas case study for working with textual data. The scenario is simple: A learning platform collects feedback from learners on blog posts, coding lessons, projects, and interview-prep content. The feedback arrives as messy text with mixed casing, HTML snippets, URLs, typos, repeated records, ratings, timestamps, and support categories. Your job is to turn raw feedback into useful content-quality reports. By the end, you will answer questions like: - Which posts receive the most negative feedback? - Which authors get the strongest ratings? - Which topics are learners asking about? - Which feedback messages mention refund, invoice, bugs, copyright, or confusion? - Which channels bring high-quality feedback? - Which words and phrases appear most often? - How do feedback volume and response time change over time? - Which posts need review first? This case study uses original sample datasets created for this lesson. It does not use copied movie reviews, social media posts, ecommerce reviews, public datasets, or third-party course material. Files used in this case study: - `pandas_text_case_posts.csv` - `pandas_text_case_authors.csv` - `pandas_text_case_feedback_jan.csv` - `pandas_text_case_feedback_feb.csv` - `pandas_text_case_topic_rules.csv` Place all CSV files in the same folder as this Markdown file before running the examples. ## What This Case Study Covers This project brings together many Pandas skills: - `read_csv` - `concat` - `merge` - `validate` - `indicator=True` - `shape`, `info`, `describe` - missing values - duplicate rows - string cleaning with `.str` - regex replacement - regex extraction - URL and HTML removal - punctuation removal - text normalization - text feature engineering - tokenization with `.str.split()` - `explode` - stopword filtering - word counts - n-grams - keyword rule matching - document-term matrices - simple TF-IDF with Pandas and NumPy - datetime parsing - Timedelta response-time analysis - `groupby` - named aggregation - `transform` - `pivot_table` - `set_index` - `resample` - rolling averages - MultiIndex-style summaries - final export-ready reports ## 1. Import Libraries ```python import pandas as pd import numpy as np ``` Optional display settings: ```python pd.set_option("display.max_columns", 100) pd.set_option("display.width", 120) ``` ## 2. Load The Data ```python posts = pd.read_csv("pandas_text_case_posts.csv") authors = pd.read_csv("pandas_text_case_authors.csv") rules = pd.read_csv("pandas_text_case_topic_rules.csv") jan = pd.read_csv("pandas_text_case_feedback_jan.csv") feb = pd.read_csv("pandas_text_case_feedback_feb.csv") ``` Check the shapes: ```python for name, df in { "posts": posts, "authors": authors, "rules": rules, "jan": jan, "feb": feb, }.items(): print(name, df.shape) ``` Preview the data: ```python posts.head() ``` ```python jan.head() ``` ```python rules.head() ``` ## 3. Understand The Data Model The data has three supporting tables: - `posts`: one row per content item - `authors`: one row per content author - `rules`: keyword rules used to detect topics in feedback text It has two monthly feedback tables: - `jan`: feedback received in January - `feb`: feedback received in February Important keys: | Table | Key Columns | |---|---| | posts | `post_id`, `author_id` | | authors | `author_id` | | feedback | `feedback_id`, `post_id` | | rules | `rule_id` | ## 4. Inspect Raw Tables ```python posts.info() ``` ```python jan.info() ``` ```python jan.describe(include="object") ``` Check numeric fields: ```python jan.describe(numeric_only=True) ``` Check sentiment labels: ```python jan["sentiment_label"].value_counts(dropna=False) ``` ## 5. Concatenate Monthly Feedback January and February feedback have the same schema. ```python feedback = pd.concat([jan, feb], ignore_index=True) feedback.shape ``` Keep source month when needed: ```python feedback_with_source = pd.concat( [jan, feb], keys=["Jan", "Feb"] ).reset_index(level=0).rename(columns={"level_0": "source_month"}) feedback_with_source.head() ``` For the main workflow: ```python feedback = pd.concat([jan, feb], ignore_index=True) ``` ## 6. Data Quality Checks Check duplicate full rows: ```python feedback.duplicated().sum() ``` Drop exact duplicates: ```python feedback = feedback.drop_duplicates().copy() ``` Check duplicate feedback IDs: ```python feedback["feedback_id"].duplicated().sum() ``` Check missing values: ```python feedback.isna().sum() ``` Check unknown post IDs before merging: ```python feedback.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts() ``` Find feedback rows pointing to missing posts: ```python feedback.merge(posts, how="left", on="post_id", indicator=True) \ .query("_merge == 'left_only'")[["feedback_id", "post_id", "raw_message"]] ``` Do not hide this. Data quality findings are part of the case study. ## 7. Parse Dates Convert timestamps: ```python feedback["submitted_at"] = pd.to_datetime(feedback["submitted_at"], errors="coerce") feedback["responded_at"] = pd.to_datetime(feedback["responded_at"], errors="coerce") posts["published_at"] = pd.to_datetime(posts["published_at"], errors="coerce") ``` Find invalid submitted dates: ```python feedback[feedback["submitted_at"].isna()] ``` Keep valid submitted feedback for time analysis: ```python feedback_valid = feedback.dropna(subset=["submitted_at"]).copy() ``` Create response-time features: ```python feedback_valid["response_time"] = feedback_valid["responded_at"] - feedback_valid["submitted_at"] feedback_valid["response_hours"] = ( feedback_valid["response_time"].dt.total_seconds() / 3600 ).round(2) ``` Flag feedback not answered within 24 hours: ```python feedback_valid["late_response"] = feedback_valid["response_hours"] > 24 ``` Rows with missing `responded_at` will have missing response hours. ## 8. Clean Column Names Column names can also be text data. ```python feedback_valid.columns = ( feedback_valid.columns .str.strip() .str.lower() .str.replace(" ", "_", regex=False) ) ``` This dataset already has clean column names, but the pattern is useful in real projects. ## 9. Clean Raw Feedback Text Start with a copy: ```python text_df = feedback_valid.copy() ``` Inspect raw text: ```python text_df["raw_message"].head(10) ``` Create a cleaned text column: ```python text_df["message_clean"] = text_df["raw_message"].fillna("") ``` Lowercase: ```python text_df["message_clean"] = text_df["message_clean"].str.lower() ``` Remove leading and trailing spaces: ```python text_df["message_clean"] = text_df["message_clean"].str.strip() ``` Remove HTML tags: ```python text_df["message_clean"] = text_df["message_clean"].str.replace( r"<[^>]+>", " ", regex=True ) ``` Remove URLs: ```python text_df["message_clean"] = text_df["message_clean"].str.replace( r"https?://\S+|www\.\S+", " ", regex=True ) ``` Normalize a few common contractions and short forms: ```python replacements = { r"\bcan't\b": "cannot", r"\bwon't\b": "will not", r"\bi'm\b": "i am", r"\bi've\b": "i have", r"\bpls\b": "please", r"\bplz\b": "please", } text_df["message_clean"] = text_df["message_clean"].replace( replacements, regex=True ) ``` Remove punctuation and symbols: ```python text_df["message_clean"] = text_df["message_clean"].str.replace( r"[^a-z0-9\s]", " ", regex=True ) ``` Collapse repeated spaces: ```python text_df["message_clean"] = text_df["message_clean"].str.replace( r"\s+", " ", regex=True ).str.strip() ``` Compare raw and cleaned text: ```python text_df[["raw_message", "message_clean"]].head(10) ``` ## 10. Text Quality Features Create length features: ```python text_df["char_count_raw"] = text_df["raw_message"].fillna("").str.len() text_df["char_count_clean"] = text_df["message_clean"].str.len() ``` Word count: ```python text_df["word_count"] = text_df["message_clean"].str.split().str.len() ``` Count exclamation marks in raw text: ```python text_df["exclamation_count"] = text_df["raw_message"].fillna("").str.count("!") ``` Check whether the raw message had a URL: ```python text_df["has_url"] = text_df["raw_message"].fillna("").str.contains( r"https?://|www\.", regex=True, case=False, na=False ) ``` Check whether the raw message had HTML: ```python text_df["has_html"] = text_df["raw_message"].fillna("").str.contains( r"<[^>]+>", regex=True, na=False ) ``` Feature summary: ```python text_df[[ "feedback_id", "char_count_raw", "char_count_clean", "word_count", "exclamation_count", "has_url", "has_html" ]].head() ``` ## 11. Extract Useful Text Patterns Extract email-like text if a user typed one into the message: ```python text_df["mentioned_email"] = text_df["raw_message"].str.extract( r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})" ) ``` Extract issue IDs like `ERR-204`: ```python text_df["issue_code"] = text_df["raw_message"].str.extract( r"\b(ERR-\d{3})\b" ) ``` Find messages that mention money or billing: ```python text_df["mentions_billing"] = text_df["message_clean"].str.contains( r"\b(?:refund|invoice|payment|paid|money|billing)\b", regex=True, na=False ) ``` Find messages that mention bugs: ```python text_df["mentions_bug"] = text_df["message_clean"].str.contains( r"\b(?:error|bug|crash|broken|not working|err)\b", regex=True, na=False ) ``` Find messages that mention content clarity: ```python text_df["mentions_confusion"] = text_df["message_clean"].str.contains( r"\b(?:confusing|unclear|hard|lost|explain|example)\b", regex=True, na=False ) ``` ## 12. Tokenize Text Tokenization means splitting text into words. ```python text_df["tokens"] = text_df["message_clean"].str.split() ``` Inspect: ```python text_df[["feedback_id", "tokens"]].head() ``` Explode tokens into one row per token: ```python tokens = text_df[["feedback_id", "post_id", "sentiment_label", "tokens"]].explode("tokens") tokens = tokens.rename(columns={"tokens": "token"}) tokens.head() ``` Remove missing or empty tokens: ```python tokens = tokens.dropna(subset=["token"]) tokens = tokens[tokens["token"].str.len() > 0] ``` ## 13. Remove Stopwords Use a small custom stopword list. ```python stopwords = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "can", "for", "from", "has", "have", "i", "in", "is", "it", "my", "of", "on", "or", "please", "so", "that", "the", "this", "to", "was", "with", "you", "your" } ``` Filter: ```python meaningful_tokens = tokens[~tokens["token"].isin(stopwords)].copy() ``` Top words: ```python meaningful_tokens["token"].value_counts().head(20) ``` Top words by sentiment: ```python meaningful_tokens.groupby("sentiment_label")["token"].value_counts().groupby(level=0).head(10) ``` ## 14. N-Grams With Pandas An n-gram is a sequence of words. For bigrams, use the current token plus the next token inside the same feedback message. ```python meaningful_tokens = meaningful_tokens.sort_values(["feedback_id"]).copy() meaningful_tokens["next_token"] = meaningful_tokens.groupby("feedback_id")["token"].shift(-1) meaningful_tokens["bigram"] = meaningful_tokens["token"] + " " + meaningful_tokens["next_token"] ``` Remove rows where there is no next token: ```python bigrams = meaningful_tokens.dropna(subset=["next_token"]).copy() ``` Top bigrams: ```python bigrams["bigram"].value_counts().head(20) ``` Top bigrams by sentiment: ```python bigrams.groupby("sentiment_label")["bigram"].value_counts().groupby(level=0).head(10) ``` ## 15. Build A Document-Term Matrix A document-term matrix has: - one row per document - one column per token - values showing token counts ```python token_counts = ( meaningful_tokens .groupby(["feedback_id", "token"]) .size() .rename("count") ) ``` Convert to wide format: ```python dtm = token_counts.unstack(fill_value=0) dtm.head() ``` Keep the most common tokens: ```python top_tokens = meaningful_tokens["token"].value_counts().head(25).index dtm_top = dtm.reindex(columns=top_tokens, fill_value=0) dtm_top.head() ``` This is a Pandas-native version of a simple bag-of-words table. ## 16. Simple TF-IDF With Pandas Term frequency: ```python tf = dtm_top.div(dtm_top.sum(axis=1).replace(0, np.nan), axis=0).fillna(0) ``` Document frequency: ```python doc_freq = (dtm_top > 0).sum(axis=0) ``` Inverse document frequency: ```python idf = np.log((1 + len(dtm_top)) / (1 + doc_freq)) + 1 ``` TF-IDF: ```python tfidf = tf * idf tfidf.head() ``` Top TF-IDF token per feedback: ```python top_tfidf = tfidf.idxmax(axis=1).rename("top_keyword").reset_index() top_tfidf.head() ``` This is not a full NLP library, but it teaches the idea using Pandas. ## 17. Match Topic Rules The topic rules table has regex patterns. ```python rules ``` Create a long table of rule hits. ```python rule_hits = [] for row in rules.itertuples(index=False): mask = text_df["message_clean"].str.contains(row.pattern, regex=True, na=False) matched = text_df.loc[mask, ["feedback_id", "post_id", "message_clean"]].copy() matched["rule_id"] = row.rule_id matched["topic"] = row.topic matched["severity"] = row.severity matched["recommended_action"] = row.recommended_action rule_hits.append(matched) rule_hits = pd.concat(rule_hits, ignore_index=True) rule_hits.head() ``` Topic counts: ```python rule_hits["topic"].value_counts() ``` Severity counts: ```python rule_hits["severity"].value_counts() ``` Topics by post: ```python rule_hits.groupby(["post_id", "topic"]).size().reset_index(name="mentions") ``` ## 18. Merge With Posts And Authors Audit post matches: ```python text_df.merge(posts, how="left", on="post_id", indicator=True)["_merge"].value_counts() ``` Build the analysis table: ```python analysis = ( text_df .merge(posts, how="left", on="post_id", validate="many_to_one") .merge(authors, how="left", on="author_id", validate="many_to_one") ) analysis.head() ``` Select useful columns: ```python analysis = analysis[[ "feedback_id", "post_id", "slug", "title", "series", "post_type", "author", "submitted_at", "responded_at", "response_hours", "late_response", "rating", "sentiment_label", "channel", "status", "message_clean", "word_count", "char_count_clean", "exclamation_count", "has_url", "has_html", "mentions_billing", "mentions_bug", "mentions_confusion" ]] analysis.head() ``` ## 19. Post-Level Content Quality Report Create a post-level report: ```python post_quality = ( analysis .groupby(["post_id", "title", "series", "post_type", "author"], as_index=False) .agg( feedback_count=("feedback_id", "count"), avg_rating=("rating", "mean"), negative_count=("sentiment_label", lambda s: (s == "negative").sum()), positive_count=("sentiment_label", lambda s: (s == "positive").sum()), avg_word_count=("word_count", "mean"), billing_mentions=("mentions_billing", "sum"), bug_mentions=("mentions_bug", "sum"), confusion_mentions=("mentions_confusion", "sum"), late_responses=("late_response", "sum") ) ) post_quality["avg_rating"] = post_quality["avg_rating"].round(2) post_quality["avg_word_count"] = post_quality["avg_word_count"].round(2) ``` Create a review priority score: ```python post_quality["review_priority_score"] = ( post_quality["negative_count"] * 3 + post_quality["bug_mentions"] * 3 + post_quality["confusion_mentions"] * 2 + post_quality["billing_mentions"] * 2 + post_quality["late_responses"] ) post_quality.sort_values("review_priority_score", ascending=False) ``` This table tells the content team what to improve first. ## 20. Author-Level Report ```python author_report = ( analysis .groupby("author", as_index=False) .agg( posts=("post_id", "nunique"), feedback_count=("feedback_id", "count"), avg_rating=("rating", "mean"), negative_count=("sentiment_label", lambda s: (s == "negative").sum()), avg_response_hours=("response_hours", "mean") ) ) author_report["avg_rating"] = author_report["avg_rating"].round(2) author_report["avg_response_hours"] = author_report["avg_response_hours"].round(2) author_report.sort_values("feedback_count", ascending=False) ``` ## 21. Channel And Sentiment Report ```python channel_report = ( analysis .groupby(["channel", "sentiment_label"], as_index=False) .agg( feedback_count=("feedback_id", "count"), avg_rating=("rating", "mean"), avg_word_count=("word_count", "mean") ) ) channel_report["avg_rating"] = channel_report["avg_rating"].round(2) channel_report["avg_word_count"] = channel_report["avg_word_count"].round(2) channel_report ``` Pivot version: ```python analysis.pivot_table( index="channel", columns="sentiment_label", values="feedback_id", aggfunc="count", fill_value=0 ) ``` ## 22. Topic Report From Rule Hits Merge rule hits with post metadata: ```python topic_report_base = ( rule_hits .merge(posts, how="left", on="post_id", validate="many_to_one") .merge(authors, how="left", on="author_id", validate="many_to_one") ) ``` Topic count by series: ```python topic_by_series = ( topic_report_base .groupby(["series", "topic"], as_index=False) .agg(mentions=("feedback_id", "count")) .sort_values(["series", "mentions"], ascending=[True, False]) ) topic_by_series ``` Topic severity report: ```python topic_report_base.pivot_table( index="topic", columns="severity", values="feedback_id", aggfunc="count", fill_value=0 ) ``` ## 23. Date Features And Time-Based Analysis Add date features: ```python analysis["submitted_date"] = analysis["submitted_at"].dt.date analysis["submitted_month"] = analysis["submitted_at"].dt.to_period("M").astype(str) analysis["submitted_day_name"] = analysis["submitted_at"].dt.day_name() analysis["submitted_hour"] = analysis["submitted_at"].dt.hour ``` Feedback by day name: ```python analysis.groupby("submitted_day_name")["feedback_id"].count() ``` Feedback by hour: ```python analysis.groupby("submitted_hour")["feedback_id"].count().sort_index() ``` Monthly sentiment: ```python analysis.pivot_table( index="submitted_month", columns="sentiment_label", values="feedback_id", aggfunc="count", fill_value=0 ) ``` ## 24. Resample Feedback Trends Set a datetime index: ```python analysis_ts = analysis.set_index("submitted_at").sort_index() ``` Daily feedback count: ```python daily_feedback = analysis_ts["feedback_id"].resample("D").count() daily_feedback.head() ``` Daily average rating: ```python daily_rating = analysis_ts["rating"].resample("D").mean() daily_rating.head() ``` Daily negative feedback: ```python daily_negative = ( analysis_ts["sentiment_label"] .eq("negative") .resample("D") .sum() ) ``` Combine daily metrics: ```python daily_report = pd.DataFrame({ "feedback_count": daily_feedback, "avg_rating": daily_rating, "negative_count": daily_negative }) daily_report.head() ``` Rolling 7-day feedback volume: ```python daily_report["feedback_7d_avg"] = daily_report["feedback_count"].rolling( 7, min_periods=1 ).mean() ``` Weekly report: ```python weekly_report = analysis_ts.resample("W").agg( feedback_count=("feedback_id", "count"), avg_rating=("rating", "mean"), avg_response_hours=("response_hours", "mean") ) weekly_report.round(2) ``` ## 25. Response Time SLA Report Create an SLA bucket: ```python analysis["response_bucket"] = pd.cut( analysis["response_hours"], bins=[-0.01, 6, 24, 48, np.inf], labels=["0-6h", "6-24h", "24-48h", "48h+"] ) ``` SLA by channel: ```python analysis.pivot_table( index="channel", columns="response_bucket", values="feedback_id", aggfunc="count", fill_value=0 ) ``` Average response by post type: ```python analysis.groupby("post_type", as_index=False).agg( feedback_count=("feedback_id", "count"), avg_response_hours=("response_hours", "mean"), late_responses=("late_response", "sum") ).round(2) ``` ## 26. MultiIndex Text Summary Group by series, post type, and sentiment: ```python multi_summary = ( analysis .groupby(["series", "post_type", "sentiment_label"]) .agg( feedback_count=("feedback_id", "count"), avg_rating=("rating", "mean"), avg_word_count=("word_count", "mean") ) ) multi_summary ``` Select one series: ```python multi_summary.loc["Pandas"] ``` Unstack sentiment: ```python multi_summary["feedback_count"].unstack("sentiment_label", fill_value=0) ``` Reset index for export: ```python multi_summary.reset_index().head() ``` ## 27. Build A Simple Triage Score Create a row-level score: ```python analysis["triage_score"] = ( analysis["mentions_bug"].astype(int) * 4 + analysis["mentions_billing"].astype(int) * 3 + analysis["mentions_confusion"].astype(int) * 2 + analysis["sentiment_label"].eq("negative").astype(int) * 3 + analysis["late_response"].fillna(False).astype(int) + analysis["exclamation_count"].clip(upper=3) ) ``` Assign priority: ```python analysis["triage_priority"] = pd.cut( analysis["triage_score"], bins=[-1, 2, 5, 99], labels=["low", "medium", "high"] ) ``` Top messages to review: ```python analysis.sort_values("triage_score", ascending=False)[[ "feedback_id", "title", "sentiment_label", "triage_score", "triage_priority", "message_clean" ]].head(10) ``` ## 28. Create Final Reports Executive summary: ```python executive_summary = pd.DataFrame({ "metric": [ "feedback_rows", "unique_posts", "average_rating", "negative_feedback", "billing_mentions", "bug_mentions", "confusion_mentions", "average_response_hours", "late_responses" ], "value": [ analysis["feedback_id"].nunique(), analysis["post_id"].nunique(), round(analysis["rating"].mean(), 2), int(analysis["sentiment_label"].eq("negative").sum()), int(analysis["mentions_billing"].sum()), int(analysis["mentions_bug"].sum()), int(analysis["mentions_confusion"].sum()), round(analysis["response_hours"].mean(), 2), int(analysis["late_response"].sum()) ] }) executive_summary ``` Final reports dictionary: ```python reports = { "executive_summary": executive_summary, "post_quality": post_quality, "author_report": author_report, "channel_report": channel_report, "topic_by_series": topic_by_series, "daily_report": daily_report, "weekly_report": weekly_report, "multi_summary": multi_summary.reset_index(), } ``` Optional exports: ```python # executive_summary.to_csv("text_case_executive_summary.csv", index=False) # post_quality.to_csv("text_case_post_quality.csv", index=False) # daily_report.to_csv("text_case_daily_report.csv") ``` ## 29. Case Study Checklist Use this checklist for text-heavy Pandas projects: 1. Load all raw files. 2. Concatenate same-schema monthly files. 3. Check shapes, types, missing values, and duplicates. 4. Parse timestamps with `errors="coerce"`. 5. Audit joins before merging. 6. Create a clean text column. 7. Lowercase and strip text. 8. Remove HTML, URLs, punctuation, and repeated spaces. 9. Normalize common short forms. 10. Extract useful patterns with regex. 11. Create text-length features. 12. Tokenize with `.str.split()`. 13. Use `explode()` for one-token-per-row analysis. 14. Remove stopwords. 15. Count words and phrases. 16. Build a document-term matrix. 17. Match topic rules. 18. Merge metadata. 19. Build post, author, channel, and topic reports. 20. Use pivots for cross-tab reports. 21. Use `resample()` for time trends. 22. Use rolling metrics for smoothing. 23. Create triage scores. 24. Create export-ready final tables. ## 30. Practice Tasks Try these after completing the walkthrough. ### Task 1. Find the top 10 longest feedback messages. ```python analysis.sort_values("word_count", ascending=False)[[ "feedback_id", "title", "word_count", "message_clean" ]].head(10) ``` **Explanation** - Sorts a DataFrame named 'analysis' by the 'word_count' column in descending order to prioritize longer entries - Selects specific columns: 'feedback_id', 'title', 'word_count', and 'message_clean' for display - Returns only the top 10 rows with the highest word counts using the head() method - This approach helps quickly identify the most verbose feedback entries for detailed review - The resulting DataFrame shows the longest feedback messages along with their identifying information ### Task 2. Find posts with average rating below 3.5. ```python post_quality[post_quality["avg_rating"] < 3.5].sort_values("avg_rating") ``` **Explanation** - Filters the post_quality DataFrame to include only rows where avg_rating is less than 3.5 - Sorts the filtered results by avg_rating column in ascending order (lowest ratings first) - Returns a new DataFrame containing only poor-rated posts ordered from worst to best among the low-rated subset - Useful for identifying underperforming content that needs improvement or review - The chained operations allow for quick identification of posts requiring attention based on quality metrics ### Task 3. Find the most common token in negative feedback. ```python meaningful_tokens[meaningful_tokens["sentiment_label"] == "negative"]["token"] \ .value_counts() \ .head(10) ``` **Explanation** - Filters the meaningful_tokens DataFrame to include only rows where sentiment_label equals "negative" - Selects the token column from the filtered results - Counts the frequency of each unique token value using value_counts() - Returns the top 10 most frequently occurring tokens with negative sentiment labels - This analysis helps identify common words associated with negative sentiment in the dataset ### Task 4. Find messages that mention both bug and confusion. ```python analysis[analysis["mentions_bug"] & analysis["mentions_confusion"]][[ "feedback_id", "title", "message_clean" ]] ``` **Explanation** - The code filters a DataFrame named 'analysis' to include only rows where both "mentions_bug" and "mentions_confusion" columns contain True values - It then selects three specific columns: "feedback_id", "title", and "message_clean" from the filtered results - The boolean filtering uses the & operator to combine two conditions, ensuring both criteria must be met - This approach is commonly used for data analysis tasks where you need to isolate specific subsets of data based on multiple attributes - The result returns a new DataFrame containing only the specified columns for entries that satisfy both boolean conditions ### Task 5. Find average response time by sentiment. ```python analysis.groupby("sentiment_label", as_index=False)["response_hours"].mean().round(2) ``` **Explanation** - Groups data by the "sentiment_label" column to aggregate responses by sentiment category - Calculates the mean of "response_hours" for each sentiment group using the mean() function - Rounds the resulting averages to 2 decimal places for cleaner presentation - Returns a new DataFrame with sentiment labels and their corresponding average response times - The as_index=False parameter ensures the grouping column remains a regular column rather than becoming the index ### Task 6. Find top bigrams for the Pandas series. ```python pandas_feedback_ids = analysis[analysis["series"] == "Pandas"]["feedback_id"] bigrams[bigrams["feedback_id"].isin(pandas_feedback_ids)]["bigram"] \ .value_counts() \ .head(10) ``` **Explanation** - Filters the analysis dataframe to select only rows where the series column equals "Pandas" and extracts the corresponding feedback_id values - Uses the extracted feedback_ids to filter the bigrams dataframe for matching entries - Counts the frequency of each unique bigram in the filtered results - Returns the top 10 most frequently occurring bigrams from the Pandas series feedback data ### Task 7. Build a post type and sentiment pivot. ```python analysis.pivot_table( index="post_type", columns="sentiment_label", values="feedback_id", aggfunc="count", fill_value=0 ) ``` **Explanation** - Generates a cross-tabulation showing how feedback counts are distributed between different post types and sentiment labels - Uses post_type as row identifiers and sentiment_label as column headers for structured data analysis - Counts feedback_id occurrences within each combination of post type and sentiment category - Fills missing combinations with zero values to ensure complete dataset representation - Provides insights into which post types generate feedback with specific sentiment patterns ### Task 8. Find weekly negative feedback rate. ```python weekly_negative_rate = ( analysis_ts["sentiment_label"].eq("negative").resample("W").mean() * 100 ) weekly_negative_rate.round(2) ``` **Explanation** - Creates a time series analysis by counting negative sentiment occurrences and resampling to weekly intervals - Converts boolean True/False values to numeric 1/0 for calculation using eq("negative") method - Applies mean() function to compute the average proportion of negative sentiments per week - Multiplies by 100 to convert decimal proportions to percentage values - Rounds final results to 2 decimal places for improved readability ### Task 9. Find posts with high review priority and many confusion mentions. ```python post_quality.query("review_priority_score >= 5 and confusion_mentions > 0") \ .sort_values("review_priority_score", ascending=False) ``` **Explanation** - Filters customer reviews where the review priority score is 5 or higher and there are mentions of confusion - Sorts the filtered results by review priority score in descending order to show highest priority items first - Uses pandas query method to apply boolean conditions on the post_quality DataFrame - Returns a sorted subset of reviews that meet both quality and confusion criteria for prioritized attention ### Task 10. Build a term-frequency table by sentiment. ```python sentiment_token_counts = ( meaningful_tokens .groupby(["sentiment_label", "token"]) .size() .rename("count") .reset_index() .sort_values(["sentiment_label", "count"], ascending=[True, False]) ) sentiment_token_counts.groupby("sentiment_label").head(10) ``` **Explanation** - Groups meaningful tokens by sentiment label and token identity to count occurrences of each token within each sentiment category - Renames the count column for clarity and resets the index to create a clean dataframe structure - Sorts results by sentiment label ascending and token count descending to prioritize highest frequency tokens within each sentiment group - Uses groupby with head(10) to extract the top 10 most frequent tokens for each sentiment label, enabling comparative analysis across sentiment categories ## 31. Interview Questions From This Case Study ### 1. Why should text cleaning happen before token analysis? Cleaning removes inconsistent casing, symbols, URLs, and formatting noise, so token counts represent meaning instead of messy variants. ### 2. Why use `.str` methods instead of Python string methods directly? A Pandas Series is a column of values, not one string. `.str` applies string methods vectorized across the Series and handles missing values better. ### 3. Why use regex for HTML and URL removal? HTML tags and URLs follow patterns, so regex can remove many variants with one expression. ### 4. Why should stopword removal happen after tokenization? Stopwords are word-level items, so you need tokens before filtering them. ### 5. What does `explode()` do? It turns each list-like value into multiple rows, which is perfect for token-level analysis. ### 6. What is a document-term matrix? It is a table where each row is a document, each column is a token, and each value is the count of that token in that document. ### 7. What is TF-IDF? TF-IDF increases the importance of words that are common in one document but not common across all documents. ### 8. Why use rule-based topic matching? It gives a transparent first-pass classification system that analysts can inspect and improve. ### 9. Why audit joins before merging? Join audits reveal missing post IDs, missing author IDs, or feedback records that will lose context after a merge. ### 10. Why use `validate` in `merge()`? It checks that the relationship is what you expect, such as many feedback rows matching one post. ### 11. Why parse dates in a text case study? Feedback is text, but it also arrives over time. Dates let you analyze trends, response times, and weekly quality changes. ### 12. Why create a triage score? A triage score turns many signals into one review priority, helping teams decide what to handle first. ### 13. What is the limitation of simple keyword rules? They can miss context, sarcasm, synonyms, and complex intent. They are useful for transparent baseline analysis, not perfect understanding. ### 14. Why keep raw and cleaned text? Raw text preserves evidence. Cleaned text supports analysis. ### 15. Why is this a Pandas case study instead of an NLP project? The focus is on data wrangling, quality checks, feature engineering, aggregation, and reporting using Pandas. NLP models can come later. ## 32. Final Mental Model For textual data in Pandas, think in layers: | Layer | Pandas Tools | |---|---| | Ingest | `read_csv`, `concat` | | Inspect | `shape`, `info`, `describe`, `isna`, `duplicated` | | Clean | `.str.lower`, `.str.strip`, `.str.replace` | | Extract | `.str.extract`, `.str.contains`, regex | | Tokenize | `.str.split`, `explode` | | Count | `value_counts`, `groupby`, `size` | | Vectorize | `unstack`, document-term matrix, TF-IDF | | Enrich | `merge`, `validate`, `indicator` | | Time | `to_datetime`, Timedelta, `resample`, rolling | | Report | `agg`, `pivot_table`, sorted summaries | If you remember only one thing, remember this: Text analysis in Pandas is not just cleaning strings. It is a complete workflow for turning messy human language into structured signals that can be joined, counted, trended, and reported. ## Official References - Pandas working with text data user guide: https://pandas.pydata.org/docs/user_guide/text.html - Pandas `Series.str` API reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html - Pandas missing data user guide: https://pandas.pydata.org/docs/user_guide/missing_data.html - Pandas merging, joining, and concatenating user guide: https://pandas.pydata.org/docs/user_guide/merging.html - Pandas GroupBy user guide: https://pandas.pydata.org/docs/user_guide/groupby.html - Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html