# Mastering Pandas: Vectorized String and DateTime Operations URL: https://madhudadi.in/blog/posts/pandas-vectorized-string-datetime-operations-guide Published: 2026-06-05 Tags: Pandas, python Read time: 50 min Difficulty: intermediate > Learn Pandas vectorized string operations and datetime tools with original CSV datasets. Clean names, emails, phone numbers, plan codes, and notes with Series.str, then parse dates, use Timestamp, date_range, to_datetime, dt accessors, time filtering, timedeltas, and resampling.# Pandas Vectorized String Operations and DateTime: Clean Text and Analyze Time Real datasets are rarely clean. You may receive columns like: - names with extra spaces - emails with mixed case - phone numbers with symbols - plan codes packed into one string - notes with inconsistent keywords - dates stored as plain text - event timestamps with invalid values Pandas gives you two powerful accessors for this kind of work: - `.str` for vectorized string operations - `.dt` for vectorized datetime operations This guide uses original sample datasets created for this lesson. It does not use copied passenger, finance, ecommerce, or public dataset examples. Files used in this lesson: - `pandas_string_profiles.csv` - `pandas_datetime_events.csv` Place both CSV files in the same folder as this Markdown file before running the examples. ## What You Will Learn By the end, you should be able to use: - vectorized operations - `Series.str.lower()`, `upper()`, `title()`, and `capitalize()` - `Series.str.strip()`, `lstrip()`, and `rstrip()` - `Series.str.len()` - `Series.str.split()` - `Series.str.get()` - `Series.str.extract()` - `Series.str.replace()` - `Series.str.contains()` - `Series.str.startswith()` and `endswith()` - `Series.str.isdigit()`, `isalpha()`, and `isalnum()` - string slicing with `.str[]` - `pd.Timestamp` - `pd.DatetimeIndex` - `pd.date_range()` - `pd.to_datetime()` - `errors="coerce"` - `.dt.year`, `.dt.month`, `.dt.month_name()`, `.dt.day_name()`, `.dt.hour` - date filtering - `pd.to_timedelta()` - resampling time series data ## 1. Setup ```python import pandas as pd import numpy as np ``` Load the datasets: ```python profiles = pd.read_csv("pandas_string_profiles.csv") events = pd.read_csv("pandas_datetime_events.csv") ``` Inspect them: ```python profiles.head() ``` ```python events.head() ``` ## 2. What Vectorized Operations Mean In NumPy, vectorized operations apply one operation to many values at once. ```python arr = np.array([10, 20, 30, 40]) arr * 2 ``` Pandas brings the same idea to columns. Without vectorized string methods, this kind of code can break: ```python names = [" diya rao ", "KABIR SEN", None, "meera iyer"] [name.strip().title() for name in names] ``` The `None` value causes a problem. Pandas handles missing values more safely: ```python s = pd.Series(names) s.str.strip().str.title() ``` The `.str` accessor applies string methods across the Series while preserving missing values. ## 3. Clean Text With Case Methods Start with raw profile names: ```python profiles["raw_name"].head() ``` Convert to lowercase: ```python profiles["raw_name"].str.lower() ``` Convert to uppercase: ```python profiles["raw_name"].str.upper() ``` Convert to title case: ```python profiles["raw_name"].str.title() ``` Clean the name with strip plus title: ```python profiles["clean_name"] = profiles["raw_name"].str.strip().str.title() profiles[["raw_name", "clean_name"]].head() ``` String methods can be chained when each step returns another Series. ## 4. Strip Extra Spaces Extra spaces are common in CSV files. ```python profiles["raw_name"].str.strip() ``` Remove only left-side spaces: ```python profiles["raw_name"].str.lstrip() ``` Remove only right-side spaces: ```python profiles["raw_name"].str.rstrip() ``` Clean city names: ```python profiles["city_clean"] = profiles["city"].str.strip().str.title() profiles[["city", "city_clean"]].head() ``` This is a simple but high-value cleaning step. ## 5. String Length Find name lengths: ```python profiles["clean_name"].str.len() ``` Add a length column: ```python profiles["name_length"] = profiles["clean_name"].str.len() ``` Find the longest name: ```python profiles.loc[ profiles["name_length"] == profiles["name_length"].max(), ["profile_id", "clean_name", "name_length"] ] ``` Use `.str.len()` for: - validating IDs - detecting suspiciously short names - finding long text fields - checking code formats ## 6. Split Strings Into Parts Clean the plan code: ```python profiles["plan_code"].head() ``` Each plan code follows this pattern: ```text TRACK-TIER-YEAR ``` Split it: ```python plan_parts = profiles["plan_code"].str.split("-", expand=True) plan_parts.head() ``` Name the columns: ```python profiles[["track_code", "tier_code", "plan_year"]] = plan_parts profiles[["plan_code", "track_code", "tier_code", "plan_year"]].head() ``` Convert year to number: ```python profiles["plan_year"] = profiles["plan_year"].astype(int) ``` Get only the first piece without expanding: ```python profiles["plan_code"].str.split("-").str.get(0) ``` `.str.get(0)` gets the first item from each split list. ## 7. Extract With Regex Splitting works when the delimiter is simple. Regex extraction is better when you need named pieces. Extract an email domain: ```python profiles["email_clean"] = profiles["email"].str.strip().str.lower() profiles["email_domain"] = profiles["email_clean"].str.extract( r"@(?P[^@\s]+)$" ) profiles[["email", "email_clean", "email_domain"]].head() ``` Extract name title, first name, and last name: ```python name_parts = profiles["clean_name"].str.extract( r"^(?:(?PMr|Ms|Dr|Prof)\s+)?(?P<first_name>[A-Za-z]+)\s+(?P<last_name>[A-Za-z]+)$" ) name_parts.head() ``` Join the extracted columns back: ```python profiles = pd.concat([profiles, name_parts], axis=1) profiles[["clean_name", "title", "first_name", "last_name"]].head() ``` Use `.str.extract()` when you want a clean DataFrame of pattern groups. ## 8. Replace Text Standardize plan tiers: ```python profiles["tier_code"].value_counts() ``` ```python profiles["tier_label"] = profiles["tier_code"].str.replace("PRO", "Premium", regex=False) profiles["tier_label"] = profiles["tier_label"].str.replace("FREE", "Free", regex=False) profiles["tier_label"] = profiles["tier_label"].str.replace("TRIAL", "Trial", regex=False) ``` Use `regex=False` when you want a literal replacement. Clean phone numbers by keeping only digits: ```python profiles["phone_digits"] = profiles["phone"].str.replace(r"\D", "", regex=True) profiles[["phone", "phone_digits"]].head() ``` Now check digit counts: ```python profiles["phone_digits"].str.len().value_counts(dropna=False) ``` ## 9. Filter With Contains, Startswith, And Endswith Find profiles related to Pandas: ```python profiles[profiles["plan_code"].str.contains("PANDAS", case=False, na=False)] ``` Find Gmail users: ```python profiles[profiles["email_clean"].str.endswith("@gmail.com", na=False)] ``` Find learners whose first name starts with `A`: ```python profiles[profiles["first_name"].str.startswith("A", na=False)] ``` Find notes that mention refund: ```python profiles[profiles["support_note"].str.contains("refund", case=False, na=False)] ``` Always think about missing values. Use `na=False` when missing text should behave like "not matched". ## 10. Character Checks Check whether a value contains only digits: ```python profiles["phone_digits"].str.isdigit() ``` Check whether first names are alphabetic: ```python profiles["first_name"].str.isalpha() ``` Check whether plan codes are alphanumeric after removing hyphens: ```python profiles["plan_code"].str.replace("-", "", regex=False).str.isalnum() ``` These checks are useful for validation. ## 11. String Slicing Get the first three characters of the plan code: ```python profiles["plan_code"].str[:3] ``` Get the last four characters: ```python profiles["plan_code"].str[-4:] ``` Reverse cleaned names: ```python profiles["clean_name"].str[::-1].head() ``` String slicing with `.str[]` behaves like Python slicing, but across the whole Series. ## 12. Create Display Labels Use `.str.cat()` to combine text columns. ```python profiles["display_label"] = ( profiles["first_name"] .str.cat(profiles["last_name"], sep=" ") .str.cat(profiles["tier_label"], sep=" - ") ) profiles[["profile_id", "display_label"]].head() ``` This is useful when building labels for reports, dropdowns, and dashboards. ## 13. Timestamp Objects A `Timestamp` represents one point in time. ```python pd.Timestamp("2026-02-14") ``` Create a timestamp with time: ```python pd.Timestamp("2026-02-14 17:42:00") ``` Read timestamp attributes: ```python ts = pd.Timestamp("2026-02-14 17:42:00") ts.year, ts.month, ts.day, ts.hour, ts.minute ``` Pandas timestamps work well with Series and DataFrames. ## 14. DatetimeIndex A `DatetimeIndex` is a collection of timestamps. ```python date_index = pd.DatetimeIndex([ "2026-01-01", "2026-02-01", "2026-03-01" ]) date_index ``` Use it as a Series index: ```python monthly_visits = pd.Series([1200, 1800, 2400], index=date_index) monthly_visits ``` Now the index understands time. ## 15. Generate Dates With `date_range()` Daily dates: ```python pd.date_range(start="2026-01-01", end="2026-01-07", freq="D") ``` Every 2 days: ```python pd.date_range(start="2026-01-01", end="2026-01-15", freq="2D") ``` Business days: ```python pd.date_range(start="2026-01-01", end="2026-01-15", freq="B") ``` Every 6 hours: ```python pd.date_range(start="2026-01-01", periods=8, freq="6h") ``` Month starts: ```python pd.date_range(start="2026-01-01", periods=6, freq="MS") ``` Use `date_range()` when you need a planned calendar, missing-date scaffold, or regular time index. ## 16. Convert Text To Datetime The event dataset stores time as text. ```python events.info() ``` Convert it: ```python events["event_at"] = pd.to_datetime(events["event_time"], errors="coerce") ``` Check invalid timestamps: ```python events[events["event_at"].isna()] ``` `errors="coerce"` turns invalid dates into `NaT`. `NaT` means "not a time", similar to missing datetime. ## 17. Use The `.dt` Accessor Extract year: ```python events["event_at"].dt.year ``` Extract month number: ```python events["event_at"].dt.month ``` Extract month name: ```python events["event_at"].dt.month_name() ``` Extract day name: ```python events["event_at"].dt.day_name() ``` Extract hour: ```python events["event_at"].dt.hour ``` Add useful columns: ```python events["event_date"] = events["event_at"].dt.date events["event_month"] = events["event_at"].dt.month_name() events["event_day_name"] = events["event_at"].dt.day_name() events["event_hour"] = events["event_at"].dt.hour ``` Now the event log is ready for time-based analysis. ## 18. Date Flags Check month start and month end: ```python events["is_month_start"] = events["event_at"].dt.is_month_start events["is_month_end"] = events["event_at"].dt.is_month_end ``` Check quarter start and quarter end: ```python events["is_quarter_start"] = events["event_at"].dt.is_quarter_start events["is_quarter_end"] = events["event_at"].dt.is_quarter_end ``` View month-end events: ```python events[events["is_month_end"]] ``` These flags help in reporting and finance-style analysis. ## 19. Filter By Date Filter events after February 1: ```python events[events["event_at"] >= pd.Timestamp("2026-02-01")] ``` Filter events between two dates: ```python events[ events["event_at"].between( pd.Timestamp("2026-02-01"), pd.Timestamp("2026-02-28 23:59:59") ) ] ``` Filter weekend events: ```python events[events["event_at"].dt.dayofweek >= 5] ``` In Pandas: - Monday is `0` - Sunday is `6` ## 20. Timedelta Basics Convert duration seconds into timedeltas: ```python events["session_duration"] = pd.to_timedelta(events["duration_seconds"], unit="s") events[["duration_seconds", "session_duration"]].head() ``` Convert duration to minutes: ```python events["duration_minutes"] = events["session_duration"].dt.total_seconds() / 60 ``` Round it: ```python events["duration_minutes"] = events["duration_minutes"].round(2) events[["event_id", "duration_minutes"]].head() ``` Timedeltas are useful for session length, response time, age of record, and retention analysis. ## 21. Recency Analysis Choose a fixed analysis date: ```python analysis_date = pd.Timestamp("2026-03-15") ``` Find each profile's latest valid event: ```python latest_event = events.groupby("profile_id")["event_at"].max().reset_index() latest_event = latest_event.rename(columns={"event_at": "last_event_at"}) ``` Merge with profiles: ```python profile_recency = profiles.merge(latest_event, on="profile_id", how="left") profile_recency["days_since_last_event"] = ( analysis_date - profile_recency["last_event_at"] ).dt.days profile_recency[["profile_id", "clean_name", "last_event_at", "days_since_last_event"]] ``` This is a common product analytics pattern. ## 22. Group By Date Parts Events by day name: ```python events.groupby("event_day_name")["event_id"].count() ``` Events by hour: ```python events.groupby("event_hour")["event_id"].count().sort_index() ``` Revenue by month: ```python events.groupby("event_month")["amount"].sum() ``` Average duration by event type: ```python events.groupby("event_type")["duration_minutes"].mean().round(2) ``` The `.dt` accessor often prepares columns for normal `groupby()`. ## 23. Resample Time Series Data To use `resample()`, set a datetime column as the index. ```python events_by_time = events.dropna(subset=["event_at"]).set_index("event_at").sort_index() ``` Daily event count: ```python events_by_time["event_id"].resample("D").count() ``` Weekly amount: ```python events_by_time["amount"].resample("W").sum() ``` Monthly event type count: ```python events_by_time.groupby("event_type")["event_id"].resample("MS").count() ``` Use `resample()` when your rows are time-indexed and you want regular time buckets. ## 24. Mini Project 1: Clean Profile Table Build a clean profile table with: - profile ID - clean name - first name - last name - clean email - email domain - clean city - track code - tier label - phone digits Solution: ```python clean_profiles = profiles.copy() clean_profiles["clean_name"] = clean_profiles["raw_name"].str.strip().str.title() clean_profiles["email_clean"] = clean_profiles["email"].str.strip().str.lower() clean_profiles["city_clean"] = clean_profiles["city"].str.strip().str.title() clean_profiles["phone_digits"] = clean_profiles["phone"].str.replace(r"\D", "", regex=True) clean_profiles[["track_code", "tier_code", "plan_year"]] = ( clean_profiles["plan_code"].str.split("-", expand=True) ) clean_profiles["tier_label"] = clean_profiles["tier_code"].replace({ "PRO": "Premium", "FREE": "Free", "TRIAL": "Trial" }) name_parts = clean_profiles["clean_name"].str.extract( r"^(?:(?P<title>Mr|Ms|Dr|Prof)\s+)?(?P<first_name>[A-Za-z]+)\s+(?P<last_name>[A-Za-z]+)$" ) clean_profiles = pd.concat([clean_profiles, name_parts], axis=1) clean_profiles["email_domain"] = clean_profiles["email_clean"].str.extract(r"@([^@\s]+)$") clean_profiles[[ "profile_id", "clean_name", "first_name", "last_name", "email_clean", "email_domain", "city_clean", "track_code", "tier_label", "phone_digits" ]] ``` ## 25. Mini Project 2: Event Activity Report Build a report with: - valid event timestamp - event month - event day name - event hour - duration minutes - revenue amount Solution: ```python activity_report = events.copy() activity_report["event_at"] = pd.to_datetime(activity_report["event_time"], errors="coerce") activity_report = activity_report.dropna(subset=["event_at"]) activity_report["event_month"] = activity_report["event_at"].dt.month_name() activity_report["event_day_name"] = activity_report["event_at"].dt.day_name() activity_report["event_hour"] = activity_report["event_at"].dt.hour activity_report["duration_minutes"] = ( pd.to_timedelta(activity_report["duration_seconds"], unit="s").dt.total_seconds() / 60 ).round(2) activity_report[[ "event_id", "profile_id", "event_type", "event_month", "event_day_name", "event_hour", "duration_minutes", "amount" ]].head() ``` ## 26. Mini Project 3: Weekly Revenue And Engagement Build a weekly table with: - event count - total amount - average session duration Solution: ```python weekly_report = ( activity_report .set_index("event_at") .resample("W") .agg( event_count=("event_id", "count"), total_amount=("amount", "sum"), average_duration=("duration_minutes", "mean") ) ) weekly_report["average_duration"] = weekly_report["average_duration"].round(2) weekly_report ``` ## 27. Common Mistakes ### Mistake 1: Using Python string methods directly on a Series This is wrong: ```python profiles["raw_name"].strip() ``` **Explanation** - Removes leading and trailing whitespace characters from the 'raw_name' column values - Returns a new string value without any surrounding spaces, tabs, or newline characters - Commonly used for data cleaning to standardize text formatting before processing - Preserves internal spaces within names while eliminating unwanted boundary whitespace - Typically applied as part of data preprocessing pipelines to ensure consistent data quality Use `.str`: ```python profiles["raw_name"].str.strip() ``` **Explanation** - The code removes leading and trailing whitespace characters from the "raw_name" column in a pandas DataFrame - Uses the str.strip() method which eliminates empty spaces, tabs, and newline characters from both ends of strings - This operation is commonly used for data cleaning to standardize text formatting before analysis - The method returns a new Series with cleaned string values while preserving the original DataFrame structure - Useful for eliminating data inconsistencies caused by accidental spacing during data entry or import processes ### Mistake 2: Forgetting missing values in filters This may fail when values are missing: ```python profiles[profiles["support_note"].str.contains("refund")] ``` **Explanation** - This code filters a DataFrame called 'profiles' to return only rows where the 'support_note' column contains the substring "refund" - The str.contains() method performs a case-sensitive search for the specified pattern within each string value - The result is a new DataFrame containing only the matching rows that have "refund" anywhere in their support_note field - This filtering approach is commonly used for data analysis tasks like identifying customer service cases related to refunds - The operation preserves all columns from the original DataFrame while narrowing down to relevant records based on text content Use: ```python profiles[profiles["support_note"].str.contains("refund", case=False, na=False)] ``` **Explanation** - Uses pandas boolean indexing to filter rows where the "support_note" column contains the word "refund" regardless of case sensitivity - The case=False parameter ensures the search is case-insensitive, matching "Refund", "REFUND", or "refund" - The na=False parameter handles missing values by treating them as False rather than raising an error - This operation returns a filtered DataFrame containing only profiles with refund-related support notes - Commonly used for customer service analysis to identify cases requiring refund processing ### Mistake 3: Forgetting `regex=False` for literal replacement When replacing exact text, use: ```python profiles["tier_code"].str.replace("PRO", "Premium", regex=False) ``` **Explanation** - This code performs a string replacement on the 'tier_code' column in the profiles DataFrame - It substitutes all occurrences of "PRO" with "Premium" without using regular expressions - The regex=False parameter ensures literal string matching rather than pattern matching - This transformation standardizes tier naming conventions by converting abbreviated codes to full labels - The operation returns a new Series with the modified string values while preserving the original DataFrame structure ### Mistake 4: Using `.dt` before converting to datetime This fails if the column is still text: ```python events["event_time"].dt.year ``` **Explanation** - Accesses the year attribute of datetime data using pandas' dt accessor for efficient time-based data extraction - Enables temporal analysis by isolating specific year values from full datetime timestamps - Returns a Series containing only the year integers, useful for grouping, filtering, or aggregating time-series data - Commonly used in data preprocessing workflows for time-based feature engineering and chronological analysis Convert first: ```python events["event_at"] = pd.to_datetime(events["event_time"], errors="coerce") events["event_at"].dt.year ``` **Explanation** - The first line converts the 'event_time' column values to datetime objects using pd.to_datetime() with error handling that coerces invalid parsing attempts into NaT (Not a Time) values - The second line accesses the year component from the converted datetime objects using the .dt accessor and .year attribute - This approach handles potential data quality issues by gracefully managing malformed date strings - The operation returns a pandas Series containing only the extracted year integers from valid datetime entries - This pattern is commonly used for time-based data analysis and grouping operations ### Mistake 5: Ignoring invalid dates Use: ```python pd.to_datetime(events["event_time"], errors="coerce") ``` **Explanation** - Converts the event_time column from string or other formats to proper datetime objects using pandas to_datetime function - The errors="coerce" parameter handles invalid date values by converting them to NaT (Not a Time) instead of raising an exception - This approach ensures data cleaning and consistency when working with temporal data in pandas DataFrames - Useful for time-series analysis and chronological sorting of event data - Prevents data type conversion errors that would occur with malformed timestamp entries Then inspect `NaT` rows. ## 28. Practice Questions Try these before looking at the solutions. ### Q1. Create a clean lowercase email column. ```python profiles["email_clean"] = profiles["email"].str.strip().str.lower() profiles[["email", "email_clean"]].head() ``` **Explanation** - The code creates a new column "email_clean" by applying string methods to the existing "email" column - First .str.strip() removes leading and trailing whitespace from each email address - Then .str.lower() converts all characters to lowercase for consistent formatting - The final .head() displays the original and cleaned email columns side-by-side for verification - This approach ensures standardized email data format for better data processing and matching ### Q2. Extract the domain from each email. ```python profiles["email_domain"] = profiles["email_clean"].str.extract(r"@([^@\s]+)$") profiles["email_domain"].value_counts(dropna=False) ``` **Explanation** - The first line uses pandas string methods to extract the domain portion of email addresses by capturing text after the @ symbol until the end of the string - The regex pattern r"@([^@\s]+)$" captures one or more characters that are not @ or whitespace, starting after @ and ending at the string end - The second line displays frequency counts of all unique email domains including null values to show distribution across different email providers - This approach efficiently categorizes users by their email service providers for analytics and segmentation purposes - The extracted domains can be used for targeted marketing campaigns or identifying user demographics based on email provider trends ### Q3. Find all profiles whose support note mentions invoice. ```python profiles[profiles["support_note"].str.contains("invoice", case=False, na=False)] ``` **Explanation** - Uses pandas DataFrame boolean indexing to filter rows where the "support_note" column contains the substring "invoice" - The str.contains() method performs case-insensitive matching with case=False parameter - Handles missing values gracefully with na=False parameter, treating NaN values as False rather than raising errors - Returns a new DataFrame containing only rows that match the filtering criteria - Commonly used for data cleaning and analysis tasks involving text-based filtering ### Q4. Split plan code into track, tier, and year. ```python profiles[["track_code", "tier_code", "plan_year"]] = profiles["plan_code"].str.split("-", expand=True) profiles[["plan_code", "track_code", "tier_code", "plan_year"]].head() ``` **Explanation** - The code uses pandas str.split() method to decompose the plan_code column into four separate columns using the hyphen delimiter - The expand=True parameter ensures the split results are returned as separate columns rather than a single Series of lists - The assignment creates three new columns (track_code, tier_code, plan_year) by splitting the original plan_code values at each hyphen character - The final line displays the first few rows of all four columns to verify the successful transformation and data extraction ### Q5. Convert event time into datetime and find invalid rows. ```python events["event_at"] = pd.to_datetime(events["event_time"], errors="coerce") events[events["event_at"].isna()] ``` **Explanation** - Converts the event_time column to datetime format using pd.to_datetime() with error handling that coerces invalid parsing attempts to NaT (Not a Time) values - The errors="coerce" parameter ensures that any malformed date strings are converted to null values instead of raising exceptions - Filters and displays only the rows where the conversion resulted in null values, helping identify problematic timestamp entries - This approach is useful for data cleaning and validation when working with potentially inconsistent datetime data from external sources ### Q6. Count events by day name. ```python events["day_name"] = events["event_at"].dt.day_name() events.groupby("day_name")["event_id"].count() ``` **Explanation** - Extracts the day name from a datetime column using the dt accessor and assigns it as a new column called "day_name" - Groups the dataframe by the newly created "day_name" column to aggregate events by weekday - Counts the number of unique event IDs for each day to determine event frequency distribution - This pattern enables time-based analysis to identify peak activity days in event data ### Q7. Find events that happened after 6 PM. ```python events[events["event_at"].dt.hour >= 18] ``` **Explanation** - Filters a DataFrame named events to include only rows where the event occurred after 6 PM (hour >= 18) - Uses the dt accessor to extract hour component from the datetime column "event_at" - The comparison operator >= selects events from 6 PM onwards including the exact 6 PM timestamp - This operation returns a new DataFrame containing only the filtered evening/night events - Commonly used for time-based data analysis to isolate specific time periods of interest ### Q8. Find total amount by month. ```python events["month"] = events["event_at"].dt.month_name() events.groupby("month")["amount"].sum() ``` **Explanation** - Extracts the full month name from a datetime column named "event_at" and stores it in a new "month" column - Groups all events by the newly created month column to aggregate data by time periods - Calculates the sum of monetary amounts for each month to show total spending or revenue per month - This pattern enables easy monthly trend analysis and reporting of financial data ### Q9. Convert duration seconds into minutes. ```python events["duration_minutes"] = ( pd.to_timedelta(events["duration_seconds"], unit="s").dt.total_seconds() / 60 ).round(2) events[["event_id", "duration_seconds", "duration_minutes"]].head() ``` **Explanation** - Converts the duration_seconds column to timedelta format using pd.to_timedelta with seconds unit - Extracts total seconds from the timedelta object and divides by 60 to convert to minutes - Rounds the resulting minutes value to 2 decimal places for precision - Displays the first few rows showing event_id, original duration in seconds, and converted duration in minutes ### Q10. Create daily event counts with resample. ```python events.dropna(subset=["event_at"]).set_index("event_at")["event_id"].resample("D").count() ``` **Explanation** - The code removes rows with missing values in the "event_at" column using dropna() to ensure valid timestamps - It sets the "event_at" column as the DataFrame index to enable time-based operations and grouping - The event_id column is selected and resampled on a daily frequency ("D") to aggregate events by date - The count() method aggregates the number of events per day by counting non-null values in each time period - This creates a time series showing daily event counts for analysis and visualization purposes ## 29. Interview Questions ### 1. What is vectorization? Vectorization means applying an operation to an entire array or Series at once instead of writing a manual loop for each value. ### 2. Why do Pandas string methods use `.str`? Because a Series is not a single Python string. `.str` exposes vectorized string methods for each value in the Series. ### 3. Why is `.str.contains(..., na=False)` useful? It treats missing values as non-matches, which prevents filtering errors when text contains nulls. ### 4. What is the difference between `split()` and `extract()`? `split()` divides text using a delimiter. `extract()` pulls regex groups into columns. ### 5. Why use `regex=False` in `str.replace()`? Use it when replacing literal text so Pandas does not interpret the pattern as a regular expression. ### 6. What is a Pandas `Timestamp`? It is Pandas' representation of one moment in time. ### 7. What is `DatetimeIndex`? It is an index made of datetime values, useful for time series selection and resampling. ### 8. What does `pd.to_datetime(..., errors="coerce")` do? It converts valid date strings into datetime values and invalid ones into `NaT`. ### 9. What does the `.dt` accessor do? It exposes vectorized datetime properties and methods such as year, month, day name, hour, and month-end flags. ### 10. What is `resample()` used for? It groups time-indexed data into regular time buckets such as daily, weekly, or monthly periods. ### 11. What is a timedelta? A timedelta represents a duration, such as 45 seconds, 2 hours, or 10 days. ### 12. Why should you inspect invalid dates after conversion? Because invalid strings become `NaT` when using `errors="coerce"`, and those rows may affect analysis if ignored. ## 30. Final Mental Model Use this table: | Need | Use | |---|---| | Clean text values in a Series | `.str` | | Remove spaces | `.str.strip()` | | Standardize case | `.str.lower()`, `.str.title()` | | Split codes into columns | `.str.split(..., expand=True)` | | Extract regex groups | `.str.extract()` | | Filter text by pattern | `.str.contains()` | | Convert text to datetime | `pd.to_datetime()` | | Generate regular dates | `pd.date_range()` | | Read parts of datetime | `.dt.year`, `.dt.month_name()`, `.dt.day_name()` | | Convert seconds to duration | `pd.to_timedelta()` | | Group time-indexed data | `.resample()` | If you remember only one thing, remember this: Use `.str` before text analysis and `.dt` before time analysis. Clean text and parsed time are the foundation of reliable Pandas workflows. ## Official References - Pandas working with text data user guide: https://pandas.pydata.org/docs/user_guide/text.html - Pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html - Pandas `Series.str` API reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html - Pandas `to_datetime` API reference: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html - Pandas `date_range` API reference: https://pandas.pydata.org/docs/reference/api/pandas.date_range.html