EDA Data Gathering: SQL, APIs, Web Scraping, Files, and the Data Analysis Process
Exploratory Data Analysis starts before charts.
The first job is to gather reliable data from the places where it lives:
- CSV files
- Excel workbooks
- SQL databases
- APIs
- web pages
- JSON exports
- internal application logs
This guide teaches the data gathering stage with original synthetic examples. It does not use copied SQL dumps, API keys, external spreadsheets, scraped ecommerce pages, or proprietary course data.
Files Used In This Guide
Use these CSV files:
eda_invoices.csveda_order_leads.csveda_city_metrics.csveda_product_listing_snapshot.csv
Place them in the same folder as your notebook or script.
If you keep them in a data/ folder, update the paths:
invoices = pd.read_csv("data/eda_invoices.csv")
leads = pd.read_csv("data/eda_order_leads.csv")
city_metrics = pd.read_csv("data/eda_city_metrics.csv")
products = pd.read_csv("data/eda_product_listing_snapshot.csv")What You Will Learn
By the end, you should be able to:
- explain where data gathering fits in the data analysis process
- load raw CSV extracts safely
- read data from a SQL database into Pandas
- export multiple tables into one Excel workbook
- design an API collection loop with pagination
- store API data into SQL
- parse product-like HTML with BeautifulSoup
- save scraped data to JSON
- maintain a source log for reproducibility
- run basic data quality checks before analysis
1. The Data Analysis Process
A practical data analysis workflow has six stages:
- Define the question.
- Gather the data.
- Assess the data.
- Clean and transform the data.
- Explore and model the data.
- Communicate the result.
This guide focuses on stage 2 and the beginning of stage 3.
Data gathering is not just downloading files. It is the discipline of knowing:
- what was collected
- where it came from
- when it was collected
- which fields exist
- which assumptions were made
- whether the collection can be repeated
2. Setup
Install the common libraries:
pip install pandas openpyxl sqlalchemy beautifulsoup4 lxml requestsImport the libraries:
import json
import sqlite3
import time
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoupFor this guide, every example can run offline except the optional API pattern. The API section uses simulated pages so you do not need accounts, keys, or live network calls.
3. Load Raw CSV Extracts
Start with simple file-based gathering.
invoices = pd.read_csv("eda_invoices.csv")
leads = pd.read_csv("eda_order_leads.csv")
city_metrics = pd.read_csv("eda_city_metrics.csv")
products = pd.read_csv("eda_product_listing_snapshot.csv")
print(invoices.head())
print(leads.head())
print(city_metrics.head())
print(products.head())Check shape and columns:
for name, df in {
"invoices": invoices,
"leads": leads,
"city_metrics": city_metrics,
"products": products,
}.items():
print(name)
print("shape:", df.shape)
print("columns:", list(df.columns))
print()Raw file gathering should always answer two questions:
- Did I load the expected number of rows?
- Did I get the expected columns?
4. Create A Source Log
A source log records collection metadata.
source_log = pd.DataFrame(
[
{
"source_name": "eda_invoices.csv",
"source_type": "csv",
"owner": "finance_ops",
"refresh_frequency": "daily",
"loaded_rows": len(invoices),
},
{
"source_name": "eda_order_leads.csv",
"source_type": "csv",
"owner": "growth_team",
"refresh_frequency": "daily",
"loaded_rows": len(leads),
},
{
"source_name": "eda_city_metrics.csv",
"source_type": "simulated_api",
"owner": "market_research",
"refresh_frequency": "monthly",
"loaded_rows": len(city_metrics),
},
{
"source_name": "eda_product_listing_snapshot.csv",
"source_type": "simulated_scrape",
"owner": "catalog_team",
"refresh_frequency": "weekly",
"loaded_rows": len(products),
},
]
)
source_logThis is a small habit with a large payoff. When a project grows, a source log prevents confusion.
5. Read Data From SQL
In real projects, data often comes from databases.
This example uses SQLite so it runs locally without a database server.
conn = sqlite3.connect("eda_demo.db")
invoices.to_sql("invoices", conn, if_exists="replace", index=False)
leads.to_sql("order_leads", conn, if_exists="replace", index=False)
city_metrics.to_sql("city_metrics", conn, if_exists="replace", index=False)Now query those tables back into Pandas:
paid_invoices = pd.read_sql_query(
"""
SELECT
invoice_id,
order_id,
city,
category,
amount,
payment_status
FROM invoices
WHERE payment_status = 'Paid'
""",
conn,
)
paid_invoices.head()Use SQL when filtering or joining inside the database is more efficient than loading everything into memory.
6. Join SQL Tables
Join invoices with leads by order_id.
orders_with_leads = pd.read_sql_query(
"""
SELECT
i.invoice_id,
i.order_id,
i.invoice_date,
i.city,
i.category,
i.amount,
i.payment_status,
l.source,
l.campaign,
l.lead_score,
l.converted
FROM invoices AS i
LEFT JOIN order_leads AS l
ON i.order_id = l.order_id
""",
conn,
)
orders_with_leads.head()After joining, verify row counts:
print("invoices:", len(invoices))
print("joined:", len(orders_with_leads))If a join unexpectedly increases row count, you may have duplicate keys on one side.
7. Export Multiple Tables To Excel
Sometimes stakeholders want one workbook with multiple sheets.
with pd.ExcelWriter("eda_gathered_sources.xlsx", engine="openpyxl") as writer:
invoices.to_excel(writer, sheet_name="invoices", index=False)
leads.to_excel(writer, sheet_name="order_leads", index=False)
city_metrics.to_excel(writer, sheet_name="city_metrics", index=False)
source_log.to_excel(writer, sheet_name="source_log", index=False)This creates one Excel file with separate tabs.
Use clear sheet names. Avoid spaces and very long names when the workbook will be used by scripts later.
8. API Gathering Concepts
APIs usually return JSON.
Real API collection often involves:
- endpoint URL
- query parameters
- authentication header
- pagination
- rate limits
- retries
- schema changes
Never hard-code secret API keys inside notebooks. Use environment variables or a secret manager.
Example pattern:
import os
api_key = os.environ.get("CITY_API_KEY")
headers = {
"Authorization": f"Bearer {api_key}",
}The rest of this guide uses simulated API pages so the code is safe to publish.
9. Simulate Paginated API Pages
Turn the city metrics CSV into API-like pages.
api_pages = []
for page_number, page_df in city_metrics.groupby("api_page"):
api_pages.append(
{
"page": int(page_number),
"has_next": page_number < city_metrics["api_page"].max(),
"data": page_df.drop(columns=["api_page"]).to_dict(orient="records"),
}
)
api_pages[0]This structure behaves like a paginated API response:
- metadata lives at the page level
- records live inside
data has_nexttells the loop whether to continue
10. Collect All API Pages
Use a loop that stops when there are no more pages.
all_city_records = []
for page in api_pages:
all_city_records.extend(page["data"])
print(f"Collected page {page['page']}: {len(page['data'])} rows")
if not page["has_next"]:
break
time.sleep(0.2)
cities_from_api = pd.DataFrame(all_city_records)
cities_from_api.head()In a real API loop:
- sleep between requests if the API asks you to
- handle non-200 status codes
- log failed pages
- retry temporary failures
- stop before rate limits are exceeded
11. Store API Data Into SQL
After gathering API data, store it in a database table.
cities_from_api.to_sql(
"api_city_metrics",
conn,
if_exists="replace",
index=False,
)
pd.read_sql_query("SELECT * FROM api_city_metrics LIMIT 5", conn)This makes the gathered API data available for later joins.
12. API Collection Checklist
Before trusting API data, check:
print(cities_from_api.shape)
print(cities_from_api.isna().sum())
print(cities_from_api.duplicated(subset=["city_id"]).sum())API data can fail in quiet ways:
- missing pages
- duplicated pages
- changed field names
- nested JSON that was flattened incorrectly
- numbers returned as strings
13. Web Scraping Safety
Web scraping should be done carefully.
Before scraping a site:
- read the site's terms
- check
robots.txt - avoid private or sensitive data
- use a respectful delay
- do not bypass anti-bot systems
- stop if the site blocks or rate-limits you
For a public blog post, it is better to teach scraping with synthetic HTML.
14. Parse One HTML Page
Here is a tiny product listing page.
html_page_1 = """
<html>
<body>
<div class="product-card">
<img src="https://example.com/notebook-pro.jpg" alt="Notebook Pro 14">
<h2 class="name">Notebook Pro 14</h2>
<span class="rating">4.5</span>
<span class="reviews">312 reviews</span>
<span class="discounted">58990</span>
<span class="actual">72990</span>
</div>
<div class="product-card">
<img src="https://example.com/studytab.jpg" alt="StudyTab 10">
<h2 class="name">StudyTab 10</h2>
<span class="rating">4.2</span>
<span class="reviews">140 reviews</span>
<span class="discounted">17990</span>
<span class="actual">22990</span>
</div>
</body>
</html>
"""Parse it:
soup = BeautifulSoup(html_page_1, "lxml")
cards = soup.select(".product-card")
records = []
for card in cards:
records.append(
{
"image_url": card.select_one("img").get("src"),
"product_name": card.select_one(".name").get_text(strip=True),
"rating": float(card.select_one(".rating").get_text(strip=True)),
"reviews": card.select_one(".reviews").get_text(strip=True),
"discounted_price": int(card.select_one(".discounted").get_text(strip=True)),
"actual_price": int(card.select_one(".actual").get_text(strip=True)),
}
)
scraped_sample = pd.DataFrame(records)
scraped_sampleThe key idea is to select stable HTML elements and extract text or attributes.
15. Scrape Multiple Simulated Pages
Use the included product CSV to simulate pages gathered from a listing.
pages = []
for page_number, page_df in products.groupby("page"):
pages.append(page_df.to_dict(orient="records"))
scraped_records = []
for page_number, page_records in enumerate(pages, start=1):
print(f"Collecting page {page_number}")
scraped_records.extend(page_records)
time.sleep(0.2)
scraped_products = pd.DataFrame(scraped_records)
scraped_products.head()This is the same control flow you would use for real pagination, but without touching a live website.
16. Save Scraped Data To JSON
JSON is useful when downstream systems expect nested or semi-structured records.
scraped_products.to_json(
"eda_product_listing_snapshot.json",
orient="records",
indent=2,
)Read it back:
products_from_json = pd.read_json("eda_product_listing_snapshot.json")
products_from_json.head()Always test that your saved file can be read back correctly.
17. Normalize Raw Data Types
Gathered data often arrives with wrong data types.
invoices["invoice_date"] = pd.to_datetime(invoices["invoice_date"])
leads["lead_date"] = pd.to_datetime(leads["lead_date"])
invoices["amount"] = pd.to_numeric(invoices["amount"], errors="coerce")
products["discounted_price"] = pd.to_numeric(products["discounted_price"], errors="coerce")
products["actual_price"] = pd.to_numeric(products["actual_price"], errors="coerce")Check the result:
print(invoices.dtypes)
print(products.dtypes)Do this before analysis. Charts and aggregations can be wrong if numbers are still strings.
18. Basic Quality Checks
Create a reusable summary function.
def quality_report(df, key_columns=None):
key_columns = key_columns or []
report = {
"rows": len(df),
"columns": len(df.columns),
"missing_cells": int(df.isna().sum().sum()),
}
for key in key_columns:
report[f"duplicate_{key}"] = int(df.duplicated(subset=[key]).sum())
return reportRun it:
quality_report(invoices, key_columns=["invoice_id"])quality_report(leads, key_columns=["lead_id"])quality_report(products, key_columns=["product_id"])Quality checks should happen before serious EDA.
19. Validate Expected Categories
Check categorical columns against known values.
expected_payment_status = {"Paid", "Pending", "Failed"}
actual_payment_status = set(invoices["payment_status"].unique())
unexpected = actual_payment_status - expected_payment_status
print("Unexpected payment statuses:", unexpected)Validate product stock values:
expected_stock_values = {"Yes", "No"}
actual_stock_values = set(products["in_stock"].unique())
print("Unexpected stock values:", actual_stock_values - expected_stock_values)This catches spelling drift like paid, PAID, pendng, or blank values.
20. Combine Gathered Sources
Join leads, invoices, and city metrics.
orders = invoices.merge(
leads,
on=["order_id", "city"],
how="left",
validate="one_to_one",
)
orders = orders.merge(
city_metrics[["city", "internet_pct", "learning_centers"]],
on="city",
how="left",
validate="many_to_one",
)
orders.head()Use validate in merge when you know the expected relationship.
It can catch accidental many-to-many joins early.
21. First EDA After Gathering
Now that data is gathered and lightly validated, basic EDA can begin.
Revenue by category:
revenue_by_category = (
orders.query("payment_status == 'Paid'")
.groupby("category", as_index=False)["amount"]
.sum()
.sort_values("amount", ascending=False)
)
revenue_by_categoryConversion by source:
conversion_by_source = (
leads.assign(converted_flag=leads["converted"].eq("Yes").astype(int))
.groupby("source", as_index=False)["converted_flag"]
.mean()
)
conversion_by_source["conversion_rate_pct"] = conversion_by_source["converted_flag"] * 100
conversion_by_sourceProduct discount percentage:
products["discount_pct"] = (
(products["actual_price"] - products["discounted_price"])
/ products["actual_price"]
* 100
).round(1)
products[["product_name", "category", "discount_pct"]].head()22. Save A Cleaned Starter Dataset
After basic gathering and checks, save a starter file for the next EDA stage.
orders.to_csv("eda_orders_starter_clean.csv", index=False)
products.to_csv("eda_products_starter_clean.csv", index=False)
source_log.to_csv("eda_source_log.csv", index=False)The word starter is intentional. At this point, the data is gathered and lightly checked, not fully analyzed.
23. Data Gathering Checklist
Before you move to deeper EDA, confirm:
- every source has a name and owner
- every source has a collection date
- raw files are preserved
- secrets are not stored in notebooks
- row counts are recorded
- key columns are checked for duplicates
- important categories are validated
- date and numeric columns have correct data types
- joined tables use expected relationships
- gathered outputs can be read back
24. Common Mistakes
Hard-coding API keys
Never publish API keys in notebooks, Markdown files, or Git commits.
Scraping live sites without permission
Use official APIs when possible. If scraping is allowed, be slow, respectful, and transparent.
Ignoring pagination
Many APIs and listing pages return only the first page by default. Always check whether more pages exist.
Trusting row counts blindly
Record expected and actual row counts. A successful request can still return partial data.
Joining before checking keys
Bad joins can duplicate rows and inflate metrics. Check duplicates before merging.
Overwriting raw data
Keep raw data separate from cleaned data. Raw files are your audit trail.
25. Practice Tasks
Use the included CSV files:
- Load all four CSV files and print shape, columns, and first five rows.
- Create a source log with source name, source type, owner, and row count.
- Store
invoices,order_leads, andcity_metricsin SQLite. - Read paid invoices from SQLite into Pandas.
- Export all raw tables into one Excel workbook with multiple sheets.
- Simulate API pagination from
eda_city_metrics.csv. - Store the simulated API result into SQLite.
- Parse the provided HTML snippet and create a product DataFrame.
- Save product listings to JSON and read them back.
- Build a quality report for every gathered table.
- Join invoices with leads and city metrics.
- Calculate paid revenue by category.
26. Interview-Style Questions
What is data gathering in EDA?
Data gathering is the process of collecting raw data from files, databases, APIs, web pages, logs, or internal systems before cleaning and analysis.
Why should you keep raw data unchanged?
Raw data is the audit trail. If cleaning logic fails, you can return to the original source.
Why is pagination important in APIs?
Many APIs return data in pages. If you collect only the first page, your dataset may be incomplete.
Why should API keys not be written in notebooks?
Notebook files are often shared, committed, or exported. Hard-coded keys can leak access to private systems.
What is the purpose of a source log?
A source log records where each dataset came from, who owns it, how often it refreshes, and how many rows were loaded.
Why use SQL before Pandas?
SQL is useful when filtering, joining, or aggregating large data close to the database is more efficient.
Why use validate in Pandas merge?
validate checks whether the join relationship matches your expectation, such as one-to-one or many-to-one.
What should you check after scraping?
Check row count, missing fields, duplicate records, parsing errors, and whether saved output can be read back.
27. Final Notes
Good EDA starts with careful gathering.
Before plotting, make sure you know:
- what each source represents
- how each source was collected
- whether collection was complete
- which fields are reliable
- which fields need cleaning
- how sources connect to each other
Once this foundation is stable, the next stages of EDA become faster, safer, and easier to explain.
