What are the main data sources covered in the guide for EDA data gathering?

The guide covers gathering data from CSV files, Excel workbooks, SQL databases, APIs, web pages, JSON exports, and internal application logs.

What libraries are recommended for installation to follow the guide?

The guide recommends installing pandas, openpyxl, sqlalchemy, beautifulsoup4, lxml, and requests.

What is the purpose of maintaining a source log during data gathering?

Maintaining a source log helps ensure reproducibility by tracking what data was collected, where it came from, when it was collected, and other relevant details.

What is the first stage of the data analysis process according to the guide?

The first stage of the data analysis process is to define the question.

How does the guide suggest handling API data collection?

The guide suggests designing an API collection loop with pagination and storing the API data into SQL.

EDA Data Gathering: SQL, APIs & Web Scraping

EDA Data Gathering: SQL, APIs, Web Scraping, Files, and the Data Analysis Process

Exploratory Data Analysis starts before charts.

The first job is to gather reliable data from the places where it lives:

CSV files
Excel workbooks
SQL databases
APIs
web pages
JSON exports
internal application logs

This guide teaches the data gathering stage with original synthetic examples. It does not use copied SQL dumps, API keys, external spreadsheets, scraped ecommerce pages, or proprietary course data.

Files Used In This Guide

Use these CSV files:

eda_invoices.csv
eda_order_leads.csv
eda_city_metrics.csv
eda_product_listing_snapshot.csv

Place them in the same folder as your notebook or script.

If you keep them in a data/ folder, update the paths:

python

invoices = pd.read_csv("data/eda_invoices.csv")
leads = pd.read_csv("data/eda_order_leads.csv")
city_metrics = pd.read_csv("data/eda_city_metrics.csv")
products = pd.read_csv("data/eda_product_listing_snapshot.csv")

What You Will Learn

By the end, you should be able to:

explain where data gathering fits in the data analysis process
load raw CSV extracts safely
read data from a SQL database into Pandas
export multiple tables into one Excel workbook
design an API collection loop with pagination
store API data into SQL
parse product-like HTML with BeautifulSoup
save scraped data to JSON
maintain a source log for reproducibility
run basic data quality checks before analysis

1. The Data Analysis Process

A practical data analysis workflow has six stages:

Define the question.
Gather the data.
Assess the data.
Clean and transform the data.
Explore and model the data.
Communicate the result.

This guide focuses on stage 2 and the beginning of stage 3.

Data gathering is not just downloading files. It is the discipline of knowing:

what was collected
where it came from
when it was collected
which fields exist
which assumptions were made
whether the collection can be repeated

2. Setup

Install the common libraries:

bash

pip install pandas openpyxl sqlalchemy beautifulsoup4 lxml requests

Import the libraries:

python

import json
import sqlite3
import time
from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

For this guide, every example can run offline except the optional API pattern. The API section uses simulated pages so you do not need accounts, keys, or live network calls.

3. Load Raw CSV Extracts

Start with simple file-based gathering.

python

invoices = pd.read_csv("eda_invoices.csv")
leads = pd.read_csv("eda_order_leads.csv")
city_metrics = pd.read_csv("eda_city_metrics.csv")
products = pd.read_csv("eda_product_listing_snapshot.csv")

print(invoices.head())
print(leads.head())
print(city_metrics.head())
print(products.head())

Check shape and columns:

python

for name, df in {
    "invoices": invoices,
    "leads": leads,
    "city_metrics": city_metrics,
    "products": products,
}.items():
    print(name)
    print("shape:", df.shape)
    print("columns:", list(df.columns))
    print()

Raw file gathering should always answer two questions:

Did I load the expected number of rows?
Did I get the expected columns?

4. Create A Source Log

A source log records collection metadata.

python

source_log = pd.DataFrame(
    [
        {
            "source_name": "eda_invoices.csv",
            "source_type": "csv",
            "owner": "finance_ops",
            "refresh_frequency": "daily",
            "loaded_rows": len(invoices),
        },
        {
            "source_name": "eda_order_leads.csv",
            "source_type": "csv",
            "owner": "growth_team",
            "refresh_frequency": "daily",
            "loaded_rows": len(leads),
        },
        {
            "source_name": "eda_city_metrics.csv",
            "source_type": "simulated_api",
            "owner": "market_research",
            "refresh_frequency": "monthly",
            "loaded_rows": len(city_metrics),
        },
        {
            "source_name": "eda_product_listing_snapshot.csv",
            "source_type": "simulated_scrape",
            "owner": "catalog_team",
            "refresh_frequency": "weekly",
            "loaded_rows": len(products),
        },
    ]
)

source_log

This is a small habit with a large payoff. When a project grows, a source log prevents confusion.

5. Read Data From SQL

In real projects, data often comes from databases.

This example uses SQLite so it runs locally without a database server.

python

conn = sqlite3.connect("eda_demo.db")

invoices.to_sql("invoices", conn, if_exists="replace", index=False)
leads.to_sql("order_leads", conn, if_exists="replace", index=False)
city_metrics.to_sql("city_metrics", conn, if_exists="replace", index=False)

Now query those tables back into Pandas:

python

paid_invoices = pd.read_sql_query(
    """
    SELECT
        invoice_id,
        order_id,
        city,
        category,
        amount,
        payment_status
    FROM invoices
    WHERE payment_status = 'Paid'
    """,
    conn,
)

paid_invoices.head()

Use SQL when filtering or joining inside the database is more efficient than loading everything into memory.

6. Join SQL Tables

Join invoices with leads by order_id.

python

orders_with_leads = pd.read_sql_query(
    """
    SELECT
        i.invoice_id,
        i.order_id,
        i.invoice_date,
        i.city,
        i.category,
        i.amount,
        i.payment_status,
        l.source,
        l.campaign,
        l.lead_score,
        l.converted
    FROM invoices AS i
    LEFT JOIN order_leads AS l
        ON i.order_id = l.order_id
    """,
    conn,
)

orders_with_leads.head()

After joining, verify row counts:

python

print("invoices:", len(invoices))
print("joined:", len(orders_with_leads))

If a join unexpectedly increases row count, you may have duplicate keys on one side.

7. Export Multiple Tables To Excel

Sometimes stakeholders want one workbook with multiple sheets.

python

with pd.ExcelWriter("eda_gathered_sources.xlsx", engine="openpyxl") as writer:
    invoices.to_excel(writer, sheet_name="invoices", index=False)
    leads.to_excel(writer, sheet_name="order_leads", index=False)
    city_metrics.to_excel(writer, sheet_name="city_metrics", index=False)
    source_log.to_excel(writer, sheet_name="source_log", index=False)

This creates one Excel file with separate tabs.

Use clear sheet names. Avoid spaces and very long names when the workbook will be used by scripts later.

8. API Gathering Concepts

APIs usually return JSON.

Real API collection often involves:

endpoint URL
query parameters
authentication header
pagination
rate limits
retries
schema changes

Never hard-code secret API keys inside notebooks. Use environment variables or a secret manager.

Example pattern:

python

import os

api_key = os.environ.get("CITY_API_KEY")

headers = {
    "Authorization": f"Bearer {api_key}",
}

The rest of this guide uses simulated API pages so the code is safe to publish.

9. Simulate Paginated API Pages

Turn the city metrics CSV into API-like pages.

python

api_pages = []

for page_number, page_df in city_metrics.groupby("api_page"):
    api_pages.append(
        {
            "page": int(page_number),
            "has_next": page_number < city_metrics["api_page"].max(),
            "data": page_df.drop(columns=["api_page"]).to_dict(orient="records"),
        }
    )

api_pages[0]

This structure behaves like a paginated API response:

metadata lives at the page level
records live inside data
has_next tells the loop whether to continue

10. Collect All API Pages

Use a loop that stops when there are no more pages.

python

all_city_records = []

for page in api_pages:
    all_city_records.extend(page["data"])

    print(f"Collected page {page['page']}: {len(page['data'])} rows")

    if not page["has_next"]:
        break

    time.sleep(0.2)

cities_from_api = pd.DataFrame(all_city_records)
cities_from_api.head()

In a real API loop:

sleep between requests if the API asks you to
handle non-200 status codes
log failed pages
retry temporary failures
stop before rate limits are exceeded

11. Store API Data Into SQL

After gathering API data, store it in a database table.

python

cities_from_api.to_sql(
    "api_city_metrics",
    conn,
    if_exists="replace",
    index=False,
)

pd.read_sql_query("SELECT * FROM api_city_metrics LIMIT 5", conn)

This makes the gathered API data available for later joins.

12. API Collection Checklist

Before trusting API data, check:

python

print(cities_from_api.shape)
print(cities_from_api.isna().sum())
print(cities_from_api.duplicated(subset=["city_id"]).sum())

API data can fail in quiet ways:

missing pages
duplicated pages
changed field names
nested JSON that was flattened incorrectly
numbers returned as strings

13. Web Scraping Safety

Web scraping should be done carefully.

Before scraping a site:

read the site's terms
check robots.txt
avoid private or sensitive data
use a respectful delay
do not bypass anti-bot systems
stop if the site blocks or rate-limits you

For a public blog post, it is better to teach scraping with synthetic HTML.

14. Parse One HTML Page

Here is a tiny product listing page.

python

html_page_1 = """
<html>
  <body>
    <div class="product-card">
      <img src="https://example.com/notebook-pro.jpg" alt="Notebook Pro 14">
      <h2 class="name">Notebook Pro 14</h2>
      <span class="rating">4.5</span>
      <span class="reviews">312 reviews</span>
      <span class="discounted">58990</span>
      <span class="actual">72990</span>
    </div>
    <div class="product-card">
      <img src="https://example.com/studytab.jpg" alt="StudyTab 10">
      <h2 class="name">StudyTab 10</h2>
      <span class="rating">4.2</span>
      <span class="reviews">140 reviews</span>
      <span class="discounted">17990</span>
      <span class="actual">22990</span>
    </div>
  </body>
</html>
"""

Parse it:

python

soup = BeautifulSoup(html_page_1, "lxml")
cards = soup.select(".product-card")

records = []

for card in cards:
    records.append(
        {
            "image_url": card.select_one("img").get("src"),
            "product_name": card.select_one(".name").get_text(strip=True),
            "rating": float(card.select_one(".rating").get_text(strip=True)),
            "reviews": card.select_one(".reviews").get_text(strip=True),
            "discounted_price": int(card.select_one(".discounted").get_text(strip=True)),
            "actual_price": int(card.select_one(".actual").get_text(strip=True)),
        }
    )

scraped_sample = pd.DataFrame(records)
scraped_sample

The key idea is to select stable HTML elements and extract text or attributes.

15. Scrape Multiple Simulated Pages

Use the included product CSV to simulate pages gathered from a listing.

python

pages = []

for page_number, page_df in products.groupby("page"):
    pages.append(page_df.to_dict(orient="records"))

scraped_records = []

for page_number, page_records in enumerate(pages, start=1):
    print(f"Collecting page {page_number}")
    scraped_records.extend(page_records)
    time.sleep(0.2)

scraped_products = pd.DataFrame(scraped_records)
scraped_products.head()

This is the same control flow you would use for real pagination, but without touching a live website.

16. Save Scraped Data To JSON

JSON is useful when downstream systems expect nested or semi-structured records.

python

scraped_products.to_json(
    "eda_product_listing_snapshot.json",
    orient="records",
    indent=2,
)

Read it back:

python

products_from_json = pd.read_json("eda_product_listing_snapshot.json")
products_from_json.head()

Always test that your saved file can be read back correctly.

17. Normalize Raw Data Types

Gathered data often arrives with wrong data types.

python

invoices["invoice_date"] = pd.to_datetime(invoices["invoice_date"])
leads["lead_date"] = pd.to_datetime(leads["lead_date"])

invoices["amount"] = pd.to_numeric(invoices["amount"], errors="coerce")
products["discounted_price"] = pd.to_numeric(products["discounted_price"], errors="coerce")
products["actual_price"] = pd.to_numeric(products["actual_price"], errors="coerce")

Check the result:

python

print(invoices.dtypes)
print(products.dtypes)

Do this before analysis. Charts and aggregations can be wrong if numbers are still strings.

18. Basic Quality Checks

Create a reusable summary function.

python

def quality_report(df, key_columns=None):
    key_columns = key_columns or []

    report = {
        "rows": len(df),
        "columns": len(df.columns),
        "missing_cells": int(df.isna().sum().sum()),
    }

    for key in key_columns:
        report[f"duplicate_{key}"] = int(df.duplicated(subset=[key]).sum())

    return report

Run it:

python

quality_report(invoices, key_columns=["invoice_id"])

python

quality_report(leads, key_columns=["lead_id"])

python

quality_report(products, key_columns=["product_id"])

Quality checks should happen before serious EDA.

19. Validate Expected Categories

Check categorical columns against known values.

python

expected_payment_status = {"Paid", "Pending", "Failed"}
actual_payment_status = set(invoices["payment_status"].unique())

unexpected = actual_payment_status - expected_payment_status
print("Unexpected payment statuses:", unexpected)

Validate product stock values:

python

expected_stock_values = {"Yes", "No"}
actual_stock_values = set(products["in_stock"].unique())

print("Unexpected stock values:", actual_stock_values - expected_stock_values)

This catches spelling drift like paid, PAID, pendng, or blank values.

20. Combine Gathered Sources

Join leads, invoices, and city metrics.

python

orders = invoices.merge(
    leads,
    on=["order_id", "city"],
    how="left",
    validate="one_to_one",
)

orders = orders.merge(
    city_metrics[["city", "internet_pct", "learning_centers"]],
    on="city",
    how="left",
    validate="many_to_one",
)

orders.head()

Use validate in merge when you know the expected relationship.

It can catch accidental many-to-many joins early.

21. First EDA After Gathering

Now that data is gathered and lightly validated, basic EDA can begin.

Revenue by category:

python

revenue_by_category = (
    orders.query("payment_status == 'Paid'")
    .groupby("category", as_index=False)["amount"]
    .sum()
    .sort_values("amount", ascending=False)
)

revenue_by_category

Conversion by source:

python

conversion_by_source = (
    leads.assign(converted_flag=leads["converted"].eq("Yes").astype(int))
    .groupby("source", as_index=False)["converted_flag"]
    .mean()
)

conversion_by_source["conversion_rate_pct"] = conversion_by_source["converted_flag"] * 100
conversion_by_source

Product discount percentage:

python

products["discount_pct"] = (
    (products["actual_price"] - products["discounted_price"])
    / products["actual_price"]
    * 100
).round(1)

products[["product_name", "category", "discount_pct"]].head()

22. Save A Cleaned Starter Dataset

After basic gathering and checks, save a starter file for the next EDA stage.

python

orders.to_csv("eda_orders_starter_clean.csv", index=False)
products.to_csv("eda_products_starter_clean.csv", index=False)
source_log.to_csv("eda_source_log.csv", index=False)

The word starter is intentional. At this point, the data is gathered and lightly checked, not fully analyzed.

23. Data Gathering Checklist

Before you move to deeper EDA, confirm:

every source has a name and owner
every source has a collection date
raw files are preserved
secrets are not stored in notebooks
row counts are recorded
key columns are checked for duplicates
important categories are validated
date and numeric columns have correct data types
joined tables use expected relationships
gathered outputs can be read back

24. Common Mistakes

Hard-coding API keys

Never publish API keys in notebooks, Markdown files, or Git commits.

Scraping live sites without permission

Use official APIs when possible. If scraping is allowed, be slow, respectful, and transparent.

Ignoring pagination

Many APIs and listing pages return only the first page by default. Always check whether more pages exist.

Trusting row counts blindly

Record expected and actual row counts. A successful request can still return partial data.

Joining before checking keys

Bad joins can duplicate rows and inflate metrics. Check duplicates before merging.

Overwriting raw data

Keep raw data separate from cleaned data. Raw files are your audit trail.

25. Practice Tasks

Use the included CSV files:

Load all four CSV files and print shape, columns, and first five rows.
Create a source log with source name, source type, owner, and row count.
Store invoices, order_leads, and city_metrics in SQLite.
Read paid invoices from SQLite into Pandas.
Export all raw tables into one Excel workbook with multiple sheets.
Simulate API pagination from eda_city_metrics.csv.
Store the simulated API result into SQLite.
Parse the provided HTML snippet and create a product DataFrame.
Save product listings to JSON and read them back.
Build a quality report for every gathered table.
Join invoices with leads and city metrics.
Calculate paid revenue by category.

what each source represents
how each source was collected
whether collection was complete
which fields are reliable
which fields need cleaning
how sources connect to each other

Once this foundation is stable, the next stages of EDA become faster, safer, and easier to explain.

EDA Data Gathering: SQL, APIs & Web Scraping Guide

AI Insights

Frequently Asked Questions

Related Work

FastAPI Browser-Workflow Automation

AI Visibility & SEO Auditing Platform

RAG-Powered Learning Platform with AI Assistant