EDA Data Gathering: SQL, APIs & Web Scraping Guide

Jun 23, 2026
40 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: eda-data-gathering-sql-apis-web-scraping-guide
Quick Answer

Learn the data gathering stage of exploratory data analysis with original offline-safe examples covering CSV files, SQL extraction, Excel export, API pagination, web scraping, JSON output, source logs, and data quality checks.

Quick Summary

Learn effective EDA data gathering techniques using SQL, APIs, and web scraping. Enhance your data analysis skills with practical examples.

EDA Data Gathering: SQL, APIs, Web Scraping, Files, and the Data Analysis Process

Exploratory Data Analysis starts before charts.

The first job is to gather reliable data from the places where it lives:

  • CSV files
  • Excel workbooks
  • SQL databases
  • APIs
  • web pages
  • JSON exports
  • internal application logs

This guide teaches the data gathering stage with original synthetic examples. It does not use copied SQL dumps, API keys, external spreadsheets, scraped ecommerce pages, or proprietary course data.

Files Used In This Guide

Use these CSV files:

  • eda_invoices.csv
  • eda_order_leads.csv
  • eda_city_metrics.csv
  • eda_product_listing_snapshot.csv

Place them in the same folder as your notebook or script.

If you keep them in a data/ folder, update the paths:

python
invoices = pd.read_csv("data/eda_invoices.csv")
leads = pd.read_csv("data/eda_order_leads.csv")
city_metrics = pd.read_csv("data/eda_city_metrics.csv")
products = pd.read_csv("data/eda_product_listing_snapshot.csv")

What You Will Learn

By the end, you should be able to:

  • explain where data gathering fits in the data analysis process
  • load raw CSV extracts safely
  • read data from a SQL database into Pandas
  • export multiple tables into one Excel workbook
  • design an API collection loop with pagination
  • store API data into SQL
  • parse product-like HTML with BeautifulSoup
  • save scraped data to JSON
  • maintain a source log for reproducibility
  • run basic data quality checks before analysis

1. The Data Analysis Process

A practical data analysis workflow has six stages:

  1. Define the question.
  2. Gather the data.
  3. Assess the data.
  4. Clean and transform the data.
  5. Explore and model the data.
  6. Communicate the result.

This guide focuses on stage 2 and the beginning of stage 3.

Data gathering is not just downloading files. It is the discipline of knowing:

  • what was collected
  • where it came from
  • when it was collected
  • which fields exist
  • which assumptions were made
  • whether the collection can be repeated

2. Setup

Install the common libraries:

bash
pip install pandas openpyxl sqlalchemy beautifulsoup4 lxml requests

Import the libraries:

python
import json
import sqlite3
import time
from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

For this guide, every example can run offline except the optional API pattern. The API section uses simulated pages so you do not need accounts, keys, or live network calls.

3. Load Raw CSV Extracts

Start with simple file-based gathering.

python
invoices = pd.read_csv("eda_invoices.csv")
leads = pd.read_csv("eda_order_leads.csv")
city_metrics = pd.read_csv("eda_city_metrics.csv")
products = pd.read_csv("eda_product_listing_snapshot.csv")

print(invoices.head())
print(leads.head())
print(city_metrics.head())
print(products.head())

Check shape and columns:

python
for name, df in {
    "invoices": invoices,
    "leads": leads,
    "city_metrics": city_metrics,
    "products": products,
}.items():
    print(name)
    print("shape:", df.shape)
    print("columns:", list(df.columns))
    print()

Raw file gathering should always answer two questions:

  • Did I load the expected number of rows?
  • Did I get the expected columns?

4. Create A Source Log

A source log records collection metadata.

python
source_log = pd.DataFrame(
    [
        {
            "source_name": "eda_invoices.csv",
            "source_type": "csv",
            "owner": "finance_ops",
            "refresh_frequency": "daily",
            "loaded_rows": len(invoices),
        },
        {
            "source_name": "eda_order_leads.csv",
            "source_type": "csv",
            "owner": "growth_team",
            "refresh_frequency": "daily",
            "loaded_rows": len(leads),
        },
        {
            "source_name": "eda_city_metrics.csv",
            "source_type": "simulated_api",
            "owner": "market_research",
            "refresh_frequency": "monthly",
            "loaded_rows": len(city_metrics),
        },
        {
            "source_name": "eda_product_listing_snapshot.csv",
            "source_type": "simulated_scrape",
            "owner": "catalog_team",
            "refresh_frequency": "weekly",
            "loaded_rows": len(products),
        },
    ]
)

source_log

This is a small habit with a large payoff. When a project grows, a source log prevents confusion.

5. Read Data From SQL

In real projects, data often comes from databases.

This example uses SQLite so it runs locally without a database server.

python
conn = sqlite3.connect("eda_demo.db")

invoices.to_sql("invoices", conn, if_exists="replace", index=False)
leads.to_sql("order_leads", conn, if_exists="replace", index=False)
city_metrics.to_sql("city_metrics", conn, if_exists="replace", index=False)

Now query those tables back into Pandas:

python
paid_invoices = pd.read_sql_query(
    """
    SELECT
        invoice_id,
        order_id,
        city,
        category,
        amount,
        payment_status
    FROM invoices
    WHERE payment_status = 'Paid'
    """,
    conn,
)

paid_invoices.head()

Use SQL when filtering or joining inside the database is more efficient than loading everything into memory.

6. Join SQL Tables

Join invoices with leads by order_id.

python
orders_with_leads = pd.read_sql_query(
    """
    SELECT
        i.invoice_id,
        i.order_id,
        i.invoice_date,
        i.city,
        i.category,
        i.amount,
        i.payment_status,
        l.source,
        l.campaign,
        l.lead_score,
        l.converted
    FROM invoices AS i
    LEFT JOIN order_leads AS l
        ON i.order_id = l.order_id
    """,
    conn,
)

orders_with_leads.head()

After joining, verify row counts:

python
print("invoices:", len(invoices))
print("joined:", len(orders_with_leads))

If a join unexpectedly increases row count, you may have duplicate keys on one side.

7. Export Multiple Tables To Excel

Sometimes stakeholders want one workbook with multiple sheets.

python
with pd.ExcelWriter("eda_gathered_sources.xlsx", engine="openpyxl") as writer:
    invoices.to_excel(writer, sheet_name="invoices", index=False)
    leads.to_excel(writer, sheet_name="order_leads", index=False)
    city_metrics.to_excel(writer, sheet_name="city_metrics", index=False)
    source_log.to_excel(writer, sheet_name="source_log", index=False)

This creates one Excel file with separate tabs.

Use clear sheet names. Avoid spaces and very long names when the workbook will be used by scripts later.

8. API Gathering Concepts

APIs usually return JSON.

Real API collection often involves:

  • endpoint URL
  • query parameters
  • authentication header
  • pagination
  • rate limits
  • retries
  • schema changes

Never hard-code secret API keys inside notebooks. Use environment variables or a secret manager.

Example pattern:

python
import os

api_key = os.environ.get("CITY_API_KEY")

headers = {
    "Authorization": f"Bearer {api_key}",
}

The rest of this guide uses simulated API pages so the code is safe to publish.

9. Simulate Paginated API Pages

Turn the city metrics CSV into API-like pages.

python
api_pages = []

for page_number, page_df in city_metrics.groupby("api_page"):
    api_pages.append(
        {
            "page": int(page_number),
            "has_next": page_number < city_metrics["api_page"].max(),
            "data": page_df.drop(columns=["api_page"]).to_dict(orient="records"),
        }
    )

api_pages[0]

This structure behaves like a paginated API response:

  • metadata lives at the page level
  • records live inside data
  • has_next tells the loop whether to continue

10. Collect All API Pages

Use a loop that stops when there are no more pages.

python
all_city_records = []

for page in api_pages:
    all_city_records.extend(page["data"])

    print(f"Collected page {page['page']}: {len(page['data'])} rows")

    if not page["has_next"]:
        break

    time.sleep(0.2)

cities_from_api = pd.DataFrame(all_city_records)
cities_from_api.head()

In a real API loop:

  • sleep between requests if the API asks you to
  • handle non-200 status codes
  • log failed pages
  • retry temporary failures
  • stop before rate limits are exceeded

11. Store API Data Into SQL

After gathering API data, store it in a database table.

python
cities_from_api.to_sql(
    "api_city_metrics",
    conn,
    if_exists="replace",
    index=False,
)

pd.read_sql_query("SELECT * FROM api_city_metrics LIMIT 5", conn)

This makes the gathered API data available for later joins.

12. API Collection Checklist

Before trusting API data, check:

python
print(cities_from_api.shape)
print(cities_from_api.isna().sum())
print(cities_from_api.duplicated(subset=["city_id"]).sum())

API data can fail in quiet ways:

  • missing pages
  • duplicated pages
  • changed field names
  • nested JSON that was flattened incorrectly
  • numbers returned as strings

13. Web Scraping Safety

Web scraping should be done carefully.

Before scraping a site:

  • read the site's terms
  • check robots.txt
  • avoid private or sensitive data
  • use a respectful delay
  • do not bypass anti-bot systems
  • stop if the site blocks or rate-limits you

For a public blog post, it is better to teach scraping with synthetic HTML.

14. Parse One HTML Page

Here is a tiny product listing page.

python
html_page_1 = """
<html>
  <body>
    <div class="product-card">
      <img src="https://example.com/notebook-pro.jpg" alt="Notebook Pro 14">
      <h2 class="name">Notebook Pro 14</h2>
      <span class="rating">4.5</span>
      <span class="reviews">312 reviews</span>
      <span class="discounted">58990</span>
      <span class="actual">72990</span>
    </div>
    <div class="product-card">
      <img src="https://example.com/studytab.jpg" alt="StudyTab 10">
      <h2 class="name">StudyTab 10</h2>
      <span class="rating">4.2</span>
      <span class="reviews">140 reviews</span>
      <span class="discounted">17990</span>
      <span class="actual">22990</span>
    </div>
  </body>
</html>
"""

Parse it:

python
soup = BeautifulSoup(html_page_1, "lxml")
cards = soup.select(".product-card")

records = []

for card in cards:
    records.append(
        {
            "image_url": card.select_one("img").get("src"),
            "product_name": card.select_one(".name").get_text(strip=True),
            "rating": float(card.select_one(".rating").get_text(strip=True)),
            "reviews": card.select_one(".reviews").get_text(strip=True),
            "discounted_price": int(card.select_one(".discounted").get_text(strip=True)),
            "actual_price": int(card.select_one(".actual").get_text(strip=True)),
        }
    )

scraped_sample = pd.DataFrame(records)
scraped_sample

The key idea is to select stable HTML elements and extract text or attributes.

15. Scrape Multiple Simulated Pages

Use the included product CSV to simulate pages gathered from a listing.

python
pages = []

for page_number, page_df in products.groupby("page"):
    pages.append(page_df.to_dict(orient="records"))

scraped_records = []

for page_number, page_records in enumerate(pages, start=1):
    print(f"Collecting page {page_number}")
    scraped_records.extend(page_records)
    time.sleep(0.2)

scraped_products = pd.DataFrame(scraped_records)
scraped_products.head()

This is the same control flow you would use for real pagination, but without touching a live website.

16. Save Scraped Data To JSON

JSON is useful when downstream systems expect nested or semi-structured records.

python
scraped_products.to_json(
    "eda_product_listing_snapshot.json",
    orient="records",
    indent=2,
)

Read it back:

python
products_from_json = pd.read_json("eda_product_listing_snapshot.json")
products_from_json.head()

Always test that your saved file can be read back correctly.

17. Normalize Raw Data Types

Gathered data often arrives with wrong data types.

python
invoices["invoice_date"] = pd.to_datetime(invoices["invoice_date"])
leads["lead_date"] = pd.to_datetime(leads["lead_date"])

invoices["amount"] = pd.to_numeric(invoices["amount"], errors="coerce")
products["discounted_price"] = pd.to_numeric(products["discounted_price"], errors="coerce")
products["actual_price"] = pd.to_numeric(products["actual_price"], errors="coerce")

Check the result:

python
print(invoices.dtypes)
print(products.dtypes)

Do this before analysis. Charts and aggregations can be wrong if numbers are still strings.

18. Basic Quality Checks

Create a reusable summary function.

python
def quality_report(df, key_columns=None):
    key_columns = key_columns or []

    report = {
        "rows": len(df),
        "columns": len(df.columns),
        "missing_cells": int(df.isna().sum().sum()),
    }

    for key in key_columns:
        report[f"duplicate_{key}"] = int(df.duplicated(subset=[key]).sum())

    return report

Run it:

python
quality_report(invoices, key_columns=["invoice_id"])
python
quality_report(leads, key_columns=["lead_id"])
python
quality_report(products, key_columns=["product_id"])

Quality checks should happen before serious EDA.

19. Validate Expected Categories

Check categorical columns against known values.

python
expected_payment_status = {"Paid", "Pending", "Failed"}
actual_payment_status = set(invoices["payment_status"].unique())

unexpected = actual_payment_status - expected_payment_status
print("Unexpected payment statuses:", unexpected)

Validate product stock values:

python
expected_stock_values = {"Yes", "No"}
actual_stock_values = set(products["in_stock"].unique())

print("Unexpected stock values:", actual_stock_values - expected_stock_values)

This catches spelling drift like paid, PAID, pendng, or blank values.

20. Combine Gathered Sources

Join leads, invoices, and city metrics.

python
orders = invoices.merge(
    leads,
    on=["order_id", "city"],
    how="left",
    validate="one_to_one",
)

orders = orders.merge(
    city_metrics[["city", "internet_pct", "learning_centers"]],
    on="city",
    how="left",
    validate="many_to_one",
)

orders.head()

Use validate in merge when you know the expected relationship.

It can catch accidental many-to-many joins early.

21. First EDA After Gathering

Now that data is gathered and lightly validated, basic EDA can begin.

Revenue by category:

python
revenue_by_category = (
    orders.query("payment_status == 'Paid'")
    .groupby("category", as_index=False)["amount"]
    .sum()
    .sort_values("amount", ascending=False)
)

revenue_by_category

Conversion by source:

python
conversion_by_source = (
    leads.assign(converted_flag=leads["converted"].eq("Yes").astype(int))
    .groupby("source", as_index=False)["converted_flag"]
    .mean()
)

conversion_by_source["conversion_rate_pct"] = conversion_by_source["converted_flag"] * 100
conversion_by_source

Product discount percentage:

python
products["discount_pct"] = (
    (products["actual_price"] - products["discounted_price"])
    / products["actual_price"]
    * 100
).round(1)

products[["product_name", "category", "discount_pct"]].head()

22. Save A Cleaned Starter Dataset

After basic gathering and checks, save a starter file for the next EDA stage.

python
orders.to_csv("eda_orders_starter_clean.csv", index=False)
products.to_csv("eda_products_starter_clean.csv", index=False)
source_log.to_csv("eda_source_log.csv", index=False)

The word starter is intentional. At this point, the data is gathered and lightly checked, not fully analyzed.

23. Data Gathering Checklist

Before you move to deeper EDA, confirm:

  • every source has a name and owner
  • every source has a collection date
  • raw files are preserved
  • secrets are not stored in notebooks
  • row counts are recorded
  • key columns are checked for duplicates
  • important categories are validated
  • date and numeric columns have correct data types
  • joined tables use expected relationships
  • gathered outputs can be read back

24. Common Mistakes

Hard-coding API keys

Never publish API keys in notebooks, Markdown files, or Git commits.

Scraping live sites without permission

Use official APIs when possible. If scraping is allowed, be slow, respectful, and transparent.

Ignoring pagination

Many APIs and listing pages return only the first page by default. Always check whether more pages exist.

Trusting row counts blindly

Record expected and actual row counts. A successful request can still return partial data.

Joining before checking keys

Bad joins can duplicate rows and inflate metrics. Check duplicates before merging.

Overwriting raw data

Keep raw data separate from cleaned data. Raw files are your audit trail.

25. Practice Tasks

Use the included CSV files:

  1. Load all four CSV files and print shape, columns, and first five rows.
  2. Create a source log with source name, source type, owner, and row count.
  3. Store invoices, order_leads, and city_metrics in SQLite.
  4. Read paid invoices from SQLite into Pandas.
  5. Export all raw tables into one Excel workbook with multiple sheets.
  6. Simulate API pagination from eda_city_metrics.csv.
  7. Store the simulated API result into SQLite.
  8. Parse the provided HTML snippet and create a product DataFrame.
  9. Save product listings to JSON and read them back.
  10. Build a quality report for every gathered table.
  11. Join invoices with leads and city metrics.
  12. Calculate paid revenue by category.

26. Interview-Style Questions

What is data gathering in EDA?

Data gathering is the process of collecting raw data from files, databases, APIs, web pages, logs, or internal systems before cleaning and analysis.

Why should you keep raw data unchanged?

Raw data is the audit trail. If cleaning logic fails, you can return to the original source.

Why is pagination important in APIs?

Many APIs return data in pages. If you collect only the first page, your dataset may be incomplete.

Why should API keys not be written in notebooks?

Notebook files are often shared, committed, or exported. Hard-coded keys can leak access to private systems.

What is the purpose of a source log?

A source log records where each dataset came from, who owns it, how often it refreshes, and how many rows were loaded.

Why use SQL before Pandas?

SQL is useful when filtering, joining, or aggregating large data close to the database is more efficient.

Why use validate in Pandas merge?

validate checks whether the join relationship matches your expectation, such as one-to-one or many-to-one.

What should you check after scraping?

Check row count, missing fields, duplicate records, parsing errors, and whether saved output can be read back.

27. Final Notes

Good EDA starts with careful gathering.

Before plotting, make sure you know:

  • what each source represents
  • how each source was collected
  • whether collection was complete
  • which fields are reliable
  • which fields need cleaning
  • how sources connect to each other

Once this foundation is stable, the next stages of EDA become faster, safer, and easier to explain.

Frequently Asked Questions

What are the main data sources covered in the guide for EDA data gathering?
The guide covers gathering data from CSV files, Excel workbooks, SQL databases, APIs, web pages, JSON exports, and internal application logs.
What is the purpose of maintaining a source log during data gathering?
Maintaining a source log helps ensure reproducibility by tracking what data was collected, where it came from, when it was collected, and other relevant details.
What is the first stage of the data analysis process according to the guide?
The first stage of the data analysis process is to define the question.
How does the guide suggest handling API data collection?
The guide suggests designing an API collection loop with pagination and storing the API data into SQL.

Related Work

See how this thinking shows up in shipped systems.