Home Blog Page 2

SQL with APIs and JSON Fields in PostgreSQL: Working with Semi-Structured Data

0
sql course

Table of Contents

  1. Introduction
  2. Why JSON Support Matters in Modern SQL
  3. PostgreSQL JSON vs JSONB
  4. Creating Tables with JSON Fields
  5. Inserting JSON Data (from APIs or files)
  6. Accessing JSON Properties in PostgreSQL
  7. Filtering Based on JSON Content
  8. Updating and Modifying JSON Fields
  9. Using JSON Functions: jsonb_set, jsonb_extract_path, jsonb_array_elements
  10. Flattening JSON Arrays
  11. Indexing JSONB for Performance
  12. Parsing API Responses in PostgreSQL
  13. Building a Table to Store API Logs
  14. Practical Example: GitHub API Data in JSON
  15. Joining JSON with Relational Tables
  16. Aggregating Inside JSON Arrays
  17. Validating and Cleaning JSON Payloads
  18. Common Pitfalls and Optimization Tips
  19. JSON Utility Functions and Extensions
  20. Summary and What’s Next

1. Introduction

Modern applications often rely on API responses and semi-structured data, especially in JSON format. PostgreSQL stands out among SQL databases for its robust native JSON support, making it ideal for ingesting and analyzing data directly from APIs.


2. Why JSON Support Matters in Modern SQL

  • APIs return nested JSON by default
  • Avoid premature normalization
  • Store entire documents or responses in a single field
  • Perform flexible querying without rigid schema changes

3. PostgreSQL JSON vs JSONB

FeatureJSONJSONB
StorageTextual formatBinary format
IndexableNoYes (GIN, expression indexes)
Order PreservedYesNo
PerformanceSlower for lookupFaster for search/filtering

Recommendation: Use JSONB for analytics and indexing.


4. Creating Tables with JSON Fields

CREATE TABLE api_logs (
id SERIAL PRIMARY KEY,
source TEXT,
payload JSONB,
fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This allows you to store entire API payloads for later parsing and analysis.


5. Inserting JSON Data (from APIs or files)

INSERT INTO api_logs (source, payload)
VALUES (
'github',
'{
"login": "prafull",
"id": 12345,
"repos": [
{"name": "syskool", "stars": 150},
{"name": "xeblabs", "stars": 80}
]
}'
);

Use your backend or ETL pipeline to automate this step.


6. Accessing JSON Properties in PostgreSQL

SELECT
payload->>'login' AS username,
(payload->>'id')::INT AS user_id
FROM api_logs
WHERE source = 'github';
  • -> returns JSON object
  • ->> returns text (castable to scalar types)

7. Filtering Based on JSON Content

SELECT *
FROM api_logs
WHERE payload->>'login' = 'prafull';

With nested data:

WHERE payload->'repos' @> '[{"name": "syskool"}]'

8. Updating and Modifying JSON Fields

UPDATE api_logs
SET payload = jsonb_set(payload, '{login}', '"kumar_prafull"')
WHERE id = 1;
  • First param: JSON field
  • Second: path as array
  • Third: new value (must be JSON-encoded)

9. Using JSON Functions

  • jsonb_set: Update a specific key
  • jsonb_extract_path: Navigate nested levels
  • jsonb_array_elements: Unpack arrays into rows
  • jsonb_build_object: Create JSON in SQL
  • jsonb_object_keys: List keys in an object

Example:

SELECT key, value
FROM api_logs,
jsonb_each_text(payload)
WHERE id = 1;

10. Flattening JSON Arrays

SELECT
payload->>'login' AS user,
repo_elem->>'name' AS repo_name,
(repo_elem->>'stars')::INT AS stars
FROM api_logs,
jsonb_array_elements(payload->'repos') AS repo_elem
WHERE source = 'github';

Each array element becomes its own row — useful for reporting.


11. Indexing JSONB for Performance

GIN Index:

CREATE INDEX idx_jsonb_payload ON api_logs USING GIN (payload jsonb_path_ops);

Expression Index:

CREATE INDEX idx_username ON api_logs ((payload->>'login'));

Improves filter and search performance.


12. Parsing API Responses in PostgreSQL

If API is fetched in ETL, store raw JSON, then parse downstream:

-- Extract nested country code
SELECT payload->'address'->>'country_code' AS country
FROM api_logs;

Use jsonb_populate_recordset() for bulk parsing if schema is known.


13. Building a Table to Store API Logs

CREATE TABLE external_api_responses (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
endpoint TEXT,
response JSONB,
status_code INT,
requested_at TIMESTAMP DEFAULT NOW()
);

Track performance, failures, and response sizes.


14. Practical Example: GitHub API Data in JSON

-- Assume payload from GitHub API
SELECT
payload->>'login' AS username,
jsonb_array_length(payload->'repos') AS repo_count
FROM api_logs
WHERE source = 'github';

15. Joining JSON with Relational Tables

-- users table has user_id
-- api_logs.payload contains matching GitHub user_id

SELECT u.name, a.payload->>'login' AS github_login
FROM users u
JOIN api_logs a ON u.external_id = (a.payload->>'id')::INT;

16. Aggregating Inside JSON Arrays

SELECT
payload->>'login' AS user,
SUM((repo_elem->>'stars')::INT) AS total_stars
FROM api_logs,
jsonb_array_elements(payload->'repos') AS repo_elem
GROUP BY user;

Aggregate metrics from nested structures.


17. Validating and Cleaning JSON Payloads

Check for structure:

SELECT * FROM api_logs
WHERE jsonb_typeof(payload->'repos') != 'array';

Avoid invalid assumptions — always validate before parsing.


18. Common Pitfalls and Optimization Tips

  • Use jsonb, not json, for faster querying
  • Always wrap JSON string literals in double quotes
  • Avoid storing massive payloads with no access plans
  • Index only fields used in filtering
  • Be careful with null vs missing key comparisons

19. JSON Utility Functions and Extensions

  • jsonb_pretty(payload) – formatted view
  • jsonb_strip_nulls(payload) – clean nulls
  • jsonb_build_object() – construct JSON in SQL
  • Extensions: pg_partman, pg_cron, plv8 (JS for PostgreSQL)

20. Summary and What’s Next

PostgreSQL provides powerful JSON capabilities that make it an excellent tool for storing and analyzing API data and other semi-structured documents. From storing entire payloads to filtering and joining on nested attributes, you can use pure SQL to explore flexible data structures.

SQL for Data Engineering: ETL Concepts and Building Scalable Data Pipelines

0
sql course

Table of Contents

  1. Introduction
  2. What Is Data Engineering and Why SQL Matters
  3. Understanding ETL (Extract, Transform, Load)
  4. SQL’s Role in Each ETL Stage
  5. Data Sources: Raw Data Ingestion with SQL
  6. Transformations: Cleaning and Normalizing Data
  7. Using Staging Tables in SQL Pipelines
  8. Surrogate Keys and Slowly Changing Dimensions (SCDs)
  9. Incremental Loads and Change Data Capture (CDC)
  10. Partitioning and Batching Strategies
  11. Deduplication and Idempotency in SQL
  12. Building Materialized Views
  13. Scheduling and Automating ETL Workflows
  14. Logging and Monitoring in SQL Pipelines
  15. Error Handling and Data Quality Checks
  16. Working with Semi-Structured Data (JSON, XML)
  17. Schema Design for Analytical Workloads
  18. Integration with Airflow, dbt, and Other Tools
  19. Best Practices for SQL Data Engineering
  20. Summary and What’s Next

1. Introduction

Data engineering is the backbone of data-driven organizations. SQL remains the foundation for building reliable, scalable, and maintainable data pipelines that power reporting, analytics, and machine learning.


2. What Is Data Engineering and Why SQL Matters

Data engineering involves the design, development, and maintenance of systems that enable efficient data collection, storage, transformation, and distribution. SQL is crucial for:

  • Writing ETL logic
  • Creating clean, normalized, and enriched datasets
  • Building warehouse-ready schemas
  • Orchestrating transformations and analytics layers

3. Understanding ETL (Extract, Transform, Load)

StageDescription
ExtractPull raw data from various sources
TransformClean, enrich, and standardize the data
LoadStore the final data into a warehouse or target table

SQL is typically used for T and L steps, and sometimes for E within RDBMS-based connectors.


4. SQL’s Role in Each ETL Stage

  • Extract: Read from APIs, logs, or staging databases
  • Transform: Use SELECT, JOIN, CASE, WINDOW, etc.
  • Load: Insert into fact/dimension tables or materialized views

5. Data Sources: Raw Data Ingestion with SQL

-- Load raw CSV into staging
COPY staging_orders FROM '/data/orders.csv' DELIMITER ',' CSV HEADER;

Or via external connectors (e.g., Fivetran, Airbyte).


6. Transformations: Cleaning and Normalizing Data

Example transformation SQL:

INSERT INTO cleaned_orders
SELECT
id,
LOWER(TRIM(customer_name)) AS customer_name,
COALESCE(total_amount, 0) AS total_amount,
CASE
WHEN status = 'delivered' THEN 'fulfilled'
ELSE status
END AS status_cleaned
FROM staging_orders
WHERE total_amount IS NOT NULL;

7. Using Staging Tables in SQL Pipelines

Staging tables help isolate raw data and ensure replayability:

-- staging_users
-- cleaned_users
-- users_final

Follow a clear naming pattern: stg_, cleaned_, fct_, dim_


8. Surrogate Keys and Slowly Changing Dimensions (SCDs)

Surrogate Keys:

ALTER TABLE customers ADD COLUMN customer_key SERIAL PRIMARY KEY;

SCD Type 2 (history tracking):

-- Insert new row with new version
INSERT INTO dim_customer_history (customer_id, name, version, valid_from, valid_to)
VALUES (1, 'New Name', 2, CURRENT_DATE, NULL);

9. Incremental Loads and Change Data Capture (CDC)

Last updated timestamp method:

INSERT INTO fact_orders
SELECT *
FROM staging_orders
WHERE updated_at > (SELECT MAX(updated_at) FROM fact_orders);

CDC tools: Debezium, Fivetran, Snowpipe (Snowflake), etc.


10. Partitioning and Batching Strategies

Partitioned tables improve performance:

CREATE TABLE sales (
id INT,
sale_date DATE,
amount NUMERIC
) PARTITION BY RANGE (sale_date);

Batching:

-- Load only last 7 days
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days';

11. Deduplication and Idempotency in SQL

-- Deduplicate by latest timestamp
SELECT DISTINCT ON (order_id) *
FROM staging_orders
ORDER BY order_id, updated_at DESC;

Use MERGE or UPSERT patterns to avoid duplicates.


12. Building Materialized Views

CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT order_date, SUM(amount) AS total
FROM fact_orders
GROUP BY order_date;

Use REFRESH MATERIALIZED VIEW to update periodically.


13. Scheduling and Automating ETL Workflows

  • Use cron jobs, Airflow, dbt Cloud, or other orchestrators
  • SQL examples:
-- PostgreSQL job
SELECT refresh_sales_summary();
  • Define dependencies and run incrementally

14. Logging and Monitoring in SQL Pipelines

Create a log table:

CREATE TABLE etl_log (
job_name TEXT,
start_time TIMESTAMP,
end_time TIMESTAMP,
status TEXT,
rows_processed INT
);

Insert log entry at the end of ETL steps.


15. Error Handling and Data Quality Checks

Examples:

-- Constraint check
SELECT * FROM staging_users WHERE email IS NULL;

-- Threshold check
SELECT COUNT(*) FROM orders WHERE total < 0;

Raise alerts or halt pipeline on failure.


16. Working with Semi-Structured Data (JSON, XML)

JSON:

SELECT json_data->>'user_id' AS user_id
FROM event_logs;

Flattening nested JSON:

SELECT id, json_array_elements(data->'items') AS item
FROM orders_json;

17. Schema Design for Analytical Workloads

Use Star Schema:

  • Fact tables: Events, transactions
  • Dimension tables: Customer, Product, Region

Avoid unnecessary normalization in analytical queries.


18. Integration with Airflow, dbt, and Other Tools

  • Airflow: Orchestrate SQL DAGs
  • dbt: SQL-based transformations with version control and testing
  • Dataform, Luigi, Dagster — alternative orchestration tools

Example dbt model:

-- models/fct_orders.sql
SELECT * FROM {{ ref('stg_orders') }}
WHERE status = 'complete'

19. Best Practices for SQL Data Engineering

  • Use idempotent transformations (can re-run safely)
  • Break logic into modular layers (staging → clean → final)
  • Keep metadata logs for traceability
  • Use surrogate keys for dimension tables
  • Prefer batch-based incremental updates
  • Automate quality checks with row counts, NOT NULL, etc.
  • Document pipeline logic and dependencies

20. Summary and What’s Next

SQL plays a central role in modern data engineering — from raw data ingestion to clean, analytics-ready datasets. Mastering SQL-based ETL, transformation layers, and orchestration tools is essential for building trustworthy, production-grade data pipelines.

SQL for Data Analysis: Deep Dive into KPIs and Cohort Analysis

0
sql course

Table of Contents

  1. Introduction
  2. SQL’s Role in Data-Driven Decision Making
  3. Understanding KPIs in Business Context
  4. Designing KPIs with SQL
  5. Measuring Conversion Rates
  6. Calculating Retention and Churn
  7. Defining and Analyzing User Cohorts
  8. Building Signup Cohorts
  9. Tracking Activity Across Time Buckets
  10. Cohort Retention Matrix (Week 0 to Week N)
  11. SQL Techniques for LTV (Lifetime Value)
  12. Funnel Analysis Using SQL
  13. Revenue Growth and Month-over-Month Trends
  14. CAC vs LTV: SQL Computation
  15. Feature Adoption & Usage Segments
  16. SQL for A/B Test Reporting
  17. Segmentation by Geography, Plan, or Behavior
  18. Rolling Averages and Smoothing Metrics
  19. Visualizing Metrics with SQL-Friendly Tables
  20. Summary and What’s Next

1. Introduction

Data analysis is at the heart of business intelligence. SQL allows analysts to compute critical KPIs, evaluate product impact, and understand user behavior at scale. In this module, we’ll focus on advanced metrics like cohort analysis, LTV, retention, and feature adoption.


2. SQL’s Role in Data-Driven Decision Making

SQL enables:

  • Fast, reproducible metric calculations
  • Filtering and slicing for dynamic segments
  • Integration with BI tools for real-time insights
  • Customized views for teams (e.g., marketing, product, sales)

3. Understanding KPIs in Business Context

KPIs (Key Performance Indicators) must be:

AttributeDescription
SpecificClearly defined (e.g., daily active users)
MeasurableQuantifiable in SQL
RelevantTied to business goals
Time-boundOver time (e.g., last 30 days)

Examples:

  • DAU/MAU
  • Churn Rate
  • Average Revenue Per User (ARPU)
  • Conversion Rate
  • Lifetime Value (LTV)

4. Designing KPIs with SQL

Daily Active Users (DAU):

SELECT activity_date, COUNT(DISTINCT user_id) AS dau
FROM user_activity
GROUP BY activity_date
ORDER BY activity_date;

Average Revenue Per User (ARPU):

SELECT SUM(revenue)/COUNT(DISTINCT user_id) AS arpu
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

5. Measuring Conversion Rates

Example: Landing → Signup → Purchase

WITH visited AS (
SELECT DISTINCT user_id FROM events WHERE page = 'landing'
),
signed_up AS (
SELECT DISTINCT user_id FROM events WHERE page = 'signup'
),
purchased AS (
SELECT DISTINCT user_id FROM events WHERE page = 'checkout'
)
SELECT
(SELECT COUNT(*) FROM visited) AS landing_visits,
(SELECT COUNT(*) FROM signed_up) AS signups,
(SELECT COUNT(*) FROM purchased) AS purchases,
ROUND(100.0 * (SELECT COUNT(*) FROM signed_up) / (SELECT COUNT(*) FROM visited), 2) AS signup_rate,
ROUND(100.0 * (SELECT COUNT(*) FROM purchased) / (SELECT COUNT(*) FROM signed_up), 2) AS purchase_rate;

6. Calculating Retention and Churn

Retention:

WITH first_seen AS (
SELECT user_id, MIN(login_date) AS cohort_date
FROM logins
GROUP BY user_id
),
activity AS (
SELECT user_id, login_date::date AS activity_day
FROM logins
)
SELECT
f.cohort_date,
a.activity_day,
COUNT(DISTINCT a.user_id) AS retained_users
FROM first_seen f
JOIN activity a ON f.user_id = a.user_id
WHERE a.activity_day >= f.cohort_date
GROUP BY f.cohort_date, a.activity_day;

Churn (simplified):

SELECT
COUNT(*) FILTER (WHERE DATEDIFF('day', last_login, CURRENT_DATE) > 30) * 100.0 / COUNT(*) AS churn_rate
FROM users;

7. Defining and Analyzing User Cohorts

Cohorts group users based on a shared time-based event (e.g., signup week/month). This allows performance comparisons across generations of users.

SELECT user_id, DATE_TRUNC('month', signup_date) AS cohort_month
FROM users;

8. Building Signup Cohorts

WITH cohorts AS (
SELECT user_id, DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
)
SELECT cohort_month, COUNT(*) AS new_users
FROM cohorts
GROUP BY cohort_month
ORDER BY cohort_month;

9. Tracking Activity Across Time Buckets

WITH user_logins AS (
SELECT user_id,
DATE_TRUNC('month', MIN(login_date)) AS cohort,
DATE_TRUNC('month', login_date) AS active_month
FROM logins
GROUP BY user_id, login_date
)
SELECT cohort, active_month, COUNT(DISTINCT user_id) AS active_users
FROM user_logins
GROUP BY cohort, active_month
ORDER BY cohort, active_month;

10. Cohort Retention Matrix (Week 0 to Week N)

WITH base AS (
SELECT user_id, MIN(login_date) AS signup_date
FROM logins
GROUP BY user_id
),
activity AS (
SELECT l.user_id,
DATE_TRUNC('week', b.signup_date) AS cohort_week,
FLOOR(DATE_PART('day', l.login_date - b.signup_date)/7) AS week_number
FROM logins l
JOIN base b ON l.user_id = b.user_id
)
SELECT cohort_week, week_number, COUNT(DISTINCT user_id) AS retained
FROM activity
GROUP BY cohort_week, week_number
ORDER BY cohort_week, week_number;

11. SQL Techniques for LTV (Lifetime Value)

SELECT customer_id, SUM(order_total) AS ltv
FROM orders
GROUP BY customer_id;

Segment by cohort:

WITH cohorted_orders AS (
SELECT o.customer_id, DATE_TRUNC('month', u.signup_date) AS cohort,
o.order_total
FROM orders o
JOIN users u ON o.customer_id = u.user_id
)
SELECT cohort, AVG(order_total) AS avg_ltv
FROM cohorted_orders
GROUP BY cohort;

12. Funnel Analysis Using SQL

WITH funnel AS (
SELECT user_id,
MAX(CASE WHEN event = 'visit' THEN 1 END) AS visited,
MAX(CASE WHEN event = 'signup' THEN 1 END) AS signed_up,
MAX(CASE WHEN event = 'purchase' THEN 1 END) AS purchased
FROM events
GROUP BY user_id
)
SELECT
COUNT(*) AS total_users,
COUNT(*) FILTER (WHERE signed_up = 1) AS signed_up,
COUNT(*) FILTER (WHERE purchased = 1) AS purchased,
ROUND(100.0 * COUNT(*) FILTER (WHERE signed_up = 1) / COUNT(*), 2) AS signup_rate,
ROUND(100.0 * COUNT(*) FILTER (WHERE purchased = 1) / COUNT(*), 2) AS conversion_rate
FROM funnel;

13. Revenue Growth and Month-over-Month Trends

WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', order_date) AS month, SUM(total) AS revenue
FROM orders
GROUP BY month
)
SELECT month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 / LAG(revenue) OVER (ORDER BY month), 2) AS mom_growth
FROM monthly_revenue;

14. CAC vs LTV: SQL Computation

WITH customer_acquisition AS (
SELECT customer_id, SUM(spend) AS cac
FROM ad_spend
GROUP BY customer_id
),
ltv AS (
SELECT customer_id, SUM(order_total) AS ltv
FROM orders
GROUP BY customer_id
)
SELECT a.customer_id, a.cac, l.ltv,
ROUND(l.ltv / NULLIF(a.cac, 0), 2) AS ltv_to_cac_ratio
FROM customer_acquisition a
JOIN ltv l ON a.customer_id = l.customer_id;

15. Feature Adoption & Usage Segments

SELECT feature_name, COUNT(DISTINCT user_id) AS users_used
FROM feature_usage
GROUP BY feature_name
ORDER BY users_used DESC;

Analyze adoption by segment:

SELECT segment, feature_name, COUNT(*) AS usage_count
FROM users u
JOIN feature_usage f ON u.user_id = f.user_id
GROUP BY segment, feature_name;

16. SQL for A/B Test Reporting

WITH groups AS (
SELECT user_id, experiment_group FROM ab_assignments
),
metrics AS (
SELECT user_id, SUM(purchase_amount) AS revenue
FROM orders
GROUP BY user_id
)
SELECT g.experiment_group, COUNT(*) AS users,
AVG(m.revenue) AS avg_revenue
FROM groups g
LEFT JOIN metrics m ON g.user_id = m.user_id
GROUP BY g.experiment_group;

17. Segmentation by Geography, Plan, or Behavior

SELECT plan_type, region, AVG(lifetime_value) AS avg_ltv
FROM customers
GROUP BY plan_type, region;

Or by behavioral metric:

SELECT CASE
WHEN order_count >= 10 THEN 'high value'
WHEN order_count >= 3 THEN 'medium value'
ELSE 'low value'
END AS segment,
COUNT(*) AS users
FROM (
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
) sub
GROUP BY segment;

18. Rolling Averages and Smoothing Metrics

SELECT date,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS revenue_7day_avg
FROM daily_revenue;

19. Visualizing Metrics with SQL-Friendly Tables

Many BI tools support SQL as a data source. Use consistent column names, date truncation, and ordering:

SELECT DATE_TRUNC('week', login_date) AS week, COUNT(DISTINCT user_id) AS weekly_active_users
FROM logins
GROUP BY week
ORDER BY week;

20. Summary and What’s Next

SQL enables powerful, reproducible, and scalable data analysis — from core KPIs to cohort-based insights. With mastery over cohorting, window functions, and segmentation, SQL becomes your primary tool for driving data-informed decisions.

SQL for Business Intelligence & Reporting: From Raw Data to Insights

0
sql course

Table of Contents

  1. Introduction
  2. Role of SQL in Business Intelligence (BI)
  3. Key Metrics and KPIs in Reporting
  4. Data Preparation for Reporting
  5. Filtering and Slicing Data
  6. Aggregation and Grouping Techniques
  7. Common Analytical Queries
  8. Calculating Growth, YoY, MoM Changes
  9. Using CASE for Conditional Logic
  10. Building Executive Dashboards with SQL
  11. Generating Cross-Tab Reports (Pivoting)
  12. Time-Series Analysis for BI
  13. Funnel and Conversion Analysis
  14. Segmentation and Cohort Queries
  15. Joining Multiple Data Sources
  16. Window Functions for Rankings & Trends
  17. Dynamic Date Filtering and Rolling Periods
  18. SQL for Visual BI Tools (Power BI, Tableau, etc.)
  19. Best Practices for BI SQL Design
  20. Summary and What’s Next

1. Introduction

SQL is the foundation of data analysis in business intelligence. Whether you’re building dashboards or generating reports, SQL empowers analysts to extract insights from raw data and convert it into actionable information.


2. Role of SQL in Business Intelligence (BI)

  • Extract, transform, and summarize large datasets
  • Create reusable queries for dashboards and reporting tools
  • Answer business questions with accuracy and speed
  • Feed BI tools (Tableau, Power BI, Looker) with clean datasets

3. Key Metrics and KPIs in Reporting

Examples include:

  • Revenue, Profit, Gross Margin
  • Conversion Rate, Retention Rate
  • Average Order Value, Customer Lifetime Value
  • DAU, MAU (Daily/Monthly Active Users)
  • NPS, Churn Rate

SQL is used to define and calculate all of the above consistently.


4. Data Preparation for Reporting

Create views or CTEs to clean and reshape data:

WITH clean_sales AS (
SELECT *, DATE(order_date) AS day
FROM sales
WHERE status = 'completed'
)
SELECT day, SUM(total_amount) AS daily_sales
FROM clean_sales
GROUP BY day;

5. Filtering and Slicing Data

SELECT category, COUNT(*) AS orders
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY category;

Add filters for geography, channel, segment, etc.


6. Aggregation and Grouping Techniques

Use aggregate functions:

  • SUM()
  • AVG()
  • COUNT()
  • MIN(), MAX()

Example:

SELECT region, AVG(order_value) FROM sales GROUP BY region;

7. Common Analytical Queries

  • Top N products:
SELECT product_id, SUM(quantity) AS sold
FROM order_items
GROUP BY product_id
ORDER BY sold DESC
LIMIT 10;
  • Revenue by customer segment
SELECT segment, SUM(revenue)
FROM customers JOIN orders USING(customer_id)
GROUP BY segment;

8. Calculating Growth, YoY, MoM Changes

WITH sales_by_month AS (
SELECT DATE_TRUNC('month', order_date) AS month, SUM(total) AS revenue
FROM orders
GROUP BY month
)
SELECT month, revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change,
ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month), 2) AS mom_percent
FROM sales_by_month;

9. Using CASE for Conditional Logic

SELECT
CASE
WHEN total >= 1000 THEN 'High Value'
WHEN total >= 500 THEN 'Medium Value'
ELSE 'Low Value'
END AS order_type,
COUNT(*) AS count
FROM orders
GROUP BY order_type;

10. Building Executive Dashboards with SQL

Create summary tables/views like:

SELECT
COUNT(DISTINCT user_id) AS active_users,
SUM(revenue) AS total_revenue,
AVG(order_value) AS avg_order_value
FROM daily_metrics
WHERE date = CURRENT_DATE;

Feed this into a dashboard tool via scheduled queries or materialized views.


11. Generating Cross-Tab Reports (Pivoting)

PostgreSQL:

SELECT * FROM crosstab(
'SELECT region, product, SUM(sales) FROM data GROUP BY 1,2',
'SELECT DISTINCT product FROM data ORDER BY 1'
) AS final(region TEXT, product_a INT, product_b INT);

MySQL workaround:

SELECT region,
SUM(CASE WHEN product = 'A' THEN sales ELSE 0 END) AS product_a,
SUM(CASE WHEN product = 'B' THEN sales ELSE 0 END) AS product_b
FROM data
GROUP BY region;

12. Time-Series Analysis for BI

Use DATE_TRUNC, GROUP BY on time:

SELECT DATE_TRUNC('week', created_at) AS week, COUNT(*) AS signups
FROM users
GROUP BY week
ORDER BY week;

13. Funnel and Conversion Analysis

WITH step1 AS (
SELECT user_id FROM visits WHERE page = 'landing'
),
step2 AS (
SELECT user_id FROM visits WHERE page = 'signup'
)
SELECT
(SELECT COUNT(*) FROM step1) AS landing_visits,
(SELECT COUNT(*) FROM step2) AS signups,
ROUND(100.0 * (SELECT COUNT(*) FROM step2) / (SELECT COUNT(*) FROM step1), 2) AS conversion_rate;

14. Segmentation and Cohort Queries

WITH cohort AS (
SELECT user_id, MIN(signup_date) AS cohort_date
FROM users
GROUP BY user_id
),
activity AS (
SELECT user_id, DATE_TRUNC('month', login_date) AS activity_month
FROM logins
)
SELECT cohort.cohort_date, activity.activity_month, COUNT(DISTINCT activity.user_id) AS active_users
FROM cohort
JOIN activity USING(user_id)
GROUP BY cohort_date, activity_month;

15. Joining Multiple Data Sources

SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN payments p ON o.order_id = p.order_id
WHERE p.status = 'paid';

Essential for consolidated reporting across systems.


16. Window Functions for Rankings & Trends

SELECT product_id, region,
RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rank
FROM regional_sales;

Used for leaderboards, trend tracking, etc.


17. Dynamic Date Filtering and Rolling Periods

SELECT * FROM sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days';

Make filters parameterized in tools like Power BI or Tableau.


18. SQL for Visual BI Tools (Power BI, Tableau, etc.)

  • Use views to standardize logic
  • Use calculated columns or filters in SQL when possible
  • Avoid subqueries and heavy joins in real-time dashboards
  • Ensure column names are friendly and consistent
  • Feed data via scheduled extract or live connection

19. Best Practices for BI SQL Design

  • Modularize logic with CTEs or views
  • Create intermediate staging tables for slow aggregations
  • Maintain naming conventions (dim_, fact_, agg_)
  • Pre-aggregate metrics for performance
  • Use indexing wisely on filterable dimensions (e.g., region, created_at)

20. Summary and What’s Next

SQL is a cornerstone of BI and reporting workflows, providing the engine that powers real-time dashboards, quarterly reports, and executive insights. Mastering its capabilities for filtering, aggregating, joining, and transforming data is crucial for any data analyst or BI engineer.

Analyzing Time-Series Data in SQL: Techniques for Temporal Queries

0
sql course

Table of Contents

  1. Introduction
  2. What Is Time-Series Data?
  3. Common Use Cases for Time-Based Analysis
  4. SQL Data Types for Timestamps
  5. Formatting and Parsing Date/Time Values
  6. Filtering Data by Time Range
  7. Truncating Time Granularity (DATE_TRUNC, TO_CHAR)
  8. Grouping by Day, Week, Month
  9. Calculating Running Totals and Rolling Averages
  10. Using LAG() and LEAD() for Time Comparisons
  11. Time Gap Detection Between Events
  12. Handling Missing Time Intervals
  13. Binning Time into Fixed Buckets
  14. Time-Based Joins (e.g., Point-in-Time Snapshots)
  15. Creating Time Windows with RANGE or ROWS
  16. Using INTERVAL for Dynamic Time Offsets
  17. Indexing Strategies for Time-Series Data
  18. Real-World Use Case: User Logins Over Time
  19. Best Practices for Time-Series SQL
  20. Summary and What’s Next

1. Introduction

Time-series data represents chronological records of events or measurements. SQL provides powerful tools to query, transform, and analyze such data for dashboards, reporting, and forecasting.


2. What Is Time-Series Data?

Time-series data is a sequence of values recorded over time. Examples include:

  • Sensor readings (IoT)
  • Stock prices
  • Web traffic logs
  • Temperature records
  • Financial transactions

Each row typically contains a timestamp and a measured value.


3. Common Use Cases for Time-Based Analysis

  • Trends over time
  • Month-over-month comparisons
  • User activity heatmaps
  • Downtime detection
  • Forecasting and seasonal analysis

4. SQL Data Types for Timestamps

Data TypeDescription
DATEYear, month, day (no time)
TIMETime only (no date)
TIMESTAMPDate and time (with or without zone)
DATETIMEMySQL-specific full timestamp
TIMESTAMPTZPostgreSQL with time zone

5. Formatting and Parsing Date/Time Values

PostgreSQL:

sqlCopyEditTO_CHAR(timestamp, 'YYYY-MM-DD')

MySQL:

sqlCopyEditDATE_FORMAT(timestamp, '%Y-%m-%d')

Useful for human-readable outputs or time grouping.


6. Filtering Data by Time Range

sqlCopyEditSELECT * FROM logins
WHERE login_time BETWEEN '2024-01-01' AND '2024-01-31';

Use >= and < instead of BETWEEN for precise range control.


7. Truncating Time Granularity

PostgreSQL:

sqlCopyEditSELECT DATE_TRUNC('day', login_time) AS day, COUNT(*)
FROM logins
GROUP BY 1;

MySQL:

sqlCopyEditSELECT DATE(login_time) AS day, COUNT(*)
FROM logins
GROUP BY day;

8. Grouping by Day, Week, Month

sqlCopyEdit-- Monthly active users
SELECT DATE_TRUNC('month', signup_date) AS month, COUNT(*)
FROM users
GROUP BY month
ORDER BY month;

Use WEEK, MONTH, HOUR, etc. based on granularity needed.


9. Calculating Running Totals and Rolling Averages

sqlCopyEditSELECT date,
       SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
       AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM sales_data;

Window functions make rolling aggregates simple and powerful.


10. Using LAG() and LEAD() for Time Comparisons

sqlCopyEditSELECT date, sales,
       LAG(sales) OVER (ORDER BY date) AS previous_day_sales,
       sales - LAG(sales) OVER (ORDER BY date) AS daily_change
FROM sales_data;

Ideal for day-over-day or period-over-period analysis.


11. Time Gap Detection Between Events

sqlCopyEditSELECT user_id, activity_time,
       LAG(activity_time) OVER (PARTITION BY user_id ORDER BY activity_time) AS previous_time,
       activity_time - LAG(activity_time) OVER (...) AS time_diff
FROM activity_logs;

Detects inactivity gaps, session timeouts, etc.


12. Handling Missing Time Intervals

Use a calendar or date series table to join:

sqlCopyEdit-- Generate days and LEFT JOIN actual data to fill gaps
SELECT d.day, COUNT(logins.id)
FROM generate_series('2024-01-01'::date, '2024-01-31', interval '1 day') d(day)
LEFT JOIN logins ON DATE(logins.login_time) = d.day
GROUP BY d.day;

13. Binning Time into Fixed Buckets

sqlCopyEditSELECT FLOOR(EXTRACT(EPOCH FROM event_time) / 300) * 300 AS bucket,
       COUNT(*)
FROM events
GROUP BY bucket;

This groups data into 5-minute buckets (300s).


14. Time-Based Joins (e.g., Point-in-Time Snapshots)

sqlCopyEditSELECT p.id, p.name, r.region
FROM people p
JOIN regions r ON p.snapshot_time BETWEEN r.start_time AND r.end_time;

Useful for tracking temporal dimensions like active contracts or locations.


15. Creating Time Windows with RANGE or ROWS

sqlCopyEditSELECT date, SUM(sales) OVER (
  ORDER BY date
  RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS weekly_sales
FROM sales_data;

RANGE is value-based, while ROWS is row-count-based.


16. Using INTERVAL for Dynamic Time Offsets

sqlCopyEditSELECT * FROM purchases
WHERE purchase_time >= NOW() - INTERVAL '30 days';

Dynamic filters for last X days/hours/weeks.


17. Indexing Strategies for Time-Series Data

  • Use B-tree indexes on timestamp columns
  • Composite indexes: (timestamp, user_id)
  • Consider partitioning large datasets by time range
  • Use descending indexes for reverse chronological queries

18. Real-World Use Case: User Logins Over Time

sqlCopyEditSELECT DATE_TRUNC('week', login_time) AS week, COUNT(*) AS logins
FROM user_logins
WHERE login_time >= CURRENT_DATE - INTERVAL '3 months'
GROUP BY week
ORDER BY week;

Used for weekly user activity trends.


19. Best Practices for Time-Series SQL

  • Store time in UTC
  • Avoid redundant DATE() or CAST() in filters — may prevent index use
  • Always index time fields in high-volume logs
  • Use materialized views for pre-aggregated reports
  • Benchmark with real volumes to test query plans

20. Summary and What’s Next

SQL offers powerful techniques for analyzing time-series data — from simple grouping to complex windowing functions. Mastering temporal queries equips you to build advanced analytics on user behavior, trends, forecasts, and system events.