Home Blog

Working with NoSQL-like Features in SQL: Arrays, HSTORE, and Semi-Structured Models

0
sql course

Table of Contents

  1. Introduction
  2. Why SQL Supports NoSQL-Like Structures
  3. Arrays in SQL: Definition and Use Cases
  4. Creating and Querying Array Columns
  5. Common Array Functions and Operators
  6. Unnesting Arrays with unnest()
  7. Searching Inside Arrays
  8. Aggregating and Constructing Arrays
  9. The HSTORE Data Type: Key-Value Storage in SQL
  10. Creating HSTORE Columns and Inserting Data
  11. Querying HSTORE Fields
  12. Updating and Deleting Keys in HSTORE
  13. Indexing Arrays and HSTORE Columns
  14. JSON vs HSTORE vs Arrays: When to Use What
  15. Combining Structured and Semi-Structured Data
  16. Using Arrays and HSTORE in Joins and Subqueries
  17. Real-World Example: Product Tags and Attributes
  18. Best Practices and Anti-Patterns
  19. Performance Considerations
  20. Summary and What’s Next

1. Introduction

Traditional relational databases like PostgreSQL are increasingly used for semi-structured and schema-flexible workloads. Features like ARRAY, HSTORE, and JSON/JSONB allow SQL to behave similarly to NoSQL, while preserving the power of relational logic.


2. Why SQL Supports NoSQL-Like Structures

  • Reduce over-normalization for simple list/map data
  • Flexibly store dynamic or sparse attributes
  • Avoid extra lookup tables for small key-value structures
  • Improve readability and performance for embedded fields

3. Arrays in SQL: Definition and Use Cases

An array is a data type that can store multiple values in a single column.

Use Cases:

  • Tags (['tech', 'finance'])
  • Skills (['SQL', 'Python'])
  • Multi-category products
  • Roles/permissions

4. Creating and Querying Array Columns

CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT,
interests TEXT[]
);

Inserting data:

INSERT INTO users (name, interests)
VALUES ('Anay', ARRAY['reading', 'coding']);

5. Common Array Functions and Operators

Function/OperatorDescription
= ANY(array)Checks if value is in array
array_length(array, 1)Returns length of array
array_append(array, val)Adds an element
array_remove(array, val)Removes an element
array_position()Finds index of a value

Example:

SELECT * FROM users WHERE 'coding' = ANY(interests);

6. Unnesting Arrays with unnest()

Convert array to rows:

SELECT id, unnest(interests) AS interest
FROM users;

This is useful for filtering, grouping, and aggregating arrays.


7. Searching Inside Arrays

SELECT name FROM users WHERE interests @> ARRAY['reading'];

The @> operator checks if the array contains all values of the provided array.


8. Aggregating and Constructing Arrays

Build arrays from rows:

SELECT user_id, ARRAY_AGG(skill ORDER BY skill) AS skill_set
FROM user_skills
GROUP BY user_id;

Remove duplicates:

SELECT ARRAY(SELECT DISTINCT unnest(interests)) FROM users;

9. The HSTORE Data Type: Key-Value Storage in SQL

HSTORE stores sets of key-value pairs within a single column.

Example:

CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT,
attributes HSTORE
);

10. Creating HSTORE Columns and Inserting Data

Enable extension (PostgreSQL):

CREATE EXTENSION IF NOT EXISTS hstore;

Insert data:

INSERT INTO products (name, attributes)
VALUES ('Laptop', 'brand => Dell, cpu => i7, ram => 16GB');

11. Querying HSTORE Fields

SELECT name FROM products WHERE attributes -> 'brand' = 'Dell';

Use -> to get value, ? to check for existence:

SELECT * FROM products WHERE attributes ? 'cpu';

12. Updating and Deleting Keys in HSTORE

Add a key:

UPDATE products
SET attributes = attributes || 'gpu => RTX 3060'
WHERE name = 'Laptop';

Remove a key:

UPDATE products
SET attributes = delete(attributes, 'ram')
WHERE name = 'Laptop';

13. Indexing Arrays and HSTORE Columns

PostgreSQL supports GIN indexes for both:

CREATE INDEX idx_interests ON users USING GIN (interests);
CREATE INDEX idx_attributes ON products USING GIN (attributes);

This dramatically speeds up @> and key existence queries.


14. JSON vs HSTORE vs Arrays: When to Use What

TypeUse CaseProsCons
ARRAYSimple lists, same typeLightweight, fast indexingLimited structure
HSTOREFlat key-value pairs (strings only)Simple, compactNo nesting or arrays
JSONBNested, structured dataFlexible, full JSON supportHeavier, more verbose

15. Combining Structured and Semi-Structured Data

CREATE TABLE events (
id SERIAL,
user_id INT,
event_type TEXT,
metadata JSONB
);

Example usage:

  • metadata: stores page URL, campaign ID, device info
  • event_type: keeps structure queryable

You can combine array, hstore, and json in one schema depending on requirements.


16. Using Arrays and HSTORE in Joins and Subqueries

Join user skills stored as arrays:

SELECT u.name, s.skill
FROM users u, unnest(u.interests) AS s(skill)
JOIN skill_details sd ON sd.name = s.skill;

Join HSTORE values by casting or using -> as virtual columns.


17. Real-World Example: Product Tags and Attributes

CREATE TABLE catalog (
id SERIAL,
product_name TEXT,
tags TEXT[],
specs HSTORE
);

Use case:

  • tags: quick filtering by category
  • specs: flexible storage of size, color, model, etc.

Query:

SELECT * FROM catalog
WHERE 'electronics' = ANY(tags) AND specs -> 'color' = 'black';

18. Best Practices and Anti-Patterns

✅ Use arrays for short, homogeneous data
✅ Use hstore when keys vary but are flat
✅ Use JSON for nested, structured data
✅ Always index if filtering is frequent
✅ Validate data with CHECK constraints or application logic

x Avoid storing unrelated data in the same JSON or hstore
x Avoid deeply nested arrays (hard to query and maintain)


19. Performance Considerations

  • Arrays and HSTORE are faster than JSON for flat structures
  • Avoid large arrays in rows — may hit row size limits
  • Use GIN indexes but be mindful of write overhead
  • Normalize heavily queried data if necessary
  • Combine with materialized views for reporting workloads

20. Summary and What’s Next

PostgreSQL’s support for arrays, HSTORE, and other semi-structured types provides SQL with powerful NoSQL-like flexibility — without giving up relational consistency or query optimization. These features let you handle real-world, variable data models efficiently.

Version Control for SQL Scripts: Git, Database Migrations, and Change Management

0
sql course

Table of Contents

  1. Introduction
  2. Why Version Control Matters for SQL
  3. Using Git to Track SQL Scripts
  4. Best Practices for Organizing SQL Repositories
  5. Commit Strategies for Schema and Data Scripts
  6. Understanding Database Migrations
  7. Manual vs Automated Migrations
  8. Tools for Managing Migrations (Flyway, Liquibase, dbmate, etc.)
  9. Writing Migration Files with Up/Down Scripts
  10. Naming Conventions and File Structures
  11. Applying Migrations in CI/CD Pipelines
  12. Rollbacks and Reverting Schema Changes
  13. Tracking Data Migrations (Safe Practices)
  14. Managing Migration History Tables
  15. Handling Conflicts in Team Environments
  16. Git Hooks and Pre-Deployment Validation
  17. Tagging Releases and Change Logs
  18. Managing Environment-Specific Differences
  19. Real-World Workflow Example (Dev → Staging → Prod)
  20. Summary and What’s Next

1. Introduction

While developers commonly version-control application code, SQL scripts are often left unmanaged. Using Git and migration tools allows teams to track, audit, and automate database changes alongside application code, reducing deployment risk and ensuring consistency.


2. Why Version Control Matters for SQL

  • Avoid untracked schema drift
  • Enable reproducible environments
  • Share changes across teams
  • Roll back mistakes quickly
  • Integrate with CI/CD pipelines
  • Document the evolution of database design

3. Using Git to Track SQL Scripts

Store SQL in a structured repo:

sql/
├── schema/
│ ├── 001_init.sql
│ ├── 002_add_users_table.sql
├── data/
│ ├── seed_users.sql
├── migrations/
│ ├── V1__init_schema.sql
│ ├── V2__add_orders_table.sql

Each file should:

  • Contain a single logical change
  • Be idempotent (if possible)
  • Include descriptive commit messages

4. Best Practices for Organizing SQL Repositories

  • Separate DDL (schema) and DML (data)
  • Use consistent file naming (Vx__description.sql)
  • Include environment folders if configs differ
  • Use README files to explain script purpose
  • Avoid dumping entire schema with pg_dump or mysqldump as raw SQL

5. Commit Strategies for Schema and Data Scripts

Good commit:

feat(db): add customers table with email unique constraint

Bad commit:

Update.sql

Group related schema changes and ensure testability before pushing.


6. Understanding Database Migrations

A migration is a tracked, incremental change to the database schema or data. It consists of:

  • Up: the change (e.g., CREATE TABLE)
  • Down: the rollback (e.g., DROP TABLE)

Migrations allow databases to evolve safely over time.


7. Manual vs Automated Migrations

StrategyDescription
ManualRun SQL files manually in order
AutomatedUse a tool to track and apply them

Tools enforce order, uniqueness, and rollback logic.


8. Tools for Managing Migrations

ToolLanguageFeatures
FlywayJavaConvention over config, SQL or Java support
LiquibaseJava/XMLSupports XML/JSON/YAML change logs
dbmateGoLightweight, plain SQL, CI-friendly
AlembicPythonUsed with SQLAlchemy (for Python apps)
SqitchPerlGit-style database change management

9. Writing Migration Files with Up/Down Scripts

Flyway Example:

-- V3__create_invoices_table.sql
CREATE TABLE invoices (
id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(id),
amount NUMERIC,
due_date DATE
);

Rollback (optional):

-- Down.sql
DROP TABLE invoices;

10. Naming Conventions and File Structures

Use clear versioning:

V1__init_schema.sql  
V2__add_products.sql
V3__add_index_to_orders.sql

Avoid:

change1.sql, change2.sql, latest_change.sql

Maintain a chronological, logical order.


11. Applying Migrations in CI/CD Pipelines

Integrate migration tools into your deployment pipeline:

  • Step 1: Build & test application
  • Step 2: Run flyway migrate or equivalent
  • Step 3: Deploy updated app code

Use containerized DB instances to test migrations automatically.


12. Rollbacks and Reverting Schema Changes

Always write reversible migrations if possible:

-- Up
ALTER TABLE orders ADD COLUMN delivered_at TIMESTAMP;

-- Down
ALTER TABLE orders DROP COLUMN delivered_at;

If not reversible (e.g., DROP TABLE), make it explicit.


13. Tracking Data Migrations (Safe Practices)

Avoid destructive data updates unless:

  • Backups exist
  • Scripts are tested
  • Wrapped in transactions

Log data migrations in a data_change_log table if necessary.


14. Managing Migration History Tables

Migration tools maintain metadata tables:

TablePurpose
flyway_schema_historyTrack executed migrations
schema_versionVersion control state

These prevent re-applying the same migration.


15. Handling Conflicts in Team Environments

Scenario: Two developers create V3 scripts independently
Solution:

  • Use timestamp-based versions (e.g., 20240512_add_x.sql)
  • Or assign migration IDs via PR reviews
  • Avoid merging overlapping schema changes without discussion

16. Git Hooks and Pre-Deployment Validation

Use pre-commit hooks to:

  • Check for duplicate migration versions
  • Enforce naming conventions
  • Lint or format SQL (e.g., via sqlfluff)

Example with Husky (Node):

#!/bin/sh
npx sqlfluff lint sql/migrations/*.sql

17. Tagging Releases and Change Logs

Tag schema versions along with code versions:

git tag v2.1.0-db-migration

Keep a CHANGELOG.md:

## V3 - 2024-05-12
- Added invoices table
- Dropped deprecated indexes

18. Managing Environment-Specific Differences

Use conditional logic or templating tools (dbt, envsubst):

-- Dev only
CREATE TABLE IF NOT EXISTS debug_logs (...);

Avoid hardcoding hostnames, secrets, or environment-specific logic in SQL scripts.


19. Real-World Workflow Example (Dev → Staging → Prod)

  1. Developer writes V5__add_column_x.sql
  2. Commit and push to Git
  3. CI runs:
    • Linting
    • flyway migrate on test DB
    • Run integration tests
  4. PR merged → CD triggers migration on staging
  5. Migration manually approved or auto-run on prod

20. Summary and What’s Next

Version-controlling SQL scripts and using structured database migration tools is critical for modern software development and DevOps. It ensures that your schema is as testable, reviewable, and auditable as your code.

SQL Performance Tuning with EXPLAIN PLAN: Understanding Query Optimization

0
sql course

Table of Contents

  1. Introduction
  2. Why SQL Performance Tuning Matters
  3. What Is EXPLAIN PLAN?
  4. Basic Usage of EXPLAIN PLAN (PostgreSQL, MySQL, Oracle)
  5. Understanding Query Execution Stages
  6. Interpreting Rows, Cost, Width, Time
  7. Common Operators in Query Plans
  8. Seq Scan vs Index Scan
  9. Join Types and Their Cost Implications
  10. Sorting, Filtering, Aggregation Overheads
  11. Using ANALYZE for Actual Execution Times
  12. Detecting Missing Indexes
  13. Query Rewriting for Optimization
  14. Index Selection and Composite Indexes
  15. Table Statistics and Autovacuum (PostgreSQL)
  16. Partitioning and Clustering for Performance
  17. Caching and Materialized Views
  18. Performance Pitfalls and Anti-Patterns
  19. Real-World Tuning Example
  20. Summary and What’s Next

1. Introduction

No matter how well-designed your database is, inefficient SQL queries can lead to performance bottlenecks. EXPLAIN PLAN is a diagnostic tool that helps you understand how a query is executed, and how to improve it.


2. Why SQL Performance Tuning Matters

  • Faster query execution
  • Reduced cloud/data warehouse costs
  • Better user experience for dashboards
  • Scalability for growing data volumes
  • Helps meet SLAs in production systems

3. What Is EXPLAIN PLAN?

EXPLAIN (or EXPLAIN PLAN) is a SQL command that shows the query execution strategy chosen by the database optimizer.

It tells you:

  • Which indexes (if any) are used
  • How rows are scanned or filtered
  • The expected cost and row count
  • The execution order of joins and filters

4. Basic Usage of EXPLAIN PLAN

PostgreSQL:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

MySQL:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

Oracle:

EXPLAIN PLAN FOR SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

Use EXPLAIN ANALYZE (PostgreSQL) to run the query and show actual performance.


5. Understanding Query Execution Stages

Typical stages include:

  • Table Scan (Seq Scan or Index Scan)
  • Filter
  • Sort
  • Join (Nested Loop, Hash Join, Merge Join)
  • Aggregate (GROUP BY, COUNT, SUM)
  • Output rows

6. Interpreting Rows, Cost, Width, Time

Example (PostgreSQL):

Seq Scan on orders  (cost=0.00..431.00 rows=21000 width=64)
MetricMeaning
CostEstimated time cost (startup..total)
RowsEstimated number of output rows
WidthEstimated row size in bytes

Use these estimates to evaluate efficiency and selectivity.


7. Common Operators in Query Plans

OperatorDescription
Seq ScanSequential scan (entire table)
Index ScanScan using index (ordered access)
Index OnlyIndex has all required columns
Bitmap IndexEfficient for multi-key filtering
Hash JoinFast join for large unordered sets
Nested LoopBest for small joined tables

8. Seq Scan vs Index Scan

Seq Scan is used when:

  • The table is small
  • Index is not selective
  • The cost of random access is too high

Index Scan is preferred when:

  • Filtering on indexed columns
  • Returning a few rows
EXPLAIN SELECT * FROM users WHERE id = 5;

Should show Index Scan using pk_users.


9. Join Types and Their Cost Implications

Join TypeUse Case
Nested LoopSmall inner table, indexed
Hash JoinLarge unsorted tables
Merge JoinBoth tables sorted on join key

Joins can dominate query cost — reordering tables or adding indexes may help.


10. Sorting, Filtering, Aggregation Overheads

Sorting is expensive on large datasets:

ORDER BY created_at DESC

Adding an index on created_at DESC can eliminate the sort phase.

Aggregates like GROUP BY, HAVING may trigger hash or sort operations. Use indexes or pre-aggregated tables for heavy workloads.


11. Using ANALYZE for Actual Execution Times

EXPLAIN ANALYZE SELECT * FROM orders WHERE total > 1000;

Returns actual:

  • Row count
  • Execution time per step
  • Number of loops

Helps detect misestimates (e.g., 10k expected rows but 1 million actual).


12. Detecting Missing Indexes

Use EXPLAIN to find sequential scans on large tables:

EXPLAIN SELECT * FROM sales WHERE product_id = 100;

If Seq Scan appears, consider:

CREATE INDEX idx_sales_product_id ON sales(product_id);

13. Query Rewriting for Optimization

Instead of:

SELECT * FROM orders WHERE status != 'shipped';

Rewrite to:

SELECT * FROM orders WHERE status IN ('pending', 'cancelled');

Negations are hard to optimize — always prefer positive filtering.


14. Index Selection and Composite Indexes

Composite indexes help multi-column filters:

CREATE INDEX idx_customer_order_date ON orders(customer_id, order_date);

Useful for:

SELECT * FROM orders WHERE customer_id = 123 AND order_date > '2023-01-01';

Only the leftmost prefix is used unless all columns are included in filter.


15. Table Statistics and Autovacuum (PostgreSQL)

PostgreSQL’s query planner depends on table stats:

ANALYZE orders;

Autovacuum keeps stats up-to-date, but manual updates can help during large ETL jobs.


16. Partitioning and Clustering for Performance

Partitioning:

CREATE TABLE sales_y2024 PARTITION OF sales FOR VALUES FROM ('2024-01-01') TO ('2024-12-31');

Improves filter performance on date ranges.

Clustering:

CLUSTER orders USING idx_orders_customer_id;

Physically organizes data by index — faster reads for ordered scans.


17. Caching and Materialized Views

Use materialized views to cache expensive queries:

CREATE MATERIALIZED VIEW top_products AS
SELECT product_id, SUM(sales) FROM sales GROUP BY product_id;

Refresh as needed:

REFRESH MATERIALIZED VIEW top_products;

18. Performance Pitfalls and Anti-Patterns

  • SELECT * in large joins
  • Functions in WHERE clause (DATE(created_at)) disable index
  • Using DISTINCT instead of GROUP BY
  • Sorting large unsorted tables
  • Implicit type casts (e.g., WHERE id = '123' with id as INT)

19. Real-World Tuning Example

Before:

SELECT * FROM logs WHERE DATE(timestamp) = CURRENT_DATE;

Problem:

  • Disables index on timestamp

Optimized:

SELECT * FROM logs
WHERE timestamp >= CURRENT_DATE AND timestamp < CURRENT_DATE + INTERVAL '1 day';

Now uses index — dramatically faster.


20. Summary and What’s Next

EXPLAIN PLAN is your window into SQL’s execution logic. By interpreting plans, rewriting queries, and applying indexing strategies, you can significantly improve query performance and reduce compute costs.

SQL for Financial Analytics: Profit & Loss Reporting, Forecasting, and Ratio Analysis

0
sql course

Table of Contents

  1. Introduction
  2. Importance of SQL in Financial Analysis
  3. Core Financial Statements in Data Tables
  4. Building a Profit & Loss (P&L) Report in SQL
  5. Structuring Accounts: Revenue vs Expense Categories
  6. Monthly and Quarterly P&L Trends
  7. EBITDA and Operating Margin in SQL
  8. Using Window Functions for Running Totals
  9. Forecasting Future Revenue with SQL Techniques
  10. Year-over-Year and Quarter-over-Quarter Comparisons
  11. Common Financial Ratios and Metrics
  12. Working with Budget vs Actuals
  13. Rolling 12-Month Financials
  14. Revenue Recognition Logic in SQL
  15. Modeling Deferred Revenue and Accruals
  16. Real-World Example: SaaS Financial Analytics
  17. Cohort-Based Revenue Analysis
  18. Integrating SQL Output into BI Dashboards
  19. Best Practices for Financial Modeling in SQL
  20. Summary and What’s Next

1. Introduction

Financial analytics translates raw transactional data into meaningful insights such as profitability, performance, and forecasted revenue. SQL provides the foundation for transforming structured financial data into clear, scalable reports like P&Ls, forecasts, and KPIs.


2. Importance of SQL in Financial Analysis

  • Extract and aggregate general ledger or ERP data
  • Prepare cash flow, P&L, and balance sheet views
  • Enable dynamic filtering by region, department, or entity
  • Build fiscal calendars and roll-forward reports
  • Power finance dashboards in BI tools

3. Core Financial Statements in Data Tables

Typical financial table structures:

TableDescription
transactionsRaw debits and credits by account
accountsChart of accounts (COA)
departmentsBusiness units / cost centers
calendarDate and fiscal mappings
budgetsForecasted vs actual financials

4. Building a Profit & Loss (P&L) Report in SQL

SELECT
a.account_group,
DATE_TRUNC('month', t.transaction_date) AS month,
SUM(CASE WHEN t.type = 'debit' THEN t.amount ELSE -t.amount END) AS amount
FROM transactions t
JOIN accounts a ON t.account_id = a.account_id
WHERE a.account_group IN ('Revenue', 'COGS', 'Expenses')
GROUP BY a.account_group, month
ORDER BY month, a.account_group;

5. Structuring Accounts: Revenue vs Expense Categories

Account GroupAccount Name
RevenueProduct Revenue, Subscriptions
COGSHosting, Licensing, Support
ExpensesSalaries, Rent, Marketing

Use a consistent account hierarchy in your accounts table to simplify queries.


6. Monthly and Quarterly P&L Trends

SELECT
DATE_TRUNC('quarter', transaction_date) AS quarter,
account_group,
SUM(amount) AS total
FROM p_l_data
GROUP BY quarter, account_group;

Enable rolling views with window functions.


7. EBITDA and Operating Margin in SQL

EBITDA = Earnings Before Interest, Taxes, Depreciation, and Amortization.

SELECT
report_month,
SUM(CASE WHEN group = 'Revenue' THEN amount ELSE 0 END) AS revenue,
SUM(CASE WHEN group = 'COGS' THEN amount ELSE 0 END) AS cogs,
SUM(CASE WHEN group IN ('Salaries', 'Rent') THEN amount ELSE 0 END) AS opex,
SUM(...) AS EBITDA
FROM p_l_data
GROUP BY report_month;

Operating Margin:

(EBITDA / Revenue) * 100

8. Using Window Functions for Running Totals

SELECT
report_month,
account_group,
SUM(amount) OVER (PARTITION BY account_group ORDER BY report_month) AS cumulative_amount
FROM p_l_data;

Track cumulative revenue or cost trends.


9. Forecasting Future Revenue with SQL Techniques

Simple Linear Projection:

WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', transaction_date) AS month, SUM(amount) AS revenue
FROM transactions
WHERE account_group = 'Revenue'
GROUP BY month
)
SELECT *,
LAG(revenue, 1) OVER (ORDER BY month) AS last_month,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS monthly_growth
FROM monthly_revenue;

Use that growth to extrapolate future months manually or in views.


10. Year-over-Year and Quarter-over-Quarter Comparisons

SELECT
month,
revenue,
LAG(revenue, 12) OVER (ORDER BY month) AS revenue_last_year,
(revenue - LAG(revenue, 12) OVER (ORDER BY month)) / NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0) * 100 AS yoy_growth
FROM monthly_revenue;

11. Common Financial Ratios and Metrics

MetricSQL Expression (Conceptual)
Gross Margin(Revenue – COGS) / Revenue
Net Profit MarginNet Profit / Revenue
CACTotal Marketing Spend / New Customers Acquired
LTVARPU × Customer Lifespan
Churn RateLost Customers / Starting Customers

Combine fact tables with customer/order dimensions for full visibility.


12. Working with Budget vs Actuals

SELECT
a.account_id,
c.month,
SUM(b.amount) AS budgeted,
SUM(t.amount) AS actual,
SUM(t.amount) - SUM(b.amount) AS variance
FROM budgets b
JOIN transactions t ON b.account_id = t.account_id AND b.month = DATE_TRUNC('month', t.transaction_date)
JOIN calendar c ON b.month = c.date
GROUP BY a.account_id, c.month;

13. Rolling 12-Month Financials

SELECT
month,
SUM(amount) OVER (ORDER BY month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS rolling_12m
FROM monthly_revenue;

Useful for smoothing volatility and spotting trends.


14. Revenue Recognition Logic in SQL

For SaaS companies or subscription services:

-- Spread revenue over service period
SELECT
customer_id,
service_start,
service_end,
GENERATE_SERIES(service_start, service_end, INTERVAL '1 month') AS recognition_month,
amount / EXTRACT(MONTH FROM age(service_end, service_start)) AS monthly_revenue
FROM subscriptions;

15. Modeling Deferred Revenue and Accruals

-- Deferred revenue = total invoice - revenue recognized so far
SELECT
invoice_id,
total_amount,
SUM(recognized_amount) AS revenue_recognized,
total_amount - SUM(recognized_amount) AS deferred_revenue
FROM revenue_schedule
GROUP BY invoice_id, total_amount;

Track liabilities accurately in SQL reporting views.


16. Real-World Example: SaaS Financial Analytics

Build dashboards using:

  • Monthly Recurring Revenue (MRR)
  • Churn Rate
  • Net Revenue Retention (NRR)
  • Customer Lifetime Value (LTV)
  • ARPA (Average Revenue per Account)

Example:

SELECT DATE_TRUNC('month', invoice_date) AS mrr_month, SUM(invoice_amount) AS mrr
FROM subscriptions
GROUP BY mrr_month;

17. Cohort-Based Revenue Analysis

WITH cohorts AS (
SELECT customer_id, MIN(subscription_date) AS cohort_date
FROM subscriptions
GROUP BY customer_id
),
activity AS (
SELECT s.customer_id, DATE_TRUNC('month', payment_date) AS active_month, s.amount
FROM subscriptions s
)
SELECT
cohort_date,
active_month,
SUM(amount) AS revenue
FROM cohorts
JOIN activity USING(customer_id)
GROUP BY cohort_date, active_month;

Visualize revenue evolution per cohort.


18. Integrating SQL Output into BI Dashboards

  • Build views or materialized tables for metrics
  • Feed SQL outputs to Tableau, Power BI, or Looker
  • Use parameterized queries for flexible analysis
  • Automate scheduling via dbt or Airflow

19. Best Practices for Financial Modeling in SQL

  • Normalize accounting data with clear account hierarchies
  • Use fiscal_calendar dimensions for YTD/MTD logic
  • Avoid mixing debits and credits in reporting tables
  • Always reconcile totals (trial balance logic)
  • Maintain audit trails (source file, ETL job, timestamp)

20. Summary and What’s Next

SQL is indispensable in financial analytics — from producing core reports like P&Ls to modeling deferred revenue and forecasting trends. With structured design and advanced SQL techniques, finance teams can build powerful, auditable, and automated pipelines.

Data Warehousing Concepts in SQL: Understanding Star and Snowflake Schemas

0
sql course

Table of Contents

  1. Introduction
  2. What Is a Data Warehouse?
  3. OLTP vs OLAP: The Need for Warehousing
  4. Key Components of a Data Warehouse
  5. What Is Dimensional Modeling?
  6. Facts and Dimensions Explained
  7. The Star Schema Architecture
  8. Example of a Star Schema Design
  9. The Snowflake Schema Architecture
  10. Star vs Snowflake Schema Comparison
  11. Denormalization and Performance Trade-offs
  12. Creating Fact and Dimension Tables in SQL
  13. Surrogate Keys and Their Role
  14. Slowly Changing Dimensions (SCD Types)
  15. Handling Time Dimensions
  16. Using Views to Model Analytical Layers
  17. Querying Star Schemas for BI and Reporting
  18. Best Practices for Data Warehouse Schema Design
  19. Real-World Warehouse Scenario: E-Commerce Analytics
  20. Summary and What’s Next

1. Introduction

Data warehousing is the foundation of modern analytics, BI dashboards, and reporting systems. Understanding how to design and query data warehouse schemas—especially Star and Snowflake schemas—is key to building efficient, scalable analytics systems with SQL.


2. What Is a Data Warehouse?

A data warehouse is a centralized repository designed for analytical processing (OLAP) rather than transactional operations. It integrates data from multiple sources, cleans and structures it, and makes it query-ready for decision-making.


3. OLTP vs OLAP: The Need for Warehousing

FeatureOLTP (e.g., MySQL)OLAP (e.g., Snowflake, BigQuery)
PurposeTransaction processingAnalytical processing
NormalizationHighly normalizedDenormalized or star schemas
QueriesSimple, short-livedComplex, large, aggregate queries
Write-intensiveYesNo (mostly read-only)

4. Key Components of a Data Warehouse

  • Fact tables: Store metrics (e.g., sales, orders)
  • Dimension tables: Store attributes (e.g., customer info, product details)
  • ETL/ELT pipelines: Load and transform data
  • Schema design: Star or Snowflake
  • Views and aggregates: Support analysis and reporting

5. What Is Dimensional Modeling?

Dimensional modeling is a design technique for warehouses where data is organized around:

  • Facts (measurable events)
  • Dimensions (descriptive attributes)

The goal is to create schemas that are intuitive and fast to query.


6. Facts and Dimensions Explained

ComponentDescriptionExample
FactNumeric, quantitative valuessales_amount, units_sold
DimensionDescriptive attributes about factsproduct_name, customer_region

Facts link to dimensions using foreign keys.


7. The Star Schema Architecture

In a star schema, a central fact table connects directly to surrounding denormalized dimension tables.

                +-------------+
| dim_product |
+-------------+
|
+-------------+ +-------------+ +-------------+
| dim_customer|---| fact_sales |---| dim_date |
+-------------+ +-------------+ +-------------+
|
+-------------+
| dim_region |
+-------------+

8. Example of a Star Schema Design

Fact Table:

CREATE TABLE fact_sales (
sale_id SERIAL PRIMARY KEY,
product_id INT,
customer_id INT,
date_id INT,
region_id INT,
quantity INT,
total_amount NUMERIC
);

Dimension Table:

CREATE TABLE dim_product (
product_id INT PRIMARY KEY,
product_name TEXT,
category TEXT
);

9. The Snowflake Schema Architecture

In a snowflake schema, dimensions are further normalized into sub-dimensions:

                 +-------------+
| dim_category|
+-------------+
|
+-------------+
| dim_product |
+-------------+
|
+-------------+
| fact_sales |
+-------------+

Product links to category via a foreign key instead of storing the category name directly.


10. Star vs Snowflake Schema Comparison

FeatureStar SchemaSnowflake Schema
DesignDenormalizedNormalized
Query speedFaster (fewer joins)Slower (more joins)
StorageLargerSmaller
MaintenanceEasier to understandMore scalable, harder to read
Best forPerformance, BI toolsComplex hierarchies

11. Denormalization and Performance Trade-offs

Denormalizing dimensions (as in star schema) can:

  • Reduce join complexity
  • Speed up reads
  • Slightly increase storage

Snowflake reduces redundancy but adds complexity and join cost.


12. Creating Fact and Dimension Tables in SQL

Example: Creating both dimensions and fact

CREATE TABLE dim_customer (
customer_id INT PRIMARY KEY,
name TEXT,
email TEXT,
signup_date DATE
);

CREATE TABLE fact_orders (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
date_id INT,
total_amount NUMERIC,
FOREIGN KEY (customer_id) REFERENCES dim_customer(customer_id)
);

13. Surrogate Keys and Their Role

Use surrogate keys (auto-incremented IDs) for dimensions rather than natural keys like email or name:

  • Ensure uniqueness
  • Enable SCDs (Slowly Changing Dimensions)
  • Improve indexing and joins

14. Slowly Changing Dimensions (SCD Types)

TypeDescription
SCD1Overwrite the old value
SCD2Add a new row with new version/timestamp
SCD3Keep only a limited history (e.g., last two values)

Example for SCD Type 2:

INSERT INTO dim_customer_history (customer_id, name, valid_from, valid_to)
VALUES (1, 'Alice', CURRENT_DATE, NULL);

15. Handling Time Dimensions

Time is one of the most common dimensions in warehousing:

CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
full_date DATE,
year INT,
quarter INT,
month INT,
week INT,
day_of_week TEXT
);

Pre-generate a date dimension to allow fast filtering and grouping.


16. Using Views to Model Analytical Layers

Use SQL views to build abstractions:

CREATE VIEW sales_summary AS
SELECT
d.year,
c.name AS customer_name,
p.category,
SUM(f.total_amount) AS total_spent
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_customer c ON f.customer_id = c.customer_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY d.year, c.name, p.category;

17. Querying Star Schemas for BI and Reporting

SELECT
c.region,
p.category,
d.month,
SUM(f.total_amount) AS monthly_sales
FROM fact_sales f
JOIN dim_customer c ON f.customer_id = c.customer_id
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
GROUP BY c.region, p.category, d.month
ORDER BY d.month;

18. Best Practices for Data Warehouse Schema Design

  • Use surrogate keys for dimension relationships
  • Prefer star schema for simplicity and speed
  • Design a calendar dimension
  • Avoid circular joins or many-to-many between dimensions
  • Pre-aggregate for performance (e.g., monthly rollups)
  • Document all fact and dimension tables

19. Real-World Warehouse Scenario: E-Commerce Analytics

Use Case: Track sales, customers, product performance.

TableTypeDescription
fact_salesFactOrders placed with date, customer, product
dim_productDimensionProduct details (name, category)
dim_customerDimensionCustomer demographics
dim_dateDimensionDay, week, month breakdown

20. Summary and What’s Next

Designing proper Star and Snowflake schemas is critical for building high-performance, scalable data warehouses. Understanding how to model facts, dimensions, and time can greatly impact the effectiveness of your reporting and BI efforts.