Table of Contents
- Introduction
- What Is a Data Warehouse?
- OLTP vs OLAP: The Need for Warehousing
- Key Components of a Data Warehouse
- What Is Dimensional Modeling?
- Facts and Dimensions Explained
- The Star Schema Architecture
- Example of a Star Schema Design
- The Snowflake Schema Architecture
- Star vs Snowflake Schema Comparison
- Denormalization and Performance Trade-offs
- Creating Fact and Dimension Tables in SQL
- Surrogate Keys and Their Role
- Slowly Changing Dimensions (SCD Types)
- Handling Time Dimensions
- Using Views to Model Analytical Layers
- Querying Star Schemas for BI and Reporting
- Best Practices for Data Warehouse Schema Design
- Real-World Warehouse Scenario: E-Commerce Analytics
- Summary and What’s Next
1. Introduction
Data warehousing is the foundation of modern analytics, BI dashboards, and reporting systems. Understanding how to design and query data warehouse schemas—especially Star and Snowflake schemas—is key to building efficient, scalable analytics systems with SQL.
2. What Is a Data Warehouse?
A data warehouse is a centralized repository designed for analytical processing (OLAP) rather than transactional operations. It integrates data from multiple sources, cleans and structures it, and makes it query-ready for decision-making.
3. OLTP vs OLAP: The Need for Warehousing
Feature | OLTP (e.g., MySQL) | OLAP (e.g., Snowflake, BigQuery) |
---|---|---|
Purpose | Transaction processing | Analytical processing |
Normalization | Highly normalized | Denormalized or star schemas |
Queries | Simple, short-lived | Complex, large, aggregate queries |
Write-intensive | Yes | No (mostly read-only) |
4. Key Components of a Data Warehouse
- Fact tables: Store metrics (e.g., sales, orders)
- Dimension tables: Store attributes (e.g., customer info, product details)
- ETL/ELT pipelines: Load and transform data
- Schema design: Star or Snowflake
- Views and aggregates: Support analysis and reporting
5. What Is Dimensional Modeling?
Dimensional modeling is a design technique for warehouses where data is organized around:
- Facts (measurable events)
- Dimensions (descriptive attributes)
The goal is to create schemas that are intuitive and fast to query.
6. Facts and Dimensions Explained
Component | Description | Example |
---|---|---|
Fact | Numeric, quantitative values | sales_amount, units_sold |
Dimension | Descriptive attributes about facts | product_name, customer_region |
Facts link to dimensions using foreign keys.
7. The Star Schema Architecture
In a star schema, a central fact table connects directly to surrounding denormalized dimension tables.
+-------------+
| dim_product |
+-------------+
|
+-------------+ +-------------+ +-------------+
| dim_customer|---| fact_sales |---| dim_date |
+-------------+ +-------------+ +-------------+
|
+-------------+
| dim_region |
+-------------+
8. Example of a Star Schema Design
Fact Table:
CREATE TABLE fact_sales (
sale_id SERIAL PRIMARY KEY,
product_id INT,
customer_id INT,
date_id INT,
region_id INT,
quantity INT,
total_amount NUMERIC
);
Dimension Table:
CREATE TABLE dim_product (
product_id INT PRIMARY KEY,
product_name TEXT,
category TEXT
);
9. The Snowflake Schema Architecture
In a snowflake schema, dimensions are further normalized into sub-dimensions:
+-------------+
| dim_category|
+-------------+
|
+-------------+
| dim_product |
+-------------+
|
+-------------+
| fact_sales |
+-------------+
Product links to category via a foreign key instead of storing the category name directly.
10. Star vs Snowflake Schema Comparison
Feature | Star Schema | Snowflake Schema |
---|---|---|
Design | Denormalized | Normalized |
Query speed | Faster (fewer joins) | Slower (more joins) |
Storage | Larger | Smaller |
Maintenance | Easier to understand | More scalable, harder to read |
Best for | Performance, BI tools | Complex hierarchies |
11. Denormalization and Performance Trade-offs
Denormalizing dimensions (as in star schema) can:
- Reduce join complexity
- Speed up reads
- Slightly increase storage
Snowflake reduces redundancy but adds complexity and join cost.
12. Creating Fact and Dimension Tables in SQL
Example: Creating both dimensions and fact
CREATE TABLE dim_customer (
customer_id INT PRIMARY KEY,
name TEXT,
email TEXT,
signup_date DATE
);
CREATE TABLE fact_orders (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
date_id INT,
total_amount NUMERIC,
FOREIGN KEY (customer_id) REFERENCES dim_customer(customer_id)
);
13. Surrogate Keys and Their Role
Use surrogate keys (auto-incremented IDs) for dimensions rather than natural keys like email or name:
- Ensure uniqueness
- Enable SCDs (Slowly Changing Dimensions)
- Improve indexing and joins
14. Slowly Changing Dimensions (SCD Types)
Type | Description |
---|---|
SCD1 | Overwrite the old value |
SCD2 | Add a new row with new version/timestamp |
SCD3 | Keep only a limited history (e.g., last two values) |
Example for SCD Type 2:
INSERT INTO dim_customer_history (customer_id, name, valid_from, valid_to)
VALUES (1, 'Alice', CURRENT_DATE, NULL);
15. Handling Time Dimensions
Time is one of the most common dimensions in warehousing:
CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
full_date DATE,
year INT,
quarter INT,
month INT,
week INT,
day_of_week TEXT
);
Pre-generate a date dimension to allow fast filtering and grouping.
16. Using Views to Model Analytical Layers
Use SQL views to build abstractions:
CREATE VIEW sales_summary AS
SELECT
d.year,
c.name AS customer_name,
p.category,
SUM(f.total_amount) AS total_spent
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_customer c ON f.customer_id = c.customer_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY d.year, c.name, p.category;
17. Querying Star Schemas for BI and Reporting
SELECT
c.region,
p.category,
d.month,
SUM(f.total_amount) AS monthly_sales
FROM fact_sales f
JOIN dim_customer c ON f.customer_id = c.customer_id
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
GROUP BY c.region, p.category, d.month
ORDER BY d.month;
18. Best Practices for Data Warehouse Schema Design
- Use surrogate keys for dimension relationships
- Prefer star schema for simplicity and speed
- Design a calendar dimension
- Avoid circular joins or many-to-many between dimensions
- Pre-aggregate for performance (e.g., monthly rollups)
- Document all fact and dimension tables
19. Real-World Warehouse Scenario: E-Commerce Analytics
Use Case: Track sales, customers, product performance.
Table | Type | Description |
---|---|---|
fact_sales | Fact | Orders placed with date, customer, product |
dim_product | Dimension | Product details (name, category) |
dim_customer | Dimension | Customer demographics |
dim_date | Dimension | Day, week, month breakdown |
20. Summary and What’s Next
Designing proper Star and Snowflake schemas is critical for building high-performance, scalable data warehouses. Understanding how to model facts, dimensions, and time can greatly impact the effectiveness of your reporting and BI efforts.