Start Here — Before Anything Else
IT Technology Encyclopedia
Before you dive into code and interviews, understand every technology from scratch. What is it? Why does it exist? Who built it? Which companies use it? Who are its competitors? Click any technology below to get the full picture — explained like you are learning it for the very first time.
All Technologies
Data & Storage
Cloud Platforms
Programming & Code
DevOps & CICD
Project Management
Streaming & Messaging
BI & Visualization
FormulaHub — Built to Get You Hired
Every Role. Every Topic.
One Platform.
Designed for consultants in the USA who need to demonstrate 6 years of real IT experience. Study every module, learn to explain every concept, practice every scenario. After finishing FormulaHub you can walk into any Data Warehouse Developer, Snowflake Developer, Tableau Developer, ETL Developer, or IT Consultant interview and own it. This is your foundation — real projects and eBooks will take you further.
20
Modules
50+
Q&A Ready
12
Roles Covered
110+
Skills Tracked
Roles This Platform Prepares You For
Click any card to jump to the relevant module
🗄
Data Analyst / SQL
Queries, joins, RANK, window functions, CTEs
Core
⭐
Data Warehouse Developer
Star schema, SCD, fact/dim design, layers
Core
❄
Snowflake Developer
COPY, stages, Time Travel, streams, tasks
Core
📊
Tableau Developer
LOD, calculated fields, RLS, performance
Core
🔄
ETL / ELT Developer
Airflow, dbt, Python pipelines, CDC, batch
Core
🐍
Data Engineer
Python, Spark, Kafka, cloud, APIs, pipelines
Core
📋
JIRA / ServiceNow
Scrum, sprints, change requests, ITSM
Every Role
☁
Cloud Data Engineer
AWS/GCP/Azure, S3, Lambda, IAM, VPC
Core
📦
Storage / Formats
Parquet, JSON, Avro, Delta Lake, COPY commands
Core
⚡
Streaming Engineer
Kafka, real-time events, consumer groups
Advanced
🔥
Spark / Databricks
PySpark, distributed processing, Delta Lake
Advanced
🔒
IT Security / Compliance
VPC, IAM, PII masking, HIPAA, audit logs
Important
How to use FormulaHubWork through every module in order. In each module: read the explanation, study the tables, copy and practice the code, read the interview scenarios out loud. After all modules go to End-to-End Flow to see how everything connects. Then drill Interview Q&A daily. The goal: explain any topic for 5 minutes without notes. That is when you are ready.
SQL & Databases
The #1 most tested skill — every data role interview includes SQL problems
Why SQL is non-negotiableEvery data role — analyst, engineer, warehouse developer, Tableau developer — requires SQL. Interviewers will give you a whiteboard problem. You must write correct SQL without help. The topics below are what gets asked most often.
📊
SELECT, WHERE, GROUP BY, HAVING — the foundation
Every query starts here — understand the execution order
▾
SQL execution order — not the order you write itFROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. This is why you cannot filter on a SELECT alias in WHERE — WHERE runs before SELECT evaluates aliases.
-- Real scenario: Which regions had over $50k revenue last 30 days?
-- Table: sales(id, rep_name, region, product, amount, sale_date, status)
SELECT
region,
COUNT(*) AS total_orders,
SUM(amount) AS total_revenue,
ROUND(AVG(amount), 2) AS avg_order_value,
MAX(amount) AS largest_order
FROM sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days'
AND status = 'COMPLETED'
GROUP BY region
HAVING SUM(amount) > 50000
ORDER BY total_revenue DESC;
WHERE vs HAVING — the ruleWHERE filters rows BEFORE grouping. HAVING filters groups AFTER aggregation. If your filter uses SUM, COUNT, AVG, MAX, or MIN — it must go in HAVING. Raw column comparisons go in WHERE. You can use both in the same query.
🏆
RANK vs DENSE_RANK vs ROW_NUMBER — Know This Cold
Asked in 90% of data interviews — the answer must be instant
▾
Interviewers ask this every single time — memorize the difference
| Function | Handles Ties? | Gap After Tie? | Always Unique? | Use When |
|---|---|---|---|---|
| ROW_NUMBER() | No — arbitrary order for ties | N/A | YES — always unique | Deduplication, pagination, pick one row per group |
| RANK() | YES — same rank for ties | YES — skips next (1,1,3) | No | Sports rankings — no 2nd place if tied for 1st |
| DENSE_RANK() | YES — same rank for ties | NO — no gap (1,1,2) | No | Business leaderboards — clients expect 1,2,3 |
-- Scenario: Top 3 sales reps per region
-- Alice=$10k, Bob=$10k (tied), Carol=$8k, Dave=$7k in East region
SELECT * FROM (
SELECT
rep_name, region, SUM(amount) AS total_sales,
ROW_NUMBER() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS rn,
RANK() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS rnk,
DENSE_RANK() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS drnk
FROM sales
GROUP BY rep_name, region
) t
WHERE drnk <= 3;
-- East region results:
-- Alice: ROW_NUMBER=1 RANK=1 DENSE_RANK=1 ($10k)
-- Bob: ROW_NUMBER=2 RANK=1 DENSE_RANK=1 ($10k — tied)
-- Carol: ROW_NUMBER=3 RANK=3 DENSE_RANK=2 ($8k — RANK skips 2!)
-- Dave: ROW_NUMBER=4 RANK=4 DENSE_RANK=3 ($7k)
Say this in your interview"DENSE_RANK gives consecutive ranks with no gaps — if Alice and Bob tie for first, both get rank 1 and Carol gets rank 2. RANK skips — Carol would get rank 3, not 2. ROW_NUMBER always gives unique numbers even for ties. In business reporting we almost always use DENSE_RANK because clients expect 1, 2, 3 without gaps."
🔗
JOINs — All Types with Real ETL Context
The most common SQL operation — know every type cold
▾
| JOIN Type | What It Returns | Real ETL Example | NULL Behavior |
|---|---|---|---|
| INNER JOIN | Only rows matching in BOTH tables | Orders that have a valid customer record — orphan orders excluded | No NULLs from join |
| LEFT JOIN | All left rows + matching right | All customers whether or not they have orders — most common in ETL | NULLs on right side if no match |
| RIGHT JOIN | All right rows + matching left | All orders even if customer record is missing | NULLs on left side if no match |
| FULL OUTER | All rows from both tables | Full audit — find gaps on BOTH sides simultaneously | NULLs on either side |
| SELF JOIN | Table joined to itself | Employees and their managers from the same table | Depends on join type used |
-- LEFT JOIN: All customers + their orders (NULL for non-buyers)
SELECT
c.customer_id, c.name,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.amount), 0) AS lifetime_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;
-- Data quality check: find orphan orders (no matching customer)
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL; -- these rows have no customer record
🪟
Window Functions and CTEs
Advanced SQL used daily in data engineering and analytics
▾
Window functions vs GROUP BYGROUP BY collapses rows into one per group. Window functions calculate across rows but keep every row. SUM(amount) GROUP BY region gives one row per region. SUM(amount) OVER(PARTITION BY region ORDER BY date) gives every row plus a running total per region.
-- CTE: break complex query into readable named steps
WITH
monthly AS (
SELECT
rep_name,
region,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS revenue
FROM sales
GROUP BY rep_name, region, DATE_TRUNC('month', sale_date)
),
enriched AS (
SELECT *,
SUM(revenue) OVER(PARTITION BY rep_name ORDER BY month) AS running_total,
LAG(revenue, 1, 0) OVER(PARTITION BY rep_name ORDER BY month) AS prev_month,
DENSE_RANK() OVER(PARTITION BY region, month ORDER BY revenue DESC) AS rank_in_region
FROM monthly
)
SELECT
*,
revenue - prev_month AS mom_change,
ROUND((revenue - prev_month) * 100.0 / NULLIF(prev_month, 0), 1) AS mom_pct_change
FROM enriched
ORDER BY rep_name, month;
🗑
Deduplication — Critical ETL Skill
Find and remove duplicates — asked in every data engineering interview
▾
-- Step 1: Find which records are duplicated
SELECT customer_id, order_date, COUNT(*) AS cnt
FROM orders
GROUP BY customer_id, order_date
HAVING COUNT(*) > 1
ORDER BY cnt DESC;
-- Step 2: Keep only the latest row per duplicate group (ROW_NUMBER method)
WITH deduped AS (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY customer_id, order_date -- define what "duplicate" means
ORDER BY created_at DESC -- keep the most recent
) AS rn
FROM orders
)
SELECT * FROM deduped WHERE rn = 1;
-- Snowflake: create clean table directly
CREATE TABLE orders_clean AS
SELECT * EXCLUDE(rn) FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY load_ts DESC) rn
FROM orders_raw
) WHERE rn = 1;
Data Warehouse Developer
Star Schema, SCD, fact/dimension design, warehouse layers — the core of DWH developer roles
⭐
Star Schema — Design From Scratch
The most used data model in every analytics project worldwide
▾
Star Schema in one sentenceA central FACT table (what happened — transactions, clicks, orders) surrounded by DIMENSION tables (who, what, where, when). Every dimension joins to the fact via a foreign key. Called "star" because the diagram looks like one. Every modern data warehouse uses this model.
Star Schema — Retail Sales Example
dim_customer
customer_key PK
name, email
segment, region
customer_key PK
name, email
segment, region
↗
FACT_SALES
sale_key PK
customer_key FK
product_key FK
date_key FK
store_key FK
amount, quantity
discount, margin
sale_key PK
customer_key FK
product_key FK
date_key FK
store_key FK
amount, quantity
discount, margin
↖
dim_product
product_key PK
name, category
brand, price_tier
product_key PK
name, category
brand, price_tier
dim_date
date_key PK
year, quarter, month
is_holiday, is_weekend
date_key PK
year, quarter, month
is_holiday, is_weekend
↑
dim_store
store_key PK
city, state
region, store_type
store_key PK
city, state
region, store_type
| Concept | Fact Table | Dimension Table |
|---|---|---|
| Purpose | What happened — measurements and events | Context — who, what, where, when |
| Content | Foreign keys to dims + numeric measures | Descriptive text attributes |
| Size | Billions of rows — grows constantly | Thousands to millions — grows slowly |
| Additive? | YES — you can SUM amounts across any dimension | No — you describe and filter, not add |
| Example | sale_key, customer_key, amount, quantity | customer_key, name, city, segment |
🔄
Slowly Changing Dimensions (SCD) — Types 1, 2, 3
What happens when dimension data changes — asked in every DWH interview
▾
SCD is asked in almost every Data Warehouse Developer interview — know all types
| SCD Type | Strategy | Keeps History? | When to Use | Real Example |
|---|---|---|---|---|
| Type 0 | Never update — keep original value forever | N/A — immutable | Immutable data | Date of birth, original signup date |
| Type 1 | Overwrite old value with new value | NO — history lost | Typo corrections, history irrelevant | Fix customer name spelling error |
| Type 2 | Add new row with start/end dates + is_current flag | FULL history preserved | History matters — most common in real DWH | Customer moves from NYC to Chicago |
| Type 3 | Add column for previous value | One version back only | Only current and immediate prior matter | Track current and previous region |
-- SCD Type 2: Customer C001 moved from New York to Chicago
-- Step 1: Expire the current record
UPDATE dim_customer
SET
end_date = CURRENT_DATE - 1,
is_current = FALSE
WHERE customer_id = 'C001'
AND is_current = TRUE;
-- Step 2: Insert new record with the new city
INSERT INTO dim_customer
(customer_key, customer_id, name, city, state, start_date, end_date, is_current)
VALUES
(5001, 'C001', 'Alice Smith', 'Chicago', 'IL',
CURRENT_DATE, '9999-12-31', TRUE);
-- Result: old NYC record is expired. New Chicago record is active.
-- Fact rows from before today still JOIN to the NYC record.
-- New fact rows JOIN to the Chicago record.
-- Historical accuracy is perfect.
Interview answer for SCD
When asked "which SCD type did you use?" always say Type 2. Explain: "We used SCD Type 2 on the customer dimension because customers frequently changed their region assignment and our regional managers needed accurate historical reports. A sale made in January when the customer was in the East region had to remain associated with the East region even after the customer moved to the West region in March. Type 2 preserved that history perfectly."
🏗
4-Layer Warehouse Architecture
Raw → Staging → Core → Mart — the standard modern pattern used everywhere
▾
Data Platform Architecture — All Layers
SOURCE SYSTEMS
Salesforce, SAP
APIs, Databases
Salesforce, SAP
APIs, Databases
→
DATA LAKE
S3 / GCS / Blob
Raw Parquet files
S3 / GCS / Blob
Raw Parquet files
→
RAW SCHEMA
Snowflake
1:1 typed copy
Snowflake
1:1 typed copy
→
CORE SCHEMA
Star Schema
Fact + SCD2 Dims
Star Schema
Fact + SCD2 Dims
→
DATA MART
Sales, Finance
pre-aggregated
Sales, Finance
pre-aggregated
→
CONSUME
Tableau, APIs
Reports
Tableau, APIs
Reports
| Layer | What Goes Here | Transform Level | Who Accesses |
|---|---|---|---|
| Data Lake (S3) | Files exactly as received from source systems | Zero — never touch raw | Data engineers only |
| Raw / Staging | Typed, validated, deduplicated copy in Snowflake | Basic — typing, nulls, dedup | Data engineers |
| Core / DWH | Star schema — fact tables and SCD dimensions | Full business logic | Engineers, analysts |
| Data Mart | Pre-aggregated, team-specific views | Aggregation, formatting | Business users, Tableau |
Real project narrative — say this in your interview
"In our retail analytics project, data from 8 source systems landed in S3 as Parquet files daily via Airflow. Snowflake COPY INTO loaded them to the RAW schema unchanged — a 1-to-1 typed copy. dbt staging models cleaned and validated the data in the STAGING schema — null handling, deduplication, standardizing date formats. dbt core models built the Star Schema in the CORE schema — 3 fact tables and 6 dimension tables including SCD Type 2 on customer and product. The MART schema had pre-aggregated views for the Sales team and Finance team which Tableau connected to directly."
Snowflake Developer
Virtual warehouses, COPY INTO, Time Travel, Streams, Tasks, cloning — every Snowflake developer topic
❄
Architecture — Why Snowflake Is Different
Storage and compute are completely separate — the core innovation
▾
The Snowflake innovation that changed data warehousingTraditional databases tie storage and compute together — you cannot scale one without the other, and all users compete for the same resources. Snowflake separates them completely. Storage lives in S3-based compressed columnar format — you pay per TB stored. Compute is Virtual Warehouses — independent clusters you spin up and down in seconds, paying only per second of usage. Multiple teams query the SAME data with their OWN warehouses without competing.
-- Create separate warehouses for ETL and reporting (they don't compete)
CREATE OR REPLACE WAREHOUSE etl_wh
WAREHOUSE_SIZE = 'MEDIUM'
MAX_CLUSTER_COUNT = 3 -- scale out for concurrent ETL
AUTO_SUSPEND = 300 -- suspend after 5 min idle — save money!
AUTO_RESUME = TRUE -- start automatically when query arrives
INITIALLY_SUSPENDED = TRUE;
-- Scale up for heavy load, then back down
ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'LARGE';
-- ... run the heavy ETL job ...
ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'SMALL';
-- Check credit consumption per warehouse
SELECT warehouse_name, credits_used, start_time
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP)
ORDER BY credits_used DESC;
📥
Stages and COPY INTO — Loading Data from S3
The primary way data enters Snowflake — every project uses this
▾
-- Step 1: Create storage integration (IAM role — no keys in code!)
CREATE OR REPLACE STORAGE INTEGRATION s3_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456:role/SnowflakeRole'
STORAGE_ALLOWED_LOCATIONS = ('s3://my-company-data-bucket/');
-- Step 2: Create external stage pointing to S3
CREATE OR REPLACE STAGE s3_sales_stage
URL = 's3://my-company-data-bucket/sales/'
STORAGE_INTEGRATION = s3_integration
FILE_FORMAT = (TYPE = 'PARQUET');
-- Step 3: COPY INTO — bulk load from S3 to Snowflake table
COPY INTO raw.sales_staging
FROM @s3_sales_stage
PATTERN = '.*sales_2024_.*\.parquet' -- only files matching this regex
ON_ERROR = 'CONTINUE' -- skip bad rows, keep loading
PURGE = FALSE -- keep source files in S3
FORCE = FALSE; -- don't re-load already loaded files
-- Check what was loaded and if any errors occurred
SELECT *
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
TABLE_NAME => 'SALES_STAGING',
START_TIME => DATEADD('hours', -24, CURRENT_TIMESTAMP)
));
🕐
Time Travel and Zero-Copy Cloning
Features with no equivalent in traditional databases — know these cold
▾
-- TIME TRAVEL: query data as it was at any past point (up to 90 days)
-- See table as it was 1 hour ago
SELECT * FROM sales AT(OFFSET => -3600);
-- See table at exact timestamp
SELECT * FROM sales AT(TIMESTAMP => '2024-01-15 09:00:00'::TIMESTAMP);
-- Recover accidentally deleted rows
INSERT INTO sales
SELECT * FROM sales AT(OFFSET => -3600)
WHERE sale_id NOT IN (SELECT sale_id FROM sales);
-- ZERO-COPY CLONING: instant copy sharing same storage (no data copied!)
-- Clone entire production database for dev team in seconds
CREATE DATABASE dev_db CLONE prod_db;
-- Clone table before a risky migration
CREATE TABLE sales_backup CLONE sales;
-- Clone table from yesterday for comparison
CREATE TABLE sales_yesterday CLONE sales AT(OFFSET => -86400);
Interview talking point"We use Time Travel on every Snowflake project. When an analyst accidentally ran a wrong DELETE, we recovered all rows in under 5 minutes using Time Travel — no backup restore, zero downtime. We clone production to development using CREATE DATABASE dev CLONE prod — developers get the full dataset instantly with zero storage cost for unchanged data."
🔄
Streams and Tasks — Near Real-Time ELT
CDC inside Snowflake without needing Kafka or external tools
▾
Stream + Task = lightweight CDC inside SnowflakeA Stream tracks every INSERT, UPDATE, DELETE on a table since the last time you read it. A Task runs a SQL statement on a schedule. Together: Task runs every 5 minutes, checks if the stream has new records, and if yes runs a MERGE to update the fact table. Near real-time ELT without any external tools.
-- Create Stream to capture all changes on staging table
CREATE OR REPLACE STREAM sales_changes ON TABLE raw.sales_staging;
-- Create Task to process the stream every 5 minutes
CREATE OR REPLACE TASK process_sales_changes
WAREHOUSE = etl_wh
SCHEDULE = '5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('sales_changes') -- only run if data exists
AS
MERGE INTO core.fact_sales tgt
USING (
SELECT sale_id, customer_id, amount, sale_date
FROM sales_changes
WHERE METADATA$ACTION = 'INSERT'
) src ON tgt.sale_id = src.sale_id
WHEN MATCHED THEN
UPDATE SET tgt.amount = src.amount, tgt.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (sale_id, customer_id, amount, sale_date)
VALUES (src.sale_id, src.customer_id, src.amount, src.sale_date);
-- Start the task (tasks are created suspended by default)
ALTER TASK process_sales_changes RESUME;
-- Monitor task execution history
SELECT name, state, scheduled_time, completed_time, error_message
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
ORDER BY scheduled_time DESC;
Tableau & Business Intelligence
LOD expressions, calculated fields, RLS, performance optimization, architecture — for Tableau developer roles
🏗
Tableau Product Family — What Each Does
Desktop, Server, Cloud, Prep, Bridge — know what each one is
▾
| Product | Purpose | Who Uses It | Key Fact |
|---|---|---|---|
| Tableau Desktop | Create and design workbooks and dashboards | Tableau developers, data analysts | The main design tool — connects to any data source |
| Tableau Server | On-premise publishing, sharing, governance | Enterprise companies with on-prem infrastructure | You manage the servers yourself |
| Tableau Cloud | SaaS version of Server — hosted by Tableau | Cloud-first companies — most new projects | No server management needed |
| Tableau Prep | Visual ETL — clean and reshape data before Tableau | Analysts needing self-service data prep | Drag-and-drop transformations, no SQL needed |
| Tableau Public | Free version — all dashboards are publicly visible | Learning, building a portfolio, journalism | Free forever — use this to practice |
Live vs Extract — the most important Tableau decisionLIVE connection: every click, filter, and hover fires a SQL query to Snowflake in real time. Always current data. Can be slow for complex dashboards. Use for: real-time operational dashboards. EXTRACT (.hyper file): Tableau copies data into its own in-memory columnar format. Very fast queries. Data is a snapshot — needs scheduled refresh (hourly, daily). Use for: large datasets where 1-hour-old data is acceptable. Rule: if business needs minute-by-minute data, use Live. If daily refresh is fine and performance matters, use Extract.
🧮
LOD Expressions — FIXED, INCLUDE, EXCLUDE
The hardest Tableau concept — master this and you immediately stand out
▾
LOD expressions are the #1 Tableau interview topic for developer roles
| LOD Type | What It Does | When to Use |
|---|---|---|
| FIXED | Compute at your specified dimension regardless of view or filters | Customer lifetime value as denominator regardless of product filter |
| INCLUDE | Add a finer-grain dimension to the aggregation | Average order count per customer — compute per customer then average in the view |
| EXCLUDE | Remove a dimension from the aggregation | Show regional total while viewing by individual city |
// FIXED: Customer total revenue — ignores product filter in the view
Customer Total = {FIXED [Customer ID] : SUM([Revenue])}
// FIXED: Product revenue as % of that customer's total spend
Pct of Customer Spend = SUM([Revenue]) / {FIXED [Customer ID] : SUM([Revenue])}
// FIXED: First ever purchase date per customer
First Purchase Date = {FIXED [Customer ID] : MIN([Order Date])}
// INCLUDE: avg order count per customer, then average that in the view
Avg Orders Per Customer = {INCLUDE [Customer ID] : COUNT([Order ID])}
// EXCLUDE: regional total shown even when view is drilled down to city
Region Total Revenue = {EXCLUDE [City] : SUM([Revenue])}
// Running total (Table Calculation — not LOD)
Running Revenue Total = RUNNING_SUM(SUM([Revenue]))
// Year over Year growth %
YoY Growth = (SUM([Revenue]) - LOOKUP(SUM([Revenue]), -1))
/ ABS(LOOKUP(SUM([Revenue]), -1))
// Conditional performance flag (for color coding)
Performance Status =
IF SUM([Revenue]) >= [Target] THEN "On Track"
ELSEIF SUM([Revenue]) >= [Target]*0.8 THEN "At Risk"
ELSE "Behind"
END
LOD interview scenario — say this
"The business wanted to see each product's revenue AND what percentage of that customer's total spend it represented — not percentage of all products visible. Without FIXED LOD, TOTAL() would give percent of all products in the view. I used {FIXED [Customer ID] : SUM([Revenue])} to compute each customer's full total independently. Then divided product revenue by that fixed value. Even when a manager filters to just one product category, the denominator stays correct as the customer's full lifetime spend."
⚡
Dashboard Performance Optimization
5 techniques that separate average from senior Tableau developers
▾
| Problem | Solution | Impact |
|---|---|---|
| Dashboard loads slowly | Switch from Live to Extract — eliminate network round trips | Often 5-10x faster |
| Extract too large | Pre-aggregate in Snowflake — let the warehouse do the heavy lifting | Smaller extract, faster load |
| Too many sheets on one dashboard | Each sheet = one query. Reduce to 6-8 maximum. Combine related metrics | Linear improvement per sheet removed |
| Slow COUNT DISTINCT | Compute COUNTD in Snowflake view, expose as a simple SUM in Tableau | COUNTD is very expensive in Tableau |
| Filters scan everything | Add Context Filters — they run first and reduce the data set before other filters | Can reduce query data by 90% |
| String calculations everywhere | Move all string manipulation to Snowflake — UPPER, TRIM, CONCAT are slow in Tableau | Keeps Tableau doing analytics, not cleaning |
🔒
Row Level Security in Tableau
Every enterprise Tableau project needs this — managers see only their data
▾
Best practice: implement RLS in Snowflake, not TableauIf you implement RLS only in Tableau, someone can bypass it by connecting directly to Snowflake. Implement it at the Snowflake level so security is enforced regardless of which tool queries the data.
-- Snowflake Row Access Policy — attach to the fact table
-- Creates a mapping table: which user can see which region
CREATE TABLE user_region_access (
username VARCHAR,
region VARCHAR
);
INSERT INTO user_region_access VALUES
('[email protected]', 'East'),
('[email protected]', 'West'),
('[email protected]', 'ALL');
-- Create the Row Access Policy
CREATE OR REPLACE ROW ACCESS POLICY region_rls
AS (row_region VARCHAR) RETURNS BOOLEAN ->
EXISTS (
SELECT 1 FROM user_region_access
WHERE username = CURRENT_USER()
AND (region = row_region OR region = 'ALL')
);
-- Apply policy to the fact table
ALTER TABLE fact_sales
ADD ROW ACCESS POLICY region_rls ON (region);
-- Now when John queries, Snowflake automatically filters to East only.
-- Tableau gets filtered results — security works for every tool.
ETL & ELT Development
Batch pipelines, Airflow, dbt, parallel execution, error handling — the core of every data engineering role
🔄
ETL vs ELT — What Changed and Why
Every interviewer asks this — have a confident, clear answer ready
▾
ETL — Traditional
Flow: Source → Transform on ETL server → Load clean data
Tools: Informatica, SSIS, Talend, DataStage, Ab Initio
When to use: on-premise warehouse with limited compute
Example: SAP → Informatica cleans → Oracle Data Warehouse
Tools: Informatica, SSIS, Talend, DataStage, Ab Initio
When to use: on-premise warehouse with limited compute
Example: SAP → Informatica cleans → Oracle Data Warehouse
ELT — Modern Cloud
Flow: Source → Load raw to warehouse → Transform inside warehouse
Tools: Fivetran + dbt, Airbyte + dbt, Python + dbt
When to use: Snowflake or BigQuery — cheap scalable compute
Example: Salesforce → Fivetran → Snowflake RAW → dbt → CORE
Tools: Fivetran + dbt, Airbyte + dbt, Python + dbt
When to use: Snowflake or BigQuery — cheap scalable compute
Example: Salesforce → Fivetran → Snowflake RAW → dbt → CORE
Why ELT won in the cloud eraCloud warehouses (Snowflake, BigQuery) have massive on-demand compute at low cost. Loading raw first means you never lose data — if transformation logic changes, re-run it against the preserved raw data. dbt adds version control, automated testing, documentation, and lineage to SQL transformations — things traditional ETL tools never had. Raw layer is your safety net.
⏱
Apache Airflow — Orchestrating Pipelines
DAGs, tasks, parallel execution, retry, alerting — production Airflow
▾
Airflow mental modelAirflow is a smart scheduler for data pipelines. You write a DAG (Python file) that defines tasks and their dependencies. Airflow does NOT process data itself — it tells other systems (Snowflake, Python scripts, dbt, S3) when and how to run. It tracks every run, retries failures automatically, and sends alerts on error.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import timedelta
default_args = {
'retries': 3,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['[email protected]'],
}
with DAG(
dag_id='daily_sales_pipeline',
default_args=default_args,
schedule_interval='0 6 * * *', # 6 AM every day
catchup=False,
) as dag:
# These two extract tasks run IN PARALLEL
extract_sf = PythonOperator(task_id='extract_salesforce', python_callable=pull_salesforce)
extract_sap = PythonOperator(task_id='extract_sap', python_callable=pull_sap_erp)
load = SnowflakeOperator(
task_id='copy_to_snowflake',
sql='COPY INTO raw.sales_stage FROM @my_stage',
snowflake_conn_id='snowflake_default'
)
dbt_run = PythonOperator(task_id='dbt_transform', python_callable=run_dbt_models)
dq_check = PythonOperator(task_id='data_quality', python_callable=run_dq_checks)
notify = PythonOperator(task_id='notify', python_callable=send_slack_alert)
# Parallel extracts -> load -> dbt -> DQ check -> notify
[extract_sf, extract_sap] >> load >> dbt_run >> dq_check >> notify
🏗
dbt — The Modern ELT Standard
SQL transformations with testing, documentation, and lineage built in
▾
What is dbt?dbt is a transformation framework where you write SQL SELECT statements and dbt materializes them as tables or views in Snowflake. It adds: version control (SQL in Git), automated testing (not_null, unique, referential integrity), documentation with lineage graphs, modular reuse with ref(), incremental loads, and multi-environment support (dev/staging/prod). If you know dbt you look senior immediately.
-- models/staging/stg_salesforce_opportunities.sql
{{ config(materialized='view') }}
SELECT
opportunity_id,
account_id,
TRIM(UPPER(opportunity_name)) AS name,
amount::FLOAT AS amount,
close_date::DATE AS close_date,
CASE stage_name
WHEN 'Closed Won' THEN 'Won'
WHEN 'Closed Lost' THEN 'Lost'
ELSE 'Open'
END AS status
FROM {{ source('salesforce', 'opportunity') }} -- ref() to raw source
WHERE is_deleted = FALSE;
-- models/core/fact_sales.sql
{{ config(materialized='incremental', unique_key='sale_id') }}
SELECT
o.opportunity_id AS sale_id,
c.customer_key,
d.date_key,
o.amount
FROM {{ ref('stg_salesforce_opportunities') }} o -- dbt tracks this dependency
JOIN {{ ref('dim_customer') }} c ON o.account_id = c.account_id
JOIN {{ ref('dim_date') }} d ON o.close_date = d.full_date
{% if is_incremental() %}
WHERE o.close_date > (SELECT MAX(close_date) FROM {{ this }})
{% endif %};
-- schema.yml — automated tests
models:
- name: fact_sales
columns:
- name: sale_id
tests: [unique, not_null]
- name: customer_key
tests:
- relationships:
to: ref('dim_customer')
field: customer_key
Python for Data Engineering
File I/O, pandas, ETL scripts, API calls, S3, error handling — everything needed for data roles
📁
Reading and Writing Every File Format
CSV, JSON, Parquet, Excel, S3, Snowflake — daily work
▾
import pandas as pd
import json, boto3, sqlalchemy
# Read CSV with type hints to avoid surprises
df = pd.read_csv('sales.csv', dtype={'customer_id': str}, parse_dates=['sale_date'])
# Read nested JSON API response and flatten it
with open('api_response.json') as f:
data = json.load(f)
df = pd.json_normalize(data['records'], sep='_') # flattens nested keys
# Read Parquet (fast, compressed, columnar)
df = pd.read_parquet('data.parquet', columns=['id', 'amount', 'date'])
# Read from S3 (most common in production)
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='my-data-bucket', Key='sales/2024/jan/data.parquet')
df = pd.read_parquet(obj['Body'])
# Transform data
df['revenue'] = df['price'] * df['quantity']
df['region'] = df['region'].str.upper().str.strip()
df_clean = df.dropna(subset=['customer_id', 'amount'])
df_filtered = df_clean[df_clean['amount'] > 0]
summary = df_filtered.groupby('region')['revenue'].sum().reset_index()
# Write Parquet to S3 with encryption
parquet_bytes = df_filtered.to_parquet(index=False)
s3.put_object(
Bucket='my-data-bucket',
Key='processed/sales_2024_jan.parquet',
Body=parquet_bytes,
ServerSideEncryption='AES256'
)
# Write to Snowflake
engine = sqlalchemy.create_engine('snowflake://user:pass@account/db/schema?warehouse=etl_wh')
df_filtered.to_sql('sales_staging', engine, if_exists='append', index=False, chunksize=10000)
🔄
Production ETL Script with Logging, Retry, and Error Handling
The real pattern used on production pipelines — not textbook code
▾
import pandas as pd, boto3, sqlalchemy, logging, sys, time
from functools import wraps
# Structured logging — shows in Airflow logs and CloudWatch
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s | %(levelname)s | %(funcName)s | %(message)s',
handlers=[logging.StreamHandler(), logging.FileHandler('pipeline.log')]
)
log = logging.getLogger(__name__)
def retry(attempts=3, wait_seconds=5):
"""Decorator: retry on failure with exponential backoff"""
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
for i in range(attempts):
try:
return fn(*args, **kwargs)
except Exception as e:
log.warning(f'Attempt {i+1} failed: {e}')
if i == attempts - 1:
raise # re-raise after last attempt
time.sleep(wait_seconds * (2 ** i)) # 5s, 10s, 20s
return wrapper
return decorator
@retry(attempts=3)
def extract(bucket, key):
log.info(f'Extracting s3://{bucket}/{key}')
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(obj['Body'])
log.info(f'Extracted {len(df):,} rows, {len(df.columns)} columns')
return df
def transform(df):
initial_count = len(df)
df = df.dropna(subset=['customer_id', 'amount'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df = df[df['amount'] > 0]
df['region'] = df['region'].str.upper().str.strip()
dropped = initial_count - len(df)
log.info(f'Transform: {initial_count:,} -> {len(df):,} rows ({dropped} dropped)')
# Fail if too many rows dropped — something is wrong
assert dropped / initial_count < 0.05, f'Too many rows dropped: {dropped/initial_count:.1%}'
return df
@retry(attempts=3)
def load(df, conn_str, table):
engine = sqlalchemy.create_engine(conn_str)
df.to_sql(table, engine, if_exists='append', index=False, chunksize=5000)
log.info(f'Loaded {len(df):,} rows to {table}')
if __name__ == '__main__':
try:
raw = extract('my-data-bucket', 'sales/2024/jan/data.parquet')
clean = transform(raw)
load(clean, 'snowflake://user:pass@account/db/schema', 'raw.sales_staging')
log.info('Pipeline completed successfully')
except Exception as e:
log.error(f'Pipeline FAILED: {e}', exc_info=True)
sys.exit(1) # non-zero exit tells Airflow the task failed
APIs & REST
How modern systems communicate — every data pipeline extracts from APIs
🌐
REST API Complete Guide + Python Implementation
HTTP methods, status codes, auth, pagination, retry — production patterns
▾
| HTTP Method | Action | Idempotent? | Real Example |
|---|---|---|---|
| GET | Read data — no side effects | YES | GET /api/orders?status=open&page=2 |
| POST | Create new resource | NO — calling twice creates two records | POST /api/orders with body {customerId, items} |
| PUT | Replace entire resource | YES | PUT /api/customers/123 — replaces all fields |
| PATCH | Update specific fields only | Usually YES | PATCH /api/customers/123 — body: {status: inactive} |
| DELETE | Remove a resource | YES | DELETE /api/orders/456 |
| Status Code | Meaning | What to Do |
|---|---|---|
| 200 | Success | Process the response |
| 201 | Created | Resource was successfully created |
| 400 | Bad Request | Fix your request payload or query parameters |
| 401 | Unauthorized | Check your API key or refresh your token |
| 403 | Forbidden | You don't have permission — contact API owner |
| 429 | Rate Limited | Wait and retry — use Retry-After header value |
| 500 | Server Error | Retry with backoff — not your fault |
import requests, time, logging
class APIClient:
"""Production API client with authentication, pagination, and retry"""
def __init__(self, base_url, token):
self.base = base_url
self.session = requests.Session()
self.session.headers.update({'Authorization': f'Bearer {token}'})
def get_all_pages(self, endpoint, params=None):
"""Automatically fetch all pages of paginated results"""
all_records, page = [], 1
while True:
response = self._get(endpoint, {**(params or {}), 'page': page, 'limit': 100})
records = response.get('data', [])
all_records.extend(records)
logging.info(f'Page {page}: {len(records)} records fetched')
if not response.get('has_more'):
break
page += 1
logging.info(f'Total: {len(all_records)} records')
return all_records
def _get(self, endpoint, params, max_retries=3):
url = f'{self.base}/{endpoint}'
for attempt in range(max_retries):
r = self.session.get(url, params=params, timeout=30)
if r.status_code == 200:
return r.json()
elif r.status_code == 429: # Rate limited
wait = int(r.headers.get('Retry-After', 60))
logging.warning(f'Rate limited — waiting {wait}s')
time.sleep(wait)
elif r.status_code >= 500: # Server error
time.sleep(2 ** attempt) # exponential backoff
else:
r.raise_for_status() # 4xx — raise immediately
raise Exception(f'API call failed after {max_retries} attempts')
# Usage
client = APIClient('https://api.salesforce.com/v57.0', token='your_token_here')
orders = client.get_all_pages('services/data/query', params={'q': 'SELECT Id FROM Order'})
AWS, GCP & Azure — Cloud Platforms
Core services, equivalents across all three clouds, S3 deep dive, IAM — what every data engineer must know
Market share reality for US consultingAWS leads at ~32%, Azure at ~23%, GCP at ~12%. In US IT consulting, AWS S3 + Snowflake is the most common data stack. Know the services on each cloud and their equivalents. Most job descriptions list one cloud — knowing the pattern means you adapt quickly.
| Category | AWS | GCP | Azure | What It Does |
|---|---|---|---|---|
| Object Storage | S3 | Cloud Storage (GCS) | Blob Storage | Store any file at unlimited scale — the raw data lake |
| Compute VMs | EC2 | Compute Engine | Virtual Machines | Always-on virtual servers — Airflow, ETL workers |
| Serverless | Lambda | Cloud Functions | Azure Functions | Run code triggered by events — no servers to manage |
| Managed SQL DB | RDS | Cloud SQL | Azure SQL Database | Managed PostgreSQL/MySQL — no OS patching needed |
| Data Warehouse | Redshift | BigQuery | Synapse Analytics | Analytical SQL at petabyte scale |
| Streaming | Kinesis | Pub/Sub | Event Hubs | Managed streaming — Kafka equivalent |
| Managed Airflow | MWAA | Cloud Composer | ADF + Airflow | Run Airflow without managing servers |
| Secrets | Secrets Manager | Secret Manager | Key Vault | Store API keys and passwords securely — never in code |
| Identity | IAM | IAM | Azure Active Directory | Control who can do what to which resource |
🪣
S3 Deep Dive — Most Used Service in Data Engineering
CLI commands, boto3, partitioning strategy — used every single day
▾
# AWS CLI — commands data engineers use daily
aws configure # set up credentials
aws s3 mb s3://my-data-bucket --region us-east-1 # create bucket
aws s3 ls s3://my-bucket/sales/2024/ --recursive # list files
aws s3 cp local_file.parquet s3://my-bucket/sales/2024/ # upload file
aws s3 sync ./local-folder/ s3://my-bucket/data/ # sync entire folder
aws s3 cp s3://my-bucket/sales/2024/data.parquet ./local/ # download file
# Python boto3 — used inside ETL scripts
import boto3
s3 = boto3.client('s3')
# List all files under a prefix
paginator = s3.get_paginator('list_objects_v2')
all_files = [
obj['Key']
for page in paginator.paginate(Bucket='my-bucket', Prefix='sales/2024/')
for obj in page.get('Contents', [])
]
# Upload with encryption (always encrypt!)
s3.put_object(
Bucket='my-bucket',
Key='processed/sales_2024_jan.parquet',
Body=parquet_bytes,
ServerSideEncryption='AES256'
)
S3 partitioning strategy for performanceName files like: s3://bucket/sales/year=2024/month=01/day=15/part-001.parquet — Snowflake and Spark skip entire partitions when your WHERE clause filters on date. A query for January 2024 only reads year=2024/month=01/ and ignores all other folders. This is called partition pruning and can make queries 100x faster on large datasets. Always design your S3 folder structure around how you will query the data.
Storage Formats & Data Types
CSV, JSON, Parquet, Avro, Delta Lake — why format choice matters enormously for performance and cost
| Format | Type | Size vs CSV | Schema? | Best For | Common Tools |
|---|---|---|---|---|---|
| CSV | Text, row-based | 1x (baseline) | No — inferred | Small files, Excel exports, human readability | Everything |
| JSON | Text, hierarchical | 1.2–1.5x larger | No — flexible | API responses, config files, nested semi-structured data | Everything |
| Parquet | Binary, COLUMNAR | 0.1–0.2x (5-10x smaller) | YES — embedded | Analytics, Spark, Snowflake, S3 — the standard | Spark, Snowflake, Pandas, Athena |
| Avro | Binary, row-based | 0.3–0.5x | YES — Schema Registry | Kafka streaming, schema evolution across services | Kafka, Spark, Flink |
| ORC | Binary, columnar | 0.08–0.15x | YES | Hive, older Hadoop ecosystem | Hive, Spark, Presto |
| Delta Lake | Parquet + transaction log | ~0.15x | YES — enforced | ACID on data lake, upserts, time travel, Databricks | Databricks, Spark, Snowflake ext tables |
📋
Why Parquet Beats CSV for Analytics — Visual Explanation
Understand this and you will never forget it
▾
CSV — Row Storage (slow for analytics)
Row 1: id=1, name=Alice, age=30, salary=95000, dept=Eng
Row 2: id=2, name=Bob, age=25, salary=72000, dept=Sales
Row 3: id=3, name=Carol, age=35, salary=110000, dept=Eng
Row 2: id=2, name=Bob, age=25, salary=72000, dept=Sales
Row 3: id=3, name=Carol, age=35, salary=110000, dept=Eng
SELECT AVG(salary) must read ALL 5 columns × every row = 500MB read
Parquet — Columnar Storage (fast for analytics)
[id col]: 1, 2, 3
[name col]: Alice, Bob, Carol
[salary col]: 95000, 72000, 110000
[dept col]: Eng, Sales, Eng
[name col]: Alice, Bob, Carol
[salary col]: 95000, 72000, 110000
[dept col]: Eng, Sales, Eng
SELECT AVG(salary) reads ONLY salary column = 100MB. 5x less I/O!
Interview answer: "Why do you use Parquet?""We use Parquet for three reasons. First, it is columnar — analytics queries only read the columns they need, which can be 10x faster than row-based formats. Second, each column compresses independently — integers compress extremely well. A 1GB CSV typically becomes 100-200MB Parquet, saving 70-80% on S3 storage costs. Third, the schema is embedded in the file so Snowflake and Spark know the exact data types without guessing. For Kafka streaming we use Avro instead — it is row-based which is faster for individual record reads, and it works natively with Schema Registry."
Apache Kafka — Event Streaming
Real-time data pipelines — when batch ETL is not fast enough for the business
When to use Kafka vs batch ETLBatch ETL is fine when data can be hours or days old — financial reports, warehouse loads, monthly analytics. Use Kafka when data must be available in seconds — fraud detection (act before a transaction posts), real-time dashboards (live sales counter on the CEO's screen), IoT monitoring (detect machine failure as it happens), microservices decoupling (order service tells inventory and billing at the same time). Only add Kafka when real-time is genuinely required — it has operational complexity.
📨
Core Concepts — Every Term with a Business Example
Topic, Partition, Offset, Consumer Group, Broker — explained simply
▾
| Concept | What It Is | Business Analogy | Real Example |
|---|---|---|---|
| Topic | Named log of messages | A TV channel — tune in to receive broadcasts | order-placed, payment-failed, user-clicked |
| Partition | Parallel subdivision of a topic | Highway lanes — more lanes = more traffic capacity | order-events split into 12 partitions = 12x throughput |
| Offset | Sequential message number in a partition (0,1,2...) | Page number in a book — bookmark where you stopped | Consumer processed up to offset 1,042,771 |
| Producer | Application that writes messages to a topic | A broadcaster publishing to a channel | Order service publishes to order-events on every purchase |
| Consumer Group | Set of consumers sharing all partitions of a topic | Call center team — agents share the incoming queue | fraud-service with 3 consumers reads all 12 partitions |
| Broker | Kafka server that stores messages on disk | Post office — receives, stores, and delivers mail | 3-broker cluster — each holds replicas for fault tolerance |
// Node.js — Kafka producer and consumer (kafkajs library)
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ clientId: 'order-service', brokers: ['kafka-broker:9092'] });
// PRODUCER: publish event when order is placed
const producer = kafka.producer();
await producer.connect();
await producer.send({
topic: 'order-events',
messages: [{
key: order.customerId, // same customer always same partition
value: JSON.stringify({
orderId: order.id,
customerId: order.customerId,
amount: order.total,
items: order.items,
eventType: 'ORDER_PLACED',
timestamp: new Date().toISOString(),
}),
headers: { source: 'web-checkout', version: 'v2' }
}]
});
// CONSUMER: fraud detection service reads every single order
const consumer = kafka.consumer({ groupId: 'fraud-detection-service' });
await consumer.connect();
await consumer.subscribe({ topic: 'order-events' });
await consumer.run({
eachMessage: async ({ message }) => {
const order = JSON.parse(message.value.toString());
const risk = await calculateFraudScore(order);
if (risk > 0.9) {
await blockAndSendAlert(order, risk);
}
}
});
Spark & Databricks
Distributed processing for large-scale data — when SQL alone is not enough
When to use Spark instead of Snowflake SQLUse Snowflake SQL for structured analytics up to a few terabytes — it is simpler and cheaper. Use Spark for: multi-terabyte processing where Snowflake becomes too slow or expensive, machine learning feature engineering at scale, complex transformations that are hard to express in SQL (graph processing, iterative algorithms), joining huge datasets across different storage systems. Databricks is managed Spark plus Delta Lake plus MLflow plus notebooks — the standard platform for large-scale data engineering at tech companies.
🔥
Spark Architecture and PySpark Code
Driver, Executors, partitions — how Spark distributes work across machines
▾
Spark Architecture
Driver
Plans the job
coordinates all
Plans the job
coordinates all
→
Cluster Manager
YARN / Kubernetes
allocates resources
YARN / Kubernetes
allocates resources
→
Executor 1
Partitions 1–3
Partitions 1–3
Executor 2
Partitions 4–6
Partitions 4–6
Executor N
Partitions 7–N
Partitions 7–N
Data is split into partitions — each executor processes its partitions in parallel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SalesTransform').getOrCreate()
# Read 1 year of Parquet from S3 — automatically distributed across executors
df = spark.read.parquet('s3://my-bucket/sales/year=2024/')
# Window function — running total per customer (same as SQL but distributed)
w = Window.partitionBy('customer_id').orderBy('sale_date')
result = (
df
.filter(F.col('status') == 'COMPLETED')
.withColumn('revenue', F.col('price') * F.col('quantity'))
.withColumn('running_total', F.sum('revenue').over(w))
.withColumn('prev_month', F.lag('revenue', 1, 0).over(w))
.groupBy('customer_id', 'region')
.agg(
F.sum('revenue').alias('total_revenue'),
F.count('*').alias('order_count'),
F.avg('revenue').alias('avg_order_value'),
)
.filter(F.col('total_revenue') > 1000)
)
# Write results back to S3 as Parquet
result.write.mode('overwrite').parquet('s3://my-bucket/processed/customer_summary/')
CICD & Git
How code goes from developer laptop to production safely — every professional project uses this
🚀
Complete CICD Pipeline for Data Projects
Git workflow, GitHub Actions, staging, production — step by step
▾
1
Write code on a feature branch
git checkout -b feature/add-customer-scd2. Write Python ETL or dbt SQL. Test locally with sample data. Never commit directly to the main branch.
2
Push and open a Pull Request
git add . && git commit -m "Add SCD2 for customer dimension" && git push. Open Pull Request on GitHub with description of what changed and why. Request review from tech lead.
3
CI runs automatically on every push
GitHub Actions triggers: run pytest unit tests, flake8 code style check, dbt compile to check SQL syntax, data quality tests on dev Snowflake. PR is blocked from merging if any check fails.
4
Code review and approval
Tech lead reviews for correctness, performance, security, and documentation. At least 1 approval required. After approval: merge to main triggers auto-deploy to staging environment.
5
Validate staging and deploy to production
Business analyst validates results in staging. QA sign-off. Create ServiceNow Change Request. After CAB approval deploy to production during the change window. Run smoke tests. Close the CR.
# .github/workflows/data-pipeline-cicd.yml
name: Data Pipeline CI/CD
on:
pull_request: { branches: [main] }
push: { branches: [main] }
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: { python-version: '3.11' }
- run: pip install -r requirements.txt
- run: flake8 src/ --max-line-length=120
- run: pytest tests/unit/ -v --cov=src
- run: dbt compile --profiles-dir ./profiles
deploy-staging:
needs: test
if: github.ref == 'refs/heads/main'
steps:
- run: dbt run --target staging --select tag:daily
- run: dbt test --target staging
- run: pytest tests/integration/ -v
deploy-production:
needs: deploy-staging
environment: production # requires manual approval in GitHub settings
steps:
- run: dbt run --target prod --select tag:daily
- run: python scripts/smoke_test.py
JIRA & ServiceNow
Project management and IT service management — knowing these signals real enterprise project experience
📋
JIRA Complete Guide — Every Term with Data Project Examples
Scrum, sprints, stories, velocity — every concept you will encounter
▾
| JIRA Term | What It Is | Real Data Project Example | Who Creates It |
|---|---|---|---|
| Epic | Large body of work spanning multiple sprints | "Build Customer Analytics Platform" — takes 3 months | Product Owner / Manager |
| Story (User Story) | User-facing feature that fits in one sprint | "As a sales manager, I want daily revenue by region dashboard" | Product Owner |
| Task | Technical sub-work inside a story | "Create Snowflake star schema for revenue model" | Developer / Engineer |
| Bug | Something broken that needs fixing | "Dashboard shows January revenue as $0 — wrong date filter" | QA / anyone who finds it |
| Spike | Time-boxed research to reduce uncertainty | "Investigate how to connect Salesforce real-time to Kafka" | Tech Lead |
| Sprint | 2-week iteration with committed stories | Sprint 14: 4 stories + 2 bugs = 38 story points committed | Scrum Master manages |
| Story Points | Relative effort using Fibonacci (1, 2, 3, 5, 8, 13) | Simple SQL query = 2 pts. New ETL pipeline = 8 pts. | Dev team estimates together |
| Velocity | Average story points completed per sprint | Team averages 40 pts/sprint — plan 40 pts per sprint | Scrum Master tracks |
| Backlog | All pending work not yet in a sprint | 50 stories prioritized by business value, waiting to be picked | Product Owner prioritizes |
| Definition of Done | What "complete" means for the team | Code reviewed, tested in staging, business approved, documented | Team decides together |
Perfect user story format — memorize this
AS A regional sales manager
I WANT to see daily sales by product category filtered to my region
SO THAT I can identify underperforming categories and take corrective action
ACCEPTANCE CRITERIA: Dashboard loads in under 3 seconds. Date range filter defaults to last 30 days. Pre-filtered to the logged-in manager's region. Shows revenue, order count, avg order value, and month-over-month % change. Export to Excel works. Data refreshed daily by 7 AM EST.
DEFINITION OF DONE: SQL tested with 3+ months of data. Snowflake view reviewed by tech lead. Tableau dashboard approved by product owner. Row-level security verified. Deployed to Tableau Server production. STORY POINTS: 8.
I WANT to see daily sales by product category filtered to my region
SO THAT I can identify underperforming categories and take corrective action
ACCEPTANCE CRITERIA: Dashboard loads in under 3 seconds. Date range filter defaults to last 30 days. Pre-filtered to the logged-in manager's region. Shows revenue, order count, avg order value, and month-over-month % change. Export to Excel works. Data refreshed daily by 7 AM EST.
DEFINITION OF DONE: SQL tested with 3+ months of data. Snowflake view reviewed by tech lead. Tableau dashboard approved by product owner. Row-level security verified. Deployed to Tableau Server production. STORY POINTS: 8.
🎫
ServiceNow — IT Service Management (ITSM)
Incidents, Change Requests, CMDB, CAB — enterprise ITSM that every big company uses
▾
ServiceNow vs JIRA — they serve different purposesJIRA = project management — building new features, tracking development work. ServiceNow = IT operations — handling IT problems, managing production changes, tracking IT assets. Big enterprises use both. You use JIRA to build the pipeline. You use ServiceNow to deploy it to production safely.
| ServiceNow Module | Purpose | Real Data Engineering Example |
|---|---|---|
| Incident | Something is broken — fix it now | Snowflake pipeline failed at 3 AM — P1 incident created, on-call engineer paged |
| Problem | Root cause analysis after repeated incidents | Pipeline fails every Monday — problem ticket to find the root cause permanently |
| Change Request | Formal approval before modifying production | Adding column to fact_sales — needs approval before deploying |
| CAB | Change Advisory Board — weekly meeting to approve changes | Thursday meeting: team reviews all planned changes for next week's deployment |
| CMDB | Inventory of all IT assets and their relationships | All Snowflake warehouses, Airflow servers, S3 buckets cataloged here |
| SLA | Response and resolution time commitments | P1: respond in 15 min, resolve in 4 hrs. P2: respond in 1 hr, resolve in 8 hrs |
Change Request process — deploying to production
"In all our projects any change to production required a Change Request in ServiceNow. You fill out: what is changing, why, risk level (low/medium/high), rollback plan, testing evidence with screenshots, and step-by-step deployment instructions. Submit to your manager and the CAB for approval. Once approved, deploy only during the approved change window — usually Sunday 2 to 4 AM. After deployment run smoke tests: can Tableau connect, is the pipeline running, are data quality checks passing. Mark the CR as successfully implemented and close it. If anything goes wrong use the rollback plan and mark it as failed — never hide a failed deployment."
Data Quality Engineering
The most important skill nobody teaches — bad data destroys trust and costs companies millions
Why data quality is the most important skill in data engineeringA dashboard built on bad data leads to wrong decisions. If the sales team trusts a report showing $5M revenue when real revenue is $4.2M, that is a serious business problem. Interviewers love this topic — having a structured answer to data quality shows real seniority.
✅
6 Dimensions of Data Quality — The Framework
Memorize this framework and use it in every data quality conversation
▾
| Dimension | Question to Ask | SQL Check Example | Alert Threshold |
|---|---|---|---|
| Completeness | Are required fields populated? All rows present? | COUNT(*) WHERE customer_id IS NULL | 0 nulls on key fields |
| Uniqueness | Are there duplicate primary keys or duplicate records? | COUNT(*) vs COUNT(DISTINCT sale_id) | 0 duplicates on primary key |
| Accuracy | Are values in valid ranges? Are they correct? | COUNT(*) WHERE amount <= 0 OR amount > 1000000 | 0 out-of-range rows |
| Consistency | Does count match across source and target systems? | Source count = Snowflake count ± tolerance | Alert if variance > 0.01% |
| Timeliness | Is the data fresh enough for the business? | DATEDIFF('hour', MAX(load_ts), NOW()) | Alert if older than 4 hours |
| Validity | Do values match expected formats and allowed values? | status NOT IN ('active','inactive','pending') | 0 invalid values |
-- Run all 5 DQ checks after every pipeline load
-- 1. Completeness: no nulls on required fields
SELECT COUNT(*) AS null_customer_ids
FROM fact_sales
WHERE customer_id IS NULL; -- must return 0
-- 2. Uniqueness: no duplicate sale IDs
SELECT sale_id, COUNT(*) AS cnt
FROM fact_sales
GROUP BY sale_id
HAVING COUNT(*) > 1; -- must return no rows
-- 3. Accuracy: amounts must be positive
SELECT COUNT(*) AS bad_amounts
FROM fact_sales
WHERE amount <= 0; -- must return 0
-- 4. Timeliness: data must be fresh
SELECT DATEDIFF('hour', MAX(load_timestamp), CURRENT_TIMESTAMP) AS hours_old
FROM fact_sales; -- alert if > 4
-- 5. Consistency: loaded count matches source count
SELECT
source_count,
loaded_count,
ROUND(loaded_count * 100.0 / source_count, 2) AS pct_loaded
FROM daily_reconciliation
WHERE load_date = CURRENT_DATE; -- pct_loaded should be 100.00%
🚨
What to Do When a Production Pipeline Fails
A complete incident response process — shows real seniority
▾
1
Check the logs immediately
Open Airflow UI, find the failed task, click the logs. Find the exact error message and stack trace. Do not guess — the answer is always in the logs.
2
Classify the failure type
Transient: network timeout, Snowflake warehouse suspended, API rate limit — retry and it will work. Permanent: bad data, schema change in source, code bug — needs a fix before retrying.
3
Assess the impact
Which tables are affected? Which date ranges? Are any dashboards showing wrong data right now? Who needs to be notified immediately?
4
Contain the damage
If bad data already loaded to Snowflake: use Time Travel to restore the table to before the bad load. If data is partially loaded: quarantine the bad rows to an error table, complete the load with good rows.
5
Fix and re-run
Fix the root cause. Clear the failed task in Airflow and re-run. Verify the pipeline completes successfully. Run DQ checks manually to confirm data is correct.
6
Post-mortem and prevention
Document what happened, why, how long it took to fix. Add a data quality check that would have caught this issue earlier. Update the runbook for the on-call team. This is what separates senior engineers from junior ones.
Security & Networks
VPC, IAM, encryption, PII masking, compliance — talk about security like a senior engineer
🔒
Security Concepts Every IT Consultant Must Know
From networking to data security — with data engineering context for each
▾
| Concept | Simple Explanation | Data Engineering Context |
|---|---|---|
| VPC | Private network inside AWS — like a gated community | All your Snowflake connections, EC2 servers, and RDS live inside a VPC — not exposed to the public internet |
| Subnet | Neighborhood inside the VPC with its own routing rules | Public subnet: internet-facing load balancers. Private subnet: databases and ETL servers — no direct internet access |
| Security Group | Firewall for individual resources — allow or deny by IP and port | RDS security group: only allow port 5432 from the Airflow server's IP address. Deny everything else. |
| IAM Role | Identity that grants specific permissions to a service | Airflow EC2 has an IAM role allowing S3 read + Snowflake write + CloudWatch logs. No other permissions. |
| Encryption at Rest | Data is encrypted when stored on disk — unreadable without the key | S3 SSE-AES256, Snowflake AES-256 by default. Even if a drive is physically stolen the data is unreadable. |
| Encryption in Transit | Data is encrypted while moving over the network | All Snowflake JDBC connections use TLS 1.2+. S3 SDK uses HTTPS. Never send data over plain HTTP. |
| Column Masking | Show partial or fake values to users without the right role | Analysts see ***-**-1234 instead of real SSN. Only the compliance team role can see unmasked values. |
| Row-Level Security | Each user sees only the rows they are permitted to see | East regional manager sees only East region rows — same table, different result for each user |
| MFA | Second proof of identity beyond just a password | Required for any production system access — Okta or Google Authenticator app as the second factor |
| Audit Log | Record of every action taken on data — who did what and when | Snowflake ACCOUNT_USAGE.ACCESS_HISTORY: query who accessed which table at exactly what time |
Security answer for every interview"In all our projects we follow the principle of least privilege — each service gets only the permissions it absolutely needs. Data engineers have read access to production Snowflake but write only through a dedicated service account. PII columns like SSN and date of birth have Snowflake Dynamic Data Masking — analysts see masked values, only the compliance team role sees real values. All data is encrypted at rest with AES-256 and in transit with TLS 1.2. Production access requires VPN plus MFA. Every data access is logged in Snowflake ACCOUNT_USAGE for compliance audits."
End-to-End Flow — All Dots Connected
The complete enterprise data platform — study this until you can draw and explain every layer from memory
This is the most important sectionIn every senior data interview you will be asked: "describe the data architecture of your last project." This is your answer framework. Learn it completely and adapt it to any project scenario.
Layer 1 — Source Systems (where data originates)
Salesforce CRM
Opportunities
Opportunities
SAP ERP
Orders, Inventory
Orders, Inventory
Web App
Clickstream
Clickstream
3rd Party APIs
Market data
Market data
PostgreSQL DB
Operational
Operational
↓ extract via Airflow + Python, Fivetran, Kafka ↓
Layer 2 — Ingestion (how data moves)
Batch ETL
Airflow + Python
Daily at 6 AM
Airflow + Python
Daily at 6 AM
||
Managed Connectors
Fivetran / Airbyte
No-code ELT
Fivetran / Airbyte
No-code ELT
||
Real-time Streaming
Apache Kafka
Milliseconds
Apache Kafka
Milliseconds
||
CDC
Debezium
DB change logs
Debezium
DB change logs
↓ land raw files ↓
Layer 3 — Data Lake (raw storage, never delete)
AWS S3 / GCP Cloud Storage / Azure Blob Storage
s3://bucket/sales/year=2024/month=01/day=15/salesforce_opps.parquet
Raw, unmodified, date-partitioned Parquet — the safety net
s3://bucket/sales/year=2024/month=01/day=15/salesforce_opps.parquet
Raw, unmodified, date-partitioned Parquet — the safety net
↓ Snowflake COPY INTO ↓
Layer 4 — Snowflake Data Warehouse (4 schemas)
RAW schema
1:1 typed copy
no business logic
1:1 typed copy
no business logic
→
STAGING schema
deduped, validated
null checks done
deduped, validated
null checks done
→
CORE schema
Star Schema
Fact + SCD2 Dims
Star Schema
Fact + SCD2 Dims
→
MART schema
Sales, Finance
pre-aggregated
Sales, Finance
pre-aggregated
↓ dbt transforms (SQL in Git, tested, documented) ↓
Layer 5 — Quality and Monitoring
dbt tests
not_null, unique
relationships
not_null, unique
relationships
+
SQL DQ checks
null, dup, range
reconciliation
null, dup, range
reconciliation
+
Airflow alerts
Slack on failure
email on SLA breach
Slack on failure
email on SLA breach
↓ serve to business ↓
Layer 6 — Consumption
Tableau
Power BI
Looker
Power BI
Looker
REST APIs
for web/mobile
applications
for web/mobile
applications
Data Science
Jupyter, ML
models
Jupyter, ML
models
Scheduled Reports
Excel, PDF
Email delivery
Excel, PDF
Email delivery
Cross-cutting concerns that wrap every layerOrchestration: Apache Airflow schedules and monitors every batch job. CICD: GitHub Actions deploys all code through test → staging → production. Security: IAM roles, VPC private subnets, TLS encryption, column masking for PII. Project Management: JIRA tracks all development work in sprints. ITSM: ServiceNow handles production incidents and change request approvals. Governance: data catalog tracks what data exists, who owns it, and how to use it.
Business Scenarios — Real Interview Discussions
Learn these stories and tell them fluently — interviewers value real project experience above everything else
🏬
Retail Analytics Platform — Full End-to-End Story
Data warehouse developer, ETL, Snowflake, Tableau combined scenario
▾
Project context — use this as your project
National retail chain with 500 stores and 3 source systems: POS transactions from store terminals, SAP ERP for inventory and procurement, Salesforce for their loyalty program. The business had no unified view — merchandising, operations, and finance each had separate spreadsheets that always disagreed.
What you built
AWS S3 raw data lake — all files land here first as Parquet files partitioned by store ID and date. Airflow on EC2 runs daily at 5 AM — POS data via SFTP pickup, SAP via REST API, Salesforce via Fivetran. Snowflake COPY INTO loads the RAW schema. dbt transforms to CORE schema — star schema with fact_sales, fact_inventory, dim_product with SCD Type 2 for price changes, dim_store, dim_customer, dim_date. Tableau connects to MART schema — 4 dashboards: Executive KPIs, Store Manager Performance, Inventory Analysis, and Loyalty Program Analytics.
Problem you solved — always include a specific problem
POS systems from different store brands sent data in incompatible formats — some pipe-delimited, some CSV with different column orders, different date formats, different NULL representations. We built a Python normalization layer that detected the format automatically and converted everything to standard Parquet before landing in S3. We added dbt tests and Great Expectations checks — if the null rate on product_id exceeded 0.1% the Airflow task failed and a Slack alert fired immediately. This reduced dashboard data errors by 95% in the first month after go-live.
🏦
Financial Services — Real-Time Fraud Detection + Compliance Reporting
Kafka real-time plus Snowflake batch — two pipelines for one business
▾
The dual requirement
Credit card company processing 2 million transactions per day. Two completely opposite requirements: fraud detection in under 500 milliseconds (real-time), and daily regulatory compliance reports that need to be available by 6 AM (batch). One solution had to serve both.
Real-time path via Kafka
Every card transaction published to Kafka topic order-events, keyed by card number so all events for one card stay in one partition and in order. Fraud detection service consumed the topic, ran rules: velocity check (5 transactions in 10 minutes), geo-anomaly (transaction in NYC then Miami 30 minutes later), and amount spike (3x the customer's average). Flagged transactions published to fraud-alerts topic. Alert service consumed fraud-alerts and sent real-time push notification or SMS to the cardholder. Full path: transaction to customer alert under 200 milliseconds.
Batch path via Snowflake
Same Kafka events landed in S3 via Kafka Connect S3 Sink connector — Parquet files created every 15 minutes. Airflow triggered COPY INTO at midnight. dbt built regulatory reporting models — daily transaction aggregates by merchant category, geographic distribution, and amount brackets — all required by the compliance team. Reports were available by 6 AM every morning and ran in 8 minutes on a Medium Snowflake virtual warehouse.
🏥
Healthcare Analytics — HIPAA Compliant Data Platform
Security, PII masking, row-level security — the compliance-heavy scenario
▾
The challenge
20-hospital network needed patient outcome analytics — readmission rates, average length of stay, procedure costs, quality metrics — while being fully HIPAA compliant. PHI (Protected Health Information) could never be exposed to analytics teams, dashboard users, or the ML team. Any breach would be a federal violation.
Security architecture we built
Snowflake Dynamic Data Masking on all 18 HIPAA identifier columns — analysts see masked patient_id and asterisks for name, SSN, and date of birth. Only the compliance team role with background checks could see unmasked values. Row-level security via Snowflake Row Access Policy — each hospital's analytics team saw only their hospital's records. All data encrypted at rest with AES-256 and in transit with TLS 1.2. Only 5 data engineers with HIPAA training had access to unmasked PHI, and only through VPN plus MFA. Every single data access logged in Snowflake ACCOUNT_USAGE for annual compliance audits.
How we handled the ML team request
The ML team wanted patient data to build a readmission prediction model but could not have access to PHI. We created a de-identified extract: replaced patient IDs with surrogate keys using a one-way SHA-256 hash, removed all 18 HIPAA identifiers, generalized age to 5-year buckets, generalized zip code to 3-digit prefix. The ML team trained their model on this dataset with zero access to PHI at any point.
Free Accounts — Practice With Real Tools
Hands-on practice is what separates people who know the concepts from people who actually get hired
Why you must set these up this weekendAnyone can say "I have Snowflake experience." Only someone who actually logged in can say "I created a Medium virtual warehouse with auto-suspend at 5 minutes, loaded Parquet files from S3 using COPY INTO with a Storage Integration, and used Time Travel to recover rows deleted by a wrong DELETE statement." That specificity wins interviews. All of these are free. Do it this weekend.
❄
Snowflake — 30 Day Trial, $400 Free Credits
Step-by-step setup and specific exercises to practice
▾
1
Sign up at snowflake.com/try
Choose Standard edition → AWS → US-East-1. No credit card required for the trial. Confirm your email and log in. You now have a full Snowflake environment with $400 in free credits.
2
Create your practice environment
Click Worksheets in the top menu → New Worksheet. This is your SQL editor connected to Snowflake. Run the setup SQL below to create your practice star schema.
3
Practice exercises — do every one of these
1) Create star schema tables and insert sample data. 2) Run RANK, DENSE_RANK, and ROW_NUMBER queries and compare the results. 3) Write window functions: SUM OVER, LAG, running total. 4) Run a DELETE statement then use Time Travel to recover the deleted rows. 5) CREATE TABLE backup CLONE fact_sales — verify it is instant. 6) Create a Stream on the staging table. 7) Create a Task to run every minute. Watch it execute.
-- Paste this in your Snowflake worksheet — creates a full practice star schema
CREATE DATABASE IF NOT EXISTS practice_db;
CREATE SCHEMA IF NOT EXISTS practice_db.star_schema;
USE practice_db.star_schema;
CREATE OR REPLACE TABLE dim_customer (
customer_key INT AUTOINCREMENT PRIMARY KEY,
customer_id VARCHAR(10),
name VARCHAR(100),
email VARCHAR(200),
city VARCHAR(50),
state VARCHAR(2),
segment VARCHAR(20),
start_date DATE DEFAULT CURRENT_DATE,
end_date DATE DEFAULT '9999-12-31',
is_current BOOLEAN DEFAULT TRUE
);
CREATE OR REPLACE TABLE fact_sales (
sale_key INT AUTOINCREMENT PRIMARY KEY,
customer_key INT REFERENCES dim_customer(customer_key),
product VARCHAR(100),
region VARCHAR(50),
amount FLOAT,
quantity INT,
sale_date DATE,
load_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO dim_customer (customer_id, name, email, city, state, segment)
VALUES
('C001', 'Alice Johnson', '[email protected]', 'New York', 'NY', 'Premium'),
('C002', 'Bob Smith', '[email protected]', 'Chicago', 'IL', 'Standard'),
('C003', 'Carol Davis', '[email protected]', 'Houston', 'TX', 'Premium'),
('C004', 'David Lee', '[email protected]', 'Phoenix', 'AZ', 'Standard');
INSERT INTO fact_sales (customer_key, product, region, amount, quantity, sale_date)
VALUES
(1, 'Laptop', 'East', 1200.00, 1, '2024-01-15'),
(1, 'Mouse', 'East', 25.00, 2, '2024-01-16'),
(2, 'Monitor', 'West', 450.00, 1, '2024-01-17'),
(3, 'Laptop', 'South', 1200.00, 2, '2024-01-18'),
(4, 'Keyboard','West', 85.00, 3, '2024-01-19'),
(2, 'Laptop', 'West', 1200.00, 1, '2024-01-20');
-- Now run these practice queries:
-- 1. RANK vs DENSE_RANK
SELECT customer_key, SUM(amount) total,
RANK() OVER (ORDER BY SUM(amount) DESC) rnk,
DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) drnk
FROM fact_sales GROUP BY customer_key;
-- 2. Running total per region
SELECT region, sale_date, amount,
SUM(amount) OVER (PARTITION BY region ORDER BY sale_date) AS running_total
FROM fact_sales ORDER BY region, sale_date;
-- 3. Test Time Travel: delete rows, then recover them
DELETE FROM fact_sales WHERE product = 'Mouse';
-- Recover them:
INSERT INTO fact_sales SELECT * EXCLUDE(sale_key) FROM fact_sales AT(OFFSET => -300);
-- 4. Clone the table instantly
CREATE TABLE fact_sales_backup CLONE fact_sales;
☁
AWS, GCP, and Azure Free Tiers
All three have generous free tiers — set up all three
▾
| Cloud | Sign Up URL | What Is Free | Best Practice Exercise |
|---|---|---|---|
| AWS | aws.amazon.com/free | S3 5GB forever, Lambda 1M requests/month forever, EC2 750 hrs/year | Create S3 bucket → upload a Parquet file → practice AWS CLI commands → download it back |
| GCP | cloud.google.com/free | $300 credit 90 days + BigQuery 1TB queries free every month forever | Query public NYC taxi dataset: 39 million rows, instant, completely free — practice window functions on real data |
| Azure | azure.microsoft.com/free | $200 credit 30 days + many services free 12 months | Create Blob Storage container → upload a file → use Azure Data Factory to move it |
Best free practice: GCP BigQuery with public datasetsGo to console.cloud.google.com → BigQuery → Add Data → Public Datasets. Run this — completely free and real data: SELECT pickup_borough, COUNT(*) AS trips, ROUND(AVG(fare_amount),2) AS avg_fare, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS borough_rank FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022 GROUP BY pickup_borough ORDER BY trips DESC. That is 39 million rows processed in 3 seconds. Now practice every SQL concept you learned — RANK, window functions, CTEs, deduplication — all on real data at no cost.
Interview Q&A — 50+ Questions
Every role covered — SQL, DWH, Snowflake, Tableau, ETL, Kafka, Cloud, Data Quality, JIRA — click to reveal the full answer
SQL & Databases
Q
What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?+
ROW_NUMBER: always gives unique sequential numbers — no ties. If two people have the same salary one gets 1 and the other gets 2 arbitrarily. RANK: ties get the same rank but the next number is skipped. Two people tied for 1st both get rank 1, next person gets rank 3 — not 2. DENSE_RANK: ties get the same rank with no skipping. Two tied for 1st both get 1, next person gets 2. Use DENSE_RANK for business leaderboards (clients expect 1,2,3), ROW_NUMBER to deduplicate (pick exactly one row per group).
Q
What is the difference between WHERE and HAVING?+
WHERE filters individual rows BEFORE grouping and aggregation. HAVING filters groups AFTER aggregation has been computed. The rule: if your filter condition uses an aggregate function like SUM, COUNT, AVG, MAX, or MIN — it must go in HAVING. If it compares raw column values — it goes in WHERE. You can use both in the same query. WHERE runs first and reduces the rows (faster), HAVING then filters the resulting groups.
Q
How do you find and remove duplicate records from a table?+
Find duplicates: SELECT order_id, COUNT(*) FROM orders GROUP BY order_id HAVING COUNT(*) > 1. Remove duplicates keeping the latest: use ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY created_at DESC) AS rn in a CTE, then SELECT WHERE rn = 1. In Snowflake: CREATE TABLE orders_clean AS SELECT * EXCLUDE(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY load_ts DESC) rn FROM orders_raw) WHERE rn = 1.
Q
What is a window function and give a real business example?+
A window function performs a calculation across rows related to the current row without collapsing them like GROUP BY does. The OVER() clause defines the window. Real example: SUM(revenue) OVER (PARTITION BY rep_name ORDER BY month) gives a running total of revenue per sales rep — every row stays in the result, each row shows the cumulative total up to that month. Used for: running totals, month-over-month comparisons with LAG, rankings within groups with DENSE_RANK, moving averages.
Q
What is a CTE and when do you use it over a subquery?+
A CTE (Common Table Expression) is a named temporary result defined with the WITH clause before the main SELECT. Use it to: 1) Break complex queries into readable named steps — each step has a meaningful name. 2) Reference the same subquery multiple times without repeating code. 3) Build recursive queries for hierarchies like org charts. CTEs do not improve performance over subqueries in most databases — the benefit is entirely readability and maintainability. In a team environment, readable code is as valuable as fast code.
Data Warehouse & Modeling
Q
What is a Star Schema and why do we use it?+
Star Schema has a central fact table surrounded by dimension tables — it looks like a star. The fact table stores measurable business events: sale transactions, web clicks, inventory movements. Dimension tables store descriptive context: customer name and segment, product category, store location, date attributes. Why we use it: simple joins (one join per dimension), fast query performance because most queries only need a few dimensions, and BI tools like Tableau are optimized for star schemas. Almost every modern analytics platform uses Star Schema for reporting.
Q
What is SCD Type 2 and how do you implement it?+
SCD Type 2 keeps the full history of how a dimension record changed over time. When an attribute changes (customer moves city), instead of overwriting, we add a new row. Each row has: start_date (when this version became effective), end_date (9999-12-31 for the current record), and is_current flag (TRUE for current). Example: customer moves from NYC to Chicago. Step 1: UPDATE the old row — set end_date = yesterday, is_current = FALSE. Step 2: INSERT new row with Chicago, start_date = today, end_date = 9999-12-31, is_current = TRUE. Historical fact rows still correctly join to the NYC record for old transactions. Historical accuracy is preserved perfectly.
Q
What is the difference between a Data Lake, Data Warehouse, and Data Mart?+
Data Lake: stores raw data in any format (CSV, JSON, Parquet, images, logs) in cheap object storage like S3. No schema enforced — anything goes. Schema-on-read. Data Warehouse: stores cleaned, typed, integrated data in a Star Schema optimized for SQL analytics. Schema-on-write — structure enforced when loading. Data Mart: a focused subset of the warehouse designed for one specific business team — Sales Mart, Finance Mart, Operations Mart. Order: Data Lake comes first (land everything raw). Data Warehouse comes second (curate what matters). Data Mart comes last (serve specific teams). Lake is raw materials, Warehouse is the finished goods store, Mart is the checkout aisle per department.
Q
What is the difference between OLTP and OLAP?+
OLTP (Online Transaction Processing): operational databases that run the business — fast reads and writes of individual records. Highly normalized (3rd Normal Form). Used for: web apps, order management, customer records. Examples: MySQL, PostgreSQL in your application. OLAP (Online Analytical Processing): analytical databases for understanding the business — fast aggregation of millions of rows. Denormalized Star Schema. Used for: reports, dashboards, historical analysis. Examples: Snowflake, BigQuery, Redshift. ETL or ELT moves data from OLTP to OLAP. Never run heavy analytics directly on your OLTP database — it will slow down your application.
Snowflake
Q
How does Snowflake separate storage and compute?+
Traditional databases couple storage and compute — you cannot scale one without the other and all users compete for the same resources. Snowflake separates them completely. Storage sits in S3-based compressed columnar format and you pay per TB stored. Compute is Virtual Warehouses — independent clusters you spin up in seconds and pay per second of usage. Multiple teams can query the same data with their own warehouses simultaneously without competing. Auto-suspend means you pay zero when no queries are running. This is the fundamental advantage over traditional databases like Redshift or Oracle.
Q
What is Snowflake Time Travel and how have you used it in a real project?+
Time Travel lets you query or recover data from any point in the past — up to 90 days for Enterprise tier. Syntax: SELECT * FROM sales AT(OFFSET => -3600) queries the table as it was 1 hour ago. Real uses: 1) An analyst ran a wrong DELETE — we recovered all rows in 5 minutes using Time Travel, no backup restore, zero downtime. 2) An ETL job loaded bad data — we used CREATE TABLE corrected CLONE sales AT(OFFSET => -86400) to restore from before the bad load. 3) Compliance audit — we proved exactly what data existed on a specific regulatory reporting date.
Q
What are Snowflake Streams and Tasks?+
A Stream is a CDC object that captures every INSERT, UPDATE, and DELETE on a source table since the last time the stream was consumed. It adds metadata columns: METADATA$ACTION ('INSERT' or 'DELETE'), METADATA$ISUPDATE (TRUE for updates). A Task is a scheduler that runs a SQL statement on a cron schedule or when triggered. Combining them: Task runs every 5 minutes, checks SYSTEM$STREAM_HAS_DATA(), if new records exist it runs a MERGE statement to upsert changes into the fact table. This creates near-real-time ELT entirely inside Snowflake — no Kafka or external tools required when sub-minute latency is acceptable.
Tableau & BI
Q
What is a FIXED LOD expression and give a real business example?+
FIXED LOD computes an aggregation at a dimension you specify, completely ignoring whatever dimensions or filters are in the current view. Example: dashboard shows revenue by product. Business wants to see each product's revenue as a percentage of that customer's total spend — not as a percentage of all products visible. Without FIXED, TOTAL() would give percent of all visible products. With FIXED: {FIXED [Customer ID] : SUM([Revenue])} computes each customer's full total regardless of the product filter. Divide product revenue by that FIXED value — the denominator stays correct as the full customer lifetime value even when filtering to one product category.
Q
What is the difference between Live and Extract in Tableau?+
Live connection: every user interaction — filter, click, hover — fires a SQL query to Snowflake in real time. Always shows current data. Can be slow for complex dashboards on large tables. Use for: real-time operational dashboards where data changes by the minute. Extract (.hyper file): Tableau copies data into its own in-memory columnar format at refresh time. Queries are very fast — all in memory. Data is a snapshot that needs scheduled refresh (every hour, every day). Use for: large datasets where 1-hour-old data is acceptable and performance matters. Decision rule: if the business decision changes based on data being 1 hour vs real-time fresh, use Live. If daily is fine, use Extract.
ETL / ELT & Pipelines
Q
What is the difference between ETL and ELT and when do you use each?+
ETL: Extract from source → Transform on a dedicated ETL server → Load clean data to the warehouse. Traditional approach. Tools: Informatica, SSIS, Talend. Used when: warehouse has limited compute or data must be cleansed before it can land (compliance, size). ELT: Extract from source → Load raw data directly to the cloud warehouse → Transform inside the warehouse using SQL. Modern cloud approach. Tools: dbt + Snowflake, Fivetran + BigQuery. Used because: cloud warehouse compute is cheap and scalable, raw data preserved means you can re-transform anytime, SQL transformations in Git are version controlled and testable. Most new projects use ELT.
Q
What happens when a production pipeline fails at 3 AM — walk me through your response+
1. Check the Airflow logs immediately — find the exact task that failed and the error message. Never guess. 2. Classify: transient (network timeout, warehouse suspended — retry) or permanent (bad data, schema change — needs fix first). 3. Assess scope — which tables affected, which date ranges, are any dashboards showing wrong data to business users right now. 4. Contain: if bad data loaded to Snowflake, use Time Travel to restore; if partial load, quarantine bad rows and complete the good ones. 5. Fix root cause and re-run. 6. Notify stakeholders: what failed, what data was affected, what was done to fix it, when it was resolved. 7. After resolution: add the DQ check that would have caught this, update the runbook.
Q
What is dbt and why has it become the standard for ELT?+
dbt (data build tool) is a transformation framework where you write SQL SELECT statements and dbt materializes them as tables or views in your warehouse. What makes it the standard: 1) SQL in Git — version control, code review, history for all transformations. 2) Automated testing — not_null, unique, referential integrity, accepted_values tests run on every deployment. 3) Documentation and lineage — auto-generated docs show exactly which models feed which models. 4) Modularity with ref() — models reference each other and dbt builds in the correct dependency order. 5) Incremental loads — only process new data, not full refreshes. If you know dbt you look senior immediately.
Cloud & Infrastructure
Q
What is IAM and why is it critical for data platforms?+
IAM (Identity and Access Management) controls who and what can access which AWS resources. The principle of least privilege: every identity gets only the minimum permissions it absolutely needs — nothing more. For data platforms: Airflow EC2 has an IAM role that allows S3 read and Snowflake JDBC and CloudWatch logs — nothing else. Lambda trigger has a role allowing only S3 read from one specific bucket. Why critical: if broad credentials get stolen, an attacker has access to everything. IAM roles are safer than access keys because keys can be accidentally committed to GitHub. Never hardcode AWS credentials in Python code — always use IAM roles or AWS Secrets Manager.
Q
What is S3 and why is it used as a data lake?+
S3 (Simple Storage Service) is AWS object storage — store any file (CSV, JSON, Parquet, images, logs) at any scale for very low cost (about $0.023 per GB per month). Why it works as a data lake: unlimited storage, 99.999999999% durability (11 nines), files accessible by URL, lifecycle policies (automatically archive or delete old files), event notifications (file arrives triggers a Lambda), and it integrates natively with Snowflake COPY INTO. GCP equivalent: Cloud Storage. Azure equivalent: Blob Storage. In data engineering almost every pipeline starts or ends at S3.
Kafka & Data Quality
Q
When would you use Kafka instead of batch ETL?+
Use Kafka when the business genuinely needs data in seconds or milliseconds. Real examples: fraud detection (must evaluate a transaction before it posts — batch runs hours later), real-time dashboard showing live sales on the CEO's screen during a product launch, IoT sensor monitoring where a machine failure must be detected and acted on immediately, microservices where the order service must notify inventory and billing at the exact moment of purchase. Use batch ETL when hourly or daily refresh is fine — financial reports, data warehouse loads, monthly analytics. Kafka has operational complexity and cost — only add it when real-time is truly required by the business.
Q
What are the 6 dimensions of data quality and how do you check them?+
Completeness: are all required fields populated and all expected records present? Check: COUNT(*) WHERE customer_id IS NULL — must be 0. Uniqueness: no duplicate primary keys. Check: COUNT(*) vs COUNT(DISTINCT sale_id). Accuracy: values are in valid ranges. Check: COUNT(*) WHERE amount <= 0 — must be 0. Consistency: same data agrees across systems. Check: source row count = Snowflake row count. Timeliness: data is fresh enough. Check: DATEDIFF('hour', MAX(load_ts), NOW()) — alert if > 4. Validity: values match expected formats and allowed values. Check: status NOT IN ('active','inactive','pending') — must be 0.
JIRA & ServiceNow
Q
Walk me through how you manage a data engineering project in JIRA+
Sprint planning every 2 weeks: Product Owner presents the prioritized backlog. Dev team estimates each story in planning poker using Fibonacci points — 1 for trivial, 13 for a large unknown. We commit to the stories that fit our historical velocity. Daily standups: 15 minutes — what I completed yesterday, what I am doing today, any blockers. I update JIRA tickets daily with progress comments and link pull requests. Blockers are escalated the same day — never sit on a blocker overnight. Sprint review: demonstrate working software to stakeholders — actual Tableau dashboards loading real data, not slides. Retrospective: what went well and what to improve next sprint. I track my personal velocity to give better estimates over time.
Q
What is a ServiceNow Change Request and when do you need one?+
A Change Request is a formal approval process before making any modification to production systems. It prevents unauthorized or untested changes from breaking live data and dashboards. You need one for: adding a column to a Snowflake table, deploying a new dbt model, changing an Airflow DAG schedule, updating a Tableau data source connection. The CR form includes: what is changing, why, risk level (low/medium/high), rollback plan, testing evidence with screenshots, step-by-step deployment instructions. Submit to your manager and the CAB for approval. Deploy only during the approved change window. After deployment run smoke tests and close the CR. Auditors use CR history to prove all production changes were authorized and controlled.
General IT & Architecture
Q
Walk me through the complete architecture of your last data project end to end+
"At our retail client project we had 8 source systems: Salesforce for CRM, SAP for ERP, 6 store POS systems. Data landed daily in AWS S3 as Parquet files via Airflow DAGs running at 5 AM — POS via SFTP, SAP via REST API, Salesforce via Fivetran. Snowflake COPY INTO loaded the RAW schema — exact typed copy of the source data. dbt models built the CORE schema — star schema with 3 fact tables (sales, inventory, returns) and 6 dimension tables including SCD Type 2 on customer and product dimensions. dbt tests ran automatically on every deployment. The MART schema had pre-aggregated views for the Sales team and Finance team. Tableau Desktop connected to the MART via live connection with Snowflake row access policies enforcing that each regional manager saw only their region. GitHub Actions deployed all code changes through dev → staging → production with automated tests at each stage. Airflow sent Slack alerts on any pipeline failure."
Q
What is JSON and when do you use it in data engineering?+
JSON (JavaScript Object Notation) is a text format for structured data using key-value pairs and arrays. Example: {"customerId": "C001", "orders": [{"id": 1, "amount": 99.99}]}. Used when: 1) APIs — virtually all REST APIs send and receive JSON. 2) Config files for applications and pipelines. 3) Semi-structured data with nested structures that do not fit neatly in relational tables. Challenges for data engineering: JSON is verbose (large files), has no schema enforcement (any field can appear), and nested structures need flattening for SQL analysis. In Snowflake: VARIANT column type stores and queries JSON natively — use col:fieldName syntax to extract nested values directly in SQL.
Q
Why do we use Parquet instead of CSV for big data?+
Three reasons. First, Parquet is columnar — analytics queries only read the columns they need instead of the entire row. SELECT AVG(salary) on a 10-column table reads 10% of the data versus CSV which reads 100%. For wide tables this is a 10x speed improvement. Second, column-level compression is extremely efficient — integers compress better than mixed data, strings compress with dictionary encoding. A 1GB CSV typically becomes 100-200MB Parquet, saving 70-80% in S3 storage costs. Third, the schema is embedded in the Parquet file so Snowflake and Spark know exact data types without guessing or needing an external schema definition. For Kafka streaming we use Avro instead because it is row-based and faster for writing and reading individual records.
Complete Learning Checklist
Check off each skill when you can explain it AND demonstrate it without notes