Start Here — Before Anything Else

IT Technology Encyclopedia

Before you dive into code and interviews, understand every technology from scratch. What is it? Why does it exist? Who built it? Which companies use it? Who are its competitors? Click any technology below to get the full picture — explained like you are learning it for the very first time.

All Technologies

Data & Storage

Cloud Platforms

Programming & Code

DevOps & CICD

Project Management

Streaming & Messaging

BI & Visualization

FormulaHub — Built to Get You Hired

Every Role. Every Topic.
One Platform.

Designed for consultants in the USA who need to demonstrate 6 years of real IT experience. Study every module, learn to explain every concept, practice every scenario. After finishing FormulaHub you can walk into any Data Warehouse Developer, Snowflake Developer, Tableau Developer, ETL Developer, or IT Consultant interview and own it. This is your foundation — real projects and eBooks will take you further.

Modules

50+

Q&A Ready

Roles Covered

110+

Skills Tracked

Roles This Platform Prepares You For

Click any card to jump to the relevant module

🗄

Data Analyst / SQL

Queries, joins, RANK, window functions, CTEs

Core

⭐

Data Warehouse Developer

Star schema, SCD, fact/dim design, layers

Core

❄

Snowflake Developer

COPY, stages, Time Travel, streams, tasks

Core

📊

Tableau Developer

LOD, calculated fields, RLS, performance

Core

🔄

ETL / ELT Developer

Airflow, dbt, Python pipelines, CDC, batch

Core

🐍

Data Engineer

Python, Spark, Kafka, cloud, APIs, pipelines

Core

📋

JIRA / ServiceNow

Scrum, sprints, change requests, ITSM

Every Role

☁

Cloud Data Engineer

AWS/GCP/Azure, S3, Lambda, IAM, VPC

Core

📦

Storage / Formats

Parquet, JSON, Avro, Delta Lake, COPY commands

Core

⚡

Streaming Engineer

Kafka, real-time events, consumer groups

Advanced

🔥

Spark / Databricks

PySpark, distributed processing, Delta Lake

Advanced

🔒

IT Security / Compliance

VPC, IAM, PII masking, HIPAA, audit logs

Important

How to use FormulaHubWork through every module in order. In each module: read the explanation, study the tables, copy and practice the code, read the interview scenarios out loud. After all modules go to End-to-End Flow to see how everything connects. Then drill Interview Q&A daily. The goal: explain any topic for 5 minutes without notes. That is when you are ready.

SQL & Databases

The #1 most tested skill — every data role interview includes SQL problems

Why SQL is non-negotiableEvery data role — analyst, engineer, warehouse developer, Tableau developer — requires SQL. Interviewers will give you a whiteboard problem. You must write correct SQL without help. The topics below are what gets asked most often.

📊

SELECT, WHERE, GROUP BY, HAVING — the foundation

Every query starts here — understand the execution order

▾

SQL execution order — not the order you write itFROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. This is why you cannot filter on a SELECT alias in WHERE — WHERE runs before SELECT evaluates aliases.

-- Real scenario: Which regions had over $50k revenue last 30 days? -- Table: sales(id, rep_name, region, product, amount, sale_date, status) SELECT region, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, ROUND(AVG(amount), 2) AS avg_order_value, MAX(amount) AS largest_order FROM sales WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days' AND status = 'COMPLETED' GROUP BY region HAVING SUM(amount) > 50000 ORDER BY total_revenue DESC;

WHERE vs HAVING — the ruleWHERE filters rows BEFORE grouping. HAVING filters groups AFTER aggregation. If your filter uses SUM, COUNT, AVG, MAX, or MIN — it must go in HAVING. Raw column comparisons go in WHERE. You can use both in the same query.

🏆

RANK vs DENSE_RANK vs ROW_NUMBER — Know This Cold

Asked in 90% of data interviews — the answer must be instant

▾

Interviewers ask this every single time — memorize the difference

Function	Handles Ties?	Gap After Tie?	Always Unique?	Use When
ROW_NUMBER()	No — arbitrary order for ties	N/A	YES — always unique	Deduplication, pagination, pick one row per group
RANK()	YES — same rank for ties	YES — skips next (1,1,3)	No	Sports rankings — no 2nd place if tied for 1st
DENSE_RANK()	YES — same rank for ties	NO — no gap (1,1,2)	No	Business leaderboards — clients expect 1,2,3

-- Scenario: Top 3 sales reps per region -- Alice=$10k, Bob=$10k (tied), Carol=$8k, Dave=$7k in East region SELECT * FROM ( SELECT rep_name, region, SUM(amount) AS total_sales, ROW_NUMBER() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS rn, RANK() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS rnk, DENSE_RANK() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS drnk FROM sales GROUP BY rep_name, region ) t WHERE drnk <= 3; -- East region results: -- Alice: ROW_NUMBER=1 RANK=1 DENSE_RANK=1 ($10k) -- Bob: ROW_NUMBER=2 RANK=1 DENSE_RANK=1 ($10k — tied) -- Carol: ROW_NUMBER=3 RANK=3 DENSE_RANK=2 ($8k — RANK skips 2!) -- Dave: ROW_NUMBER=4 RANK=4 DENSE_RANK=3 ($7k)

Say this in your interview"DENSE_RANK gives consecutive ranks with no gaps — if Alice and Bob tie for first, both get rank 1 and Carol gets rank 2. RANK skips — Carol would get rank 3, not 2. ROW_NUMBER always gives unique numbers even for ties. In business reporting we almost always use DENSE_RANK because clients expect 1, 2, 3 without gaps."

🔗

JOINs — All Types with Real ETL Context

The most common SQL operation — know every type cold

▾

JOIN Type	What It Returns	Real ETL Example	NULL Behavior
INNER JOIN	Only rows matching in BOTH tables	Orders that have a valid customer record — orphan orders excluded	No NULLs from join
LEFT JOIN	All left rows + matching right	All customers whether or not they have orders — most common in ETL	NULLs on right side if no match
RIGHT JOIN	All right rows + matching left	All orders even if customer record is missing	NULLs on left side if no match
FULL OUTER	All rows from both tables	Full audit — find gaps on BOTH sides simultaneously	NULLs on either side
SELF JOIN	Table joined to itself	Employees and their managers from the same table	Depends on join type used

-- LEFT JOIN: All customers + their orders (NULL for non-buyers) SELECT c.customer_id, c.name, COUNT(o.order_id) AS total_orders, COALESCE(SUM(o.amount), 0) AS lifetime_value FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.name; -- Data quality check: find orphan orders (no matching customer) SELECT o.order_id, o.customer_id FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id WHERE c.customer_id IS NULL; -- these rows have no customer record

🪟

Window Functions and CTEs

Advanced SQL used daily in data engineering and analytics

▾

Window functions vs GROUP BYGROUP BY collapses rows into one per group. Window functions calculate across rows but keep every row. SUM(amount) GROUP BY region gives one row per region. SUM(amount) OVER(PARTITION BY region ORDER BY date) gives every row plus a running total per region.

-- CTE: break complex query into readable named steps WITH monthly AS ( SELECT rep_name, region, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS revenue FROM sales GROUP BY rep_name, region, DATE_TRUNC('month', sale_date) ), enriched AS ( SELECT *, SUM(revenue) OVER(PARTITION BY rep_name ORDER BY month) AS running_total, LAG(revenue, 1, 0) OVER(PARTITION BY rep_name ORDER BY month) AS prev_month, DENSE_RANK() OVER(PARTITION BY region, month ORDER BY revenue DESC) AS rank_in_region FROM monthly ) SELECT *, revenue - prev_month AS mom_change, ROUND((revenue - prev_month) * 100.0 / NULLIF(prev_month, 0), 1) AS mom_pct_change FROM enriched ORDER BY rep_name, month;

🗑

Deduplication — Critical ETL Skill

Find and remove duplicates — asked in every data engineering interview

▾

-- Step 1: Find which records are duplicated SELECT customer_id, order_date, COUNT(*) AS cnt FROM orders GROUP BY customer_id, order_date HAVING COUNT(*) > 1 ORDER BY cnt DESC; -- Step 2: Keep only the latest row per duplicate group (ROW_NUMBER method) WITH deduped AS ( SELECT *, ROW_NUMBER() OVER( PARTITION BY customer_id, order_date -- define what "duplicate" means ORDER BY created_at DESC -- keep the most recent ) AS rn FROM orders ) SELECT * FROM deduped WHERE rn = 1; -- Snowflake: create clean table directly CREATE TABLE orders_clean AS SELECT * EXCLUDE(rn) FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY load_ts DESC) rn FROM orders_raw ) WHERE rn = 1;

Data Warehouse Developer

Star Schema, SCD, fact/dimension design, warehouse layers — the core of DWH developer roles

⭐

Star Schema — Design From Scratch

The most used data model in every analytics project worldwide

▾

Star Schema in one sentenceA central FACT table (what happened — transactions, clicks, orders) surrounded by DIMENSION tables (who, what, where, when). Every dimension joins to the fact via a foreign key. Called "star" because the diagram looks like one. Every modern data warehouse uses this model.

Star Schema — Retail Sales Example

dim_customer
customer_key PK
name, email
segment, region

↗

FACT_SALES

sale_key PK
customer_key FK
product_key FK
date_key FK
store_key FK

amount, quantity
discount, margin

↖

dim_product
product_key PK
name, category
brand, price_tier

dim_date
date_key PK
year, quarter, month
is_holiday, is_weekend

↑

dim_store
store_key PK
city, state
region, store_type

Concept	Fact Table	Dimension Table
Purpose	What happened — measurements and events	Context — who, what, where, when
Content	Foreign keys to dims + numeric measures	Descriptive text attributes
Size	Billions of rows — grows constantly	Thousands to millions — grows slowly
Additive?	YES — you can SUM amounts across any dimension	No — you describe and filter, not add
Example	sale_key, customer_key, amount, quantity	customer_key, name, city, segment

🔄

Slowly Changing Dimensions (SCD) — Types 1, 2, 3

What happens when dimension data changes — asked in every DWH interview

▾

SCD is asked in almost every Data Warehouse Developer interview — know all types

SCD Type	Strategy	Keeps History?	When to Use	Real Example
Type 0	Never update — keep original value forever	N/A — immutable	Immutable data	Date of birth, original signup date
Type 1	Overwrite old value with new value	NO — history lost	Typo corrections, history irrelevant	Fix customer name spelling error
Type 2	Add new row with start/end dates + is_current flag	FULL history preserved	History matters — most common in real DWH	Customer moves from NYC to Chicago
Type 3	Add column for previous value	One version back only	Only current and immediate prior matter	Track current and previous region

-- SCD Type 2: Customer C001 moved from New York to Chicago -- Step 1: Expire the current record UPDATE dim_customer SET end_date = CURRENT_DATE - 1, is_current = FALSE WHERE customer_id = 'C001' AND is_current = TRUE; -- Step 2: Insert new record with the new city INSERT INTO dim_customer (customer_key, customer_id, name, city, state, start_date, end_date, is_current) VALUES (5001, 'C001', 'Alice Smith', 'Chicago', 'IL', CURRENT_DATE, '9999-12-31', TRUE); -- Result: old NYC record is expired. New Chicago record is active. -- Fact rows from before today still JOIN to the NYC record. -- New fact rows JOIN to the Chicago record. -- Historical accuracy is perfect.

Interview answer for SCD

When asked "which SCD type did you use?" always say Type 2. Explain: "We used SCD Type 2 on the customer dimension because customers frequently changed their region assignment and our regional managers needed accurate historical reports. A sale made in January when the customer was in the East region had to remain associated with the East region even after the customer moved to the West region in March. Type 2 preserved that history perfectly."

🏗

4-Layer Warehouse Architecture

Raw → Staging → Core → Mart — the standard modern pattern used everywhere

▾

Data Platform Architecture — All Layers

SOURCE SYSTEMS
Salesforce, SAP
APIs, Databases

→

DATA LAKE
S3 / GCS / Blob
Raw Parquet files

→

RAW SCHEMA
Snowflake
1:1 typed copy

→

CORE SCHEMA
Star Schema
Fact + SCD2 Dims

→

DATA MART
Sales, Finance
pre-aggregated

→

CONSUME
Tableau, APIs
Reports

Layer	What Goes Here	Transform Level	Who Accesses
Data Lake (S3)	Files exactly as received from source systems	Zero — never touch raw	Data engineers only
Raw / Staging	Typed, validated, deduplicated copy in Snowflake	Basic — typing, nulls, dedup	Data engineers
Core / DWH	Star schema — fact tables and SCD dimensions	Full business logic	Engineers, analysts
Data Mart	Pre-aggregated, team-specific views	Aggregation, formatting	Business users, Tableau

Real project narrative — say this in your interview

"In our retail analytics project, data from 8 source systems landed in S3 as Parquet files daily via Airflow. Snowflake COPY INTO loaded them to the RAW schema unchanged — a 1-to-1 typed copy. dbt staging models cleaned and validated the data in the STAGING schema — null handling, deduplication, standardizing date formats. dbt core models built the Star Schema in the CORE schema — 3 fact tables and 6 dimension tables including SCD Type 2 on customer and product. The MART schema had pre-aggregated views for the Sales team and Finance team which Tableau connected to directly."

Snowflake Developer

Virtual warehouses, COPY INTO, Time Travel, Streams, Tasks, cloning — every Snowflake developer topic

❄

Architecture — Why Snowflake Is Different

Storage and compute are completely separate — the core innovation

▾

The Snowflake innovation that changed data warehousingTraditional databases tie storage and compute together — you cannot scale one without the other, and all users compete for the same resources. Snowflake separates them completely. Storage lives in S3-based compressed columnar format — you pay per TB stored. Compute is Virtual Warehouses — independent clusters you spin up and down in seconds, paying only per second of usage. Multiple teams query the SAME data with their OWN warehouses without competing.

-- Create separate warehouses for ETL and reporting (they don't compete) CREATE OR REPLACE WAREHOUSE etl_wh WAREHOUSE_SIZE = 'MEDIUM' MAX_CLUSTER_COUNT = 3 -- scale out for concurrent ETL AUTO_SUSPEND = 300 -- suspend after 5 min idle — save money! AUTO_RESUME = TRUE -- start automatically when query arrives INITIALLY_SUSPENDED = TRUE; -- Scale up for heavy load, then back down ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'LARGE'; -- ... run the heavy ETL job ... ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'SMALL'; -- Check credit consumption per warehouse SELECT warehouse_name, credits_used, start_time FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP) ORDER BY credits_used DESC;

📥

Stages and COPY INTO — Loading Data from S3

The primary way data enters Snowflake — every project uses this

▾

-- Step 1: Create storage integration (IAM role — no keys in code!) CREATE OR REPLACE STORAGE INTEGRATION s3_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'S3' ENABLED = TRUE STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456:role/SnowflakeRole' STORAGE_ALLOWED_LOCATIONS = ('s3://my-company-data-bucket/'); -- Step 2: Create external stage pointing to S3 CREATE OR REPLACE STAGE s3_sales_stage URL = 's3://my-company-data-bucket/sales/' STORAGE_INTEGRATION = s3_integration FILE_FORMAT = (TYPE = 'PARQUET'); -- Step 3: COPY INTO — bulk load from S3 to Snowflake table COPY INTO raw.sales_staging FROM @s3_sales_stage PATTERN = '.*sales_2024_.*\.parquet' -- only files matching this regex ON_ERROR = 'CONTINUE' -- skip bad rows, keep loading PURGE = FALSE -- keep source files in S3 FORCE = FALSE; -- don't re-load already loaded files -- Check what was loaded and if any errors occurred SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY( TABLE_NAME => 'SALES_STAGING', START_TIME => DATEADD('hours', -24, CURRENT_TIMESTAMP) ));

🕐

Time Travel and Zero-Copy Cloning

Features with no equivalent in traditional databases — know these cold

▾

-- TIME TRAVEL: query data as it was at any past point (up to 90 days) -- See table as it was 1 hour ago SELECT * FROM sales AT(OFFSET => -3600); -- See table at exact timestamp SELECT * FROM sales AT(TIMESTAMP => '2024-01-15 09:00:00'::TIMESTAMP); -- Recover accidentally deleted rows INSERT INTO sales SELECT * FROM sales AT(OFFSET => -3600) WHERE sale_id NOT IN (SELECT sale_id FROM sales); -- ZERO-COPY CLONING: instant copy sharing same storage (no data copied!) -- Clone entire production database for dev team in seconds CREATE DATABASE dev_db CLONE prod_db; -- Clone table before a risky migration CREATE TABLE sales_backup CLONE sales; -- Clone table from yesterday for comparison CREATE TABLE sales_yesterday CLONE sales AT(OFFSET => -86400);

Interview talking point"We use Time Travel on every Snowflake project. When an analyst accidentally ran a wrong DELETE, we recovered all rows in under 5 minutes using Time Travel — no backup restore, zero downtime. We clone production to development using CREATE DATABASE dev CLONE prod — developers get the full dataset instantly with zero storage cost for unchanged data."

🔄

Streams and Tasks — Near Real-Time ELT

CDC inside Snowflake without needing Kafka or external tools

▾

Stream + Task = lightweight CDC inside SnowflakeA Stream tracks every INSERT, UPDATE, DELETE on a table since the last time you read it. A Task runs a SQL statement on a schedule. Together: Task runs every 5 minutes, checks if the stream has new records, and if yes runs a MERGE to update the fact table. Near real-time ELT without any external tools.

-- Create Stream to capture all changes on staging table CREATE OR REPLACE STREAM sales_changes ON TABLE raw.sales_staging; -- Create Task to process the stream every 5 minutes CREATE OR REPLACE TASK process_sales_changes WAREHOUSE = etl_wh SCHEDULE = '5 MINUTE' WHEN SYSTEM$STREAM_HAS_DATA('sales_changes') -- only run if data exists AS MERGE INTO core.fact_sales tgt USING ( SELECT sale_id, customer_id, amount, sale_date FROM sales_changes WHERE METADATA$ACTION = 'INSERT' ) src ON tgt.sale_id = src.sale_id WHEN MATCHED THEN UPDATE SET tgt.amount = src.amount, tgt.updated_at = CURRENT_TIMESTAMP WHEN NOT MATCHED THEN INSERT (sale_id, customer_id, amount, sale_date) VALUES (src.sale_id, src.customer_id, src.amount, src.sale_date); -- Start the task (tasks are created suspended by default) ALTER TASK process_sales_changes RESUME; -- Monitor task execution history SELECT name, state, scheduled_time, completed_time, error_message FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) ORDER BY scheduled_time DESC;

Tableau & Business Intelligence

LOD expressions, calculated fields, RLS, performance optimization, architecture — for Tableau developer roles

🏗

Tableau Product Family — What Each Does

Desktop, Server, Cloud, Prep, Bridge — know what each one is

▾

Product	Purpose	Who Uses It	Key Fact
Tableau Desktop	Create and design workbooks and dashboards	Tableau developers, data analysts	The main design tool — connects to any data source
Tableau Server	On-premise publishing, sharing, governance	Enterprise companies with on-prem infrastructure	You manage the servers yourself
Tableau Cloud	SaaS version of Server — hosted by Tableau	Cloud-first companies — most new projects	No server management needed
Tableau Prep	Visual ETL — clean and reshape data before Tableau	Analysts needing self-service data prep	Drag-and-drop transformations, no SQL needed
Tableau Public	Free version — all dashboards are publicly visible	Learning, building a portfolio, journalism	Free forever — use this to practice

Live vs Extract — the most important Tableau decisionLIVE connection: every click, filter, and hover fires a SQL query to Snowflake in real time. Always current data. Can be slow for complex dashboards. Use for: real-time operational dashboards. EXTRACT (.hyper file): Tableau copies data into its own in-memory columnar format. Very fast queries. Data is a snapshot — needs scheduled refresh (hourly, daily). Use for: large datasets where 1-hour-old data is acceptable. Rule: if business needs minute-by-minute data, use Live. If daily refresh is fine and performance matters, use Extract.

🧮

LOD Expressions — FIXED, INCLUDE, EXCLUDE

The hardest Tableau concept — master this and you immediately stand out

▾

LOD expressions are the #1 Tableau interview topic for developer roles

LOD Type	What It Does	When to Use
FIXED	Compute at your specified dimension regardless of view or filters	Customer lifetime value as denominator regardless of product filter
INCLUDE	Add a finer-grain dimension to the aggregation	Average order count per customer — compute per customer then average in the view
EXCLUDE	Remove a dimension from the aggregation	Show regional total while viewing by individual city

// FIXED: Customer total revenue — ignores product filter in the view Customer Total = {FIXED [Customer ID] : SUM([Revenue])} // FIXED: Product revenue as % of that customer's total spend Pct of Customer Spend = SUM([Revenue]) / {FIXED [Customer ID] : SUM([Revenue])} // FIXED: First ever purchase date per customer First Purchase Date = {FIXED [Customer ID] : MIN([Order Date])} // INCLUDE: avg order count per customer, then average that in the view Avg Orders Per Customer = {INCLUDE [Customer ID] : COUNT([Order ID])} // EXCLUDE: regional total shown even when view is drilled down to city Region Total Revenue = {EXCLUDE [City] : SUM([Revenue])} // Running total (Table Calculation — not LOD) Running Revenue Total = RUNNING_SUM(SUM([Revenue])) // Year over Year growth % YoY Growth = (SUM([Revenue]) - LOOKUP(SUM([Revenue]), -1)) / ABS(LOOKUP(SUM([Revenue]), -1)) // Conditional performance flag (for color coding) Performance Status = IF SUM([Revenue]) >= [Target] THEN "On Track" ELSEIF SUM([Revenue]) >= [Target]*0.8 THEN "At Risk" ELSE "Behind" END

LOD interview scenario — say this

"The business wanted to see each product's revenue AND what percentage of that customer's total spend it represented — not percentage of all products visible. Without FIXED LOD, TOTAL() would give percent of all products in the view. I used {FIXED [Customer ID] : SUM([Revenue])} to compute each customer's full total independently. Then divided product revenue by that fixed value. Even when a manager filters to just one product category, the denominator stays correct as the customer's full lifetime spend."

⚡

Dashboard Performance Optimization

5 techniques that separate average from senior Tableau developers

▾

Problem	Solution	Impact
Dashboard loads slowly	Switch from Live to Extract — eliminate network round trips	Often 5-10x faster
Extract too large	Pre-aggregate in Snowflake — let the warehouse do the heavy lifting	Smaller extract, faster load
Too many sheets on one dashboard	Each sheet = one query. Reduce to 6-8 maximum. Combine related metrics	Linear improvement per sheet removed
Slow COUNT DISTINCT	Compute COUNTD in Snowflake view, expose as a simple SUM in Tableau	COUNTD is very expensive in Tableau
Filters scan everything	Add Context Filters — they run first and reduce the data set before other filters	Can reduce query data by 90%
String calculations everywhere	Move all string manipulation to Snowflake — UPPER, TRIM, CONCAT are slow in Tableau	Keeps Tableau doing analytics, not cleaning

🔒

Row Level Security in Tableau

Every enterprise Tableau project needs this — managers see only their data

▾

Best practice: implement RLS in Snowflake, not TableauIf you implement RLS only in Tableau, someone can bypass it by connecting directly to Snowflake. Implement it at the Snowflake level so security is enforced regardless of which tool queries the data.

ETL & ELT Development

Batch pipelines, Airflow, dbt, parallel execution, error handling — the core of every data engineering role

🔄

ETL vs ELT — What Changed and Why

Every interviewer asks this — have a confident, clear answer ready

▾

ETL — Traditional

Flow: Source → Transform on ETL server → Load clean data

Tools: Informatica, SSIS, Talend, DataStage, Ab Initio
When to use: on-premise warehouse with limited compute
Example: SAP → Informatica cleans → Oracle Data Warehouse

ELT — Modern Cloud

Flow: Source → Load raw to warehouse → Transform inside warehouse

Tools: Fivetran + dbt, Airbyte + dbt, Python + dbt
When to use: Snowflake or BigQuery — cheap scalable compute
Example: Salesforce → Fivetran → Snowflake RAW → dbt → CORE

Why ELT won in the cloud eraCloud warehouses (Snowflake, BigQuery) have massive on-demand compute at low cost. Loading raw first means you never lose data — if transformation logic changes, re-run it against the preserved raw data. dbt adds version control, automated testing, documentation, and lineage to SQL transformations — things traditional ETL tools never had. Raw layer is your safety net.

⏱

Apache Airflow — Orchestrating Pipelines

DAGs, tasks, parallel execution, retry, alerting — production Airflow

▾

Airflow mental modelAirflow is a smart scheduler for data pipelines. You write a DAG (Python file) that defines tasks and their dependencies. Airflow does NOT process data itself — it tells other systems (Snowflake, Python scripts, dbt, S3) when and how to run. It tracks every run, retries failures automatically, and sends alerts on error.

🏗

dbt — The Modern ELT Standard

SQL transformations with testing, documentation, and lineage built in

▾

What is dbt?dbt is a transformation framework where you write SQL SELECT statements and dbt materializes them as tables or views in Snowflake. It adds: version control (SQL in Git), automated testing (not_null, unique, referential integrity), documentation with lineage graphs, modular reuse with ref(), incremental loads, and multi-environment support (dev/staging/prod). If you know dbt you look senior immediately.

Python for Data Engineering

File I/O, pandas, ETL scripts, API calls, S3, error handling — everything needed for data roles

📁

Reading and Writing Every File Format

CSV, JSON, Parquet, Excel, S3, Snowflake — daily work

▾

🔄

Production ETL Script with Logging, Retry, and Error Handling

The real pattern used on production pipelines — not textbook code

▾

APIs & REST

How modern systems communicate — every data pipeline extracts from APIs

🌐

REST API Complete Guide + Python Implementation

HTTP methods, status codes, auth, pagination, retry — production patterns

▾

HTTP Method	Action	Idempotent?	Real Example
GET	Read data — no side effects	YES	GET /api/orders?status=open&page=2
POST	Create new resource	NO — calling twice creates two records	POST /api/orders with body {customerId, items}
PUT	Replace entire resource	YES	PUT /api/customers/123 — replaces all fields
PATCH	Update specific fields only	Usually YES	PATCH /api/customers/123 — body: {status: inactive}
DELETE	Remove a resource	YES	DELETE /api/orders/456

Status Code	Meaning	What to Do
200	Success	Process the response
201	Created	Resource was successfully created
400	Bad Request	Fix your request payload or query parameters
401	Unauthorized	Check your API key or refresh your token
403	Forbidden	You don't have permission — contact API owner
429	Rate Limited	Wait and retry — use Retry-After header value
500	Server Error	Retry with backoff — not your fault

AWS, GCP & Azure — Cloud Platforms

Core services, equivalents across all three clouds, S3 deep dive, IAM — what every data engineer must know

Market share reality for US consultingAWS leads at ~32%, Azure at ~23%, GCP at ~12%. In US IT consulting, AWS S3 + Snowflake is the most common data stack. Know the services on each cloud and their equivalents. Most job descriptions list one cloud — knowing the pattern means you adapt quickly.

Category	AWS	GCP	Azure	What It Does
Object Storage	S3	Cloud Storage (GCS)	Blob Storage	Store any file at unlimited scale — the raw data lake
Compute VMs	EC2	Compute Engine	Virtual Machines	Always-on virtual servers — Airflow, ETL workers
Serverless	Lambda	Cloud Functions	Azure Functions	Run code triggered by events — no servers to manage
Managed SQL DB	RDS	Cloud SQL	Azure SQL Database	Managed PostgreSQL/MySQL — no OS patching needed
Data Warehouse	Redshift	BigQuery	Synapse Analytics	Analytical SQL at petabyte scale
Streaming	Kinesis	Pub/Sub	Event Hubs	Managed streaming — Kafka equivalent
Managed Airflow	MWAA	Cloud Composer	ADF + Airflow	Run Airflow without managing servers
Secrets	Secrets Manager	Secret Manager	Key Vault	Store API keys and passwords securely — never in code
Identity	IAM	IAM	Azure Active Directory	Control who can do what to which resource

🪣

S3 Deep Dive — Most Used Service in Data Engineering

CLI commands, boto3, partitioning strategy — used every single day

▾

S3 partitioning strategy for performanceName files like: s3://bucket/sales/year=2024/month=01/day=15/part-001.parquet — Snowflake and Spark skip entire partitions when your WHERE clause filters on date. A query for January 2024 only reads year=2024/month=01/ and ignores all other folders. This is called partition pruning and can make queries 100x faster on large datasets. Always design your S3 folder structure around how you will query the data.

Storage Formats & Data Types

CSV, JSON, Parquet, Avro, Delta Lake — why format choice matters enormously for performance and cost

Format	Type	Size vs CSV	Schema?	Best For	Common Tools
CSV	Text, row-based	1x (baseline)	No — inferred	Small files, Excel exports, human readability	Everything
JSON	Text, hierarchical	1.2–1.5x larger	No — flexible	API responses, config files, nested semi-structured data	Everything
Parquet	Binary, COLUMNAR	0.1–0.2x (5-10x smaller)	YES — embedded	Analytics, Spark, Snowflake, S3 — the standard	Spark, Snowflake, Pandas, Athena
Avro	Binary, row-based	0.3–0.5x	YES — Schema Registry	Kafka streaming, schema evolution across services	Kafka, Spark, Flink
ORC	Binary, columnar	0.08–0.15x	YES	Hive, older Hadoop ecosystem	Hive, Spark, Presto
Delta Lake	Parquet + transaction log	~0.15x	YES — enforced	ACID on data lake, upserts, time travel, Databricks	Databricks, Spark, Snowflake ext tables

📋

Why Parquet Beats CSV for Analytics — Visual Explanation

Understand this and you will never forget it

▾

CSV — Row Storage (slow for analytics)

              Row 1: id=1, name=Alice, age=30, salary=95000, dept=Eng

              Row 2: id=2, name=Bob,   age=25, salary=72000, dept=Sales

              Row 3: id=3, name=Carol, age=35, salary=110000, dept=Eng

SELECT AVG(salary) must read ALL 5 columns × every row = 500MB read

Parquet — Columnar Storage (fast for analytics)

              [id col]:     1, 2, 3

              [name col]:   Alice, Bob, Carol

              [salary col]: 95000, 72000, 110000

              [dept col]:   Eng, Sales, Eng

SELECT AVG(salary) reads ONLY salary column = 100MB. 5x less I/O!

Interview answer: "Why do you use Parquet?""We use Parquet for three reasons. First, it is columnar — analytics queries only read the columns they need, which can be 10x faster than row-based formats. Second, each column compresses independently — integers compress extremely well. A 1GB CSV typically becomes 100-200MB Parquet, saving 70-80% on S3 storage costs. Third, the schema is embedded in the file so Snowflake and Spark know the exact data types without guessing. For Kafka streaming we use Avro instead — it is row-based which is faster for individual record reads, and it works natively with Schema Registry."

Apache Kafka — Event Streaming

Real-time data pipelines — when batch ETL is not fast enough for the business

When to use Kafka vs batch ETLBatch ETL is fine when data can be hours or days old — financial reports, warehouse loads, monthly analytics. Use Kafka when data must be available in seconds — fraud detection (act before a transaction posts), real-time dashboards (live sales counter on the CEO's screen), IoT monitoring (detect machine failure as it happens), microservices decoupling (order service tells inventory and billing at the same time). Only add Kafka when real-time is genuinely required — it has operational complexity.

📨

Core Concepts — Every Term with a Business Example

Topic, Partition, Offset, Consumer Group, Broker — explained simply

▾

Concept	What It Is	Business Analogy	Real Example
Topic	Named log of messages	A TV channel — tune in to receive broadcasts	order-placed, payment-failed, user-clicked
Partition	Parallel subdivision of a topic	Highway lanes — more lanes = more traffic capacity	order-events split into 12 partitions = 12x throughput
Offset	Sequential message number in a partition (0,1,2...)	Page number in a book — bookmark where you stopped	Consumer processed up to offset 1,042,771
Producer	Application that writes messages to a topic	A broadcaster publishing to a channel	Order service publishes to order-events on every purchase
Consumer Group	Set of consumers sharing all partitions of a topic	Call center team — agents share the incoming queue	fraud-service with 3 consumers reads all 12 partitions
Broker	Kafka server that stores messages on disk	Post office — receives, stores, and delivers mail	3-broker cluster — each holds replicas for fault tolerance

Spark & Databricks

Distributed processing for large-scale data — when SQL alone is not enough

When to use Spark instead of Snowflake SQLUse Snowflake SQL for structured analytics up to a few terabytes — it is simpler and cheaper. Use Spark for: multi-terabyte processing where Snowflake becomes too slow or expensive, machine learning feature engineering at scale, complex transformations that are hard to express in SQL (graph processing, iterative algorithms), joining huge datasets across different storage systems. Databricks is managed Spark plus Delta Lake plus MLflow plus notebooks — the standard platform for large-scale data engineering at tech companies.

🔥

Spark Architecture and PySpark Code

Driver, Executors, partitions — how Spark distributes work across machines

▾

Spark Architecture

Driver
Plans the job
coordinates all

→

Cluster Manager
YARN / Kubernetes
allocates resources

→

Executor 1
Partitions 1–3

Executor 2
Partitions 4–6

Executor N
Partitions 7–N

Data is split into partitions — each executor processes its partitions in parallel

CICD & Git

How code goes from developer laptop to production safely — every professional project uses this

🚀

Complete CICD Pipeline for Data Projects

Git workflow, GitHub Actions, staging, production — step by step

▾

Write code on a feature branch

git checkout -b feature/add-customer-scd2. Write Python ETL or dbt SQL. Test locally with sample data. Never commit directly to the main branch.

Push and open a Pull Request

git add . && git commit -m "Add SCD2 for customer dimension" && git push. Open Pull Request on GitHub with description of what changed and why. Request review from tech lead.

CI runs automatically on every push

GitHub Actions triggers: run pytest unit tests, flake8 code style check, dbt compile to check SQL syntax, data quality tests on dev Snowflake. PR is blocked from merging if any check fails.

Code review and approval

Tech lead reviews for correctness, performance, security, and documentation. At least 1 approval required. After approval: merge to main triggers auto-deploy to staging environment.

Validate staging and deploy to production

Business analyst validates results in staging. QA sign-off. Create ServiceNow Change Request. After CAB approval deploy to production during the change window. Run smoke tests. Close the CR.

JIRA & ServiceNow

Project management and IT service management — knowing these signals real enterprise project experience

📋

JIRA Complete Guide — Every Term with Data Project Examples

Scrum, sprints, stories, velocity — every concept you will encounter

▾

JIRA Term	What It Is	Real Data Project Example	Who Creates It
Epic	Large body of work spanning multiple sprints	"Build Customer Analytics Platform" — takes 3 months	Product Owner / Manager
Story (User Story)	User-facing feature that fits in one sprint	"As a sales manager, I want daily revenue by region dashboard"	Product Owner
Task	Technical sub-work inside a story	"Create Snowflake star schema for revenue model"	Developer / Engineer
Bug	Something broken that needs fixing	"Dashboard shows January revenue as $0 — wrong date filter"	QA / anyone who finds it
Spike	Time-boxed research to reduce uncertainty	"Investigate how to connect Salesforce real-time to Kafka"	Tech Lead
Sprint	2-week iteration with committed stories	Sprint 14: 4 stories + 2 bugs = 38 story points committed	Scrum Master manages
Story Points	Relative effort using Fibonacci (1, 2, 3, 5, 8, 13)	Simple SQL query = 2 pts. New ETL pipeline = 8 pts.	Dev team estimates together
Velocity	Average story points completed per sprint	Team averages 40 pts/sprint — plan 40 pts per sprint	Scrum Master tracks
Backlog	All pending work not yet in a sprint	50 stories prioritized by business value, waiting to be picked	Product Owner prioritizes
Definition of Done	What "complete" means for the team	Code reviewed, tested in staging, business approved, documented	Team decides together

Perfect user story format — memorize this

AS A regional sales manager
I WANT to see daily sales by product category filtered to my region
SO THAT I can identify underperforming categories and take corrective action

ACCEPTANCE CRITERIA: Dashboard loads in under 3 seconds. Date range filter defaults to last 30 days. Pre-filtered to the logged-in manager's region. Shows revenue, order count, avg order value, and month-over-month % change. Export to Excel works. Data refreshed daily by 7 AM EST.

DEFINITION OF DONE: SQL tested with 3+ months of data. Snowflake view reviewed by tech lead. Tableau dashboard approved by product owner. Row-level security verified. Deployed to Tableau Server production. STORY POINTS: 8.

🎫

ServiceNow — IT Service Management (ITSM)

Incidents, Change Requests, CMDB, CAB — enterprise ITSM that every big company uses

▾

ServiceNow vs JIRA — they serve different purposesJIRA = project management — building new features, tracking development work. ServiceNow = IT operations — handling IT problems, managing production changes, tracking IT assets. Big enterprises use both. You use JIRA to build the pipeline. You use ServiceNow to deploy it to production safely.

ServiceNow Module	Purpose	Real Data Engineering Example
Incident	Something is broken — fix it now	Snowflake pipeline failed at 3 AM — P1 incident created, on-call engineer paged
Problem	Root cause analysis after repeated incidents	Pipeline fails every Monday — problem ticket to find the root cause permanently
Change Request	Formal approval before modifying production	Adding column to fact_sales — needs approval before deploying
CAB	Change Advisory Board — weekly meeting to approve changes	Thursday meeting: team reviews all planned changes for next week's deployment
CMDB	Inventory of all IT assets and their relationships	All Snowflake warehouses, Airflow servers, S3 buckets cataloged here
SLA	Response and resolution time commitments	P1: respond in 15 min, resolve in 4 hrs. P2: respond in 1 hr, resolve in 8 hrs

Change Request process — deploying to production

"In all our projects any change to production required a Change Request in ServiceNow. You fill out: what is changing, why, risk level (low/medium/high), rollback plan, testing evidence with screenshots, and step-by-step deployment instructions. Submit to your manager and the CAB for approval. Once approved, deploy only during the approved change window — usually Sunday 2 to 4 AM. After deployment run smoke tests: can Tableau connect, is the pipeline running, are data quality checks passing. Mark the CR as successfully implemented and close it. If anything goes wrong use the rollback plan and mark it as failed — never hide a failed deployment."

Data Quality Engineering

The most important skill nobody teaches — bad data destroys trust and costs companies millions

Why data quality is the most important skill in data engineeringA dashboard built on bad data leads to wrong decisions. If the sales team trusts a report showing $5M revenue when real revenue is $4.2M, that is a serious business problem. Interviewers love this topic — having a structured answer to data quality shows real seniority.

✅

6 Dimensions of Data Quality — The Framework

Memorize this framework and use it in every data quality conversation

▾

Dimension	Question to Ask	SQL Check Example	Alert Threshold
Completeness	Are required fields populated? All rows present?	COUNT(*) WHERE customer_id IS NULL	0 nulls on key fields
Uniqueness	Are there duplicate primary keys or duplicate records?	COUNT(*) vs COUNT(DISTINCT sale_id)	0 duplicates on primary key
Accuracy	Are values in valid ranges? Are they correct?	COUNT(*) WHERE amount <= 0 OR amount > 1000000	0 out-of-range rows
Consistency	Does count match across source and target systems?	Source count = Snowflake count ± tolerance	Alert if variance > 0.01%
Timeliness	Is the data fresh enough for the business?	DATEDIFF('hour', MAX(load_ts), NOW())	Alert if older than 4 hours
Validity	Do values match expected formats and allowed values?	status NOT IN ('active','inactive','pending')	0 invalid values

🚨

What to Do When a Production Pipeline Fails

A complete incident response process — shows real seniority

▾

Check the logs immediately

Open Airflow UI, find the failed task, click the logs. Find the exact error message and stack trace. Do not guess — the answer is always in the logs.

Classify the failure type

Transient: network timeout, Snowflake warehouse suspended, API rate limit — retry and it will work. Permanent: bad data, schema change in source, code bug — needs a fix before retrying.

Assess the impact

Which tables are affected? Which date ranges? Are any dashboards showing wrong data right now? Who needs to be notified immediately?

Contain the damage

If bad data already loaded to Snowflake: use Time Travel to restore the table to before the bad load. If data is partially loaded: quarantine the bad rows to an error table, complete the load with good rows.

Fix and re-run

Fix the root cause. Clear the failed task in Airflow and re-run. Verify the pipeline completes successfully. Run DQ checks manually to confirm data is correct.

Post-mortem and prevention

Document what happened, why, how long it took to fix. Add a data quality check that would have caught this issue earlier. Update the runbook for the on-call team. This is what separates senior engineers from junior ones.

Security & Networks

VPC, IAM, encryption, PII masking, compliance — talk about security like a senior engineer

🔒

Security Concepts Every IT Consultant Must Know

From networking to data security — with data engineering context for each

▾

Concept	Simple Explanation	Data Engineering Context
VPC	Private network inside AWS — like a gated community	All your Snowflake connections, EC2 servers, and RDS live inside a VPC — not exposed to the public internet
Subnet	Neighborhood inside the VPC with its own routing rules	Public subnet: internet-facing load balancers. Private subnet: databases and ETL servers — no direct internet access
Security Group	Firewall for individual resources — allow or deny by IP and port	RDS security group: only allow port 5432 from the Airflow server's IP address. Deny everything else.
IAM Role	Identity that grants specific permissions to a service	Airflow EC2 has an IAM role allowing S3 read + Snowflake write + CloudWatch logs. No other permissions.
Encryption at Rest	Data is encrypted when stored on disk — unreadable without the key	S3 SSE-AES256, Snowflake AES-256 by default. Even if a drive is physically stolen the data is unreadable.
Encryption in Transit	Data is encrypted while moving over the network	All Snowflake JDBC connections use TLS 1.2+. S3 SDK uses HTTPS. Never send data over plain HTTP.
Column Masking	Show partial or fake values to users without the right role	Analysts see *--1234 instead of real SSN. Only the compliance team role can see unmasked values.
Row-Level Security	Each user sees only the rows they are permitted to see	East regional manager sees only East region rows — same table, different result for each user
MFA	Second proof of identity beyond just a password	Required for any production system access — Okta or Google Authenticator app as the second factor
Audit Log	Record of every action taken on data — who did what and when	Snowflake ACCOUNT_USAGE.ACCESS_HISTORY: query who accessed which table at exactly what time

Security answer for every interview"In all our projects we follow the principle of least privilege — each service gets only the permissions it absolutely needs. Data engineers have read access to production Snowflake but write only through a dedicated service account. PII columns like SSN and date of birth have Snowflake Dynamic Data Masking — analysts see masked values, only the compliance team role sees real values. All data is encrypted at rest with AES-256 and in transit with TLS 1.2. Production access requires VPN plus MFA. Every data access is logged in Snowflake ACCOUNT_USAGE for compliance audits."

End-to-End Flow — All Dots Connected

The complete enterprise data platform — study this until you can draw and explain every layer from memory

This is the most important sectionIn every senior data interview you will be asked: "describe the data architecture of your last project." This is your answer framework. Learn it completely and adapt it to any project scenario.

Layer 1 — Source Systems (where data originates)

Salesforce CRM
Opportunities

SAP ERP
Orders, Inventory

Web App
Clickstream

3rd Party APIs
Market data

PostgreSQL DB
Operational

↓ extract via Airflow + Python, Fivetran, Kafka ↓

Layer 2 — Ingestion (how data moves)

Batch ETL
Airflow + Python
Daily at 6 AM

Managed Connectors
Fivetran / Airbyte
No-code ELT

Real-time Streaming
Apache Kafka
Milliseconds

CDC
Debezium
DB change logs

↓ land raw files ↓

Layer 3 — Data Lake (raw storage, never delete)

AWS S3 / GCP Cloud Storage / Azure Blob Storage
s3://bucket/sales/year=2024/month=01/day=15/salesforce_opps.parquet
Raw, unmodified, date-partitioned Parquet — the safety net

↓ Snowflake COPY INTO ↓

Layer 4 — Snowflake Data Warehouse (4 schemas)

RAW schema
1:1 typed copy
no business logic

→

STAGING schema
deduped, validated
null checks done

→

CORE schema
Star Schema
Fact + SCD2 Dims

→

MART schema
Sales, Finance
pre-aggregated

↓ dbt transforms (SQL in Git, tested, documented) ↓

Layer 5 — Quality and Monitoring

dbt tests
not_null, unique
relationships

SQL DQ checks
null, dup, range
reconciliation

Airflow alerts
Slack on failure
email on SLA breach

↓ serve to business ↓

Layer 6 — Consumption

Tableau
Power BI
Looker

REST APIs
for web/mobile
applications

Data Science
Jupyter, ML
models

Scheduled Reports
Excel, PDF
Email delivery

Cross-cutting concerns that wrap every layerOrchestration: Apache Airflow schedules and monitors every batch job. CICD: GitHub Actions deploys all code through test → staging → production. Security: IAM roles, VPC private subnets, TLS encryption, column masking for PII. Project Management: JIRA tracks all development work in sprints. ITSM: ServiceNow handles production incidents and change request approvals. Governance: data catalog tracks what data exists, who owns it, and how to use it.

Business Scenarios — Real Interview Discussions

Learn these stories and tell them fluently — interviewers value real project experience above everything else

🏬

Retail Analytics Platform — Full End-to-End Story

Data warehouse developer, ETL, Snowflake, Tableau combined scenario

▾

Project context — use this as your project

National retail chain with 500 stores and 3 source systems: POS transactions from store terminals, SAP ERP for inventory and procurement, Salesforce for their loyalty program. The business had no unified view — merchandising, operations, and finance each had separate spreadsheets that always disagreed.

What you built

AWS S3 raw data lake — all files land here first as Parquet files partitioned by store ID and date. Airflow on EC2 runs daily at 5 AM — POS data via SFTP pickup, SAP via REST API, Salesforce via Fivetran. Snowflake COPY INTO loads the RAW schema. dbt transforms to CORE schema — star schema with fact_sales, fact_inventory, dim_product with SCD Type 2 for price changes, dim_store, dim_customer, dim_date. Tableau connects to MART schema — 4 dashboards: Executive KPIs, Store Manager Performance, Inventory Analysis, and Loyalty Program Analytics.

Problem you solved — always include a specific problem

POS systems from different store brands sent data in incompatible formats — some pipe-delimited, some CSV with different column orders, different date formats, different NULL representations. We built a Python normalization layer that detected the format automatically and converted everything to standard Parquet before landing in S3. We added dbt tests and Great Expectations checks — if the null rate on product_id exceeded 0.1% the Airflow task failed and a Slack alert fired immediately. This reduced dashboard data errors by 95% in the first month after go-live.

🏦

Financial Services — Real-Time Fraud Detection + Compliance Reporting

Kafka real-time plus Snowflake batch — two pipelines for one business

▾

The dual requirement

Credit card company processing 2 million transactions per day. Two completely opposite requirements: fraud detection in under 500 milliseconds (real-time), and daily regulatory compliance reports that need to be available by 6 AM (batch). One solution had to serve both.

Real-time path via Kafka

Every card transaction published to Kafka topic order-events, keyed by card number so all events for one card stay in one partition and in order. Fraud detection service consumed the topic, ran rules: velocity check (5 transactions in 10 minutes), geo-anomaly (transaction in NYC then Miami 30 minutes later), and amount spike (3x the customer's average). Flagged transactions published to fraud-alerts topic. Alert service consumed fraud-alerts and sent real-time push notification or SMS to the cardholder. Full path: transaction to customer alert under 200 milliseconds.

Batch path via Snowflake

Same Kafka events landed in S3 via Kafka Connect S3 Sink connector — Parquet files created every 15 minutes. Airflow triggered COPY INTO at midnight. dbt built regulatory reporting models — daily transaction aggregates by merchant category, geographic distribution, and amount brackets — all required by the compliance team. Reports were available by 6 AM every morning and ran in 8 minutes on a Medium Snowflake virtual warehouse.

🏥

Healthcare Analytics — HIPAA Compliant Data Platform

Security, PII masking, row-level security — the compliance-heavy scenario

▾

The challenge

20-hospital network needed patient outcome analytics — readmission rates, average length of stay, procedure costs, quality metrics — while being fully HIPAA compliant. PHI (Protected Health Information) could never be exposed to analytics teams, dashboard users, or the ML team. Any breach would be a federal violation.

Security architecture we built

Snowflake Dynamic Data Masking on all 18 HIPAA identifier columns — analysts see masked patient_id and asterisks for name, SSN, and date of birth. Only the compliance team role with background checks could see unmasked values. Row-level security via Snowflake Row Access Policy — each hospital's analytics team saw only their hospital's records. All data encrypted at rest with AES-256 and in transit with TLS 1.2. Only 5 data engineers with HIPAA training had access to unmasked PHI, and only through VPN plus MFA. Every single data access logged in Snowflake ACCOUNT_USAGE for annual compliance audits.

How we handled the ML team request

The ML team wanted patient data to build a readmission prediction model but could not have access to PHI. We created a de-identified extract: replaced patient IDs with surrogate keys using a one-way SHA-256 hash, removed all 18 HIPAA identifiers, generalized age to 5-year buckets, generalized zip code to 3-digit prefix. The ML team trained their model on this dataset with zero access to PHI at any point.

Free Accounts — Practice With Real Tools

Hands-on practice is what separates people who know the concepts from people who actually get hired

Why you must set these up this weekendAnyone can say "I have Snowflake experience." Only someone who actually logged in can say "I created a Medium virtual warehouse with auto-suspend at 5 minutes, loaded Parquet files from S3 using COPY INTO with a Storage Integration, and used Time Travel to recover rows deleted by a wrong DELETE statement." That specificity wins interviews. All of these are free. Do it this weekend.

❄

Snowflake — 30 Day Trial, $400 Free Credits

Step-by-step setup and specific exercises to practice

▾

Choose Standard edition → AWS → US-East-1. No credit card required for the trial. Confirm your email and log in. You now have a full Snowflake environment with $400 in free credits.

Create your practice environment

Click Worksheets in the top menu → New Worksheet. This is your SQL editor connected to Snowflake. Run the setup SQL below to create your practice star schema.

Practice exercises — do every one of these

1) Create star schema tables and insert sample data. 2) Run RANK, DENSE_RANK, and ROW_NUMBER queries and compare the results. 3) Write window functions: SUM OVER, LAG, running total. 4) Run a DELETE statement then use Time Travel to recover the deleted rows. 5) CREATE TABLE backup CLONE fact_sales — verify it is instant. 6) Create a Stream on the staging table. 7) Create a Task to run every minute. Watch it execute.

☁

AWS, GCP, and Azure Free Tiers

All three have generous free tiers — set up all three

▾

Cloud	Sign Up URL	What Is Free	Best Practice Exercise
AWS	aws.amazon.com/free	S3 5GB forever, Lambda 1M requests/month forever, EC2 750 hrs/year	Create S3 bucket → upload a Parquet file → practice AWS CLI commands → download it back
GCP	cloud.google.com/free	$300 credit 90 days + BigQuery 1TB queries free every month forever	Query public NYC taxi dataset: 39 million rows, instant, completely free — practice window functions on real data
Azure	azure.microsoft.com/free	$200 credit 30 days + many services free 12 months	Create Blob Storage container → upload a file → use Azure Data Factory to move it

Best free practice: GCP BigQuery with public datasetsGo to console.cloud.google.com → BigQuery → Add Data → Public Datasets. Run this — completely free and real data: SELECT pickup_borough, COUNT(*) AS trips, ROUND(AVG(fare_amount),2) AS avg_fare, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS borough_rank FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022 GROUP BY pickup_borough ORDER BY trips DESC. That is 39 million rows processed in 3 seconds. Now practice every SQL concept you learned — RANK, window functions, CTEs, deduplication — all on real data at no cost.

Interview Q&A — 50+ Questions

Every role covered — SQL, DWH, Snowflake, Tableau, ETL, Kafka, Cloud, Data Quality, JIRA — click to reveal the full answer

SQL & Databases

What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?

ROW_NUMBER: always gives unique sequential numbers — no ties. If two people have the same salary one gets 1 and the other gets 2 arbitrarily. RANK: ties get the same rank but the next number is skipped. Two people tied for 1st both get rank 1, next person gets rank 3 — not 2. DENSE_RANK: ties get the same rank with no skipping. Two tied for 1st both get 1, next person gets 2. Use DENSE_RANK for business leaderboards (clients expect 1,2,3), ROW_NUMBER to deduplicate (pick exactly one row per group).

What is the difference between WHERE and HAVING?

WHERE filters individual rows BEFORE grouping and aggregation. HAVING filters groups AFTER aggregation has been computed. The rule: if your filter condition uses an aggregate function like SUM, COUNT, AVG, MAX, or MIN — it must go in HAVING. If it compares raw column values — it goes in WHERE. You can use both in the same query. WHERE runs first and reduces the rows (faster), HAVING then filters the resulting groups.

How do you find and remove duplicate records from a table?

Find duplicates: SELECT order_id, COUNT(*) FROM orders GROUP BY order_id HAVING COUNT(*) > 1. Remove duplicates keeping the latest: use ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY created_at DESC) AS rn in a CTE, then SELECT WHERE rn = 1. In Snowflake: CREATE TABLE orders_clean AS SELECT * EXCLUDE(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY load_ts DESC) rn FROM orders_raw) WHERE rn = 1.

What is a window function and give a real business example?

A window function performs a calculation across rows related to the current row without collapsing them like GROUP BY does. The OVER() clause defines the window. Real example: SUM(revenue) OVER (PARTITION BY rep_name ORDER BY month) gives a running total of revenue per sales rep — every row stays in the result, each row shows the cumulative total up to that month. Used for: running totals, month-over-month comparisons with LAG, rankings within groups with DENSE_RANK, moving averages.

What is a CTE and when do you use it over a subquery?

A CTE (Common Table Expression) is a named temporary result defined with the WITH clause before the main SELECT. Use it to: 1) Break complex queries into readable named steps — each step has a meaningful name. 2) Reference the same subquery multiple times without repeating code. 3) Build recursive queries for hierarchies like org charts. CTEs do not improve performance over subqueries in most databases — the benefit is entirely readability and maintainability. In a team environment, readable code is as valuable as fast code.

Data Warehouse & Modeling

What is a Star Schema and why do we use it?

Star Schema has a central fact table surrounded by dimension tables — it looks like a star. The fact table stores measurable business events: sale transactions, web clicks, inventory movements. Dimension tables store descriptive context: customer name and segment, product category, store location, date attributes. Why we use it: simple joins (one join per dimension), fast query performance because most queries only need a few dimensions, and BI tools like Tableau are optimized for star schemas. Almost every modern analytics platform uses Star Schema for reporting.

What is SCD Type 2 and how do you implement it?

SCD Type 2 keeps the full history of how a dimension record changed over time. When an attribute changes (customer moves city), instead of overwriting, we add a new row. Each row has: start_date (when this version became effective), end_date (9999-12-31 for the current record), and is_current flag (TRUE for current). Example: customer moves from NYC to Chicago. Step 1: UPDATE the old row — set end_date = yesterday, is_current = FALSE. Step 2: INSERT new row with Chicago, start_date = today, end_date = 9999-12-31, is_current = TRUE. Historical fact rows still correctly join to the NYC record for old transactions. Historical accuracy is preserved perfectly.

What is the difference between a Data Lake, Data Warehouse, and Data Mart?

Data Lake: stores raw data in any format (CSV, JSON, Parquet, images, logs) in cheap object storage like S3. No schema enforced — anything goes. Schema-on-read. Data Warehouse: stores cleaned, typed, integrated data in a Star Schema optimized for SQL analytics. Schema-on-write — structure enforced when loading. Data Mart: a focused subset of the warehouse designed for one specific business team — Sales Mart, Finance Mart, Operations Mart. Order: Data Lake comes first (land everything raw). Data Warehouse comes second (curate what matters). Data Mart comes last (serve specific teams). Lake is raw materials, Warehouse is the finished goods store, Mart is the checkout aisle per department.

What is the difference between OLTP and OLAP?

OLTP (Online Transaction Processing): operational databases that run the business — fast reads and writes of individual records. Highly normalized (3rd Normal Form). Used for: web apps, order management, customer records. Examples: MySQL, PostgreSQL in your application. OLAP (Online Analytical Processing): analytical databases for understanding the business — fast aggregation of millions of rows. Denormalized Star Schema. Used for: reports, dashboards, historical analysis. Examples: Snowflake, BigQuery, Redshift. ETL or ELT moves data from OLTP to OLAP. Never run heavy analytics directly on your OLTP database — it will slow down your application.

Snowflake

How does Snowflake separate storage and compute?

Traditional databases couple storage and compute — you cannot scale one without the other and all users compete for the same resources. Snowflake separates them completely. Storage sits in S3-based compressed columnar format and you pay per TB stored. Compute is Virtual Warehouses — independent clusters you spin up in seconds and pay per second of usage. Multiple teams can query the same data with their own warehouses simultaneously without competing. Auto-suspend means you pay zero when no queries are running. This is the fundamental advantage over traditional databases like Redshift or Oracle.

What is Snowflake Time Travel and how have you used it in a real project?

Time Travel lets you query or recover data from any point in the past — up to 90 days for Enterprise tier. Syntax: SELECT * FROM sales AT(OFFSET => -3600) queries the table as it was 1 hour ago. Real uses: 1) An analyst ran a wrong DELETE — we recovered all rows in 5 minutes using Time Travel, no backup restore, zero downtime. 2) An ETL job loaded bad data — we used CREATE TABLE corrected CLONE sales AT(OFFSET => -86400) to restore from before the bad load. 3) Compliance audit — we proved exactly what data existed on a specific regulatory reporting date.

What are Snowflake Streams and Tasks?

A Stream is a CDC object that captures every INSERT, UPDATE, and DELETE on a source table since the last time the stream was consumed. It adds metadata columns: METADATA$ACTION ('INSERT' or 'DELETE'), METADATA$ISUPDATE (TRUE for updates). A Task is a scheduler that runs a SQL statement on a cron schedule or when triggered. Combining them: Task runs every 5 minutes, checks SYSTEM$STREAM_HAS_DATA(), if new records exist it runs a MERGE statement to upsert changes into the fact table. This creates near-real-time ELT entirely inside Snowflake — no Kafka or external tools required when sub-minute latency is acceptable.

Tableau & BI

What is a FIXED LOD expression and give a real business example?

FIXED LOD computes an aggregation at a dimension you specify, completely ignoring whatever dimensions or filters are in the current view. Example: dashboard shows revenue by product. Business wants to see each product's revenue as a percentage of that customer's total spend — not as a percentage of all products visible. Without FIXED, TOTAL() would give percent of all visible products. With FIXED: {FIXED [Customer ID] : SUM([Revenue])} computes each customer's full total regardless of the product filter. Divide product revenue by that FIXED value — the denominator stays correct as the full customer lifetime value even when filtering to one product category.

What is the difference between Live and Extract in Tableau?

Live connection: every user interaction — filter, click, hover — fires a SQL query to Snowflake in real time. Always shows current data. Can be slow for complex dashboards on large tables. Use for: real-time operational dashboards where data changes by the minute. Extract (.hyper file): Tableau copies data into its own in-memory columnar format at refresh time. Queries are very fast — all in memory. Data is a snapshot that needs scheduled refresh (every hour, every day). Use for: large datasets where 1-hour-old data is acceptable and performance matters. Decision rule: if the business decision changes based on data being 1 hour vs real-time fresh, use Live. If daily is fine, use Extract.

ETL / ELT & Pipelines

What is the difference between ETL and ELT and when do you use each?

ETL: Extract from source → Transform on a dedicated ETL server → Load clean data to the warehouse. Traditional approach. Tools: Informatica, SSIS, Talend. Used when: warehouse has limited compute or data must be cleansed before it can land (compliance, size). ELT: Extract from source → Load raw data directly to the cloud warehouse → Transform inside the warehouse using SQL. Modern cloud approach. Tools: dbt + Snowflake, Fivetran + BigQuery. Used because: cloud warehouse compute is cheap and scalable, raw data preserved means you can re-transform anytime, SQL transformations in Git are version controlled and testable. Most new projects use ELT.

What happens when a production pipeline fails at 3 AM — walk me through your response

1. Check the Airflow logs immediately — find the exact task that failed and the error message. Never guess. 2. Classify: transient (network timeout, warehouse suspended — retry) or permanent (bad data, schema change — needs fix first). 3. Assess scope — which tables affected, which date ranges, are any dashboards showing wrong data to business users right now. 4. Contain: if bad data loaded to Snowflake, use Time Travel to restore; if partial load, quarantine bad rows and complete the good ones. 5. Fix root cause and re-run. 6. Notify stakeholders: what failed, what data was affected, what was done to fix it, when it was resolved. 7. After resolution: add the DQ check that would have caught this, update the runbook.

What is dbt and why has it become the standard for ELT?

dbt (data build tool) is a transformation framework where you write SQL SELECT statements and dbt materializes them as tables or views in your warehouse. What makes it the standard: 1) SQL in Git — version control, code review, history for all transformations. 2) Automated testing — not_null, unique, referential integrity, accepted_values tests run on every deployment. 3) Documentation and lineage — auto-generated docs show exactly which models feed which models. 4) Modularity with ref() — models reference each other and dbt builds in the correct dependency order. 5) Incremental loads — only process new data, not full refreshes. If you know dbt you look senior immediately.

Cloud & Infrastructure

What is IAM and why is it critical for data platforms?

IAM (Identity and Access Management) controls who and what can access which AWS resources. The principle of least privilege: every identity gets only the minimum permissions it absolutely needs — nothing more. For data platforms: Airflow EC2 has an IAM role that allows S3 read and Snowflake JDBC and CloudWatch logs — nothing else. Lambda trigger has a role allowing only S3 read from one specific bucket. Why critical: if broad credentials get stolen, an attacker has access to everything. IAM roles are safer than access keys because keys can be accidentally committed to GitHub. Never hardcode AWS credentials in Python code — always use IAM roles or AWS Secrets Manager.

What is S3 and why is it used as a data lake?

S3 (Simple Storage Service) is AWS object storage — store any file (CSV, JSON, Parquet, images, logs) at any scale for very low cost (about $0.023 per GB per month). Why it works as a data lake: unlimited storage, 99.999999999% durability (11 nines), files accessible by URL, lifecycle policies (automatically archive or delete old files), event notifications (file arrives triggers a Lambda), and it integrates natively with Snowflake COPY INTO. GCP equivalent: Cloud Storage. Azure equivalent: Blob Storage. In data engineering almost every pipeline starts or ends at S3.

Kafka & Data Quality

When would you use Kafka instead of batch ETL?

Use Kafka when the business genuinely needs data in seconds or milliseconds. Real examples: fraud detection (must evaluate a transaction before it posts — batch runs hours later), real-time dashboard showing live sales on the CEO's screen during a product launch, IoT sensor monitoring where a machine failure must be detected and acted on immediately, microservices where the order service must notify inventory and billing at the exact moment of purchase. Use batch ETL when hourly or daily refresh is fine — financial reports, data warehouse loads, monthly analytics. Kafka has operational complexity and cost — only add it when real-time is truly required by the business.

What are the 6 dimensions of data quality and how do you check them?

Completeness: are all required fields populated and all expected records present? Check: COUNT(*) WHERE customer_id IS NULL — must be 0. Uniqueness: no duplicate primary keys. Check: COUNT(*) vs COUNT(DISTINCT sale_id). Accuracy: values are in valid ranges. Check: COUNT(*) WHERE amount <= 0 — must be 0. Consistency: same data agrees across systems. Check: source row count = Snowflake row count. Timeliness: data is fresh enough. Check: DATEDIFF('hour', MAX(load_ts), NOW()) — alert if > 4. Validity: values match expected formats and allowed values. Check: status NOT IN ('active','inactive','pending') — must be 0.

JIRA & ServiceNow

Walk me through how you manage a data engineering project in JIRA

Sprint planning every 2 weeks: Product Owner presents the prioritized backlog. Dev team estimates each story in planning poker using Fibonacci points — 1 for trivial, 13 for a large unknown. We commit to the stories that fit our historical velocity. Daily standups: 15 minutes — what I completed yesterday, what I am doing today, any blockers. I update JIRA tickets daily with progress comments and link pull requests. Blockers are escalated the same day — never sit on a blocker overnight. Sprint review: demonstrate working software to stakeholders — actual Tableau dashboards loading real data, not slides. Retrospective: what went well and what to improve next sprint. I track my personal velocity to give better estimates over time.

What is a ServiceNow Change Request and when do you need one?

A Change Request is a formal approval process before making any modification to production systems. It prevents unauthorized or untested changes from breaking live data and dashboards. You need one for: adding a column to a Snowflake table, deploying a new dbt model, changing an Airflow DAG schedule, updating a Tableau data source connection. The CR form includes: what is changing, why, risk level (low/medium/high), rollback plan, testing evidence with screenshots, step-by-step deployment instructions. Submit to your manager and the CAB for approval. Deploy only during the approved change window. After deployment run smoke tests and close the CR. Auditors use CR history to prove all production changes were authorized and controlled.

General IT & Architecture

Walk me through the complete architecture of your last data project end to end

"At our retail client project we had 8 source systems: Salesforce for CRM, SAP for ERP, 6 store POS systems. Data landed daily in AWS S3 as Parquet files via Airflow DAGs running at 5 AM — POS via SFTP, SAP via REST API, Salesforce via Fivetran. Snowflake COPY INTO loaded the RAW schema — exact typed copy of the source data. dbt models built the CORE schema — star schema with 3 fact tables (sales, inventory, returns) and 6 dimension tables including SCD Type 2 on customer and product dimensions. dbt tests ran automatically on every deployment. The MART schema had pre-aggregated views for the Sales team and Finance team. Tableau Desktop connected to the MART via live connection with Snowflake row access policies enforcing that each regional manager saw only their region. GitHub Actions deployed all code changes through dev → staging → production with automated tests at each stage. Airflow sent Slack alerts on any pipeline failure."

What is JSON and when do you use it in data engineering?

JSON (JavaScript Object Notation) is a text format for structured data using key-value pairs and arrays. Example: {"customerId": "C001", "orders": [{"id": 1, "amount": 99.99}]}. Used when: 1) APIs — virtually all REST APIs send and receive JSON. 2) Config files for applications and pipelines. 3) Semi-structured data with nested structures that do not fit neatly in relational tables. Challenges for data engineering: JSON is verbose (large files), has no schema enforcement (any field can appear), and nested structures need flattening for SQL analysis. In Snowflake: VARIANT column type stores and queries JSON natively — use col:fieldName syntax to extract nested values directly in SQL.

Why do we use Parquet instead of CSV for big data?

Three reasons. First, Parquet is columnar — analytics queries only read the columns they need instead of the entire row. SELECT AVG(salary) on a 10-column table reads 10% of the data versus CSV which reads 100%. For wide tables this is a 10x speed improvement. Second, column-level compression is extremely efficient — integers compress better than mixed data, strings compress with dictionary encoding. A 1GB CSV typically becomes 100-200MB Parquet, saving 70-80% in S3 storage costs. Third, the schema is embedded in the Parquet file so Snowflake and Spark know exact data types without guessing or needing an external schema definition. For Kafka streaming we use Avro instead because it is row-based and faster for writing and reading individual records.

Complete Learning Checklist

Check off each skill when you can explain it AND demonstrate it without notes

Layer	What It Does	Real Example	Data Consultant Context
HTML	Structure — the skeleton of the page	<table>, <button>, <form>	Tableau dashboard layout, report structure
CSS	Style — colors, fonts, spacing, layout	background-color, font-size, grid	Dashboard themes, branded report colors
JavaScript	Behavior — what happens when you click	onClick, fetch API calls, DOM changes	Filter a chart, call your backend API, show/hide content

Layer	Tool	What it does	Cost
Frontend	HTML/CSS/JS or Next.js	What users see	Free
Backend API	Node.js Express or Python FastAPI	Business logic, auth checks, Claude API calls	Free on Railway
Database	Supabase (PostgreSQL)	Users, subscriptions, usage logs	Free tier
Auth	Clerk or Supabase Auth	Login, signup, JWT tokens	Free tier
Payments	Stripe	$29/month subscriptions	2.9% + $0.30 per transaction
AI	Anthropic Claude API	Resume reviewer, study cards, chatbot	~$0.003 per 1k tokens
Hosting	Vercel (frontend) + Railway (backend)	Deploy in minutes, auto-scale	Free tier

Category	What Gets Checked	Common Issues Found
Impact Bullets	Does every bullet show a result, not just a task?	"Worked with data" → "Built ETL pipeline processing 2M records/day, reducing report time from 4hrs to 15min"
Keywords / ATS	Does the resume contain keywords ATS scanners look for?	Missing: Snowflake, dbt, Airflow, COPY INTO, Star Schema, SCD Type 2
Quantification	Are achievements backed by numbers?	"Improved dashboard" → "Improved dashboard load time from 45 seconds to 3 seconds"
Technical Depth	Does it show real technical knowledge, not just tool names?	"Used Snowflake" → "Designed star schema with SCD Type 2 customer dimension, 3 fact tables, serving 4 Tableau dashboards"
Experience Framing	Is 6 years of experience structured correctly for US consulting?	Missing client names, project context, business impact
Format / Scannability	Can a recruiter scan it in 6 seconds and find the key info?	No clear sections, inconsistent dates, walls of text

IT Technology Encyclopedia

Every Role. Every Topic.One Platform.

Every Role. Every Topic.
One Platform.