FH
FormulaHub
Complete IT Career Platform — USA Consulting Edition
Progress
0%
Start Here — Before Anything Else

IT Technology Encyclopedia

Before you dive into code and interviews, understand every technology from scratch. What is it? Why does it exist? Who built it? Which companies use it? Who are its competitors? Click any technology below to get the full picture — explained like you are learning it for the very first time.

All Technologies
Data & Storage
Cloud Platforms
Programming & Code
DevOps & CICD
Project Management
Streaming & Messaging
BI & Visualization
FormulaHub — Built to Get You Hired

Every Role. Every Topic.
One Platform.

Designed for consultants in the USA who need to demonstrate 6 years of real IT experience. Study every module, learn to explain every concept, practice every scenario. After finishing FormulaHub you can walk into any Data Warehouse Developer, Snowflake Developer, Tableau Developer, ETL Developer, or IT Consultant interview and own it. This is your foundation — real projects and eBooks will take you further.

20
Modules
50+
Q&A Ready
12
Roles Covered
110+
Skills Tracked
Roles This Platform Prepares You For
Click any card to jump to the relevant module
🗄
Data Analyst / SQL
Queries, joins, RANK, window functions, CTEs
Core
Data Warehouse Developer
Star schema, SCD, fact/dim design, layers
Core
Snowflake Developer
COPY, stages, Time Travel, streams, tasks
Core
📊
Tableau Developer
LOD, calculated fields, RLS, performance
Core
🔄
ETL / ELT Developer
Airflow, dbt, Python pipelines, CDC, batch
Core
🐍
Data Engineer
Python, Spark, Kafka, cloud, APIs, pipelines
Core
📋
JIRA / ServiceNow
Scrum, sprints, change requests, ITSM
Every Role
Cloud Data Engineer
AWS/GCP/Azure, S3, Lambda, IAM, VPC
Core
📦
Storage / Formats
Parquet, JSON, Avro, Delta Lake, COPY commands
Core
Streaming Engineer
Kafka, real-time events, consumer groups
Advanced
🔥
Spark / Databricks
PySpark, distributed processing, Delta Lake
Advanced
🔒
IT Security / Compliance
VPC, IAM, PII masking, HIPAA, audit logs
Important
How to use FormulaHubWork through every module in order. In each module: read the explanation, study the tables, copy and practice the code, read the interview scenarios out loud. After all modules go to End-to-End Flow to see how everything connects. Then drill Interview Q&A daily. The goal: explain any topic for 5 minutes without notes. That is when you are ready.
SQL & Databases
The #1 most tested skill — every data role interview includes SQL problems
Why SQL is non-negotiableEvery data role — analyst, engineer, warehouse developer, Tableau developer — requires SQL. Interviewers will give you a whiteboard problem. You must write correct SQL without help. The topics below are what gets asked most often.
📊
SELECT, WHERE, GROUP BY, HAVING — the foundation
Every query starts here — understand the execution order
SQL execution order — not the order you write itFROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. This is why you cannot filter on a SELECT alias in WHERE — WHERE runs before SELECT evaluates aliases.
-- Real scenario: Which regions had over $50k revenue last 30 days? -- Table: sales(id, rep_name, region, product, amount, sale_date, status) SELECT region, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, ROUND(AVG(amount), 2) AS avg_order_value, MAX(amount) AS largest_order FROM sales WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days' AND status = 'COMPLETED' GROUP BY region HAVING SUM(amount) > 50000 ORDER BY total_revenue DESC;
WHERE vs HAVING — the ruleWHERE filters rows BEFORE grouping. HAVING filters groups AFTER aggregation. If your filter uses SUM, COUNT, AVG, MAX, or MIN — it must go in HAVING. Raw column comparisons go in WHERE. You can use both in the same query.
🏆
RANK vs DENSE_RANK vs ROW_NUMBER — Know This Cold
Asked in 90% of data interviews — the answer must be instant
Interviewers ask this every single time — memorize the difference
FunctionHandles Ties?Gap After Tie?Always Unique?Use When
ROW_NUMBER()No — arbitrary order for tiesN/AYES — always uniqueDeduplication, pagination, pick one row per group
RANK()YES — same rank for tiesYES — skips next (1,1,3)NoSports rankings — no 2nd place if tied for 1st
DENSE_RANK()YES — same rank for tiesNO — no gap (1,1,2)NoBusiness leaderboards — clients expect 1,2,3
-- Scenario: Top 3 sales reps per region -- Alice=$10k, Bob=$10k (tied), Carol=$8k, Dave=$7k in East region SELECT * FROM ( SELECT rep_name, region, SUM(amount) AS total_sales, ROW_NUMBER() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS rn, RANK() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS rnk, DENSE_RANK() OVER(PARTITION BY region ORDER BY SUM(amount) DESC) AS drnk FROM sales GROUP BY rep_name, region ) t WHERE drnk <= 3; -- East region results: -- Alice: ROW_NUMBER=1 RANK=1 DENSE_RANK=1 ($10k) -- Bob: ROW_NUMBER=2 RANK=1 DENSE_RANK=1 ($10k — tied) -- Carol: ROW_NUMBER=3 RANK=3 DENSE_RANK=2 ($8k — RANK skips 2!) -- Dave: ROW_NUMBER=4 RANK=4 DENSE_RANK=3 ($7k)
Say this in your interview"DENSE_RANK gives consecutive ranks with no gaps — if Alice and Bob tie for first, both get rank 1 and Carol gets rank 2. RANK skips — Carol would get rank 3, not 2. ROW_NUMBER always gives unique numbers even for ties. In business reporting we almost always use DENSE_RANK because clients expect 1, 2, 3 without gaps."
🔗
JOINs — All Types with Real ETL Context
The most common SQL operation — know every type cold
JOIN TypeWhat It ReturnsReal ETL ExampleNULL Behavior
INNER JOINOnly rows matching in BOTH tablesOrders that have a valid customer record — orphan orders excludedNo NULLs from join
LEFT JOINAll left rows + matching rightAll customers whether or not they have orders — most common in ETLNULLs on right side if no match
RIGHT JOINAll right rows + matching leftAll orders even if customer record is missingNULLs on left side if no match
FULL OUTERAll rows from both tablesFull audit — find gaps on BOTH sides simultaneouslyNULLs on either side
SELF JOINTable joined to itselfEmployees and their managers from the same tableDepends on join type used
-- LEFT JOIN: All customers + their orders (NULL for non-buyers) SELECT c.customer_id, c.name, COUNT(o.order_id) AS total_orders, COALESCE(SUM(o.amount), 0) AS lifetime_value FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.name; -- Data quality check: find orphan orders (no matching customer) SELECT o.order_id, o.customer_id FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id WHERE c.customer_id IS NULL; -- these rows have no customer record
🪟
Window Functions and CTEs
Advanced SQL used daily in data engineering and analytics
Window functions vs GROUP BYGROUP BY collapses rows into one per group. Window functions calculate across rows but keep every row. SUM(amount) GROUP BY region gives one row per region. SUM(amount) OVER(PARTITION BY region ORDER BY date) gives every row plus a running total per region.
-- CTE: break complex query into readable named steps WITH monthly AS ( SELECT rep_name, region, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS revenue FROM sales GROUP BY rep_name, region, DATE_TRUNC('month', sale_date) ), enriched AS ( SELECT *, SUM(revenue) OVER(PARTITION BY rep_name ORDER BY month) AS running_total, LAG(revenue, 1, 0) OVER(PARTITION BY rep_name ORDER BY month) AS prev_month, DENSE_RANK() OVER(PARTITION BY region, month ORDER BY revenue DESC) AS rank_in_region FROM monthly ) SELECT *, revenue - prev_month AS mom_change, ROUND((revenue - prev_month) * 100.0 / NULLIF(prev_month, 0), 1) AS mom_pct_change FROM enriched ORDER BY rep_name, month;
🗑
Deduplication — Critical ETL Skill
Find and remove duplicates — asked in every data engineering interview
-- Step 1: Find which records are duplicated SELECT customer_id, order_date, COUNT(*) AS cnt FROM orders GROUP BY customer_id, order_date HAVING COUNT(*) > 1 ORDER BY cnt DESC; -- Step 2: Keep only the latest row per duplicate group (ROW_NUMBER method) WITH deduped AS ( SELECT *, ROW_NUMBER() OVER( PARTITION BY customer_id, order_date -- define what "duplicate" means ORDER BY created_at DESC -- keep the most recent ) AS rn FROM orders ) SELECT * FROM deduped WHERE rn = 1; -- Snowflake: create clean table directly CREATE TABLE orders_clean AS SELECT * EXCLUDE(rn) FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY load_ts DESC) rn FROM orders_raw ) WHERE rn = 1;
Data Warehouse Developer
Star Schema, SCD, fact/dimension design, warehouse layers — the core of DWH developer roles
Star Schema — Design From Scratch
The most used data model in every analytics project worldwide
Star Schema in one sentenceA central FACT table (what happened — transactions, clicks, orders) surrounded by DIMENSION tables (who, what, where, when). Every dimension joins to the fact via a foreign key. Called "star" because the diagram looks like one. Every modern data warehouse uses this model.
Star Schema — Retail Sales Example
dim_customer
customer_key PK
name, email
segment, region
FACT_SALES

sale_key PK
customer_key FK
product_key FK
date_key FK
store_key FK

amount, quantity
discount, margin
dim_product
product_key PK
name, category
brand, price_tier
dim_date
date_key PK
year, quarter, month
is_holiday, is_weekend
dim_store
store_key PK
city, state
region, store_type
ConceptFact TableDimension Table
PurposeWhat happened — measurements and eventsContext — who, what, where, when
ContentForeign keys to dims + numeric measuresDescriptive text attributes
SizeBillions of rows — grows constantlyThousands to millions — grows slowly
Additive?YES — you can SUM amounts across any dimensionNo — you describe and filter, not add
Examplesale_key, customer_key, amount, quantitycustomer_key, name, city, segment
🔄
Slowly Changing Dimensions (SCD) — Types 1, 2, 3
What happens when dimension data changes — asked in every DWH interview
SCD is asked in almost every Data Warehouse Developer interview — know all types
SCD TypeStrategyKeeps History?When to UseReal Example
Type 0Never update — keep original value foreverN/A — immutableImmutable dataDate of birth, original signup date
Type 1Overwrite old value with new valueNO — history lostTypo corrections, history irrelevantFix customer name spelling error
Type 2Add new row with start/end dates + is_current flagFULL history preservedHistory matters — most common in real DWHCustomer moves from NYC to Chicago
Type 3Add column for previous valueOne version back onlyOnly current and immediate prior matterTrack current and previous region
-- SCD Type 2: Customer C001 moved from New York to Chicago -- Step 1: Expire the current record UPDATE dim_customer SET end_date = CURRENT_DATE - 1, is_current = FALSE WHERE customer_id = 'C001' AND is_current = TRUE; -- Step 2: Insert new record with the new city INSERT INTO dim_customer (customer_key, customer_id, name, city, state, start_date, end_date, is_current) VALUES (5001, 'C001', 'Alice Smith', 'Chicago', 'IL', CURRENT_DATE, '9999-12-31', TRUE); -- Result: old NYC record is expired. New Chicago record is active. -- Fact rows from before today still JOIN to the NYC record. -- New fact rows JOIN to the Chicago record. -- Historical accuracy is perfect.
Interview answer for SCD
When asked "which SCD type did you use?" always say Type 2. Explain: "We used SCD Type 2 on the customer dimension because customers frequently changed their region assignment and our regional managers needed accurate historical reports. A sale made in January when the customer was in the East region had to remain associated with the East region even after the customer moved to the West region in March. Type 2 preserved that history perfectly."
🏗
4-Layer Warehouse Architecture
Raw → Staging → Core → Mart — the standard modern pattern used everywhere
Data Platform Architecture — All Layers
SOURCE SYSTEMS
Salesforce, SAP
APIs, Databases
DATA LAKE
S3 / GCS / Blob
Raw Parquet files
RAW SCHEMA
Snowflake
1:1 typed copy
CORE SCHEMA
Star Schema
Fact + SCD2 Dims
DATA MART
Sales, Finance
pre-aggregated
CONSUME
Tableau, APIs
Reports
LayerWhat Goes HereTransform LevelWho Accesses
Data Lake (S3)Files exactly as received from source systemsZero — never touch rawData engineers only
Raw / StagingTyped, validated, deduplicated copy in SnowflakeBasic — typing, nulls, dedupData engineers
Core / DWHStar schema — fact tables and SCD dimensionsFull business logicEngineers, analysts
Data MartPre-aggregated, team-specific viewsAggregation, formattingBusiness users, Tableau
Real project narrative — say this in your interview
"In our retail analytics project, data from 8 source systems landed in S3 as Parquet files daily via Airflow. Snowflake COPY INTO loaded them to the RAW schema unchanged — a 1-to-1 typed copy. dbt staging models cleaned and validated the data in the STAGING schema — null handling, deduplication, standardizing date formats. dbt core models built the Star Schema in the CORE schema — 3 fact tables and 6 dimension tables including SCD Type 2 on customer and product. The MART schema had pre-aggregated views for the Sales team and Finance team which Tableau connected to directly."
Snowflake Developer
Virtual warehouses, COPY INTO, Time Travel, Streams, Tasks, cloning — every Snowflake developer topic
Architecture — Why Snowflake Is Different
Storage and compute are completely separate — the core innovation
The Snowflake innovation that changed data warehousingTraditional databases tie storage and compute together — you cannot scale one without the other, and all users compete for the same resources. Snowflake separates them completely. Storage lives in S3-based compressed columnar format — you pay per TB stored. Compute is Virtual Warehouses — independent clusters you spin up and down in seconds, paying only per second of usage. Multiple teams query the SAME data with their OWN warehouses without competing.
-- Create separate warehouses for ETL and reporting (they don't compete) CREATE OR REPLACE WAREHOUSE etl_wh WAREHOUSE_SIZE = 'MEDIUM' MAX_CLUSTER_COUNT = 3 -- scale out for concurrent ETL AUTO_SUSPEND = 300 -- suspend after 5 min idle — save money! AUTO_RESUME = TRUE -- start automatically when query arrives INITIALLY_SUSPENDED = TRUE; -- Scale up for heavy load, then back down ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'LARGE'; -- ... run the heavy ETL job ... ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'SMALL'; -- Check credit consumption per warehouse SELECT warehouse_name, credits_used, start_time FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP) ORDER BY credits_used DESC;
📥
Stages and COPY INTO — Loading Data from S3
The primary way data enters Snowflake — every project uses this
-- Step 1: Create storage integration (IAM role — no keys in code!) CREATE OR REPLACE STORAGE INTEGRATION s3_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'S3' ENABLED = TRUE STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456:role/SnowflakeRole' STORAGE_ALLOWED_LOCATIONS = ('s3://my-company-data-bucket/'); -- Step 2: Create external stage pointing to S3 CREATE OR REPLACE STAGE s3_sales_stage URL = 's3://my-company-data-bucket/sales/' STORAGE_INTEGRATION = s3_integration FILE_FORMAT = (TYPE = 'PARQUET'); -- Step 3: COPY INTO — bulk load from S3 to Snowflake table COPY INTO raw.sales_staging FROM @s3_sales_stage PATTERN = '.*sales_2024_.*\.parquet' -- only files matching this regex ON_ERROR = 'CONTINUE' -- skip bad rows, keep loading PURGE = FALSE -- keep source files in S3 FORCE = FALSE; -- don't re-load already loaded files -- Check what was loaded and if any errors occurred SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY( TABLE_NAME => 'SALES_STAGING', START_TIME => DATEADD('hours', -24, CURRENT_TIMESTAMP) ));
🕐
Time Travel and Zero-Copy Cloning
Features with no equivalent in traditional databases — know these cold
-- TIME TRAVEL: query data as it was at any past point (up to 90 days) -- See table as it was 1 hour ago SELECT * FROM sales AT(OFFSET => -3600); -- See table at exact timestamp SELECT * FROM sales AT(TIMESTAMP => '2024-01-15 09:00:00'::TIMESTAMP); -- Recover accidentally deleted rows INSERT INTO sales SELECT * FROM sales AT(OFFSET => -3600) WHERE sale_id NOT IN (SELECT sale_id FROM sales); -- ZERO-COPY CLONING: instant copy sharing same storage (no data copied!) -- Clone entire production database for dev team in seconds CREATE DATABASE dev_db CLONE prod_db; -- Clone table before a risky migration CREATE TABLE sales_backup CLONE sales; -- Clone table from yesterday for comparison CREATE TABLE sales_yesterday CLONE sales AT(OFFSET => -86400);
Interview talking point"We use Time Travel on every Snowflake project. When an analyst accidentally ran a wrong DELETE, we recovered all rows in under 5 minutes using Time Travel — no backup restore, zero downtime. We clone production to development using CREATE DATABASE dev CLONE prod — developers get the full dataset instantly with zero storage cost for unchanged data."
🔄
Streams and Tasks — Near Real-Time ELT
CDC inside Snowflake without needing Kafka or external tools
Stream + Task = lightweight CDC inside SnowflakeA Stream tracks every INSERT, UPDATE, DELETE on a table since the last time you read it. A Task runs a SQL statement on a schedule. Together: Task runs every 5 minutes, checks if the stream has new records, and if yes runs a MERGE to update the fact table. Near real-time ELT without any external tools.
-- Create Stream to capture all changes on staging table CREATE OR REPLACE STREAM sales_changes ON TABLE raw.sales_staging; -- Create Task to process the stream every 5 minutes CREATE OR REPLACE TASK process_sales_changes WAREHOUSE = etl_wh SCHEDULE = '5 MINUTE' WHEN SYSTEM$STREAM_HAS_DATA('sales_changes') -- only run if data exists AS MERGE INTO core.fact_sales tgt USING ( SELECT sale_id, customer_id, amount, sale_date FROM sales_changes WHERE METADATA$ACTION = 'INSERT' ) src ON tgt.sale_id = src.sale_id WHEN MATCHED THEN UPDATE SET tgt.amount = src.amount, tgt.updated_at = CURRENT_TIMESTAMP WHEN NOT MATCHED THEN INSERT (sale_id, customer_id, amount, sale_date) VALUES (src.sale_id, src.customer_id, src.amount, src.sale_date); -- Start the task (tasks are created suspended by default) ALTER TASK process_sales_changes RESUME; -- Monitor task execution history SELECT name, state, scheduled_time, completed_time, error_message FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) ORDER BY scheduled_time DESC;
Tableau & Business Intelligence
LOD expressions, calculated fields, RLS, performance optimization, architecture — for Tableau developer roles
🏗
Tableau Product Family — What Each Does
Desktop, Server, Cloud, Prep, Bridge — know what each one is
ProductPurposeWho Uses ItKey Fact
Tableau DesktopCreate and design workbooks and dashboardsTableau developers, data analystsThe main design tool — connects to any data source
Tableau ServerOn-premise publishing, sharing, governanceEnterprise companies with on-prem infrastructureYou manage the servers yourself
Tableau CloudSaaS version of Server — hosted by TableauCloud-first companies — most new projectsNo server management needed
Tableau PrepVisual ETL — clean and reshape data before TableauAnalysts needing self-service data prepDrag-and-drop transformations, no SQL needed
Tableau PublicFree version — all dashboards are publicly visibleLearning, building a portfolio, journalismFree forever — use this to practice
Live vs Extract — the most important Tableau decisionLIVE connection: every click, filter, and hover fires a SQL query to Snowflake in real time. Always current data. Can be slow for complex dashboards. Use for: real-time operational dashboards. EXTRACT (.hyper file): Tableau copies data into its own in-memory columnar format. Very fast queries. Data is a snapshot — needs scheduled refresh (hourly, daily). Use for: large datasets where 1-hour-old data is acceptable. Rule: if business needs minute-by-minute data, use Live. If daily refresh is fine and performance matters, use Extract.
🧮
LOD Expressions — FIXED, INCLUDE, EXCLUDE
The hardest Tableau concept — master this and you immediately stand out
LOD expressions are the #1 Tableau interview topic for developer roles
LOD TypeWhat It DoesWhen to Use
FIXEDCompute at your specified dimension regardless of view or filtersCustomer lifetime value as denominator regardless of product filter
INCLUDEAdd a finer-grain dimension to the aggregationAverage order count per customer — compute per customer then average in the view
EXCLUDERemove a dimension from the aggregationShow regional total while viewing by individual city
// FIXED: Customer total revenue — ignores product filter in the view Customer Total = {FIXED [Customer ID] : SUM([Revenue])} // FIXED: Product revenue as % of that customer's total spend Pct of Customer Spend = SUM([Revenue]) / {FIXED [Customer ID] : SUM([Revenue])} // FIXED: First ever purchase date per customer First Purchase Date = {FIXED [Customer ID] : MIN([Order Date])} // INCLUDE: avg order count per customer, then average that in the view Avg Orders Per Customer = {INCLUDE [Customer ID] : COUNT([Order ID])} // EXCLUDE: regional total shown even when view is drilled down to city Region Total Revenue = {EXCLUDE [City] : SUM([Revenue])} // Running total (Table Calculation — not LOD) Running Revenue Total = RUNNING_SUM(SUM([Revenue])) // Year over Year growth % YoY Growth = (SUM([Revenue]) - LOOKUP(SUM([Revenue]), -1)) / ABS(LOOKUP(SUM([Revenue]), -1)) // Conditional performance flag (for color coding) Performance Status = IF SUM([Revenue]) >= [Target] THEN "On Track" ELSEIF SUM([Revenue]) >= [Target]*0.8 THEN "At Risk" ELSE "Behind" END
LOD interview scenario — say this
"The business wanted to see each product's revenue AND what percentage of that customer's total spend it represented — not percentage of all products visible. Without FIXED LOD, TOTAL() would give percent of all products in the view. I used {FIXED [Customer ID] : SUM([Revenue])} to compute each customer's full total independently. Then divided product revenue by that fixed value. Even when a manager filters to just one product category, the denominator stays correct as the customer's full lifetime spend."
Dashboard Performance Optimization
5 techniques that separate average from senior Tableau developers
ProblemSolutionImpact
Dashboard loads slowlySwitch from Live to Extract — eliminate network round tripsOften 5-10x faster
Extract too largePre-aggregate in Snowflake — let the warehouse do the heavy liftingSmaller extract, faster load
Too many sheets on one dashboardEach sheet = one query. Reduce to 6-8 maximum. Combine related metricsLinear improvement per sheet removed
Slow COUNT DISTINCTCompute COUNTD in Snowflake view, expose as a simple SUM in TableauCOUNTD is very expensive in Tableau
Filters scan everythingAdd Context Filters — they run first and reduce the data set before other filtersCan reduce query data by 90%
String calculations everywhereMove all string manipulation to Snowflake — UPPER, TRIM, CONCAT are slow in TableauKeeps Tableau doing analytics, not cleaning
🔒
Row Level Security in Tableau
Every enterprise Tableau project needs this — managers see only their data
Best practice: implement RLS in Snowflake, not TableauIf you implement RLS only in Tableau, someone can bypass it by connecting directly to Snowflake. Implement it at the Snowflake level so security is enforced regardless of which tool queries the data.
ETL & ELT Development
Batch pipelines, Airflow, dbt, parallel execution, error handling — the core of every data engineering role
🔄
ETL vs ELT — What Changed and Why
Every interviewer asks this — have a confident, clear answer ready
ETL — Traditional
Flow: Source → Transform on ETL server → Load clean data

Tools: Informatica, SSIS, Talend, DataStage, Ab Initio
When to use: on-premise warehouse with limited compute
Example: SAP → Informatica cleans → Oracle Data Warehouse
ELT — Modern Cloud
Flow: Source → Load raw to warehouse → Transform inside warehouse

Tools: Fivetran + dbt, Airbyte + dbt, Python + dbt
When to use: Snowflake or BigQuery — cheap scalable compute
Example: Salesforce → Fivetran → Snowflake RAW → dbt → CORE
Why ELT won in the cloud eraCloud warehouses (Snowflake, BigQuery) have massive on-demand compute at low cost. Loading raw first means you never lose data — if transformation logic changes, re-run it against the preserved raw data. dbt adds version control, automated testing, documentation, and lineage to SQL transformations — things traditional ETL tools never had. Raw layer is your safety net.
Apache Airflow — Orchestrating Pipelines
DAGs, tasks, parallel execution, retry, alerting — production Airflow
Airflow mental modelAirflow is a smart scheduler for data pipelines. You write a DAG (Python file) that defines tasks and their dependencies. Airflow does NOT process data itself — it tells other systems (Snowflake, Python scripts, dbt, S3) when and how to run. It tracks every run, retries failures automatically, and sends alerts on error.
🏗
dbt — The Modern ELT Standard
SQL transformations with testing, documentation, and lineage built in
What is dbt?dbt is a transformation framework where you write SQL SELECT statements and dbt materializes them as tables or views in Snowflake. It adds: version control (SQL in Git), automated testing (not_null, unique, referential integrity), documentation with lineage graphs, modular reuse with ref(), incremental loads, and multi-environment support (dev/staging/prod). If you know dbt you look senior immediately.
Python for Data Engineering
File I/O, pandas, ETL scripts, API calls, S3, error handling — everything needed for data roles
📁
Reading and Writing Every File Format
CSV, JSON, Parquet, Excel, S3, Snowflake — daily work
🔄
Production ETL Script with Logging, Retry, and Error Handling
The real pattern used on production pipelines — not textbook code
APIs & REST
How modern systems communicate — every data pipeline extracts from APIs
🌐
REST API Complete Guide + Python Implementation
HTTP methods, status codes, auth, pagination, retry — production patterns
HTTP MethodActionIdempotent?Real Example
GETRead data — no side effectsYESGET /api/orders?status=open&page=2
POSTCreate new resourceNO — calling twice creates two recordsPOST /api/orders with body {customerId, items}
PUTReplace entire resourceYESPUT /api/customers/123 — replaces all fields
PATCHUpdate specific fields onlyUsually YESPATCH /api/customers/123 — body: {status: inactive}
DELETERemove a resourceYESDELETE /api/orders/456
Status CodeMeaningWhat to Do
200SuccessProcess the response
201CreatedResource was successfully created
400Bad RequestFix your request payload or query parameters
401UnauthorizedCheck your API key or refresh your token
403ForbiddenYou don't have permission — contact API owner
429Rate LimitedWait and retry — use Retry-After header value
500Server ErrorRetry with backoff — not your fault
AWS, GCP & Azure — Cloud Platforms
Core services, equivalents across all three clouds, S3 deep dive, IAM — what every data engineer must know
Market share reality for US consultingAWS leads at ~32%, Azure at ~23%, GCP at ~12%. In US IT consulting, AWS S3 + Snowflake is the most common data stack. Know the services on each cloud and their equivalents. Most job descriptions list one cloud — knowing the pattern means you adapt quickly.
CategoryAWSGCPAzureWhat It Does
Object StorageS3Cloud Storage (GCS)Blob StorageStore any file at unlimited scale — the raw data lake
Compute VMsEC2Compute EngineVirtual MachinesAlways-on virtual servers — Airflow, ETL workers
ServerlessLambdaCloud FunctionsAzure FunctionsRun code triggered by events — no servers to manage
Managed SQL DBRDSCloud SQLAzure SQL DatabaseManaged PostgreSQL/MySQL — no OS patching needed
Data WarehouseRedshiftBigQuerySynapse AnalyticsAnalytical SQL at petabyte scale
StreamingKinesisPub/SubEvent HubsManaged streaming — Kafka equivalent
Managed AirflowMWAACloud ComposerADF + AirflowRun Airflow without managing servers
SecretsSecrets ManagerSecret ManagerKey VaultStore API keys and passwords securely — never in code
IdentityIAMIAMAzure Active DirectoryControl who can do what to which resource
🪣
S3 Deep Dive — Most Used Service in Data Engineering
CLI commands, boto3, partitioning strategy — used every single day
S3 partitioning strategy for performanceName files like: s3://bucket/sales/year=2024/month=01/day=15/part-001.parquet — Snowflake and Spark skip entire partitions when your WHERE clause filters on date. A query for January 2024 only reads year=2024/month=01/ and ignores all other folders. This is called partition pruning and can make queries 100x faster on large datasets. Always design your S3 folder structure around how you will query the data.
Storage Formats & Data Types
CSV, JSON, Parquet, Avro, Delta Lake — why format choice matters enormously for performance and cost
FormatTypeSize vs CSVSchema?Best ForCommon Tools
CSVText, row-based1x (baseline)No — inferredSmall files, Excel exports, human readabilityEverything
JSONText, hierarchical1.2–1.5x largerNo — flexibleAPI responses, config files, nested semi-structured dataEverything
ParquetBinary, COLUMNAR0.1–0.2x (5-10x smaller)YES — embeddedAnalytics, Spark, Snowflake, S3 — the standardSpark, Snowflake, Pandas, Athena
AvroBinary, row-based0.3–0.5xYES — Schema RegistryKafka streaming, schema evolution across servicesKafka, Spark, Flink
ORCBinary, columnar0.08–0.15xYESHive, older Hadoop ecosystemHive, Spark, Presto
Delta LakeParquet + transaction log~0.15xYES — enforcedACID on data lake, upserts, time travel, DatabricksDatabricks, Spark, Snowflake ext tables
📋
Why Parquet Beats CSV for Analytics — Visual Explanation
Understand this and you will never forget it
CSV — Row Storage (slow for analytics)
Row 1: id=1, name=Alice, age=30, salary=95000, dept=Eng
Row 2: id=2, name=Bob, age=25, salary=72000, dept=Sales
Row 3: id=3, name=Carol, age=35, salary=110000, dept=Eng
SELECT AVG(salary) must read ALL 5 columns × every row = 500MB read
Parquet — Columnar Storage (fast for analytics)
[id col]: 1, 2, 3
[name col]: Alice, Bob, Carol
[salary col]: 95000, 72000, 110000
[dept col]: Eng, Sales, Eng
SELECT AVG(salary) reads ONLY salary column = 100MB. 5x less I/O!
Interview answer: "Why do you use Parquet?""We use Parquet for three reasons. First, it is columnar — analytics queries only read the columns they need, which can be 10x faster than row-based formats. Second, each column compresses independently — integers compress extremely well. A 1GB CSV typically becomes 100-200MB Parquet, saving 70-80% on S3 storage costs. Third, the schema is embedded in the file so Snowflake and Spark know the exact data types without guessing. For Kafka streaming we use Avro instead — it is row-based which is faster for individual record reads, and it works natively with Schema Registry."
Apache Kafka — Event Streaming
Real-time data pipelines — when batch ETL is not fast enough for the business
When to use Kafka vs batch ETLBatch ETL is fine when data can be hours or days old — financial reports, warehouse loads, monthly analytics. Use Kafka when data must be available in seconds — fraud detection (act before a transaction posts), real-time dashboards (live sales counter on the CEO's screen), IoT monitoring (detect machine failure as it happens), microservices decoupling (order service tells inventory and billing at the same time). Only add Kafka when real-time is genuinely required — it has operational complexity.
📨
Core Concepts — Every Term with a Business Example
Topic, Partition, Offset, Consumer Group, Broker — explained simply
ConceptWhat It IsBusiness AnalogyReal Example
TopicNamed log of messagesA TV channel — tune in to receive broadcastsorder-placed, payment-failed, user-clicked
PartitionParallel subdivision of a topicHighway lanes — more lanes = more traffic capacityorder-events split into 12 partitions = 12x throughput
OffsetSequential message number in a partition (0,1,2...)Page number in a book — bookmark where you stoppedConsumer processed up to offset 1,042,771
ProducerApplication that writes messages to a topicA broadcaster publishing to a channelOrder service publishes to order-events on every purchase
Consumer GroupSet of consumers sharing all partitions of a topicCall center team — agents share the incoming queuefraud-service with 3 consumers reads all 12 partitions
BrokerKafka server that stores messages on diskPost office — receives, stores, and delivers mail3-broker cluster — each holds replicas for fault tolerance
Spark & Databricks
Distributed processing for large-scale data — when SQL alone is not enough
When to use Spark instead of Snowflake SQLUse Snowflake SQL for structured analytics up to a few terabytes — it is simpler and cheaper. Use Spark for: multi-terabyte processing where Snowflake becomes too slow or expensive, machine learning feature engineering at scale, complex transformations that are hard to express in SQL (graph processing, iterative algorithms), joining huge datasets across different storage systems. Databricks is managed Spark plus Delta Lake plus MLflow plus notebooks — the standard platform for large-scale data engineering at tech companies.
🔥
Spark Architecture and PySpark Code
Driver, Executors, partitions — how Spark distributes work across machines
Spark Architecture
Driver
Plans the job
coordinates all
Cluster Manager
YARN / Kubernetes
allocates resources
Executor 1
Partitions 1–3
Executor 2
Partitions 4–6
Executor N
Partitions 7–N
Data is split into partitions — each executor processes its partitions in parallel
CICD & Git
How code goes from developer laptop to production safely — every professional project uses this
🚀
Complete CICD Pipeline for Data Projects
Git workflow, GitHub Actions, staging, production — step by step
1
Write code on a feature branch
git checkout -b feature/add-customer-scd2. Write Python ETL or dbt SQL. Test locally with sample data. Never commit directly to the main branch.
2
Push and open a Pull Request
git add . && git commit -m "Add SCD2 for customer dimension" && git push. Open Pull Request on GitHub with description of what changed and why. Request review from tech lead.
3
CI runs automatically on every push
GitHub Actions triggers: run pytest unit tests, flake8 code style check, dbt compile to check SQL syntax, data quality tests on dev Snowflake. PR is blocked from merging if any check fails.
4
Code review and approval
Tech lead reviews for correctness, performance, security, and documentation. At least 1 approval required. After approval: merge to main triggers auto-deploy to staging environment.
5
Validate staging and deploy to production
Business analyst validates results in staging. QA sign-off. Create ServiceNow Change Request. After CAB approval deploy to production during the change window. Run smoke tests. Close the CR.
JIRA & ServiceNow
Project management and IT service management — knowing these signals real enterprise project experience
📋
JIRA Complete Guide — Every Term with Data Project Examples
Scrum, sprints, stories, velocity — every concept you will encounter
JIRA TermWhat It IsReal Data Project ExampleWho Creates It
EpicLarge body of work spanning multiple sprints"Build Customer Analytics Platform" — takes 3 monthsProduct Owner / Manager
Story (User Story)User-facing feature that fits in one sprint"As a sales manager, I want daily revenue by region dashboard"Product Owner
TaskTechnical sub-work inside a story"Create Snowflake star schema for revenue model"Developer / Engineer
BugSomething broken that needs fixing"Dashboard shows January revenue as $0 — wrong date filter"QA / anyone who finds it
SpikeTime-boxed research to reduce uncertainty"Investigate how to connect Salesforce real-time to Kafka"Tech Lead
Sprint2-week iteration with committed storiesSprint 14: 4 stories + 2 bugs = 38 story points committedScrum Master manages
Story PointsRelative effort using Fibonacci (1, 2, 3, 5, 8, 13)Simple SQL query = 2 pts. New ETL pipeline = 8 pts.Dev team estimates together
VelocityAverage story points completed per sprintTeam averages 40 pts/sprint — plan 40 pts per sprintScrum Master tracks
BacklogAll pending work not yet in a sprint50 stories prioritized by business value, waiting to be pickedProduct Owner prioritizes
Definition of DoneWhat "complete" means for the teamCode reviewed, tested in staging, business approved, documentedTeam decides together
Perfect user story format — memorize this
AS A regional sales manager
I WANT to see daily sales by product category filtered to my region
SO THAT I can identify underperforming categories and take corrective action

ACCEPTANCE CRITERIA: Dashboard loads in under 3 seconds. Date range filter defaults to last 30 days. Pre-filtered to the logged-in manager's region. Shows revenue, order count, avg order value, and month-over-month % change. Export to Excel works. Data refreshed daily by 7 AM EST.

DEFINITION OF DONE: SQL tested with 3+ months of data. Snowflake view reviewed by tech lead. Tableau dashboard approved by product owner. Row-level security verified. Deployed to Tableau Server production. STORY POINTS: 8.
🎫
ServiceNow — IT Service Management (ITSM)
Incidents, Change Requests, CMDB, CAB — enterprise ITSM that every big company uses
ServiceNow vs JIRA — they serve different purposesJIRA = project management — building new features, tracking development work. ServiceNow = IT operations — handling IT problems, managing production changes, tracking IT assets. Big enterprises use both. You use JIRA to build the pipeline. You use ServiceNow to deploy it to production safely.
ServiceNow ModulePurposeReal Data Engineering Example
IncidentSomething is broken — fix it nowSnowflake pipeline failed at 3 AM — P1 incident created, on-call engineer paged
ProblemRoot cause analysis after repeated incidentsPipeline fails every Monday — problem ticket to find the root cause permanently
Change RequestFormal approval before modifying productionAdding column to fact_sales — needs approval before deploying
CABChange Advisory Board — weekly meeting to approve changesThursday meeting: team reviews all planned changes for next week's deployment
CMDBInventory of all IT assets and their relationshipsAll Snowflake warehouses, Airflow servers, S3 buckets cataloged here
SLAResponse and resolution time commitmentsP1: respond in 15 min, resolve in 4 hrs. P2: respond in 1 hr, resolve in 8 hrs
Change Request process — deploying to production
"In all our projects any change to production required a Change Request in ServiceNow. You fill out: what is changing, why, risk level (low/medium/high), rollback plan, testing evidence with screenshots, and step-by-step deployment instructions. Submit to your manager and the CAB for approval. Once approved, deploy only during the approved change window — usually Sunday 2 to 4 AM. After deployment run smoke tests: can Tableau connect, is the pipeline running, are data quality checks passing. Mark the CR as successfully implemented and close it. If anything goes wrong use the rollback plan and mark it as failed — never hide a failed deployment."
Data Quality Engineering
The most important skill nobody teaches — bad data destroys trust and costs companies millions
Why data quality is the most important skill in data engineeringA dashboard built on bad data leads to wrong decisions. If the sales team trusts a report showing $5M revenue when real revenue is $4.2M, that is a serious business problem. Interviewers love this topic — having a structured answer to data quality shows real seniority.
6 Dimensions of Data Quality — The Framework
Memorize this framework and use it in every data quality conversation
DimensionQuestion to AskSQL Check ExampleAlert Threshold
CompletenessAre required fields populated? All rows present?COUNT(*) WHERE customer_id IS NULL0 nulls on key fields
UniquenessAre there duplicate primary keys or duplicate records?COUNT(*) vs COUNT(DISTINCT sale_id)0 duplicates on primary key
AccuracyAre values in valid ranges? Are they correct?COUNT(*) WHERE amount <= 0 OR amount > 10000000 out-of-range rows
ConsistencyDoes count match across source and target systems?Source count = Snowflake count ± toleranceAlert if variance > 0.01%
TimelinessIs the data fresh enough for the business?DATEDIFF('hour', MAX(load_ts), NOW())Alert if older than 4 hours
ValidityDo values match expected formats and allowed values?status NOT IN ('active','inactive','pending')0 invalid values
🚨
What to Do When a Production Pipeline Fails
A complete incident response process — shows real seniority
1
Check the logs immediately
Open Airflow UI, find the failed task, click the logs. Find the exact error message and stack trace. Do not guess — the answer is always in the logs.
2
Classify the failure type
Transient: network timeout, Snowflake warehouse suspended, API rate limit — retry and it will work. Permanent: bad data, schema change in source, code bug — needs a fix before retrying.
3
Assess the impact
Which tables are affected? Which date ranges? Are any dashboards showing wrong data right now? Who needs to be notified immediately?
4
Contain the damage
If bad data already loaded to Snowflake: use Time Travel to restore the table to before the bad load. If data is partially loaded: quarantine the bad rows to an error table, complete the load with good rows.
5
Fix and re-run
Fix the root cause. Clear the failed task in Airflow and re-run. Verify the pipeline completes successfully. Run DQ checks manually to confirm data is correct.
6
Post-mortem and prevention
Document what happened, why, how long it took to fix. Add a data quality check that would have caught this issue earlier. Update the runbook for the on-call team. This is what separates senior engineers from junior ones.
Security & Networks
VPC, IAM, encryption, PII masking, compliance — talk about security like a senior engineer
🔒
Security Concepts Every IT Consultant Must Know
From networking to data security — with data engineering context for each
ConceptSimple ExplanationData Engineering Context
VPCPrivate network inside AWS — like a gated communityAll your Snowflake connections, EC2 servers, and RDS live inside a VPC — not exposed to the public internet
SubnetNeighborhood inside the VPC with its own routing rulesPublic subnet: internet-facing load balancers. Private subnet: databases and ETL servers — no direct internet access
Security GroupFirewall for individual resources — allow or deny by IP and portRDS security group: only allow port 5432 from the Airflow server's IP address. Deny everything else.
IAM RoleIdentity that grants specific permissions to a serviceAirflow EC2 has an IAM role allowing S3 read + Snowflake write + CloudWatch logs. No other permissions.
Encryption at RestData is encrypted when stored on disk — unreadable without the keyS3 SSE-AES256, Snowflake AES-256 by default. Even if a drive is physically stolen the data is unreadable.
Encryption in TransitData is encrypted while moving over the networkAll Snowflake JDBC connections use TLS 1.2+. S3 SDK uses HTTPS. Never send data over plain HTTP.
Column MaskingShow partial or fake values to users without the right roleAnalysts see ***-**-1234 instead of real SSN. Only the compliance team role can see unmasked values.
Row-Level SecurityEach user sees only the rows they are permitted to seeEast regional manager sees only East region rows — same table, different result for each user
MFASecond proof of identity beyond just a passwordRequired for any production system access — Okta or Google Authenticator app as the second factor
Audit LogRecord of every action taken on data — who did what and whenSnowflake ACCOUNT_USAGE.ACCESS_HISTORY: query who accessed which table at exactly what time
Security answer for every interview"In all our projects we follow the principle of least privilege — each service gets only the permissions it absolutely needs. Data engineers have read access to production Snowflake but write only through a dedicated service account. PII columns like SSN and date of birth have Snowflake Dynamic Data Masking — analysts see masked values, only the compliance team role sees real values. All data is encrypted at rest with AES-256 and in transit with TLS 1.2. Production access requires VPN plus MFA. Every data access is logged in Snowflake ACCOUNT_USAGE for compliance audits."
End-to-End Flow — All Dots Connected
The complete enterprise data platform — study this until you can draw and explain every layer from memory
This is the most important sectionIn every senior data interview you will be asked: "describe the data architecture of your last project." This is your answer framework. Learn it completely and adapt it to any project scenario.
Layer 1 — Source Systems (where data originates)
Salesforce CRM
Opportunities
SAP ERP
Orders, Inventory
Web App
Clickstream
3rd Party APIs
Market data
PostgreSQL DB
Operational
↓ extract via Airflow + Python, Fivetran, Kafka ↓
Layer 2 — Ingestion (how data moves)
Batch ETL
Airflow + Python
Daily at 6 AM
||
Managed Connectors
Fivetran / Airbyte
No-code ELT
||
Real-time Streaming
Apache Kafka
Milliseconds
||
CDC
Debezium
DB change logs
↓ land raw files ↓
Layer 3 — Data Lake (raw storage, never delete)
AWS S3 / GCP Cloud Storage / Azure Blob Storage
s3://bucket/sales/year=2024/month=01/day=15/salesforce_opps.parquet
Raw, unmodified, date-partitioned Parquet — the safety net
↓ Snowflake COPY INTO ↓
Layer 4 — Snowflake Data Warehouse (4 schemas)
RAW schema
1:1 typed copy
no business logic
STAGING schema
deduped, validated
null checks done
CORE schema
Star Schema
Fact + SCD2 Dims
MART schema
Sales, Finance
pre-aggregated
↓ dbt transforms (SQL in Git, tested, documented) ↓
Layer 5 — Quality and Monitoring
dbt tests
not_null, unique
relationships
+
SQL DQ checks
null, dup, range
reconciliation
+
Airflow alerts
Slack on failure
email on SLA breach
↓ serve to business ↓
Layer 6 — Consumption
Tableau
Power BI
Looker
REST APIs
for web/mobile
applications
Data Science
Jupyter, ML
models
Scheduled Reports
Excel, PDF
Email delivery
Cross-cutting concerns that wrap every layerOrchestration: Apache Airflow schedules and monitors every batch job. CICD: GitHub Actions deploys all code through test → staging → production. Security: IAM roles, VPC private subnets, TLS encryption, column masking for PII. Project Management: JIRA tracks all development work in sprints. ITSM: ServiceNow handles production incidents and change request approvals. Governance: data catalog tracks what data exists, who owns it, and how to use it.
Business Scenarios — Real Interview Discussions
Learn these stories and tell them fluently — interviewers value real project experience above everything else
🏬
Retail Analytics Platform — Full End-to-End Story
Data warehouse developer, ETL, Snowflake, Tableau combined scenario
Project context — use this as your project
National retail chain with 500 stores and 3 source systems: POS transactions from store terminals, SAP ERP for inventory and procurement, Salesforce for their loyalty program. The business had no unified view — merchandising, operations, and finance each had separate spreadsheets that always disagreed.
What you built
AWS S3 raw data lake — all files land here first as Parquet files partitioned by store ID and date. Airflow on EC2 runs daily at 5 AM — POS data via SFTP pickup, SAP via REST API, Salesforce via Fivetran. Snowflake COPY INTO loads the RAW schema. dbt transforms to CORE schema — star schema with fact_sales, fact_inventory, dim_product with SCD Type 2 for price changes, dim_store, dim_customer, dim_date. Tableau connects to MART schema — 4 dashboards: Executive KPIs, Store Manager Performance, Inventory Analysis, and Loyalty Program Analytics.
Problem you solved — always include a specific problem
POS systems from different store brands sent data in incompatible formats — some pipe-delimited, some CSV with different column orders, different date formats, different NULL representations. We built a Python normalization layer that detected the format automatically and converted everything to standard Parquet before landing in S3. We added dbt tests and Great Expectations checks — if the null rate on product_id exceeded 0.1% the Airflow task failed and a Slack alert fired immediately. This reduced dashboard data errors by 95% in the first month after go-live.
🏦
Financial Services — Real-Time Fraud Detection + Compliance Reporting
Kafka real-time plus Snowflake batch — two pipelines for one business
The dual requirement
Credit card company processing 2 million transactions per day. Two completely opposite requirements: fraud detection in under 500 milliseconds (real-time), and daily regulatory compliance reports that need to be available by 6 AM (batch). One solution had to serve both.
Real-time path via Kafka
Every card transaction published to Kafka topic order-events, keyed by card number so all events for one card stay in one partition and in order. Fraud detection service consumed the topic, ran rules: velocity check (5 transactions in 10 minutes), geo-anomaly (transaction in NYC then Miami 30 minutes later), and amount spike (3x the customer's average). Flagged transactions published to fraud-alerts topic. Alert service consumed fraud-alerts and sent real-time push notification or SMS to the cardholder. Full path: transaction to customer alert under 200 milliseconds.
Batch path via Snowflake
Same Kafka events landed in S3 via Kafka Connect S3 Sink connector — Parquet files created every 15 minutes. Airflow triggered COPY INTO at midnight. dbt built regulatory reporting models — daily transaction aggregates by merchant category, geographic distribution, and amount brackets — all required by the compliance team. Reports were available by 6 AM every morning and ran in 8 minutes on a Medium Snowflake virtual warehouse.
🏥
Healthcare Analytics — HIPAA Compliant Data Platform
Security, PII masking, row-level security — the compliance-heavy scenario
The challenge
20-hospital network needed patient outcome analytics — readmission rates, average length of stay, procedure costs, quality metrics — while being fully HIPAA compliant. PHI (Protected Health Information) could never be exposed to analytics teams, dashboard users, or the ML team. Any breach would be a federal violation.
Security architecture we built
Snowflake Dynamic Data Masking on all 18 HIPAA identifier columns — analysts see masked patient_id and asterisks for name, SSN, and date of birth. Only the compliance team role with background checks could see unmasked values. Row-level security via Snowflake Row Access Policy — each hospital's analytics team saw only their hospital's records. All data encrypted at rest with AES-256 and in transit with TLS 1.2. Only 5 data engineers with HIPAA training had access to unmasked PHI, and only through VPN plus MFA. Every single data access logged in Snowflake ACCOUNT_USAGE for annual compliance audits.
How we handled the ML team request
The ML team wanted patient data to build a readmission prediction model but could not have access to PHI. We created a de-identified extract: replaced patient IDs with surrogate keys using a one-way SHA-256 hash, removed all 18 HIPAA identifiers, generalized age to 5-year buckets, generalized zip code to 3-digit prefix. The ML team trained their model on this dataset with zero access to PHI at any point.
Free Accounts — Practice With Real Tools
Hands-on practice is what separates people who know the concepts from people who actually get hired
Why you must set these up this weekendAnyone can say "I have Snowflake experience." Only someone who actually logged in can say "I created a Medium virtual warehouse with auto-suspend at 5 minutes, loaded Parquet files from S3 using COPY INTO with a Storage Integration, and used Time Travel to recover rows deleted by a wrong DELETE statement." That specificity wins interviews. All of these are free. Do it this weekend.
Snowflake — 30 Day Trial, $400 Free Credits
Step-by-step setup and specific exercises to practice
1
Sign up at snowflake.com/try
Choose Standard edition → AWS → US-East-1. No credit card required for the trial. Confirm your email and log in. You now have a full Snowflake environment with $400 in free credits.
2
Create your practice environment
Click Worksheets in the top menu → New Worksheet. This is your SQL editor connected to Snowflake. Run the setup SQL below to create your practice star schema.
3
Practice exercises — do every one of these
1) Create star schema tables and insert sample data. 2) Run RANK, DENSE_RANK, and ROW_NUMBER queries and compare the results. 3) Write window functions: SUM OVER, LAG, running total. 4) Run a DELETE statement then use Time Travel to recover the deleted rows. 5) CREATE TABLE backup CLONE fact_sales — verify it is instant. 6) Create a Stream on the staging table. 7) Create a Task to run every minute. Watch it execute.
AWS, GCP, and Azure Free Tiers
All three have generous free tiers — set up all three
CloudSign Up URLWhat Is FreeBest Practice Exercise
AWSaws.amazon.com/freeS3 5GB forever, Lambda 1M requests/month forever, EC2 750 hrs/yearCreate S3 bucket → upload a Parquet file → practice AWS CLI commands → download it back
GCPcloud.google.com/free$300 credit 90 days + BigQuery 1TB queries free every month foreverQuery public NYC taxi dataset: 39 million rows, instant, completely free — practice window functions on real data
Azureazure.microsoft.com/free$200 credit 30 days + many services free 12 monthsCreate Blob Storage container → upload a file → use Azure Data Factory to move it
Best free practice: GCP BigQuery with public datasetsGo to console.cloud.google.com → BigQuery → Add Data → Public Datasets. Run this — completely free and real data: SELECT pickup_borough, COUNT(*) AS trips, ROUND(AVG(fare_amount),2) AS avg_fare, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS borough_rank FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022 GROUP BY pickup_borough ORDER BY trips DESC. That is 39 million rows processed in 3 seconds. Now practice every SQL concept you learned — RANK, window functions, CTEs, deduplication — all on real data at no cost.
Interview Q&A — 50+ Questions
Every role covered — SQL, DWH, Snowflake, Tableau, ETL, Kafka, Cloud, Data Quality, JIRA — click to reveal the full answer
SQL & Databases
Q
What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?
+
ROW_NUMBER: always gives unique sequential numbers — no ties. If two people have the same salary one gets 1 and the other gets 2 arbitrarily. RANK: ties get the same rank but the next number is skipped. Two people tied for 1st both get rank 1, next person gets rank 3 — not 2. DENSE_RANK: ties get the same rank with no skipping. Two tied for 1st both get 1, next person gets 2. Use DENSE_RANK for business leaderboards (clients expect 1,2,3), ROW_NUMBER to deduplicate (pick exactly one row per group).
Q
What is the difference between WHERE and HAVING?
+
WHERE filters individual rows BEFORE grouping and aggregation. HAVING filters groups AFTER aggregation has been computed. The rule: if your filter condition uses an aggregate function like SUM, COUNT, AVG, MAX, or MIN — it must go in HAVING. If it compares raw column values — it goes in WHERE. You can use both in the same query. WHERE runs first and reduces the rows (faster), HAVING then filters the resulting groups.
Q
How do you find and remove duplicate records from a table?
+
Find duplicates: SELECT order_id, COUNT(*) FROM orders GROUP BY order_id HAVING COUNT(*) > 1. Remove duplicates keeping the latest: use ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY created_at DESC) AS rn in a CTE, then SELECT WHERE rn = 1. In Snowflake: CREATE TABLE orders_clean AS SELECT * EXCLUDE(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY load_ts DESC) rn FROM orders_raw) WHERE rn = 1.
Q
What is a window function and give a real business example?
+
A window function performs a calculation across rows related to the current row without collapsing them like GROUP BY does. The OVER() clause defines the window. Real example: SUM(revenue) OVER (PARTITION BY rep_name ORDER BY month) gives a running total of revenue per sales rep — every row stays in the result, each row shows the cumulative total up to that month. Used for: running totals, month-over-month comparisons with LAG, rankings within groups with DENSE_RANK, moving averages.
Q
What is a CTE and when do you use it over a subquery?
+
A CTE (Common Table Expression) is a named temporary result defined with the WITH clause before the main SELECT. Use it to: 1) Break complex queries into readable named steps — each step has a meaningful name. 2) Reference the same subquery multiple times without repeating code. 3) Build recursive queries for hierarchies like org charts. CTEs do not improve performance over subqueries in most databases — the benefit is entirely readability and maintainability. In a team environment, readable code is as valuable as fast code.
Data Warehouse & Modeling
Q
What is a Star Schema and why do we use it?
+
Star Schema has a central fact table surrounded by dimension tables — it looks like a star. The fact table stores measurable business events: sale transactions, web clicks, inventory movements. Dimension tables store descriptive context: customer name and segment, product category, store location, date attributes. Why we use it: simple joins (one join per dimension), fast query performance because most queries only need a few dimensions, and BI tools like Tableau are optimized for star schemas. Almost every modern analytics platform uses Star Schema for reporting.
Q
What is SCD Type 2 and how do you implement it?
+
SCD Type 2 keeps the full history of how a dimension record changed over time. When an attribute changes (customer moves city), instead of overwriting, we add a new row. Each row has: start_date (when this version became effective), end_date (9999-12-31 for the current record), and is_current flag (TRUE for current). Example: customer moves from NYC to Chicago. Step 1: UPDATE the old row — set end_date = yesterday, is_current = FALSE. Step 2: INSERT new row with Chicago, start_date = today, end_date = 9999-12-31, is_current = TRUE. Historical fact rows still correctly join to the NYC record for old transactions. Historical accuracy is preserved perfectly.
Q
What is the difference between a Data Lake, Data Warehouse, and Data Mart?
+
Data Lake: stores raw data in any format (CSV, JSON, Parquet, images, logs) in cheap object storage like S3. No schema enforced — anything goes. Schema-on-read. Data Warehouse: stores cleaned, typed, integrated data in a Star Schema optimized for SQL analytics. Schema-on-write — structure enforced when loading. Data Mart: a focused subset of the warehouse designed for one specific business team — Sales Mart, Finance Mart, Operations Mart. Order: Data Lake comes first (land everything raw). Data Warehouse comes second (curate what matters). Data Mart comes last (serve specific teams). Lake is raw materials, Warehouse is the finished goods store, Mart is the checkout aisle per department.
Q
What is the difference between OLTP and OLAP?
+
OLTP (Online Transaction Processing): operational databases that run the business — fast reads and writes of individual records. Highly normalized (3rd Normal Form). Used for: web apps, order management, customer records. Examples: MySQL, PostgreSQL in your application. OLAP (Online Analytical Processing): analytical databases for understanding the business — fast aggregation of millions of rows. Denormalized Star Schema. Used for: reports, dashboards, historical analysis. Examples: Snowflake, BigQuery, Redshift. ETL or ELT moves data from OLTP to OLAP. Never run heavy analytics directly on your OLTP database — it will slow down your application.
Snowflake
Q
How does Snowflake separate storage and compute?
+
Traditional databases couple storage and compute — you cannot scale one without the other and all users compete for the same resources. Snowflake separates them completely. Storage sits in S3-based compressed columnar format and you pay per TB stored. Compute is Virtual Warehouses — independent clusters you spin up in seconds and pay per second of usage. Multiple teams can query the same data with their own warehouses simultaneously without competing. Auto-suspend means you pay zero when no queries are running. This is the fundamental advantage over traditional databases like Redshift or Oracle.
Q
What is Snowflake Time Travel and how have you used it in a real project?
+
Time Travel lets you query or recover data from any point in the past — up to 90 days for Enterprise tier. Syntax: SELECT * FROM sales AT(OFFSET => -3600) queries the table as it was 1 hour ago. Real uses: 1) An analyst ran a wrong DELETE — we recovered all rows in 5 minutes using Time Travel, no backup restore, zero downtime. 2) An ETL job loaded bad data — we used CREATE TABLE corrected CLONE sales AT(OFFSET => -86400) to restore from before the bad load. 3) Compliance audit — we proved exactly what data existed on a specific regulatory reporting date.
Q
What are Snowflake Streams and Tasks?
+
A Stream is a CDC object that captures every INSERT, UPDATE, and DELETE on a source table since the last time the stream was consumed. It adds metadata columns: METADATA$ACTION ('INSERT' or 'DELETE'), METADATA$ISUPDATE (TRUE for updates). A Task is a scheduler that runs a SQL statement on a cron schedule or when triggered. Combining them: Task runs every 5 minutes, checks SYSTEM$STREAM_HAS_DATA(), if new records exist it runs a MERGE statement to upsert changes into the fact table. This creates near-real-time ELT entirely inside Snowflake — no Kafka or external tools required when sub-minute latency is acceptable.
Tableau & BI
Q
What is a FIXED LOD expression and give a real business example?
+
FIXED LOD computes an aggregation at a dimension you specify, completely ignoring whatever dimensions or filters are in the current view. Example: dashboard shows revenue by product. Business wants to see each product's revenue as a percentage of that customer's total spend — not as a percentage of all products visible. Without FIXED, TOTAL() would give percent of all visible products. With FIXED: {FIXED [Customer ID] : SUM([Revenue])} computes each customer's full total regardless of the product filter. Divide product revenue by that FIXED value — the denominator stays correct as the full customer lifetime value even when filtering to one product category.
Q
What is the difference between Live and Extract in Tableau?
+
Live connection: every user interaction — filter, click, hover — fires a SQL query to Snowflake in real time. Always shows current data. Can be slow for complex dashboards on large tables. Use for: real-time operational dashboards where data changes by the minute. Extract (.hyper file): Tableau copies data into its own in-memory columnar format at refresh time. Queries are very fast — all in memory. Data is a snapshot that needs scheduled refresh (every hour, every day). Use for: large datasets where 1-hour-old data is acceptable and performance matters. Decision rule: if the business decision changes based on data being 1 hour vs real-time fresh, use Live. If daily is fine, use Extract.
ETL / ELT & Pipelines
Q
What is the difference between ETL and ELT and when do you use each?
+
ETL: Extract from source → Transform on a dedicated ETL server → Load clean data to the warehouse. Traditional approach. Tools: Informatica, SSIS, Talend. Used when: warehouse has limited compute or data must be cleansed before it can land (compliance, size). ELT: Extract from source → Load raw data directly to the cloud warehouse → Transform inside the warehouse using SQL. Modern cloud approach. Tools: dbt + Snowflake, Fivetran + BigQuery. Used because: cloud warehouse compute is cheap and scalable, raw data preserved means you can re-transform anytime, SQL transformations in Git are version controlled and testable. Most new projects use ELT.
Q
What happens when a production pipeline fails at 3 AM — walk me through your response
+
1. Check the Airflow logs immediately — find the exact task that failed and the error message. Never guess. 2. Classify: transient (network timeout, warehouse suspended — retry) or permanent (bad data, schema change — needs fix first). 3. Assess scope — which tables affected, which date ranges, are any dashboards showing wrong data to business users right now. 4. Contain: if bad data loaded to Snowflake, use Time Travel to restore; if partial load, quarantine bad rows and complete the good ones. 5. Fix root cause and re-run. 6. Notify stakeholders: what failed, what data was affected, what was done to fix it, when it was resolved. 7. After resolution: add the DQ check that would have caught this, update the runbook.
Q
What is dbt and why has it become the standard for ELT?
+
dbt (data build tool) is a transformation framework where you write SQL SELECT statements and dbt materializes them as tables or views in your warehouse. What makes it the standard: 1) SQL in Git — version control, code review, history for all transformations. 2) Automated testing — not_null, unique, referential integrity, accepted_values tests run on every deployment. 3) Documentation and lineage — auto-generated docs show exactly which models feed which models. 4) Modularity with ref() — models reference each other and dbt builds in the correct dependency order. 5) Incremental loads — only process new data, not full refreshes. If you know dbt you look senior immediately.
Cloud & Infrastructure
Q
What is IAM and why is it critical for data platforms?
+
IAM (Identity and Access Management) controls who and what can access which AWS resources. The principle of least privilege: every identity gets only the minimum permissions it absolutely needs — nothing more. For data platforms: Airflow EC2 has an IAM role that allows S3 read and Snowflake JDBC and CloudWatch logs — nothing else. Lambda trigger has a role allowing only S3 read from one specific bucket. Why critical: if broad credentials get stolen, an attacker has access to everything. IAM roles are safer than access keys because keys can be accidentally committed to GitHub. Never hardcode AWS credentials in Python code — always use IAM roles or AWS Secrets Manager.
Q
What is S3 and why is it used as a data lake?
+
S3 (Simple Storage Service) is AWS object storage — store any file (CSV, JSON, Parquet, images, logs) at any scale for very low cost (about $0.023 per GB per month). Why it works as a data lake: unlimited storage, 99.999999999% durability (11 nines), files accessible by URL, lifecycle policies (automatically archive or delete old files), event notifications (file arrives triggers a Lambda), and it integrates natively with Snowflake COPY INTO. GCP equivalent: Cloud Storage. Azure equivalent: Blob Storage. In data engineering almost every pipeline starts or ends at S3.
Kafka & Data Quality
Q
When would you use Kafka instead of batch ETL?
+
Use Kafka when the business genuinely needs data in seconds or milliseconds. Real examples: fraud detection (must evaluate a transaction before it posts — batch runs hours later), real-time dashboard showing live sales on the CEO's screen during a product launch, IoT sensor monitoring where a machine failure must be detected and acted on immediately, microservices where the order service must notify inventory and billing at the exact moment of purchase. Use batch ETL when hourly or daily refresh is fine — financial reports, data warehouse loads, monthly analytics. Kafka has operational complexity and cost — only add it when real-time is truly required by the business.
Q
What are the 6 dimensions of data quality and how do you check them?
+
Completeness: are all required fields populated and all expected records present? Check: COUNT(*) WHERE customer_id IS NULL — must be 0. Uniqueness: no duplicate primary keys. Check: COUNT(*) vs COUNT(DISTINCT sale_id). Accuracy: values are in valid ranges. Check: COUNT(*) WHERE amount <= 0 — must be 0. Consistency: same data agrees across systems. Check: source row count = Snowflake row count. Timeliness: data is fresh enough. Check: DATEDIFF('hour', MAX(load_ts), NOW()) — alert if > 4. Validity: values match expected formats and allowed values. Check: status NOT IN ('active','inactive','pending') — must be 0.
JIRA & ServiceNow
Q
Walk me through how you manage a data engineering project in JIRA
+
Sprint planning every 2 weeks: Product Owner presents the prioritized backlog. Dev team estimates each story in planning poker using Fibonacci points — 1 for trivial, 13 for a large unknown. We commit to the stories that fit our historical velocity. Daily standups: 15 minutes — what I completed yesterday, what I am doing today, any blockers. I update JIRA tickets daily with progress comments and link pull requests. Blockers are escalated the same day — never sit on a blocker overnight. Sprint review: demonstrate working software to stakeholders — actual Tableau dashboards loading real data, not slides. Retrospective: what went well and what to improve next sprint. I track my personal velocity to give better estimates over time.
Q
What is a ServiceNow Change Request and when do you need one?
+
A Change Request is a formal approval process before making any modification to production systems. It prevents unauthorized or untested changes from breaking live data and dashboards. You need one for: adding a column to a Snowflake table, deploying a new dbt model, changing an Airflow DAG schedule, updating a Tableau data source connection. The CR form includes: what is changing, why, risk level (low/medium/high), rollback plan, testing evidence with screenshots, step-by-step deployment instructions. Submit to your manager and the CAB for approval. Deploy only during the approved change window. After deployment run smoke tests and close the CR. Auditors use CR history to prove all production changes were authorized and controlled.
General IT & Architecture
Q
Walk me through the complete architecture of your last data project end to end
+
"At our retail client project we had 8 source systems: Salesforce for CRM, SAP for ERP, 6 store POS systems. Data landed daily in AWS S3 as Parquet files via Airflow DAGs running at 5 AM — POS via SFTP, SAP via REST API, Salesforce via Fivetran. Snowflake COPY INTO loaded the RAW schema — exact typed copy of the source data. dbt models built the CORE schema — star schema with 3 fact tables (sales, inventory, returns) and 6 dimension tables including SCD Type 2 on customer and product dimensions. dbt tests ran automatically on every deployment. The MART schema had pre-aggregated views for the Sales team and Finance team. Tableau Desktop connected to the MART via live connection with Snowflake row access policies enforcing that each regional manager saw only their region. GitHub Actions deployed all code changes through dev → staging → production with automated tests at each stage. Airflow sent Slack alerts on any pipeline failure."
Q
What is JSON and when do you use it in data engineering?
+
JSON (JavaScript Object Notation) is a text format for structured data using key-value pairs and arrays. Example: {"customerId": "C001", "orders": [{"id": 1, "amount": 99.99}]}. Used when: 1) APIs — virtually all REST APIs send and receive JSON. 2) Config files for applications and pipelines. 3) Semi-structured data with nested structures that do not fit neatly in relational tables. Challenges for data engineering: JSON is verbose (large files), has no schema enforcement (any field can appear), and nested structures need flattening for SQL analysis. In Snowflake: VARIANT column type stores and queries JSON natively — use col:fieldName syntax to extract nested values directly in SQL.
Q
Why do we use Parquet instead of CSV for big data?
+
Three reasons. First, Parquet is columnar — analytics queries only read the columns they need instead of the entire row. SELECT AVG(salary) on a 10-column table reads 10% of the data versus CSV which reads 100%. For wide tables this is a 10x speed improvement. Second, column-level compression is extremely efficient — integers compress better than mixed data, strings compress with dictionary encoding. A 1GB CSV typically becomes 100-200MB Parquet, saving 70-80% in S3 storage costs. Third, the schema is embedded in the Parquet file so Snowflake and Spark know exact data types without guessing or needing an external schema definition. For Kafka streaming we use Avro instead because it is row-based and faster for writing and reading individual records.
Complete Learning Checklist
Check off each skill when you can explain it AND demonstrate it without notes
Frontend Development
What it is, how it connects to backend, what you need to know as a data consultant
What is Frontend?Frontend is everything the user sees and interacts with in a browser — the buttons, forms, tables, charts. It runs in the user's browser, not on a server. HTML structures the content, CSS styles it, JavaScript makes it interactive. As a data consultant you don't need to build frontends — but you MUST understand how they work because every dashboard, every data app, every reporting tool is frontend talking to your backend data.
🌐
HTML, CSS, JavaScript — The Three Layers
Every website in the world is built on exactly these three things
LayerWhat It DoesReal ExampleData Consultant Context
HTMLStructure — the skeleton of the page<table>, <button>, <form>Tableau dashboard layout, report structure
CSSStyle — colors, fonts, spacing, layoutbackground-color, font-size, gridDashboard themes, branded report colors
JavaScriptBehavior — what happens when you clickonClick, fetch API calls, DOM changesFilter a chart, call your backend API, show/hide content
The key mental modelUser clicks a button in the browser (Frontend JavaScript) → JavaScript calls your API endpoint (HTTP request to Backend) → Backend queries Snowflake → Returns JSON data → JavaScript renders the data as a chart. This is how EVERY modern web application and dashboard works. Understanding this flow is what separates consultants who can communicate with developers from those who cannot.
<button onclick="loadSalesData()">Load Sales</button> <div id="result"></div> <script> async function loadSalesData() { // Frontend calls the backend API const response = await fetch('/api/sales?region=east&month=2024-01'); const data = await response.json(); // Render the data — in real life this would be a chart document.getElementById('result').innerHTML = data.map(row => `<p>${row.region}: $${row.revenue}</p>`).join(''); } </script>
Cursor — AI Code Editor for IT Consultants
How to write production code by describing what you want
What is Cursor?Cursor is a code editor (like VS Code) with Claude/GPT built in. You describe what you want in plain English and it writes the code. For IT consultants this is transformative — you can produce working Python ETL scripts, dbt models, SQL queries, and even React components without being a developer. Install it free at cursor.sh.
1
Install Cursor — cursor.sh — free
Download and install. It looks identical to VS Code. Open a new folder. Press Ctrl+K or Cmd+K to open the AI chat panel.
2
Describe what you want in plain English
Type: "Write a Python script that reads a CSV from S3, filters rows where amount is greater than 1000, and writes the result as Parquet back to S3 with proper logging." Cursor writes the complete, working code.
3
Ask it to explain, fix, or improve
Select any code and press Ctrl+K: "Explain this line by line" or "Add retry logic" or "Make this work with Snowflake instead of PostgreSQL". It refactors it instantly.
4
Use it for interview prep
Open a blank file, type a SQL question like "Write the SQL for top 3 salespeople per region using DENSE_RANK", let Cursor write it, then read and understand it. You learn 10x faster when you see working code immediately.
The Cursor interview talking point"In our project I use Cursor to accelerate development — I describe the transformation logic in plain English and Cursor generates the dbt SQL model or Airflow DAG. I then review, test, and refine it. It cuts boilerplate time by 70% so I can focus on the business logic and data quality."
🔌
How Frontend Talks to Backend — The Full Picture
REST APIs, JSON, fetch — this is how every modern app works
Complete Request-Response Cycle
Browser
User clicks
"Load Report"
HTTP Request
GET /api/sales
Bearer token
Backend API
Node.js / Python
Validates auth
Snowflake
Runs SQL
Returns rows
JSON Response
[{region,revenue}]
back to browser
Backend Development
APIs, servers, databases, Python calling Java, authentication — the engine behind every data platform
What is Backend?Backend is the code that runs on a server, not in the browser. It handles business logic, talks to databases, enforces security, processes files, and exposes APIs that frontend and other services call. Every data pipeline, every ETL job, every Snowflake query runs on the backend. This is where IT consultants live.
Calling Java Code from Python — Real Pattern
Legacy systems use Java — data engineers need to trigger them from Python
Why this comes up in consultingMany large enterprises (banks, insurance, healthcare) have critical Java applications running legacy business logic. Your modern Python ETL pipeline may need to trigger a Java process — to run a calculation, call a legacy API, or execute a Java-based transformation. You have three clean approaches.
Interview answer for "how do you integrate with legacy Java systems?""We use one of three patterns depending on the situation. If it's a standalone JAR we run it via Python subprocess and capture the output. If we need two-way object passing we use Py4J which creates a bridge between Python and a running JVM. For services that expose HTTP endpoints we simply call them with the requests library like any other REST API. In all cases we wrap the call in retry logic with exponential backoff and log the result to our monitoring pipeline."
🛠
Backend Stack for FormulaHub-style Projects
Node.js / Python FastAPI + Supabase + Stripe + Railway
LayerToolWhat it doesCost
FrontendHTML/CSS/JS or Next.jsWhat users seeFree
Backend APINode.js Express or Python FastAPIBusiness logic, auth checks, Claude API callsFree on Railway
DatabaseSupabase (PostgreSQL)Users, subscriptions, usage logsFree tier
AuthClerk or Supabase AuthLogin, signup, JWT tokensFree tier
PaymentsStripe$29/month subscriptions2.9% + $0.30 per transaction
AIAnthropic Claude APIResume reviewer, study cards, chatbot~$0.003 per 1k tokens
HostingVercel (frontend) + Railway (backend)Deploy in minutes, auto-scaleFree tier
Commands Reference
Every command you will actually use — copy, run, understand. S3, Snowflake, Git, Python, Linux.
🪣
S3 Commands — AWS CLI & Python boto3
The commands data engineers run every single day
Snowflake COPY Commands — Loading Data
Every pattern you need — CSV, Parquet, JSON, error handling
🐍
Python Commands — Data Engineering Daily Use
pandas, file ops, environment, debugging — the complete toolkit
Linux / Terminal Commands for Data Engineers
Navigate servers, manage files, run jobs — the commands you need on any server
Snowflake SQL Commands — Day-to-Day Operations
The commands Snowflake developers run to manage and debug their platform
Resume Reviewer & Builder
Paste your resume — get a full review, fix everything, then generate a beautiful ATS-optimized final version
🔑
Pro Feature — Powered by Claude AI
This tool uses the Claude API to review and rewrite your resume. In the full version this requires a Pro subscription ($29/month).
To connect this to Claude: 1) Add your Anthropic API key to your backend environment. 2) The backend calls Claude with the resume text. 3) Claude returns a structured review JSON. 4) The frontend renders it. The prompt and UI below are production-ready — backend just needs to be wired in.
Step 1 — Paste Your Current Resume
What Claude Reviews In Your Resume
CategoryWhat Gets CheckedCommon Issues Found
Impact BulletsDoes every bullet show a result, not just a task?"Worked with data" → "Built ETL pipeline processing 2M records/day, reducing report time from 4hrs to 15min"
Keywords / ATSDoes the resume contain keywords ATS scanners look for?Missing: Snowflake, dbt, Airflow, COPY INTO, Star Schema, SCD Type 2
QuantificationAre achievements backed by numbers?"Improved dashboard" → "Improved dashboard load time from 45 seconds to 3 seconds"
Technical DepthDoes it show real technical knowledge, not just tool names?"Used Snowflake" → "Designed star schema with SCD Type 2 customer dimension, 3 fact tables, serving 4 Tableau dashboards"
Experience FramingIs 6 years of experience structured correctly for US consulting?Missing client names, project context, business impact
Format / ScannabilityCan a recruiter scan it in 6 seconds and find the key info?No clear sections, inconsistent dates, walls of text
The Claude Prompt Used (for your Cursor backend)
FormulaHub Pro
Upgrade to unlock everything — resume builder, study chatbot, 200+ Q&A, mock interviews
Free Forever
$0
No credit card required
✓ Tech Encyclopedia (18 technologies)
✓ SQL, DWH, Snowflake, Tableau modules
✓ ETL, Python, Cloud, Kafka basics
✓ Commands Reference
✓ 10 sample Interview Q&A per topic
✓ Learning Checklist
✗ Resume Reviewer & Builder
✗ Full 200+ Interview Q&A bank
✗ Study Card AI Chatbot
✗ Mock Interview Generator
✗ Business Scenarios deep dives
MOST POPULAR
Pro — Get Hired Faster
$29/month
or $199/year — save 43%
✓ Everything in Free
AI Resume Reviewer + Builder
Study Card AI Chatbot
Full 200+ Interview Q&A
Mock Interview Generator
Business Scenarios deep dives
Frontend + Backend module
✓ Priority support
How to implement Stripe payments — give this to Cursor1) Install Stripe: npm install stripe. 2) Create a price in Stripe dashboard ($29/month recurring). 3) Backend: POST /api/checkout creates a Stripe Checkout session and returns the URL. 4) Frontend: redirect user to that URL. 5) Stripe handles card input. 6) After payment: Stripe webhook calls POST /api/webhooks/stripe. 7) Your backend marks user as 'pro' in Supabase. 8) Every API call checks user's plan before serving premium content. This is 2-3 days of backend work and the complete pattern is in the Backend module.
Stripe Implementation — Give This File to Cursor
AI Study Cards — Claude Chatbot
Ask any IT topic — get a beautiful formatted study card instantly. Pro feature powered by Claude.
Try example prompts:
How to wire this to Claude in your backendThe button above calls /api/study-card with the prompt. Your backend sends it to Claude with the system prompt below. Claude returns a structured JSON card. The frontend renders it with colors and formatting. This entire flow is 50 lines of backend code.
Complete Learning Checklist
Check off each skill when you can explain it AND demonstrate it without notes