Data Engineer Cheat Sheet

Mindset

Think in data systems, not just code.
Trace every flow as input → processing → storage → serving → monitoring.
Optimize for reliability, lineage, cost, and trust.
Learn one cloud deeply and map the patterns elsewhere.

Always Ask

What is the source?
What is the grain?
What is the schema?
What is the SLA?
Who consumes it?

Core Tools

SQL: joins, windows, CTEs, incremental logic
Python: packaging, file I/O, APIs, local transforms
Linux + CLI: bash, env vars, cron, SSH, permissions
Git: branching, PRs, rebase/merge, CI basics
Cloud: compute, object storage, IAM, secrets, logs, cost
Data modeling: grain, facts, dimensions, marts
Distributed systems: partitions, shuffles, retries, idempotency
Observability: logs, metrics, freshness, row counts, runbooks

SQL

Fundamentals

SELECT, WHERE, GROUP BY, HAVING, ORDER BY
INNER, LEFT, and FULL joins
UNION vs UNION ALL
CASE WHEN, subqueries, CTEs, date logic, NULL handling
DDL vs DML, views, materialized views, temp tables

Window Functions

ROW_NUMBER, RANK, DENSE_RANK
LAG, LEAD, running totals
Understand partitioning vs ordering inside windows

SQL Interview Patterns

Deduplicate data
Top N per group
Sessionization awareness
Funnel and cohort logic awareness
Slowly changing joins
Incremental upserts / merge logic
Data reconciliation queries
Root-cause debugging queries

Python

Functions, classes, packages
Virtual environments
File I/O
JSON, CSV, Parquet basics
Requests / APIs
Pandas for local prototyping
PySpark basics later

Data Shapes & Formats

Structured: stable table schemas
Semi-structured: JSON, nested records, arrays
Unstructured references: documents, media, logs around metadata

File Formats

CSV: simple, universal, weak typing
JSON: flexible, nested, common for APIs
Parquet: columnar, compressed, analytics-friendly
Avro: schema-aware transport / event awareness
Iceberg / Delta: schema evolution, partition evolution, snapshots, ACID semantics

Storage Patterns

Raw landing zone
Cleaned canonical layer
Curated serving layer
Archival / cold storage
Metadata catalog

Data Modeling

Core Concepts

Understand business grain first
Operational schema vs analytical schema
Fact tables, dimension tables, star schemas
Slowly changing dimensions, surrogate keys, conformed dimensions
Snapshotting, incremental models, data marts

Deep Dive

One fact row = one business event or one snapshot
Dimensions describe who / what / where / when
Degenerate and junk dimensions awareness
SCD Type 1 vs Type 2
Bridge tables, aggregate tables, partitioning large facts
Avoid wide unmodeled chaos in the serving layer

Source Systems

OLTP databases
APIs
Files from vendors
Event streams
SaaS tools
Logs and clickstream
IoT / telemetry
CDC awareness

Ingestion

Batch ingestion, streaming ingestion, micro-batch
Change data capture, full refresh, incremental extract
Schema registry awareness and catalog registration

Batch

Scheduled pulls
File drops
Warehouse loads

Streaming

Kafka topics, partitions, offsets, consumer groups
Insert / update / delete semantics
Deduplication strategy

Metadata & Quality

Catalog

Table names, schemas, owners, lineage
Classification / sensitivity
Freshness
Glue Catalog or Unity Catalog awareness

Data Quality

Schema validation
Null, uniqueness, and referential-integrity checks
Accepted values and business-rule validation
Freshness tests and volume anomaly checks
Quarantine bad records
Checkpoint validation

Spark

Spark SQL and DataFrames
Transformations vs actions
Lazy execution
Catalyst / Tungsten awareness
Partitions, shuffle, caching, broadcast joins
UDF caution
PySpark

Plan Debugging

Start with EXPLAIN or df.explain("formatted")
Use Spark UI to inspect stages, tasks, shuffle read/write, spill, and skewed partitions
AQE can change join strategy and coalesce partitions at runtime
Check exchange nodes, file scans, and broadcast decisions before tuning blindly

Structured Streaming

Sources, sinks, triggers
Watermarks
Stateful ops
Checkpointing
Late / out-of-order events

Common Performance Issues

Small files
Skew
Too many shuffles
Bad partitioning
Wide transformations
Unbounded state

Databricks

Workspace basics, notebooks, clusters / compute, jobs, repos
Delta Lake
Unity Catalog awareness
Databricks SQL
Lakehouse concepts
Data engineering + warehousing on one platform
Governance and sharing awareness

Medallion Architecture

Bronze: raw ingest, append, preserve fidelity
Silver: clean, dedupe, standardize, conform
Gold: business-ready marts and aggregates

Production Ingestion

Auto Loader for cloud-file ingest, schema evolution, and managed discovery
Streaming tables for continuous ingestion and incremental transforms
Materialized views for read-heavy serving and repeated aggregations
Lakeflow / DLT-style pipelines for declarative dependencies and quality expectations

Execution Choice

Notebooks explore
Jobs operate repeatable workloads
Declarative pipelines standardize ingestion and quality guardrails

Delta Operations

MERGE: upserts, late-arriving changes, dedupe, and SCD Type 1 / 2 flows
OPTIMIZE: compaction and layout tuning when small files slow scans
VACUUM: old-file cleanup after retention and recovery needs are understood
History: snapshot isolation, transaction log, time travel, and table history checks
Rule: keep merge logic deterministic and file maintenance intentional

Redshift

Provisioned vs Serverless awareness
Leader / compute concepts
Columnar warehouse mindset
Loading data, COPY, and bulk-load patterns
Staging tables
Distribution styles and sort keys
Compression / encoding awareness
WLM / concurrency awareness
Spectrum awareness
Vacuum / analyze awareness
Security and access control

Modern Features

Materialized views for repeated aggregates and serving layers
Data sharing for governed cross-team or cross-account consumption without extra copies
Zero-ETL integrations for supported operational sources and lower-latency landings
Choose warehouse serving deliberately; do not force near-real-time product APIs onto warehouse patterns

Tuning

Filter early
Avoid unnecessary joins
Use subqueries when appropriate
Understand table-design impact

Orchestration

Airflow

DAGs, tasks, operators, dependencies
Scheduling, retries, backfills, SLA awareness
Sensors
Secrets / connections
Task idempotency
Parameterized jobs
Failure notifications
Run metadata

Scheduling Semantics

logical_date marks the start of the data interval, not the wall-clock launch time
Use data_interval_start and data_interval_end for partition filters and backfills
Set time zones explicitly, usually UTC end to end
catchup=True vs False changes whether missed historical intervals are scheduled automatically
Asset-aware and event-driven scheduling fit upstream-driven pipelines better than cron alone

Transformation

SQL-First

Prefer repeatable transformations with lineage and tests

dbt

Models, sources, seeds, snapshots
Tests and docs
Lineage graph
Macros / Jinja
Incremental models
Environments and CI/CD
Semantic-layer awareness

Sources & Freshness

Define sources with owners, loaded-at fields, and freshness thresholds
Treat upstream source contracts as first-class, not informal assumptions

Exposures & Consumers

Use exposures to model dashboards, extracts, apps, and ML consumers
Expose downstream ownership and dependency impact before changing semantics

Testing & History

Generic data tests protect keys, nulls, relationships, and accepted values
Unit tests protect transformation logic and edge-case expectations
Snapshots capture SCD Type 2 history when the business needs change tracking

Core Tests

unique, not_null, relationships, accepted_values
Custom business tests

Serving & Data Products

Downstream consumers include BI dashboards, extracts, apps, reverse ETL, and ML features
A data product needs an owner, grain, semantic contract, cadence, SLA, and change process
Metric definitions, join rules, and null behavior matter as much as storage technology
Use exposures, catalog metadata, and runbooks to connect producers to consumers
Explicit producer-consumer contracts reduce surprise breakage more than ad hoc fixes do

Streaming & Events

Kafka

Topics, partitions, brokers
Producers, consumers, consumer groups
Offsets, retention, replay
Keys and ordering
Dead-letter awareness
Schema / versioning awareness

Design Decisions

Event-time vs processing-time
Deduplication strategy
Late-event handling
State-size control
Checkpoint location
Sink guarantees

AWS Data Lake Support

Glue

Crawlers
Data Catalog
ETL jobs
Connections
Triggers and scheduling
Monitoring
Schema inference
Partition metadata
Catalog-driven discovery

Security & Governance

Least privilege
PII classification
Encryption at rest and in transit
Row / column access awareness
Audit logs
Catalog ownership
Data retention policies
Access reviews
Compliance awareness
Lineage and discoverability

Reliability

Idempotent loads
Backfill strategy
Replay strategy
Late-arriving data handling
Schema evolution plan
Poison-pill / bad-record handling
Retry vs fail-fast rules
Data contracts awareness
Versioned datasets
Runbooks
Disaster-recovery awareness

Cost & Performance

Choose the right storage format
Partition wisely
Avoid small files
Prune columns
Push filters down
Cache only when useful
Use warehouse sizing intentionally
Separate dev / test / prod
Watch long-running queries
Measure before tuning

How to Choose Tools

Need simple BI warehouse? Redshift is often relevant.
Need lakehouse + big data + ML adjacency? Databricks is often relevant.
Need large-scale distributed processing? Spark is relevant.
Need SQL transformations with lineage and tests? dbt is relevant.
Need workflow scheduling? Airflow is relevant.
Need event streams? Kafka is relevant.
Need metadata / catalog in AWS? Glue is relevant.

Learning Order

SQL hard
Python + CLI + Git
Warehouse concepts
Data modeling
Spark basics
Databricks basics
Redshift loading / tuning basics
dbt
Airflow
Kafka + streaming
Quality / governance
Portfolio + interview prep

Portfolio

Build 3 Projects

Project 1 — Batch ELT

Source: CSV, API, or Postgres
Land raw data
Clean and standardize
Load warehouse
Build star schema
dbt tests + docs
Airflow schedule
Dashboard or SQL analysis

Project 2 — Streaming

Kafka topic
Spark Structured Streaming consumer
Dedup + watermark
Write Bronze / Silver
Serve Gold aggregates
Alerting + checkpointing

Project 3 — Warehouse Optimization

Redshift schema
Distribution / sort experiments
Loading benchmarks
Query tuning examples
Cost / performance writeup
Data quality failures + recovery

Interview Prep

Explain medallion architecture in plain English
Explain fact vs dimension
Explain partitioning and skew
Explain idempotency
Explain batch vs streaming
Explain why Parquet over CSV
Explain why the warehouse layer exists
Explain dbt lineage and tests
Explain dbt sources, freshness, exposures, and snapshots
Explain Airflow retries and backfills
Explain logical_date, data interval, and catchup
Explain Spark EXPLAIN, Spark UI, and AQE at a basic level
Explain Delta MERGE, OPTIMIZE, and VACUUM
Explain Redshift distribution and sort keys at a basic level
Explain Redshift materialized views, data sharing, and zero-ETL at a basic level
Talk through the contract between a producer and downstream consumers
Talk through one pipeline end to end

Job Readiness

Can explain medallion / lakehouse vs warehouse
Can write joins, windows, CTEs, and incremental logic
Can design fact and dimension tables
Can explain Spark partitions and skew at a basic level
Can read Spark plans with EXPLAIN and Spark UI at a basic level
Can load and query Redshift
Can explain Redshift materialized views, data sharing, and zero-ETL tradeoffs
Can use Databricks notebooks / jobs / catalog at a basic level
Can explain Auto Loader, streaming tables, and declarative Lakeflow / DLT-style pipelines
Can explain Delta merge and maintenance operations at a basic level
Can define dbt sources, freshness, exposures, and snapshot use cases
Can explain Airflow logical dates, data intervals, and catchup behavior
Can explain orchestration, retries, idempotency, and SLAs
Can talk about schema evolution, data quality, and lineage

Outcome: you are job-ready enough for junior data engineering interviews, then deepen based on stack: AWS/Redshift, Databricks/Spark, Streaming/Kafka, or analytics engineering/dbt.