Auto Loader for cloud-file ingest, schema evolution, and managed discovery
Streaming tables for continuous ingestion and incremental transforms
Materialized views for read-heavy serving and repeated aggregations
Lakeflow / DLT-style pipelines for declarative dependencies and quality expectations
Execution Choice
Notebooks explore
Jobs operate repeatable workloads
Declarative pipelines standardize ingestion and quality guardrails
Delta Operations
MERGE: upserts, late-arriving changes, dedupe, and SCD Type 1 / 2 flows
OPTIMIZE: compaction and layout tuning when small files slow scans
VACUUM: old-file cleanup after retention and recovery needs are understood
History: snapshot isolation, transaction log, time travel, and table history checks
Rule: keep merge logic deterministic and file maintenance intentional
Redshift
Provisioned vs Serverless awareness
Leader / compute concepts
Columnar warehouse mindset
Loading data, COPY, and bulk-load patterns
Staging tables
Distribution styles and sort keys
Compression / encoding awareness
WLM / concurrency awareness
Spectrum awareness
Vacuum / analyze awareness
Security and access control
Modern Features
Materialized views for repeated aggregates and serving layers
Data sharing for governed cross-team or cross-account consumption without extra copies
Zero-ETL integrations for supported operational sources and lower-latency landings
Choose warehouse serving deliberately; do not force near-real-time product APIs onto warehouse patterns
Tuning
Filter early
Avoid unnecessary joins
Use subqueries when appropriate
Understand table-design impact
Orchestration
Airflow
DAGs, tasks, operators, dependencies
Scheduling, retries, backfills, SLA awareness
Sensors
Secrets / connections
Task idempotency
Parameterized jobs
Failure notifications
Run metadata
Scheduling Semantics
logical_date marks the start of the data interval, not the wall-clock launch time
Use data_interval_start and data_interval_end for partition filters and backfills
Set time zones explicitly, usually UTC end to end
catchup=True vs False changes whether missed historical intervals are scheduled automatically
Asset-aware and event-driven scheduling fit upstream-driven pipelines better than cron alone
Transformation
SQL-First
Prefer repeatable transformations with lineage and tests
dbt
Models, sources, seeds, snapshots
Tests and docs
Lineage graph
Macros / Jinja
Incremental models
Environments and CI/CD
Semantic-layer awareness
Sources & Freshness
Define sources with owners, loaded-at fields, and freshness thresholds
Treat upstream source contracts as first-class, not informal assumptions
Exposures & Consumers
Use exposures to model dashboards, extracts, apps, and ML consumers
Expose downstream ownership and dependency impact before changing semantics
Testing & History
Generic data tests protect keys, nulls, relationships, and accepted values
Unit tests protect transformation logic and edge-case expectations
Snapshots capture SCD Type 2 history when the business needs change tracking
Core Tests
unique, not_null, relationships, accepted_values
Custom business tests
Serving & Data Products
Downstream consumers include BI dashboards, extracts, apps, reverse ETL, and ML features
A data product needs an owner, grain, semantic contract, cadence, SLA, and change process
Metric definitions, join rules, and null behavior matter as much as storage technology
Use exposures, catalog metadata, and runbooks to connect producers to consumers
Explicit producer-consumer contracts reduce surprise breakage more than ad hoc fixes do
Streaming & Events
Kafka
Topics, partitions, brokers
Producers, consumers, consumer groups
Offsets, retention, replay
Keys and ordering
Dead-letter awareness
Schema / versioning awareness
Design Decisions
Event-time vs processing-time
Deduplication strategy
Late-event handling
State-size control
Checkpoint location
Sink guarantees
AWS Data Lake Support
Glue
Crawlers
Data Catalog
ETL jobs
Connections
Triggers and scheduling
Monitoring
Schema inference
Partition metadata
Catalog-driven discovery
Security & Governance
Least privilege
PII classification
Encryption at rest and in transit
Row / column access awareness
Audit logs
Catalog ownership
Data retention policies
Access reviews
Compliance awareness
Lineage and discoverability
Reliability
Idempotent loads
Backfill strategy
Replay strategy
Late-arriving data handling
Schema evolution plan
Poison-pill / bad-record handling
Retry vs fail-fast rules
Data contracts awareness
Versioned datasets
Runbooks
Disaster-recovery awareness
Cost & Performance
Choose the right storage format
Partition wisely
Avoid small files
Prune columns
Push filters down
Cache only when useful
Use warehouse sizing intentionally
Separate dev / test / prod
Watch long-running queries
Measure before tuning
How to Choose Tools
Need simple BI warehouse? Redshift is often relevant.
Need lakehouse + big data + ML adjacency? Databricks is often relevant.
Need large-scale distributed processing? Spark is relevant.
Need SQL transformations with lineage and tests? dbt is relevant.
Need workflow scheduling? Airflow is relevant.
Need event streams? Kafka is relevant.
Need metadata / catalog in AWS? Glue is relevant.
Learning Order
SQL hard
Python + CLI + Git
Warehouse concepts
Data modeling
Spark basics
Databricks basics
Redshift loading / tuning basics
dbt
Airflow
Kafka + streaming
Quality / governance
Portfolio + interview prep
Portfolio
Build 3 Projects
Project 1 — Batch ELT
Source: CSV, API, or Postgres
Land raw data
Clean and standardize
Load warehouse
Build star schema
dbt tests + docs
Airflow schedule
Dashboard or SQL analysis
Project 2 — Streaming
Kafka topic
Spark Structured Streaming consumer
Dedup + watermark
Write Bronze / Silver
Serve Gold aggregates
Alerting + checkpointing
Project 3 — Warehouse Optimization
Redshift schema
Distribution / sort experiments
Loading benchmarks
Query tuning examples
Cost / performance writeup
Data quality failures + recovery
Interview Prep
Explain medallion architecture in plain English
Explain fact vs dimension
Explain partitioning and skew
Explain idempotency
Explain batch vs streaming
Explain why Parquet over CSV
Explain why the warehouse layer exists
Explain dbt lineage and tests
Explain dbt sources, freshness, exposures, and snapshots
Explain Airflow retries and backfills
Explain logical_date, data interval, and catchup
Explain Spark EXPLAIN, Spark UI, and AQE at a basic level
Explain Delta MERGE, OPTIMIZE, and VACUUM
Explain Redshift distribution and sort keys at a basic level
Explain Redshift materialized views, data sharing, and zero-ETL at a basic level
Talk through the contract between a producer and downstream consumers
Talk through one pipeline end to end
Job Readiness
Can explain medallion / lakehouse vs warehouse
Can write joins, windows, CTEs, and incremental logic
Can design fact and dimension tables
Can explain Spark partitions and skew at a basic level
Can read Spark plans with EXPLAIN and Spark UI at a basic level
Can load and query Redshift
Can explain Redshift materialized views, data sharing, and zero-ETL tradeoffs
Can use Databricks notebooks / jobs / catalog at a basic level
Can explain Auto Loader, streaming tables, and declarative Lakeflow / DLT-style pipelines
Can explain Delta merge and maintenance operations at a basic level
Can define dbt sources, freshness, exposures, and snapshot use cases
Can explain Airflow logical dates, data intervals, and catchup behavior
Can explain orchestration, retries, idempotency, and SLAs
Can talk about schema evolution, data quality, and lineage
Outcome: you are job-ready enough for junior data engineering interviews, then deepen based on stack: AWS/Redshift, Databricks/Spark, Streaming/Kafka, or analytics engineering/dbt.