Foundations, platform concepts, interview patterns, and project roadmap

Mindset

  • Think in data systems, not just code.
  • Trace every flow as input → processing → storage → serving → monitoring.
  • Optimize for reliability, lineage, cost, and trust.
  • Learn one cloud deeply and map the patterns elsewhere.
Always Ask
  • What is the source?
  • What is the grain?
  • What is the schema?
  • What is the SLA?
  • Who consumes it?

Core Tools

  • SQL: joins, windows, CTEs, incremental logic
  • Python: packaging, file I/O, APIs, local transforms
  • Linux + CLI: bash, env vars, cron, SSH, permissions
  • Git: branching, PRs, rebase/merge, CI basics
  • Cloud: compute, object storage, IAM, secrets, logs, cost
  • Data modeling: grain, facts, dimensions, marts
  • Distributed systems: partitions, shuffles, retries, idempotency
  • Observability: logs, metrics, freshness, row counts, runbooks

SQL

Fundamentals

  • SELECT, WHERE, GROUP BY, HAVING, ORDER BY
  • INNER, LEFT, and FULL joins
  • UNION vs UNION ALL
  • CASE WHEN, subqueries, CTEs, date logic, NULL handling
  • DDL vs DML, views, materialized views, temp tables
Window Functions
  • ROW_NUMBER, RANK, DENSE_RANK
  • LAG, LEAD, running totals
  • Understand partitioning vs ordering inside windows

SQL Interview Patterns

  • Deduplicate data
  • Top N per group
  • Sessionization awareness
  • Funnel and cohort logic awareness
  • Slowly changing joins
  • Incremental upserts / merge logic
  • Data reconciliation queries
  • Root-cause debugging queries

Python

  • Functions, classes, packages
  • Virtual environments
  • File I/O
  • JSON, CSV, Parquet basics
  • Requests / APIs
  • Pandas for local prototyping
  • PySpark basics later

Data Shapes & Formats

  • Structured: stable table schemas
  • Semi-structured: JSON, nested records, arrays
  • Unstructured references: documents, media, logs around metadata
File Formats
  • CSV: simple, universal, weak typing
  • JSON: flexible, nested, common for APIs
  • Parquet: columnar, compressed, analytics-friendly
  • Avro: schema-aware transport / event awareness
  • Iceberg / Delta: schema evolution, partition evolution, snapshots, ACID semantics

Storage Patterns

  • Raw landing zone
  • Cleaned canonical layer
  • Curated serving layer
  • Archival / cold storage
  • Metadata catalog

Data Modeling

Core Concepts

  • Understand business grain first
  • Operational schema vs analytical schema
  • Fact tables, dimension tables, star schemas
  • Slowly changing dimensions, surrogate keys, conformed dimensions
  • Snapshotting, incremental models, data marts
Deep Dive
  • One fact row = one business event or one snapshot
  • Dimensions describe who / what / where / when
  • Degenerate and junk dimensions awareness
  • SCD Type 1 vs Type 2
  • Bridge tables, aggregate tables, partitioning large facts
  • Avoid wide unmodeled chaos in the serving layer

Source Systems

  • OLTP databases
  • APIs
  • Files from vendors
  • Event streams
  • SaaS tools
  • Logs and clickstream
  • IoT / telemetry
  • CDC awareness

Ingestion

  • Batch ingestion, streaming ingestion, micro-batch
  • Change data capture, full refresh, incremental extract
  • Schema registry awareness and catalog registration
Batch
  • Scheduled pulls
  • File drops
  • Warehouse loads
Streaming
  • Kafka topics, partitions, offsets, consumer groups
  • Insert / update / delete semantics
  • Deduplication strategy

Metadata & Quality

Catalog
  • Table names, schemas, owners, lineage
  • Classification / sensitivity
  • Freshness
  • Glue Catalog or Unity Catalog awareness
Data Quality
  • Schema validation
  • Null, uniqueness, and referential-integrity checks
  • Accepted values and business-rule validation
  • Freshness tests and volume anomaly checks
  • Quarantine bad records
  • Checkpoint validation

Spark

  • Spark SQL and DataFrames
  • Transformations vs actions
  • Lazy execution
  • Catalyst / Tungsten awareness
  • Partitions, shuffle, caching, broadcast joins
  • UDF caution
  • PySpark
Plan Debugging
  • Start with EXPLAIN or df.explain("formatted")
  • Use Spark UI to inspect stages, tasks, shuffle read/write, spill, and skewed partitions
  • AQE can change join strategy and coalesce partitions at runtime
  • Check exchange nodes, file scans, and broadcast decisions before tuning blindly
Structured Streaming
  • Sources, sinks, triggers
  • Watermarks
  • Stateful ops
  • Checkpointing
  • Late / out-of-order events
Common Performance Issues
  • Small files
  • Skew
  • Too many shuffles
  • Bad partitioning
  • Wide transformations
  • Unbounded state

Databricks

  • Workspace basics, notebooks, clusters / compute, jobs, repos
  • Delta Lake
  • Unity Catalog awareness
  • Databricks SQL
  • Lakehouse concepts
  • Data engineering + warehousing on one platform
  • Governance and sharing awareness
Medallion Architecture
  • Bronze: raw ingest, append, preserve fidelity
  • Silver: clean, dedupe, standardize, conform
  • Gold: business-ready marts and aggregates
Production Ingestion
  • Auto Loader for cloud-file ingest, schema evolution, and managed discovery
  • Streaming tables for continuous ingestion and incremental transforms
  • Materialized views for read-heavy serving and repeated aggregations
  • Lakeflow / DLT-style pipelines for declarative dependencies and quality expectations
Execution Choice
  • Notebooks explore
  • Jobs operate repeatable workloads
  • Declarative pipelines standardize ingestion and quality guardrails

Delta Operations

  • MERGE: upserts, late-arriving changes, dedupe, and SCD Type 1 / 2 flows
  • OPTIMIZE: compaction and layout tuning when small files slow scans
  • VACUUM: old-file cleanup after retention and recovery needs are understood
  • History: snapshot isolation, transaction log, time travel, and table history checks
  • Rule: keep merge logic deterministic and file maintenance intentional

Redshift

  • Provisioned vs Serverless awareness
  • Leader / compute concepts
  • Columnar warehouse mindset
  • Loading data, COPY, and bulk-load patterns
  • Staging tables
  • Distribution styles and sort keys
  • Compression / encoding awareness
  • WLM / concurrency awareness
  • Spectrum awareness
  • Vacuum / analyze awareness
  • Security and access control
Modern Features
  • Materialized views for repeated aggregates and serving layers
  • Data sharing for governed cross-team or cross-account consumption without extra copies
  • Zero-ETL integrations for supported operational sources and lower-latency landings
  • Choose warehouse serving deliberately; do not force near-real-time product APIs onto warehouse patterns
Tuning
  • Filter early
  • Avoid unnecessary joins
  • Use subqueries when appropriate
  • Understand table-design impact

Orchestration

Airflow
  • DAGs, tasks, operators, dependencies
  • Scheduling, retries, backfills, SLA awareness
  • Sensors
  • Secrets / connections
  • Task idempotency
  • Parameterized jobs
  • Failure notifications
  • Run metadata
Scheduling Semantics
  • logical_date marks the start of the data interval, not the wall-clock launch time
  • Use data_interval_start and data_interval_end for partition filters and backfills
  • Set time zones explicitly, usually UTC end to end
  • catchup=True vs False changes whether missed historical intervals are scheduled automatically
  • Asset-aware and event-driven scheduling fit upstream-driven pipelines better than cron alone

Transformation

SQL-First
  • Prefer repeatable transformations with lineage and tests
dbt
  • Models, sources, seeds, snapshots
  • Tests and docs
  • Lineage graph
  • Macros / Jinja
  • Incremental models
  • Environments and CI/CD
  • Semantic-layer awareness
Sources & Freshness
  • Define sources with owners, loaded-at fields, and freshness thresholds
  • Treat upstream source contracts as first-class, not informal assumptions
Exposures & Consumers
  • Use exposures to model dashboards, extracts, apps, and ML consumers
  • Expose downstream ownership and dependency impact before changing semantics
Testing & History
  • Generic data tests protect keys, nulls, relationships, and accepted values
  • Unit tests protect transformation logic and edge-case expectations
  • Snapshots capture SCD Type 2 history when the business needs change tracking
Core Tests
  • unique, not_null, relationships, accepted_values
  • Custom business tests

Serving & Data Products

  • Downstream consumers include BI dashboards, extracts, apps, reverse ETL, and ML features
  • A data product needs an owner, grain, semantic contract, cadence, SLA, and change process
  • Metric definitions, join rules, and null behavior matter as much as storage technology
  • Use exposures, catalog metadata, and runbooks to connect producers to consumers
  • Explicit producer-consumer contracts reduce surprise breakage more than ad hoc fixes do

Streaming & Events

Kafka
  • Topics, partitions, brokers
  • Producers, consumers, consumer groups
  • Offsets, retention, replay
  • Keys and ordering
  • Dead-letter awareness
  • Schema / versioning awareness
Design Decisions
  • Event-time vs processing-time
  • Deduplication strategy
  • Late-event handling
  • State-size control
  • Checkpoint location
  • Sink guarantees

AWS Data Lake Support

Glue
  • Crawlers
  • Data Catalog
  • ETL jobs
  • Connections
  • Triggers and scheduling
  • Monitoring
  • Schema inference
  • Partition metadata
  • Catalog-driven discovery

Security & Governance

  • Least privilege
  • PII classification
  • Encryption at rest and in transit
  • Row / column access awareness
  • Audit logs
  • Catalog ownership
  • Data retention policies
  • Access reviews
  • Compliance awareness
  • Lineage and discoverability

Reliability

  • Idempotent loads
  • Backfill strategy
  • Replay strategy
  • Late-arriving data handling
  • Schema evolution plan
  • Poison-pill / bad-record handling
  • Retry vs fail-fast rules
  • Data contracts awareness
  • Versioned datasets
  • Runbooks
  • Disaster-recovery awareness

Cost & Performance

  • Choose the right storage format
  • Partition wisely
  • Avoid small files
  • Prune columns
  • Push filters down
  • Cache only when useful
  • Use warehouse sizing intentionally
  • Separate dev / test / prod
  • Watch long-running queries
  • Measure before tuning

How to Choose Tools

  • Need simple BI warehouse? Redshift is often relevant.
  • Need lakehouse + big data + ML adjacency? Databricks is often relevant.
  • Need large-scale distributed processing? Spark is relevant.
  • Need SQL transformations with lineage and tests? dbt is relevant.
  • Need workflow scheduling? Airflow is relevant.
  • Need event streams? Kafka is relevant.
  • Need metadata / catalog in AWS? Glue is relevant.

Learning Order

  1. SQL hard
  2. Python + CLI + Git
  3. Warehouse concepts
  4. Data modeling
  5. Spark basics
  6. Databricks basics
  7. Redshift loading / tuning basics
  8. dbt
  9. Airflow
  10. Kafka + streaming
  11. Quality / governance
  12. Portfolio + interview prep

Portfolio

Build 3 Projects

Project 1 — Batch ELT
  • Source: CSV, API, or Postgres
  • Land raw data
  • Clean and standardize
  • Load warehouse
  • Build star schema
  • dbt tests + docs
  • Airflow schedule
  • Dashboard or SQL analysis
Project 2 — Streaming
  • Kafka topic
  • Spark Structured Streaming consumer
  • Dedup + watermark
  • Write Bronze / Silver
  • Serve Gold aggregates
  • Alerting + checkpointing
Project 3 — Warehouse Optimization
  • Redshift schema
  • Distribution / sort experiments
  • Loading benchmarks
  • Query tuning examples
  • Cost / performance writeup
  • Data quality failures + recovery

Interview Prep

  • Explain medallion architecture in plain English
  • Explain fact vs dimension
  • Explain partitioning and skew
  • Explain idempotency
  • Explain batch vs streaming
  • Explain why Parquet over CSV
  • Explain why the warehouse layer exists
  • Explain dbt lineage and tests
  • Explain dbt sources, freshness, exposures, and snapshots
  • Explain Airflow retries and backfills
  • Explain logical_date, data interval, and catchup
  • Explain Spark EXPLAIN, Spark UI, and AQE at a basic level
  • Explain Delta MERGE, OPTIMIZE, and VACUUM
  • Explain Redshift distribution and sort keys at a basic level
  • Explain Redshift materialized views, data sharing, and zero-ETL at a basic level
  • Talk through the contract between a producer and downstream consumers
  • Talk through one pipeline end to end

Job Readiness

  • Can explain medallion / lakehouse vs warehouse
  • Can write joins, windows, CTEs, and incremental logic
  • Can design fact and dimension tables
  • Can explain Spark partitions and skew at a basic level
  • Can read Spark plans with EXPLAIN and Spark UI at a basic level
  • Can load and query Redshift
  • Can explain Redshift materialized views, data sharing, and zero-ETL tradeoffs
  • Can use Databricks notebooks / jobs / catalog at a basic level
  • Can explain Auto Loader, streaming tables, and declarative Lakeflow / DLT-style pipelines
  • Can explain Delta merge and maintenance operations at a basic level
  • Can define dbt sources, freshness, exposures, and snapshot use cases
  • Can explain Airflow logical dates, data intervals, and catchup behavior
  • Can explain orchestration, retries, idempotency, and SLAs
  • Can talk about schema evolution, data quality, and lineage

Outcome: you are job-ready enough for junior data engineering interviews, then deepen based on stack: AWS/Redshift, Databricks/Spark, Streaming/Kafka, or analytics engineering/dbt.