Data Engineer Playbooks

New Source Onboarding

Goal

Bring a source system into the platform with clear ownership, schema, and validation.

Gather

Owner and escalation path
Business purpose and consumers
Grain, schema, update frequency, SLA
Access method: API, DB, files, CDC, events
Sensitive fields and retention requirements

Steps

Document source contract and sample records.
Choose full-refresh, incremental, CDC, or streaming pattern.
Land raw data with minimal transformation.
Register metadata and ownership.
Add schema and freshness checks.
Build cleaned / canonical layer.

Done When

Source has owner, tests, docs, lineage, and a recovery plan.

Incremental Pipeline

Choose

Primary key or business key
Watermark column or change version
Append-only vs merge / upsert
Late-data policy

Steps

Load only new or changed rows from source.
Stage raw extract separately.
Validate record counts and duplicate risk.
Apply deterministic transforms.
Merge or append into target table.
Persist watermark only after success.

Watch For

Clock drift, null watermarks, duplicate keys, replays, partial writes.

Safe Backfill

Before

Define exact time window and impacted tables.
Estimate compute, runtime, and blast radius.
Decide whether to isolate into temp tables.
Pause or coordinate downstream jobs if needed.

During

Process in bounded chunks.
Track progress by partition or date window.
Validate each chunk before marking complete.

After

Reconcile counts, freshness, and business metrics.
Document what was reprocessed and why.

Pipeline Failure Triage

First Questions

Did extract fail, transform fail, or load fail?
Is this schema drift, bad data, permissions, compute, or dependency failure?
Is this a one-off or a systemic issue?

Triage Order

Read the failing step and timestamp.
Check upstream freshness and schema changes.
Check recent deploys and config changes.
Inspect row counts and anomaly spikes.
Decide: retry, patch forward, quarantine, or rollback.

Preventive Actions

Add test, alert, guardrail, or runbook update.

Bad Data Investigation

Check

Freshness
Row counts vs normal baseline
Duplicate-key spikes
Null-rate spikes
Unexpected accepted values
Changed field semantics from source team

Compare

Today vs yesterday
Raw vs cleaned layer
Source totals vs warehouse totals
One partition vs neighboring partitions

Outcome

Fix transform, fix contract, quarantine data, or backfill corrected history.

Slow Query Tuning

Quick Wins

Filter early
Select fewer columns
Reduce large joins
Use staging / temp tables for complex steps

Warehouse Checks

Partition pruning working?
Sort / dist design helping or hurting?
Small files or skew causing scan bloat?
Broadcast / shuffle decisions sensible?

Done When

You measured before and after, and can explain why it improved.

Schema Change Handling

Classify

Additive
Breaking rename
Type change
Semantic change with same name

Steps

Identify downstream consumers.
Decide compatibility strategy.
Update transforms and tests.
Version docs or contracts if needed.
Backfill if semantics changed historically.

Rule

Semantic changes are more dangerous than column additions.

Airflow Scheduling Semantics

Anchor Rules

logical_date is the start of the run's data interval, not the wall-clock launch time.
Use data_interval_start and data_interval_end for partition filters and backfill boundaries.
Set timezone explicitly, usually UTC, before any date logic spreads through the DAG.
Choose catchup=True only when historical intervals should be scheduled automatically.

Use Event Scheduling When

Upstream availability matters more than a cron cadence.
Assets or dataset events are a cleaner contract than clock time alone.

Ship a New dbt Model

Define business grain.
Write model SQL from trusted sources only.
Add tests: unique, not_null, relationships, accepted values.
Use generic tests for table shape and unit tests for transformation logic when edge cases matter.
Add documentation and ownership.
Check lineage impact.
Run locally or in CI.
Validate row counts and sample records.
Promote through environment flow.

dbt Sources & Exposures

Define

Source owner, schema, loader, and freshness thresholds.
Exposures for dashboards, extracts, apps, and ML consumers.
Consumer-facing descriptions for the models that matter operationally.

Rule

Source freshness warnings are upstream contract signals, not optional noise.
Every important downstream consumer should be represented somewhere explicit.

Snapshot / SCD Type 2 History

Use When

Source rows update in place and the business needs historical state.

Choose

Unique key, change detection strategy, and hard-delete handling.
How current rows and effective date ranges will be validated.

Rule

Do not snapshot noisy churn unless someone actually consumes the history.

Spark Skew Debugging

Symptoms

One stage hangs forever
One task massively slower than others
Out-of-memory on one executor

Checks

Hot keys
Join type and data distribution
Partition count too low or too high
Explosive wide transformation

Fixes

Salting hot keys
Broadcast smaller side
Repartition intentionally
Pre-aggregate before join

Spark Query Plan Debugging

Start With

EXPLAIN or df.explain("formatted") before changing code blindly.
Check join strategy, exchange nodes, filter pushdown, and file scans.

Spark UI

Read stages, tasks, shuffle read/write, spill, skewed tasks, and input sizes.
Compare slow stages with the parts of the plan that produced them.

AQE

Look for runtime join rewrites and partition coalescing before assuming the static plan tells the whole story.

Databricks Ingestion Choices

Choose

Auto Loader for cloud-file arrival, schema drift, and managed discovery.
Streaming tables for continuous ingest and incremental transforms.
Materialized views for repeated serving queries and aggregates.
Lakeflow / DLT-style pipelines when declarative quality rules and managed dependencies help the team.

Rule

Notebooks explore, jobs operate, declarative pipelines standardize production expectations.

Delta Table Maintenance

Before

Inspect table history, file counts, query patterns, and active streaming readers.

Operations

MERGE with deterministic keys and deduped source rows.
OPTIMIZE when small files or repeated predicates are hurting scans.
VACUUM only after retention, recovery, and time-travel expectations are clear.

After

Compare file counts, runtimes, and downstream query behavior.

Redshift Modern Features

Use

Materialized views for repeated aggregates and read-heavy serving layers.
Data sharing when multiple consumers need the same curated data without copy sprawl.
Zero-ETL when supported source integrations and lower-latency ingestion are worth the tradeoff.

Rule

Redshift is strong for warehouse serving, but operational APIs and OLTP systems still need their own home.

Downstream Data Product Contract

Define

Owner, grain, semantic meaning, freshness SLA, and downstream consumers.
Dashboards, extracts, apps, and ML features that depend on the dataset.

Protect

Metric definitions, join keys, and null behavior before changing schemas.
Expose dependencies through dbt exposures, catalogs, or runbooks.

Production Release Checklist

Schema docs updated
Ownership clear
Alerts configured
Data quality tests green
Rollback or recovery plan known
Cost impact understood
Downstream consumers notified if needed
Runbook updated

Daily Data Quality Checklist

Freshness within SLA
Expected partitions arrived
No duplicate business keys
Null rates within normal range
Volume within tolerance band
Business-rule checks passing
No silent schema drift

Incident Response

Stabilize

Identify severity and affected data products.
Stop further damage if pipeline is corrupting data.

Communicate

Tell stakeholders what is known, unknown, and next update time.

Recover

Restore good state, replay safely, validate metrics, then reopen downstream access.

Afterward

Write RCA, add guardrails, and improve runbook.

Reduce Cost

Stop reading columns you do not need.
Fix partition pruning.
Compact small files.
Lower unnecessary refresh cadence.
Cache less, not more.
Separate exploratory from scheduled workloads.
Delete stale dev artifacts.
Measure query and cluster patterns before resizing.