Repeatable operating patterns for common day-to-day engineering tasks

New Source Onboarding

Goal
  • Bring a source system into the platform with clear ownership, schema, and validation.
Gather
  • Owner and escalation path
  • Business purpose and consumers
  • Grain, schema, update frequency, SLA
  • Access method: API, DB, files, CDC, events
  • Sensitive fields and retention requirements
Steps
  1. Document source contract and sample records.
  2. Choose full-refresh, incremental, CDC, or streaming pattern.
  3. Land raw data with minimal transformation.
  4. Register metadata and ownership.
  5. Add schema and freshness checks.
  6. Build cleaned / canonical layer.
Done When
  • Source has owner, tests, docs, lineage, and a recovery plan.

Incremental Pipeline

Choose
  • Primary key or business key
  • Watermark column or change version
  • Append-only vs merge / upsert
  • Late-data policy
Steps
  1. Load only new or changed rows from source.
  2. Stage raw extract separately.
  3. Validate record counts and duplicate risk.
  4. Apply deterministic transforms.
  5. Merge or append into target table.
  6. Persist watermark only after success.
Watch For
  • Clock drift, null watermarks, duplicate keys, replays, partial writes.

Safe Backfill

Before
  • Define exact time window and impacted tables.
  • Estimate compute, runtime, and blast radius.
  • Decide whether to isolate into temp tables.
  • Pause or coordinate downstream jobs if needed.
During
  • Process in bounded chunks.
  • Track progress by partition or date window.
  • Validate each chunk before marking complete.
After
  • Reconcile counts, freshness, and business metrics.
  • Document what was reprocessed and why.

Pipeline Failure Triage

First Questions
  • Did extract fail, transform fail, or load fail?
  • Is this schema drift, bad data, permissions, compute, or dependency failure?
  • Is this a one-off or a systemic issue?
Triage Order
  1. Read the failing step and timestamp.
  2. Check upstream freshness and schema changes.
  3. Check recent deploys and config changes.
  4. Inspect row counts and anomaly spikes.
  5. Decide: retry, patch forward, quarantine, or rollback.
Preventive Actions
  • Add test, alert, guardrail, or runbook update.

Bad Data Investigation

Check
  • Freshness
  • Row counts vs normal baseline
  • Duplicate-key spikes
  • Null-rate spikes
  • Unexpected accepted values
  • Changed field semantics from source team
Compare
  • Today vs yesterday
  • Raw vs cleaned layer
  • Source totals vs warehouse totals
  • One partition vs neighboring partitions
Outcome
  • Fix transform, fix contract, quarantine data, or backfill corrected history.

Slow Query Tuning

Quick Wins
  • Filter early
  • Select fewer columns
  • Reduce large joins
  • Use staging / temp tables for complex steps
Warehouse Checks
  • Partition pruning working?
  • Sort / dist design helping or hurting?
  • Small files or skew causing scan bloat?
  • Broadcast / shuffle decisions sensible?
Done When
  • You measured before and after, and can explain why it improved.

Schema Change Handling

Classify
  • Additive
  • Breaking rename
  • Type change
  • Semantic change with same name
Steps
  1. Identify downstream consumers.
  2. Decide compatibility strategy.
  3. Update transforms and tests.
  4. Version docs or contracts if needed.
  5. Backfill if semantics changed historically.
Rule
  • Semantic changes are more dangerous than column additions.

Airflow Scheduling Semantics

Anchor Rules
  • logical_date is the start of the run's data interval, not the wall-clock launch time.
  • Use data_interval_start and data_interval_end for partition filters and backfill boundaries.
  • Set timezone explicitly, usually UTC, before any date logic spreads through the DAG.
  • Choose catchup=True only when historical intervals should be scheduled automatically.
Use Event Scheduling When
  • Upstream availability matters more than a cron cadence.
  • Assets or dataset events are a cleaner contract than clock time alone.

Ship a New dbt Model

  1. Define business grain.
  2. Write model SQL from trusted sources only.
  3. Add tests: unique, not_null, relationships, accepted values.
  4. Use generic tests for table shape and unit tests for transformation logic when edge cases matter.
  5. Add documentation and ownership.
  6. Check lineage impact.
  7. Run locally or in CI.
  8. Validate row counts and sample records.
  9. Promote through environment flow.

dbt Sources & Exposures

Define
  • Source owner, schema, loader, and freshness thresholds.
  • Exposures for dashboards, extracts, apps, and ML consumers.
  • Consumer-facing descriptions for the models that matter operationally.
Rule
  • Source freshness warnings are upstream contract signals, not optional noise.
  • Every important downstream consumer should be represented somewhere explicit.

Snapshot / SCD Type 2 History

Use When
  • Source rows update in place and the business needs historical state.
Choose
  • Unique key, change detection strategy, and hard-delete handling.
  • How current rows and effective date ranges will be validated.
Rule
  • Do not snapshot noisy churn unless someone actually consumes the history.

Spark Skew Debugging

Symptoms
  • One stage hangs forever
  • One task massively slower than others
  • Out-of-memory on one executor
Checks
  • Hot keys
  • Join type and data distribution
  • Partition count too low or too high
  • Explosive wide transformation
Fixes
  • Salting hot keys
  • Broadcast smaller side
  • Repartition intentionally
  • Pre-aggregate before join

Spark Query Plan Debugging

Start With
  • EXPLAIN or df.explain("formatted") before changing code blindly.
  • Check join strategy, exchange nodes, filter pushdown, and file scans.
Spark UI
  • Read stages, tasks, shuffle read/write, spill, skewed tasks, and input sizes.
  • Compare slow stages with the parts of the plan that produced them.
AQE
  • Look for runtime join rewrites and partition coalescing before assuming the static plan tells the whole story.

Databricks Ingestion Choices

Choose
  • Auto Loader for cloud-file arrival, schema drift, and managed discovery.
  • Streaming tables for continuous ingest and incremental transforms.
  • Materialized views for repeated serving queries and aggregates.
  • Lakeflow / DLT-style pipelines when declarative quality rules and managed dependencies help the team.
Rule
  • Notebooks explore, jobs operate, declarative pipelines standardize production expectations.

Delta Table Maintenance

Before
  • Inspect table history, file counts, query patterns, and active streaming readers.
Operations
  • MERGE with deterministic keys and deduped source rows.
  • OPTIMIZE when small files or repeated predicates are hurting scans.
  • VACUUM only after retention, recovery, and time-travel expectations are clear.
After
  • Compare file counts, runtimes, and downstream query behavior.

Redshift Modern Features

Use
  • Materialized views for repeated aggregates and read-heavy serving layers.
  • Data sharing when multiple consumers need the same curated data without copy sprawl.
  • Zero-ETL when supported source integrations and lower-latency ingestion are worth the tradeoff.
Rule
  • Redshift is strong for warehouse serving, but operational APIs and OLTP systems still need their own home.

Downstream Data Product Contract

Define
  • Owner, grain, semantic meaning, freshness SLA, and downstream consumers.
  • Dashboards, extracts, apps, and ML features that depend on the dataset.
Protect
  • Metric definitions, join keys, and null behavior before changing schemas.
  • Expose dependencies through dbt exposures, catalogs, or runbooks.

Production Release Checklist

  • Schema docs updated
  • Ownership clear
  • Alerts configured
  • Data quality tests green
  • Rollback or recovery plan known
  • Cost impact understood
  • Downstream consumers notified if needed
  • Runbook updated

Daily Data Quality Checklist

  • Freshness within SLA
  • Expected partitions arrived
  • No duplicate business keys
  • Null rates within normal range
  • Volume within tolerance band
  • Business-rule checks passing
  • No silent schema drift

Incident Response

Stabilize
  • Identify severity and affected data products.
  • Stop further damage if pipeline is corrupting data.
Communicate
  • Tell stakeholders what is known, unknown, and next update time.
Recover
  • Restore good state, replay safely, validate metrics, then reopen downstream access.
Afterward
  • Write RCA, add guardrails, and improve runbook.

Reduce Cost

  • Stop reading columns you do not need.
  • Fix partition pruning.
  • Compact small files.
  • Lower unnecessary refresh cadence.
  • Cache less, not more.
  • Separate exploratory from scheduled workloads.
  • Delete stale dev artifacts.
  • Measure query and cluster patterns before resizing.