Project · Data Engineering · ELT Pipeline

A four-stage medallion data lake pulling Spotify's global playlist data to Power BI — on a near-zero infrastructure budget.

End-to-end ELT pipeline built on DuckDB, Motherduck, dbt, and Astronomer Airflow. Spotify API data flows through a Raw → Bronze → Silver → Gold medallion architecture on AWS S3, transformed by dbt into analytics-ready dimensional models, and surfaced in Power BI — with Terraform provisioning the entire infrastructure.

TypeELT Pipeline

ArchitectureMedallion · 4-layer

OrchestrationAirflow · Daily

SourceSpotify API

AWSPythonAirflowDuckDBdbtTerraformMotherduckS3Power BI

GitHub Repository ↗

01 — Context & Architecture

A cost-zero data lake built on embedded analytics, not standing servers.

The “poor man's data lake” premise

Most data lake architectures require standing up a cluster — Spark, EMR, Redshift — that burns compute cost whether it's processing data or not. DuckDB breaks this constraint: it runs in-process, reads Parquet directly from S3, and costs nothing when idle.

The goal was to prove that a serious, production-grade ELT pipeline with proper transformation layers, automated testing, and BI integration could be built without paying for idle infrastructure. Motherduck extends DuckDB into the cloud, giving Power BI a persistent query endpoint without the overhead of a traditional warehouse.

What the pipeline had to do

Extract at scale. The Spotify API exposes global playlist data — tracks, albums, artists — in nested JSON. The pipeline needed to handle pagination, rate limits, and schema variance across resource types.

Each raw payload then moves through a structured four-stage medallion — Raw preserves the original API response, Bronze converts to Parquet, Silver applies business-rule cleaning, and Gold delivers analytics-ready dimensional models directly queryable by Power BI.

Infochart 01 · Medallion Architecture

Four-stage data lake: Raw → Bronze → Silver → Gold

Each stage transforms fidelity — preserving the original at Raw, standardising format at Bronze, enriching at Silver, and delivering analytics-ready models at Gold.

02 — System Topology

Nine tools, one coherent pipeline — from API call to dashboard.

Each component has a single, well-defined job. Spotify API is the only source of truth. Airflow orchestrates without being the transformation engine. DuckDB processes without running a server. Terraform provisions without manual state. The architecture is modular: any component can be swapped independently.

Infochart 02 · End-to-End Data Flow

Source → Orchestration → Storage → Analytics

Left-to-right flow: data originates at the Spotify API, is orchestrated by Airflow, staged across four S3 zones, processed by DuckDB, warehoused in Motherduck, and queried by Power BI.

03 — Orchestration

Four modular task groups, each independently testable and retryable.

The Airflow DAG is structured so that each medallion stage maps to exactly one task group. If the silver transform fails, the bronze data is already landed and safe — the retry only re-runs what failed. This is the key resilience property the modular design provides.

Infochart 03 · Airflow DAG Structure

Task-group dependency chain · Daily @ 06:00 UTC

Each group is a self-contained Python module — it can be unit-tested independently and retried in isolation without re-running upstream stages.

04 — Transformation

Three dbt layers, each with automated test coverage.

dbt is the transformation engine for the Gold stage. Every model is tested — sources have schema tests, staging models have not-null and unique constraints, and marts carry referential integrity checks. Both dev and prod targets hit Motherduck, with full documentation generated on each run.

Infochart 04 · dbt Model Architecture

Sources → Staging → Intermediate → Marts · with automated test coverage

05 — Metrics & Cost Model

Near-zero infrastructure cost — without sacrificing engineering quality.

The “poor man's data lake” is not a shortcut — it's a deliberate architectural decision to eliminate idle compute costs. DuckDB runs only during DAG execution. S3 charges only for storage consumed. Motherduck charges only for queries run.

Medallion stages

Raw → Bronze → Silver → Gold, each independently retryable and testable

dbt automated tests

Schema, not-null, unique, and referential integrity checks across all model layers

Standing server cost

No cluster, no running instance — DuckDB only burns compute during the daily DAG run

Infochart 05 · Infrastructure Cost Model

This architecture vs. a traditional data warehouse stack

The cost advantage is structural, not incidental — it comes from eliminating idle compute, not from cutting features or quality.

06 — Deliverables

Six artifacts — every one production-grade and reproducible.

The pipeline is open-source. Each component is self-contained, documented, and independently deployable. Terraform ensures the infrastructure is reproducible from a single terraform apply.

Airflow DAG (Python)

Four modular task groups (extract → bronze → silver → dbt_run) with XCom-based dependency passing, retry logic, and independent unit tests per group.

dbt Project (staging → intermediate → marts)

68 automated tests across four model layers. Docs generated on every run. Dev and prod Motherduck targets configured in profiles.yml.

Terraform Module (IaC)

S3 bucket definitions with versioning and lifecycle rules. Motherduck external table DDL auto-generated for all four medallion zones.

Power BI Dashboard (3 report pages)

Playlist trend analysis, artist reach by market, and track-level streaming metrics — all querying Motherduck directly via DirectQuery.

README & Architecture Decision Log

Step-by-step local setup, environment variable reference, Spotify API credential configuration, and the rationale behind every major design choice.

DuckDB Extraction Layer (Python)

Reusable Python modules for each Spotify resource type — handles pagination, rate-limit back-off, and schema variance across API versions.

“

The best data lake is the one you can actually afford to run. Embed the engine, eliminate the server, charge only for what you process — and spend the savings on engineering quality instead.