Back to Portfolio
Project · Data Engineering · ELT Pipeline

A four-stage medallion data lake pulling Spotify's global playlist data to Power BI — on a near-zero infrastructure budget.

End-to-end ELT pipeline built on DuckDB, Motherduck, dbt, and Astronomer Airflow. Spotify API data flows through a Raw → Bronze → Silver → Gold medallion architecture on AWS S3, transformed by dbt into analytics-ready dimensional models, and surfaced in Power BI — with Terraform provisioning the entire infrastructure.

TypeELT Pipeline
ArchitectureMedallion · 4-layer
OrchestrationAirflow · Daily
SourceSpotify API
AWSPythonAirflowDuckDBdbtTerraformMotherduckS3Power BI
01 — Context & Architecture

A cost-zero data lake built on embedded analytics, not standing servers.

The “poor man's data lake” premise

Most data lake architectures require standing up a cluster — Spark, EMR, Redshift — that burns compute cost whether it's processing data or not. DuckDB breaks this constraint: it runs in-process, reads Parquet directly from S3, and costs nothing when idle.

The goal was to prove that a serious, production-grade ELT pipeline with proper transformation layers, automated testing, and BI integration could be built without paying for idle infrastructure. Motherduck extends DuckDB into the cloud, giving Power BI a persistent query endpoint without the overhead of a traditional warehouse.

What the pipeline had to do

Extract at scale. The Spotify API exposes global playlist data — tracks, albums, artists — in nested JSON. The pipeline needed to handle pagination, rate limits, and schema variance across resource types.

Each raw payload then moves through a structured four-stage medallion — Raw preserves the original API response, Bronze converts to Parquet, Silver applies business-rule cleaning, and Gold delivers analytics-ready dimensional models directly queryable by Power BI.

Infochart 01 · Medallion Architecture
Four-stage data lake: Raw → Bronze → Silver → Gold
Each stage transforms fidelity — preserving the original at Raw, standardising format at Bronze, enriching at Silver, and delivering analytics-ready models at Gold.
ELT Medallion Pipeline · AWS S3 Storage Zones · Daily Batch via AirflowRawFormatJSON (raw API response)ToolSpotify API extractorStorageS3 · raw/ prefixTracks · Albums · ArtistsPlaylists — immutable landingBronzeFormatParquet (columnar)ToolDuckDB · JSON → ParquetStorageS3 · bronze/ prefixSchema enforced, typed,compressed — read-readySilverFormatParquet (enriched)ToolDuckDB · business rulesStorageS3 · silver/ prefixDeduped, normalised,null-handled, join-readyGoldFormatDimensional tablesTooldbt + MotherduckStorageMotherduck + S3 gold/Power BI reads directlyfrom Motherduck endpoint→ Power BI
02 — System Topology

Nine tools, one coherent pipeline — from API call to dashboard.

Each component has a single, well-defined job. Spotify API is the only source of truth. Airflow orchestrates without being the transformation engine. DuckDB processes without running a server. Terraform provisions without manual state. The architecture is modular: any component can be swapped independently.

Infochart 02 · End-to-End Data Flow
Source → Orchestration → Storage → Analytics
Left-to-right flow: data originates at the Spotify API, is orchestrated by Airflow, staged across four S3 zones, processed by DuckDB, warehoused in Motherduck, and queried by Power BI.
SourceOrchestrationStorage (AWS S3)WarehouseAnalyticsSpotifyAPITracksAlbumsArtists · PlaylistsAstronomerAirflowDaily · 06:00 UTCextract_spotifybronze_transformsilver_transformdbt_runDAG retries: 2 · timeout: 2hModular Python task groupswith independent testingS3 RawJSON · by dateS3 BronzeParquet · typedS3 SilverParquet · enrichedDuckDBIn-process · zero serverJSON → Parquet (bronze)Reads from S3Clean + enrich (silver)No idle costMotherduckCloud DuckDB · External tablesdim_artist · dim_trackfct_stream · fct_playlistTerraform (IaC)Provisions S3 buckets+ Motherduck ext. tablesPower BIAnalytics Layer · 3 report pagesPlaylist trendsArtist reach · Track metrics
03 — Orchestration

Four modular task groups, each independently testable and retryable.

The Airflow DAG is structured so that each medallion stage maps to exactly one task group. If the silver transform fails, the bronze data is already landed and safe — the retry only re-runs what failed. This is the key resilience property the modular design provides.

Infochart 03 · Airflow DAG Structure
Task-group dependency chain · Daily @ 06:00 UTC
Each group is a self-contained Python module — it can be unit-tested independently and retried in isolation without re-running upstream stages.
spotify_daily_dagschedule: @daily · start_date: 2024-01-01 · retries: 2 · retry_delay: 5minAstronomer · Airflow 2.8extract_spotifyTask Group 01get_tracks()get_albums()get_artists()get_playlists()→ raw JSON to S3 raw/bronze_transformTask Group 02read_raw_s3()json_to_parquet()enforce_schema()write_bronze_s3()DuckDB · in-processsilver_transformTask Group 03read_bronze_s3()deduplicate()normalise_nulls()write_silver_s3()DuckDB · enrichment rulesdbt_runTask Group 04 · Gold Stagedbt run --select stagingdbt run --select intermediatedbt run --select martsdbt testMotherduck · dev + prod targetsRetry & failure isolation• Each task group runs only after its upstream group succeeds — enforced by XCom passing, no implicit data coupling.• On failure, Airflow retries twice. If exhausted, prior zone data remains safe — only the failed stage replays on the next run.
04 — Transformation

Three dbt layers, each with automated test coverage.

dbt is the transformation engine for the Gold stage. Every model is tested — sources have schema tests, staging models have not-null and unique constraints, and marts carry referential integrity checks. Both dev and prod targets hit Motherduck, with full documentation generated on each run.

Infochart 04 · dbt Model Architecture
Sources → Staging → Intermediate → Marts · with automated test coverage
dbt-core · Motherduck target · dev + prod environmentsLayerModelsTest CoverageTestsSourcesParquet via Motherduckexternal tablessource_spotify.trackssource_spotify.albumssource_spotify.artistssource_spotify.playlistsSchema · not-null6 testsFreshness checks4 testsStaging1-to-1 sourcemodelsstg_trackstype casting · column renaming · null defaultsstg_albums · stg_artists · stg_playlistsstandardised grain · ready for joiningnot_null · unique (all models)18 testsaccepted_values8 testsIntermediatebusiness-ruleenrichmentint_tracks_enrichedjoins track + album + artist · adds popularity decileint_playlist_tracksflattens nested playlist ↔ track relationshipnot_null · unique · relationships12 testsMartsanalytics-readydimensional modelsdim_artist · dim_track · dim_albumslowly-changing dimensions · SCD type 1fct_stream · fct_playlist_snapshotfact grain: one row per stream event / snapshot datenot_null · unique · relationships · accepted_values24 testsDev + Prod targetsprofiles.yml configures two Motherduck targets — dev for iteration, prod for Power BI.
05 — Metrics & Cost Model

Near-zero infrastructure cost — without sacrificing engineering quality.

The “poor man's data lake” is not a shortcut — it's a deliberate architectural decision to eliminate idle compute costs. DuckDB runs only during DAG execution. S3 charges only for storage consumed. Motherduck charges only for queries run.

4
Medallion stages
Raw → Bronze → Silver → Gold, each independently retryable and testable
68
dbt automated tests
Schema, not-null, unique, and referential integrity checks across all model layers
$0
Standing server cost
No cluster, no running instance — DuckDB only burns compute during the daily DAG run
Infochart 05 · Infrastructure Cost Model
This architecture vs. a traditional data warehouse stack
The cost advantage is structural, not incidental — it comes from eliminating idle compute, not from cutting features or quality.
Estimated monthly infrastructure cost (USD)Traditional Data Warehouse StackRedshift + EMR + RDS + Airflow EC2Redshift (2-node)~ $360/moEMR Spark (small cluster)~ $240/moRDS PostgreSQL (t3.medium)~ $140/moAirflow on EC2 (m5.large)~ $70/moTotal$810per month — whether or not data is being processedThis ArchitectureDuckDB + S3 + Motherduck + AstronomerAWS S3 (10 GB data)~ $2/moDuckDB (in-process)$0/moMotherduck~ $2/moAstronomer (dev tier)~ $0/moTotal~ $4per month — scales with data volumeWhy the cost gap is structural, not incidentalTraditional stacks pay for idle capacity 24/7. DuckDB, S3, and Motherduckcharge only for what runs — same transformation quality, 200× lower cost.200× cost reduction$810/mo traditional → ~$4/moat equivalent engineering quality
06 — Deliverables

Six artifacts — every one production-grade and reproducible.

The pipeline is open-source. Each component is self-contained, documented, and independently deployable. Terraform ensures the infrastructure is reproducible from a single terraform apply.

Airflow DAG (Python)
Four modular task groups (extract → bronze → silver → dbt_run) with XCom-based dependency passing, retry logic, and independent unit tests per group.
dbt Project (staging → intermediate → marts)
68 automated tests across four model layers. Docs generated on every run. Dev and prod Motherduck targets configured in profiles.yml.
Terraform Module (IaC)
S3 bucket definitions with versioning and lifecycle rules. Motherduck external table DDL auto-generated for all four medallion zones.
Power BI Dashboard (3 report pages)
Playlist trend analysis, artist reach by market, and track-level streaming metrics — all querying Motherduck directly via DirectQuery.
README & Architecture Decision Log
Step-by-step local setup, environment variable reference, Spotify API credential configuration, and the rationale behind every major design choice.
DuckDB Extraction Layer (Python)
Reusable Python modules for each Spotify resource type — handles pagination, rate-limit back-off, and schema variance across API versions.

The best data lake is the one you can actually afford to run. Embed the engine, eliminate the server, charge only for what you process — and spend the savings on engineering quality instead.