Project · Data Engineering · ETL Pipeline

A modular Python ETL that extracts, validates, and sinks e-commerce data to three configurable destinations.

Built on BigQuery's public thelook_ecommerce dataset, this pipeline enforces type safety via Pydantic models, processes data in-process via DuckDB, and routes results to CSV, Amazon S3, or MotherDuck based on runtime configuration — with every component independently testable.

TypeETL Pipeline

SourceBigQuery Public

EngineDuckDB (in-process)

Destinations3 · CSV · S3 · MotherDuck

PythonBigQueryPydanticDuckDBAmazon S3MotherDuckdbt

GitHub Repository ↗

01 — Context & Architecture

Extract once, validate strictly, sink anywhere.

The design problem

Most ETL scripts couple the extraction, transformation, and loading steps into a single procedural blob — making each step impossible to test independently and every destination change a surgery. The goal here was to build a pipeline where each stage is a self-contained Python module: the BigQuery extractor doesn't know about DuckDB, the Pydantic validator doesn't know about S3, and the sink layer is selected at runtime, not hardcoded.

The e-commerce dataset from BigQuery's public catalog provides a realistic multi-table workload — orders, order items, users, products, inventory, and events — at a volume that exercises real ingestion patterns without requiring private credentials.

Why Pydantic validation in the middle

Data contracts at the boundary. The BigQuery extractor returns unvalidated dicts. Pydantic models intercept each record before it enters DuckDB, enforcing types, rejecting nulls in non-nullable columns, and flagging business-rule violations — all before bad data can corrupt the analytical layer.

Validation failures are logged with row context, not swallowed silently. The pipeline can be configured to fail-fast on the first invalid record (strict mode) or quarantine invalid rows and continue (tolerant mode), depending on the sink destination's sensitivity.

Infochart 01 · End-to-End ETL Flow

BigQuery → Extract → Validate → DuckDB → Sink

Each stage is an independent Python module — no stage has a direct import from the next. The sink destination is the only runtime-variable component.

02 — Source Schema

Six tables from BigQuery's thelook_ecommerce public dataset.

The pipeline is table-agnostic — any table name is valid input to the BigQuery extractor. The six tables below represent the core e-commerce domain: transactional (orders, order_items), dimensional (users, products), operational (inventory_items), and behavioural (events).

Infochart 02 · Source Schema — thelook_ecommerce

Six e-commerce tables · key columns · row counts

The BigQuery extractor accepts any table name from this dataset. Pydantic model classes correspond 1-to-1 with each table.

03 — Pydantic Validation Layer

Type safety at the ingestion boundary — before a single record enters DuckDB.

Pydantic v2 models act as the contract between the BigQuery extractor and the DuckDB loader. Every record is passed through the relevant model class — invalid records are intercepted, logged, and either rejected or quarantined before they can corrupt the analytical layer.

Infochart 03 · Pydantic Validation — Order Model Example

Input dict → OrderModel → valid record OR validation error

Each source table has a corresponding Pydantic model class. Validation runs per-record at ingestion time, not batch-after-load.

04 — Multi-Destination Sink

One pipeline, three output targets — selected at runtime.

After DuckDB processes the validated records, the sink adapter writes the output to whichever destination is specified at runtime. Each adapter is a separate Python class implementing a common SinkAdapter interface — swapping destinations requires no changes to the extraction or validation layers.

Infochart 04 · Sink Destination Comparison

CSV · Amazon S3 · MotherDuck — when to use each

Each sink is independently configurable via environment variables. The pipeline exports DuckDB query results — the output is always post-validation and post-DuckDB processing.

05 — Component Architecture

Four independent modules, zero cross-imports between stages.

Each component is a Python module with a single public interface. The orchestrator (main.py) is the only file that imports from multiple modules — ensuring that each component can be unit-tested in isolation, and that a change in one module cannot break another.

Python modules

BigQuery extractor, Pydantic validator, DuckDB loader, and multi-sink adapter — independently testable

Output destinations

CSV, Amazon S3, MotherDuck — selected at runtime via --sink flag, no code change required

Source tables

orders, order_items, users, products, inventory_items, events — each with a corresponding Pydantic model class

Infochart 05 · Module Dependency Map

main.py orchestrates — modules never import each other

The dependency inversion keeps each module independently testable. Unit tests for the BigQuery extractor mock the BQ client; tests for the Pydantic validator use hardcoded dicts; tests for sinks use in-memory DuckDB relations.

06 — Deliverables

A reproducible, independently-testable ETL codebase.

Every component ships with unit tests, environment variable configuration, and a README covering local setup and sink-specific credential requirements.

BigQuery Extractor (bq_client.py)

Table-agnostic extractor using the BigQuery Python client. Returns validated list[dict] output. Auth via service account JSON or ADC.

Pydantic Models (models.py)

One BaseModel subclass per source table. Type coercion, enum validation, nullability enforcement, and strict/tolerant mode configuration.

DuckDB Loader (duckdb_loader.py)

In-process analytical engine. Receives validated records, creates in-memory DuckDB tables, exposes DuckDBRelation for sink consumption.

Multi-Sink Adapters (sinks.py)

CSVSink, S3Sink, and MotherDuckSink — all implementing a common SinkAdapter protocol. Runtime-selectable via --sink CLI flag.

Unit Test Suite (pytest)

Independent tests per module. BigQuery mocked with unittest.mock. Validator tests use hardcoded valid/invalid dicts. Sink tests use in-memory DuckDB.

README + .env template

Local setup guide, GCP service account configuration, AWS credential setup, MotherDuck token setup, and example CLI invocations for each sink.

“

A pipeline is only as trustworthy as the data that enters it. Validate at the boundary, process in-memory, and make the output destination a runtime decision — not a hardcoded assumption.