A modular Python ETL that extracts, validates, and sinks e-commerce data to three configurable destinations.
Built on BigQuery's public thelook_ecommerce dataset, this pipeline enforces type safety via Pydantic models, processes data in-process via DuckDB, and routes results to CSV, Amazon S3, or MotherDuck based on runtime configuration — with every component independently testable.
Extract once, validate strictly, sink anywhere.
The design problem
Most ETL scripts couple the extraction, transformation, and loading steps into a single procedural blob — making each step impossible to test independently and every destination change a surgery. The goal here was to build a pipeline where each stage is a self-contained Python module: the BigQuery extractor doesn't know about DuckDB, the Pydantic validator doesn't know about S3, and the sink layer is selected at runtime, not hardcoded.
The e-commerce dataset from BigQuery's public catalog provides a realistic multi-table workload — orders, order items, users, products, inventory, and events — at a volume that exercises real ingestion patterns without requiring private credentials.
Why Pydantic validation in the middle
Data contracts at the boundary. The BigQuery extractor returns unvalidated dicts. Pydantic models intercept each record before it enters DuckDB, enforcing types, rejecting nulls in non-nullable columns, and flagging business-rule violations — all before bad data can corrupt the analytical layer.
Validation failures are logged with row context, not swallowed silently. The pipeline can be configured to fail-fast on the first invalid record (strict mode) or quarantine invalid rows and continue (tolerant mode), depending on the sink destination's sensitivity.
Six tables from BigQuery's thelook_ecommerce public dataset.
The pipeline is table-agnostic — any table name is valid input to the BigQuery extractor. The six tables below represent the core e-commerce domain: transactional (orders, order_items), dimensional (users, products), operational (inventory_items), and behavioural (events).
Type safety at the ingestion boundary — before a single record enters DuckDB.
Pydantic v2 models act as the contract between the BigQuery extractor and the DuckDB loader. Every record is passed through the relevant model class — invalid records are intercepted, logged, and either rejected or quarantined before they can corrupt the analytical layer.
One pipeline, three output targets — selected at runtime.
After DuckDB processes the validated records, the sink adapter writes the output to whichever destination is specified at runtime. Each adapter is a separate Python class implementing a common SinkAdapter interface — swapping destinations requires no changes to the extraction or validation layers.
Four independent modules, zero cross-imports between stages.
Each component is a Python module with a single public interface. The orchestrator (main.py) is the only file that imports from multiple modules — ensuring that each component can be unit-tested in isolation, and that a change in one module cannot break another.
A reproducible, independently-testable ETL codebase.
Every component ships with unit tests, environment variable configuration, and a README covering local setup and sink-specific credential requirements.
A pipeline is only as trustworthy as the data that enters it. Validate at the boundary, process in-memory, and make the output destination a runtime decision — not a hardcoded assumption.