Back to Portfolio
Project · Data Engineering · ETL Pipeline

A modular Python ETL that extracts, validates, and sinks e-commerce data to three configurable destinations.

Built on BigQuery's public thelook_ecommerce dataset, this pipeline enforces type safety via Pydantic models, processes data in-process via DuckDB, and routes results to CSV, Amazon S3, or MotherDuck based on runtime configuration — with every component independently testable.

TypeETL Pipeline
SourceBigQuery Public
EngineDuckDB (in-process)
Destinations3 · CSV · S3 · MotherDuck
PythonBigQueryPydanticDuckDBAmazon S3MotherDuckdbt
01 — Context & Architecture

Extract once, validate strictly, sink anywhere.

The design problem

Most ETL scripts couple the extraction, transformation, and loading steps into a single procedural blob — making each step impossible to test independently and every destination change a surgery. The goal here was to build a pipeline where each stage is a self-contained Python module: the BigQuery extractor doesn't know about DuckDB, the Pydantic validator doesn't know about S3, and the sink layer is selected at runtime, not hardcoded.

The e-commerce dataset from BigQuery's public catalog provides a realistic multi-table workload — orders, order items, users, products, inventory, and events — at a volume that exercises real ingestion patterns without requiring private credentials.

Why Pydantic validation in the middle

Data contracts at the boundary. The BigQuery extractor returns unvalidated dicts. Pydantic models intercept each record before it enters DuckDB, enforcing types, rejecting nulls in non-nullable columns, and flagging business-rule violations — all before bad data can corrupt the analytical layer.

Validation failures are logged with row context, not swallowed silently. The pipeline can be configured to fail-fast on the first invalid record (strict mode) or quarantine invalid rows and continue (tolerant mode), depending on the sink destination's sensitivity.

Infochart 01 · End-to-End ETL Flow
BigQuery → Extract → Validate → DuckDB → Sink
Each stage is an independent Python module — no stage has a direct import from the next. The sink destination is the only runtime-variable component.
BigQueryPublic Datasetordersorder_items · usersproducts · eventsBigQueryExtractorbq_client.pyTable name as inputReturns list[dict]Pagination handledAuth via service accountPydanticValidatormodels.pyType enforcementNot-null checksBusiness-rule validationStrict or tolerant modeDuckDBIn-process engineValidated records loadedIn-memory analytical DBNo idle server costLocal CSVfilesystem sink · dev / testingAmazon S3object storage · data lake integrationMotherDuckcloud DuckDB · analytical query endpointDestination selected at runtime · --sink csv | s3 | motherduck
02 — Source Schema

Six tables from BigQuery's thelook_ecommerce public dataset.

The pipeline is table-agnostic — any table name is valid input to the BigQuery extractor. The six tables below represent the core e-commerce domain: transactional (orders, order_items), dimensional (users, products), operational (inventory_items), and behavioural (events).

Infochart 02 · Source Schema — thelook_ecommerce
Six e-commerce tables · key columns · row counts
The BigQuery extractor accepts any table name from this dataset. Pydantic model classes correspond 1-to-1 with each table.
ordersorder_id (PK)user_id (FK)status · created_atnum_of_itemgender · traffic_source~125K rowsorder_itemsid (PK)order_id (FK) · user_id (FK)product_id (FK)sale_price · statusshipped_at · returned_at~250K rowsusersid (PK)first_name · last_nameemail · gender · agecountry · city · statetraffic_source · created_at~100K rowsproductsid (PK)name · brand · categorydepartment · skuretail_price · costdistribution_center_id~29K rowsinventory_itemsid (PK)product_id (FK)product_categorycost · sold_atcreated_at · sold_at~500K rowseventsid (PK)user_id (FK)event_type · session_idbrowser · os · ip_addressuri · created_at~2.5M rowsLegendPrimary extractor targetSecondary / dimensionFK join relationship
03 — Pydantic Validation Layer

Type safety at the ingestion boundary — before a single record enters DuckDB.

Pydantic v2 models act as the contract between the BigQuery extractor and the DuckDB loader. Every record is passed through the relevant model class — invalid records are intercepted, logged, and either rejected or quarantined before they can corrupt the analytical layer.

Infochart 03 · Pydantic Validation — Order Model Example
Input dict → OrderModel → valid record OR validation error
Each source table has a corresponding Pydantic model class. Validation runs per-record at ingestion time, not batch-after-load.
Raw dict from BigQuery"order_id": "12345""user_id": "789""status": "Complete""num_of_item": "3""created_at": null⚠ null in non-nullable fieldstrings not yet coerced to typesOrderModel (Pydantic v2)class OrderModel(BaseModel): order_id: int user_id: int status: OrderStatus num_of_item: int created_at: datetime gender: str | None = NoneValidation rules▸ str → int coercion (order_id, user_id)▸ status must be in OrderStatus enum▸ created_at: non-nullable — FAILS here▸ gender: optional, default Nonemodel_config = ConfigDict(coerce_numbers_to_str=False)Valid record → DuckDBorder_id: 12345 (int)status: OrderStatus.COMPLETE (enum)num_of_item: 3 (int) · gender: NoneValidationError → logged + actionValidationError: created_at Input should be a valid datetime [null]strict mode: pipeline haltstolerant mode: row quarantined, pipeline continuesRuntime flag: --mode strict | tolerant
04 — Multi-Destination Sink

One pipeline, three output targets — selected at runtime.

After DuckDB processes the validated records, the sink adapter writes the output to whichever destination is specified at runtime. Each adapter is a separate Python class implementing a common SinkAdapter interface — swapping destinations requires no changes to the extraction or validation layers.

Infochart 04 · Sink Destination Comparison
CSV · Amazon S3 · MotherDuck — when to use each
Each sink is independently configurable via environment variables. The pipeline exports DuckDB query results — the output is always post-validation and post-DuckDB processing.
Sink adapter · output format · latency · use case · configurationSinkFormatLatencyBest forConfigCSVfilesystem sink.csv (UTF-8)via DuckDB COPYImmediateno network I/OLocal development · testingSchema validation checksOffline environmentsOUTPUT_PATH=./data/--sink csvAmazon S3object storageParquet (snappy)partitioned by date~2–5 secper batch uploadData lake integrationDownstream Glue / AthenaLong-term archivalAWS_BUCKET · AWS_KEY--sink s3MotherDuckcloud DuckDBDuckDB tablepersistent · queryable~1–3 secper ATTACH + INSERTAnalytical query endpointBI tool connection (Power BI)Shareable persistent resultsMOTHERDUCK_TOKEN=...--sink motherduckSinkAdapter interface— all three sinks implement write(df: duckdb.DuckDBPyRelation, table_name: str) → None. Adding a fourth sink requires no changes to the extractor or validator.
05 — Component Architecture

Four independent modules, zero cross-imports between stages.

Each component is a Python module with a single public interface. The orchestrator (main.py) is the only file that imports from multiple modules — ensuring that each component can be unit-tested in isolation, and that a change in one module cannot break another.

4
Python modules
BigQuery extractor, Pydantic validator, DuckDB loader, and multi-sink adapter — independently testable
3
Output destinations
CSV, Amazon S3, MotherDuck — selected at runtime via --sink flag, no code change required
6
Source tables
orders, order_items, users, products, inventory_items, events — each with a corresponding Pydantic model class
Infochart 05 · Module Dependency Map
main.py orchestrates — modules never import each other
The dependency inversion keeps each module independently testable. Unit tests for the BigQuery extractor mock the BQ client; tests for the Pydantic validator use hardcoded dicts; tests for sinks use in-memory DuckDB relations.
main.pyOrchestrator · CLI entry pointargparse: --table · --sink · --modebq_client.pyBigQuery Extractorfetch_table(table_name) → list[dict]BigQueryClient(credentials)mocked in unit testsmodels.pyPydantic ValidatorOrderModel, UserModel, …validate(records, model_class)tested with hardcoded dictsduckdb_loader.pyDuckDB Loaderload(records) → DuckDBRelationIn-memory table creationtested with in-memory DBsinks.pySink AdaptersCSVSink · S3Sink · MotherDuckSinkwrite(relation, table_name)tested with temp DuckDB relationDependency rule— bq_client, models, duckdb_loader, and sinks never import from each other. Only main.py holds cross-module dependencies — enforced via import linting in CI.
06 — Deliverables

A reproducible, independently-testable ETL codebase.

Every component ships with unit tests, environment variable configuration, and a README covering local setup and sink-specific credential requirements.

BigQuery Extractor (bq_client.py)
Table-agnostic extractor using the BigQuery Python client. Returns validated list[dict] output. Auth via service account JSON or ADC.
Pydantic Models (models.py)
One BaseModel subclass per source table. Type coercion, enum validation, nullability enforcement, and strict/tolerant mode configuration.
DuckDB Loader (duckdb_loader.py)
In-process analytical engine. Receives validated records, creates in-memory DuckDB tables, exposes DuckDBRelation for sink consumption.
Multi-Sink Adapters (sinks.py)
CSVSink, S3Sink, and MotherDuckSink — all implementing a common SinkAdapter protocol. Runtime-selectable via --sink CLI flag.
Unit Test Suite (pytest)
Independent tests per module. BigQuery mocked with unittest.mock. Validator tests use hardcoded valid/invalid dicts. Sink tests use in-memory DuckDB.
README + .env template
Local setup guide, GCP service account configuration, AWS credential setup, MotherDuck token setup, and example CLI invocations for each sink.

A pipeline is only as trustworthy as the data that enters it. Validate at the boundary, process in-memory, and make the output destination a runtime decision — not a hardcoded assumption.