Skip to content
BLACKLAKE
REF-ANAL
Project artifact

Market data pipeline modernisation

A quantitative analytics platform had unpredictable refresh cycles and escalating warehouse costs. Downstream models missed time windows. We restructured the pipeline for predictable time-to-data.

LatencyCostReliability
Data EngineeringBigQueryAnalytics
Industry
Financial analytics
Timeline
3 months
Executive skim
Three measured signals
Jump to outcomes
Pipeline runtime
12hr → 20min
97% reduction in batch critical path
Warehouse cost
Stabilised
Predictable monthly spend through incremental processing
Interactive queries
<5 seconds
On-demand slices for ad-hoc investigation
System sketch

Context

A production analytics pipeline ingested and transformed daily market data for time-sensitive quantitative models.

Constraint

Time-to-data had to become predictable without increasing scan cost or breaking downstream data contracts.

Intervention

Reshaped the pipeline into staged transforms with incremental processing. Aligned partitioning and clustering to access patterns. Replaced deeply nested queries with materialised intermediate steps.

Key decisions

  • Partitioning aligned to access patterns
  • Staged transforms replacing nested queries
  • Incremental processing for cost control
  • Idempotent ingestion handling
  • Orchestration with retry visibility
  • Automated data quality checks

Outcomes

Batch critical path dropped from ~4 hours to ~35 minutes. On-demand slices returned in seconds. Scan costs stabilised.

Why it matters

Fresher model inputs, fewer missed refresh windows, and predictable cloud spend—without increasing operator burden.

Implementation

Practical technology choices that matched the constraints.

BigQueryPythondbtAirflowPub/SubTerraformDataform

Discuss a similar system

If this resembles your constraints, share a short description of what you run today and what needs to change.