Banking & Financial Services

UAE Banking Institution

Medallion Lakehouse with Config-Driven Data Quality

DatabricksKafkaDelta LakeTerraformData Quality

Key Results

85% reduction in data quality incidents, config-driven rules managed by business analysts

The Challenge

The bank operated multiple disconnected data systems feeding regulatory reporting, customer analytics, and operational dashboards. Data quality issues were caught late — often by downstream consumers — and fixing them required engineering intervention for every new rule.

They needed a unified platform with:

  • Real-time CDC ingestion from core SQL Server systems
  • A medallion architecture for progressive data refinement
  • A data quality framework flexible enough for business analysts to manage
  • Quarantine patterns for regulatory compliance

Our Solution

Medallion Lakehouse on Databricks

We designed and built a three-layer medallion architecture on Databricks:

  • Bronze layer: Raw CDC events from SQL Server via Kafka (Confluent Cloud) landed as Delta tables with full event metadata
  • Silver layer: Validated, deduplicated, and conformed records with schema enforcement and SCD Type 2 tracking
  • Gold layer: Business-ready aggregations for regulatory reporting and analytics

Kafka CDC Pipeline

Change Data Capture from the bank’s SQL Server estate flowed through Confluent Cloud into the lakehouse. We implemented checkpoint-based Spark Structured Streaming for exactly-once processing guarantees — critical for financial data.

Config-Driven Data Quality Engine

The centerpiece of the platform: a DQ engine where validation rules are defined in YAML configuration files. Rule types include null checks, range validations, regex patterns, referential integrity checks, and custom business rules.

Records that pass validation flow to the silver layer. Records that fail are routed to quarantine tables with full metadata: the original record, the rule that failed, the timestamp, and the severity level. Business analysts manage rules without code changes — new rules take effect on the next pipeline run.

Infrastructure as Code

The entire platform was provisioned with Terraform: Databricks workspaces, Confluent Cloud clusters, network configuration, IAM roles, and monitoring infrastructure.

Results

  • 85% reduction in data quality incidents reaching downstream consumers
  • Zero-code rule management — business analysts add and modify DQ rules independently
  • Real-time ingestion — CDC events reach the bronze layer within minutes
  • Full auditability — quarantine tables provide a complete record of every data quality decision

Technologies Used

Databricks, Delta Lake, Apache Spark Structured Streaming, Confluent Cloud (Kafka), SQL Server CDC, Terraform, Python, YAML-based configuration

Deep Dive

How We Built a Config-Driven Data Quality Engine with Quarantine Tables →

A deep dive into the architecture of a flexible, YAML-driven data quality engine we built for a UAE banking institution. The system routes failed records to quarantine tables for review while clean data flows forward, all without requiring code changes for new rules.

Ready to Build Your Data Platform?

Let's discuss how proven architecture and engineering can solve your specific challenges.

Schedule a Consultation