Config-Driven Data Quality Engine with Quarantine Tables

Data quality in banking is not optional. When a single malformed transaction record can cascade into reconciliation failures, regulatory reporting errors, and ultimately financial loss, the cost of bad data is measured in both dollars and trust. Yet the traditional approach to data quality — hardcoding validation rules directly into pipeline code — creates a bottleneck that is equally damaging: every new rule requires a developer, a code review, a deployment cycle, and a prayer that nothing else broke.

This is the story of how we built a config-driven data quality engine for a major banking institution in the UAE, one that lets business analysts and data stewards define and modify validation rules through YAML configuration files while engineering teams focus on platform reliability rather than rule maintenance.

The Problem: Rigid Rules in a Regulated World

When we first engaged with the client, their data platform was processing transaction data from multiple core banking systems, payment gateways, and regulatory feeds. Data quality checks existed, but they were scattered across dozens of notebooks, each with its own validation logic baked directly into the transformation code.

The consequences were predictable. Adding a new validation rule — say, ensuring that a newly introduced transaction type carried a mandatory reference field — required a Jira ticket, developer assignment, code changes across multiple notebooks, testing, and deployment. The turnaround time was measured in weeks. Meanwhile, bad data continued flowing downstream, triggering manual reconciliation efforts that consumed entire teams.

The data stewards knew exactly what rules they needed. They just had no way to express them without going through the engineering backlog. We needed to invert this relationship entirely.

The Architecture: Dual-Stream Processing with Quarantine

The core insight behind our design was simple: separate the “what to check” from the “how to check it.” The engine itself is a generic PySpark application that reads rule definitions from configuration files and applies them to incoming data. Clean records continue down the pipeline. Records that fail any rule get routed to quarantine tables along with metadata about exactly which rule they violated and why.

At a high level, the architecture looks like this:

Ingestion Layer: Raw data lands in bronze Delta tables via structured streaming from Kafka topics and batch loads from SFTP drops.
DQ Engine: A standalone PySpark module that sits between bronze and silver layers. It reads rule configurations, applies them to each micro-batch, and produces two output streams.
Clean Stream: Records that pass all applicable rules flow into silver Delta tables for downstream consumption.
Quarantine Stream: Records that fail one or more rules are written to dedicated quarantine Delta tables with full failure metadata attached.
Reconciliation Layer: A separate process that monitors quarantine tables, presents failed records through a review interface, and allows corrected records to be resubmitted.

This dual-stream pattern means the pipeline never stops. Bad data does not block good data. And the quarantine tables provide a complete audit trail of every data quality issue, which turned out to be invaluable for regulatory compliance.

The Configuration Format

We chose YAML for rule definitions after evaluating JSON, TOML, and even a custom DSL. YAML won because of its readability — data stewards who had never written code could understand and modify rule files after a brief training session.

A rule configuration file looks conceptually like this. Each rule has a unique identifier, a human-readable description, the target table and column, the rule type, parameters specific to that rule type, a severity level, and an active flag that allows rules to be toggled without deletion:

The rule file is organized by source system and entity. Under each entity, you define an array of rules. Each rule specifies its type (such as “not_null,” “range,” “regex,” “referential_integrity,” or “custom_sql”), the columns it applies to, any parameters the rule type requires, and metadata like severity and owner.

We implemented the following rule types in the initial release:

not_null: Checks that specified columns contain non-null values. The simplest rule, but responsible for catching roughly 40% of all data quality issues.
range: Validates that numeric values fall within a defined minimum and maximum. Used extensively for transaction amounts, interest rates, and date ranges.
regex: Applies regular expression patterns to string columns. Essential for validating account numbers, SWIFT codes, IBAN formats, and other structured identifiers.
referential_integrity: Verifies that foreign key values exist in a reference table. This catches orphaned records — transactions referencing accounts that do not exist in the master account table, for example.
custom_sql: Allows arbitrary SQL expressions for complex business rules that do not fit the other types. For instance, checking that the sum of sub-transaction amounts equals the parent transaction amount.

Technical Implementation

The engine is implemented as a PySpark module that can run in both batch and streaming modes. At its core, the processing flow follows these steps.

First, the RuleLoader class reads all YAML files from a configured path in the Databricks workspace (or from a mounted Azure Blob Storage container). It parses them into a list of Rule objects, filtering out any rules where the active flag is set to false. The loader also validates the configuration itself — a malformed rule file should fail loudly at startup rather than silently at runtime.

Second, the RuleEngine class takes a DataFrame and a list of Rule objects and applies them. Each rule type has a corresponding validator function that returns a boolean column indicating whether each row passes that rule. The key design decision here was to apply all rules to every row and collect all failures, rather than short-circuiting on the first failure. This means the quarantine record for a given row might list three or four violations, giving the data steward full context for correction.

The validation functions are implemented as PySpark column expressions, not UDFs. This is critical for performance. A UDF-based approach would serialize data to Python for each row, destroying any benefit of Spark’s Catalyst optimizer. By expressing rules as native column operations, the engine benefits from predicate pushdown, whole-stage code generation, and all the other optimizations Spark provides.

For the referential_integrity rule type, we use a broadcast join against the reference table. The reference tables are typically small (tens of thousands of rows for account masters, product catalogs, and the like), so broadcasting them avoids expensive shuffle joins. The result is a boolean column indicating whether the foreign key value was found.

For the custom_sql rule type, we use the expr() function to parse and execute the SQL expression. The expression is validated during rule loading to catch syntax errors early.

Third, the Router class splits the DataFrame into two based on the validation results. Rows where all rules passed go to the clean output. Rows where any rule failed go to quarantine, with additional columns appended.

Quarantine Table Design

The quarantine table schema is deliberately denormalized for ease of querying. Each quarantine record contains the entire original row (all source columns preserved), plus these metadata columns:

dq_run_id: A unique identifier for the DQ engine execution run, linking the record to a specific batch or micro-batch.
dq_failed_rules: An array of structs, where each struct contains the rule_id, rule_type, column_name, expected_value (what the rule required), and actual_value (what was found in the data).
dq_failure_timestamp: When the failure was detected.
dq_source_table: Which bronze table the record came from.
dq_severity: The highest severity among all failed rules for this record (CRITICAL, HIGH, MEDIUM, LOW).
dq_status: The lifecycle state of the quarantine record — QUARANTINED, UNDER_REVIEW, CORRECTED, RESUBMITTED, or REJECTED.

We store the failed rules as an array of structs rather than a single concatenated string because downstream consumers need to filter and aggregate by specific rule types and columns. Delta Lake handles complex nested types efficiently, and the Databricks SQL interface renders them cleanly.

The quarantine tables are partitioned by date and severity, reflecting the two most common access patterns: “show me everything that failed today” and “show me all critical failures.”

Streaming with Exactly-Once Guarantees

For the streaming workloads, we use Spark Structured Streaming with Delta Lake as both the source and sink, leveraging checkpoint-based processing for exactly-once semantics.

Each streaming DQ job maintains its own checkpoint directory in DBFS. The checkpoint contains the offset information (which Kafka offsets or which Delta table version has been processed) and the state of any stateful operations. If a job fails and restarts, it resumes from the last committed checkpoint, reprocessing only the micro-batches that were in flight at the time of failure.

The foreachBatch sink pattern is what makes the dual-stream routing possible in streaming mode. In each micro-batch, we apply the DQ rules, split the DataFrame, and write both the clean and quarantine outputs within the same micro-batch transaction. Because Delta Lake supports ACID transactions, either both writes succeed or neither does. This prevents the scenario where clean data advances but the corresponding quarantine records are lost, or vice versa.

We also implemented a dead-letter mechanism for records that fail to write to either destination — a rare edge case, but one that must be handled in a banking context. These go to a separate dead-letter Delta table with the full error stack trace.

Reconciliation: Closing the Loop

Quarantine tables are useless if nobody looks at them. We built a reconciliation workflow that operates on multiple levels.

Automated reconciliation handles the simple cases. Some quarantine records can be corrected programmatically — for example, if a transaction amount is stored as a string with a currency symbol, a regex-based cleanup can extract the numeric value. These auto-correction rules are themselves defined in YAML, creating a correction pipeline that mirrors the validation pipeline.

For records requiring human judgment, we built a Streamlit-based review application that presents quarantine records grouped by source system, date, and failure type. Data stewards can inspect individual records, apply corrections, and either resubmit them to the pipeline or reject them with a reason code. The application connects directly to the quarantine Delta tables via Databricks SQL, so there is no data duplication.

Resubmitted records enter the pipeline at the bronze layer with a special flag indicating they are reprocessed quarantine records. They go through the full DQ engine again — if the correction was valid, they flow through to silver. If not, they return to quarantine with updated failure metadata. This loop continues until the record either passes or is explicitly rejected.

Monitoring and Alerting

Data quality is only as good as its monitoring. We implemented a metrics layer that tracks DQ engine performance across several dimensions: total records processed per run, pass rate and fail rate by rule type, quarantine table growth rate, mean time to resolution for quarantined records, and rule execution latency.

These metrics are written to a dedicated Delta table and visualized through Databricks SQL dashboards. We configured alerts for anomalous conditions — if the failure rate for any single rule exceeds its historical baseline by more than two standard deviations, an alert fires to the relevant Slack channel. This catches both data source degradation (more failures than usual) and rule misconfiguration (a newly added rule that fails on everything).

Results

The impact was measurable within the first quarter of operation:

85% reduction in data quality incidents reaching downstream consumers, as issues are now caught and quarantined at the silver layer boundary.
Rule deployment time dropped from weeks to hours. Data stewards can now add a rule in YAML, submit a pull request, and have it active after a standard review and merge cycle.
Full audit trail for every data quality decision, satisfying regulatory requirements that previously required manual documentation.
Non-engineers managing 70% of the rule base. The data stewardship team now owns and maintains the majority of validation rules, freeing engineering to focus on platform improvements.

Lessons Learned

Not everything went smoothly. Here are the hard-won takeaways from this project.

Start with not_null and regex rules. They catch the vast majority of real-world data quality issues and are the easiest for non-engineers to understand and configure. We initially over-invested in the custom_sql rule type, which turned out to be used far less frequently than we expected.

Version your rule configurations. We store rule YAML files in a Git repository alongside the pipeline code. Every rule change goes through a pull request with review from both a data steward (for business correctness) and an engineer (for performance implications). This Git-based workflow was essential for auditability.

Test rules against production data volumes before activation. A referential_integrity rule that performs a broadcast join against a “small” reference table works beautifully in development. When that reference table turns out to have 50 million rows in production, the broadcast fails and the job crashes. We learned to include a size check that automatically switches from broadcast to shuffle join above a configurable threshold.

Quarantine tables grow fast. In a high-volume banking context, even a 2% failure rate produces significant quarantine table volume. We implemented a retention policy that archives quarantine records older than 90 days to cold storage, keeping the active quarantine tables performant for the review application.

Invest in the reconciliation workflow early. We initially treated reconciliation as a phase-two feature. The result was a quarantine table that grew without bound because nobody had an efficient way to process the records. The Streamlit review application should have been part of the MVP.

The config-driven DQ engine has since become a pattern we replicate across engagements. The specific rule types and quarantine table schemas vary by domain, but the core architecture — externalized rules, dual-stream processing, and structured reconciliation — remains consistent. It is one of those patterns that, once you have built it, you cannot imagine going back to hardcoded validation logic scattered across notebooks.

How We Built a Config-Driven Data Quality Engine with Quarantine Tables