Your AI Is Only as Trustworthy as Your Data: A Practical Framework for Data Quality in the Modern Lakehouse

A Practical Framework for Data Quality in the Modern Lakehouse.

In our previous article in this series, we made the case that semantic foundations, taxonomies, ontologies, and Data Vault methodology are the key to building AI solutions that can be trusted. The response from practitioners raised a common and entirely fair challenge: what about data quality?

It is the right question. You can have the most elegantly designed Data Vault in the world, with immaculate Hub-Link-Satellite structures and a beautifully captured business ontology, and if the source data feeding it is dirty, inconsistent, or incomplete, your AI will still produce inaccurate outputs. Semantic structure is a necessary foundation for trusted AI. It is not, on its own, sufficient.

Data quality is the other half of the equation. And in our broad experience delivering enterprise data platforms, it is where many organisations have the least visibility and the most exposure.

Semantic structure tells AI what your data means. Data quality determines whether that data can be believed. You need both.

The Uncomfortable Truth About Data Quality

Most organisations know they have data quality problems. What they typically lack is a clear picture of where those problems are, how serious they are, and what they are costing the business.

This matters enormously in the context of AI. A human analyst looking at a dashboard can often sense when a number looks wrong and apply judgement accordingly. An AI system has no such instinct. It will reason over whatever data it is given and if that data contains duplicates, nulls in critical fields, referential integrity failures, or values that violate business rules, the AI will incorporate those errors into its outputs as confidently as it incorporates correct data.

The result is AI that is confidently wrong. And confidently wrong AI, particularly in regulated industries like financial services, healthcare, or government, is not just unhelpful. It is a liability.

A common pattern we see: organisations invest significantly in building a modern data platform, deploy AI and analytics on top of it, and then discover, through business user complaints or audit findings that the underlying data has quality issues that have been silently propagating for months or years. The platform is sound. The data feeding it was not.

The root cause is almost always the same: data quality problems originate in source systems, but they are invisible until something downstream breaks. Without a systematic framework for detecting, measuring, and tracking DQ issues, organisations are flying blind.

Where Data Quality Problems Actually Come From

Before discussing solutions, it is worth being clear about the nature of data quality problems in enterprise environments, because the source of the problem has a direct bearing on where the solution needs to be implemented.

In our experience, data quality issues fall into a small number of recurring patterns:

Completeness failures. Required fields are null or missing. Customer records without contact details. Transactions without reference numbers. Orders without product codes.
Validity failures: Values exist but violate business rules. Negative quantities. Dates that precede the founding of the organisation. Status codes that are not in the reference list.
Consistency failures: The same concept is represented differently across systems. A customer who is "Active" in the CRM is "Lapsed" in the billing system. Product codes that do not match between the ERP and the warehouse system.
Referential integrity failures: Relationships that should exist do not. Orders that reference customers who do not exist. Transactions that reference accounts that have been deleted.
Timeliness failures: Data that is correct but stale. Reference tables that have not been updated. Feeds that are running hours or days behind.

What all of these have in common is their origin: they are almost always caused by issues in source systems, data entry processes, system migrations, integration failures, or business processes that have evolved faster than the data models that support them.

This has a critical implication for how data quality should be addressed. Detecting and flagging DQ issues in the data platform is valuable and necessary. But fixing them in the data platform, cleaning, imputing, or transforming around the problem, is at best a temporary patch and at worst a way of hiding a problem that needs to be resolved at its source.

Data quality problems belong to the systems and processes that create them. The data platform's job is to make those problems visible, not to paper over them.

A Practical Framework: Detect, Quantify, Remediate

Ignition's approach to data quality in the Lakehouse environment is built around three interconnected layers, each playing a distinct role in turning data quality from an invisible risk into a managed, measurable programme.

Layer	Where it lives	Purpose
DQ Rules	Business Vault	Detect, flag, and score DQ issues at the point of integration
DQ Mart	Reporting layer	Track, quantify, and trend DQ issues over time*
Source Remediation	Upstream systems	Fix issues at origin, the only permanent solution

Layer One: DQ Rules in the Business Vault

The Business Vault is the natural home for data quality logic in a Data Vault architecture. Sitting above the Raw Vault, which preserves source data exactly as received, the Business Vault is where business rules are applied and interpreted data is produced.

DQ rules implemented in the Business Vault work by evaluating incoming data against defined standards and producing explicit quality flags and scores for each record. Rather than silently dropping, imputing, or transforming records that fail quality checks, the Business Vault approach makes failures visible and traceable.

A well-designed set of Business Vault DQ rules will evaluate:

Whether required fields are populated (completeness)
Whether field values conform to defined business rules and reference data (validity)
Whether relationships to other entities can be successfully resolved (referential integrity)
Whether the same entity is represented consistently across source systems (consistency)

Each record that passes through the Business Vault carries with it a quality assessment, not just a binary pass/pass/failure, but a scored evaluation that allows downstream consumers to make informed decisions about whether to include, exclude, or caveat records based on their quality profile.

This is where the Data Vault platform's code generation capability, delivered through IRiS and extended by consulting partners, becomes valuable. The structural patterns for implementing Business Vault DQ rules are well-established, the effort is in defining the rules themselves, which requires domain knowledge and business engagement rather than engineering creativity. Consulting partners work with business stakeholders to define those rules, and the platform implements them consistently and at scale. And finally, to support end users, we can add the BV entries into the IM layer - to either ‘markup’ records that should be used with caution or to exclude them from the mart.

Layer Two: The DQ Mart

Detecting data quality issues at the Business Vault layer is necessary but not sufficient. For DQ to become a managed programme, rather than a background noise that practitioners are vaguely aware of organisations need visibility, measurement, and accountability.

The DQ Mart is a purpose-built reporting and tracking layer that surfaces the outputs of Business Vault DQ rules in a form that is accessible to both technical practitioners and business stakeholders. It answers the questions that matter to each audience:

For business stakeholders: What percentage of our customer records are complete? What proportion of our transactions have referential integrity failures? Is our data quality improving or deteriorating over time?
For data engineers: Which source systems are generating the most DQ failures? Which specific fields are problematic? What is the volume and trend of each failure type?
For source system owners: Here is the specific evidence of the DQ issues your system is producing. Here is the business impact. Here is what needs to change.

That last point is perhaps the most important. One of the most common failure modes in data quality programmes is the inability to create accountability at source. Data teams know there are problems, but they cannot produce the evidence needed to compel source system owners, who are often in different parts of the organisation with different priorities to fix them.

The DQ Mart changes this dynamic. It transforms anecdotal frustration into quantified, evidenced, time-trended data that can be presented to leadership. It makes the cost of inaction visible. And it provides a baseline against which remediation progress can be measured.

The DQ mart also reports where we’ve fixed the record at source and removed the DQ issue, providing insight into remediation trends and the effectiveness of data quality initiatives.

Layer Three: Source System Remediation

The DQ Mart creates visibility. What organisations do with that visibility determines whether their data quality improves.

Our position, based on years of working through data quality issues with enterprise clients is unambiguous: data quality problems are best resolved at source. Cleaning data in the platform, while sometimes necessary as a short-term measure, treats the symptom rather than the disease. The source system will continue producing poor quality data, the cleaning logic will accumulate technical debt, and the organisation will remain dependent on the data team to compensate for problems that belong to the business.

The more productive path is to use the evidence from the DQ Mart to drive remediation conversations with source system owners. This typically involves:

Quantifying the business impact of specific DQ issues, not just "we have null values" but "15% of customer records are missing the field we use for risk scoring, affecting $X of exposure"
Prioritising remediation by business impact rather than technical severity, fixing the issues that matter most to outcomes first
Tracking progress over time through the DQ Mart, so that improvements at source are visible and improvements are rewarded
Building DQ standards into source system development and integration processes, so that new data sources are held to quality standards from the start

In practice, this is a consulting and change management engagement as much as it is a technical one. It requires stakeholder alignment, executive sponsorship, and a willingness to have difficult conversations with parts of the business that may not initially welcome scrutiny of their data practices. These are exactly the kinds of engagements where experienced data consultants earn their value.

Quantifying the Business Impact of Data Quality

One of the most powerful shifts an organisation can make in its approach to data quality is moving from qualitative to quantitative assessment. "Our data has quality issues" is easy to dismiss. "Our data quality issues are causing X% of AI model predictions to be based on incomplete customer records, affecting Y decisions per month with an estimated business impact of $Z" is not.

The DQ Mart enables this shift. By capturing DQ metrics at the record level, over time, and linked to specific business entities and processes, it becomes possible to calculate the business cost of data quality failures in terms that resonate with non-technical leadership.

Common business impact metrics we help clients develop include:

Coverage rate: What percentage of business entities (customers, products, accounts) have complete, valid data? What decisions cannot be made or must be made with reduced confidence, due to incomplete coverage?
Error rate by source: Which source systems are contributing the highest volume of DQ failures? What is the trend are they improving or deteriorating?
AI model impact: For organisations already running AI or ML models, what percentage of inference inputs are flagged with DQ issues? How does model performance differ between records with high and low DQ scores?
Remediation ROI: What is the measurable improvement in downstream outcomes, model accuracy, reporting confidence, regulatory compliance as DQ issues are resolved at source?

This last metric is particularly valuable for sustaining investment in data quality programmes over time. When the business can see the direct connection between source system improvements and better AI outcomes, the case for continued investment becomes self-reinforcing.

Where Data Quality Sits in the Modern Lakehouse

It is worth being explicit about how this framework integrates with a Data Vault-based Lakehouse architecture, because the positioning of DQ logic within the platform has significant implications for both governance and AI readiness.

In a well-architected Data Vault Lakehouse:

The Raw Vault preserves source data exactly as received, including data quality failures. This is critical for auditability. You need to know what data arrived, when, and in what state.
The Business Vault applies DQ rules, produces quality flags and scores, and creates business-interpreted views of the data. AI and analytics workloads that require trusted data draw from this layer.
The DQ Mart exposes quality metrics for monitoring, reporting, and accountability. It is built from the DQ flags produced in the Business Vault.
The Information Mart / Gold layer can be configured to include or exclude records based on DQ scores, giving downstream consumers explicit, documented control over the quality threshold they are willing to accept.

This architecture means that data quality is not a separate concern bolted on to the platform as an afterthought. It is woven into the structure of the Lakehouse itself, visible, measurable, and actionable at every layer.

The result is an AI-ready platform where the quality of every inference input is known, documented, and traceable. When an AI model produces an output, the data quality profile of the inputs that drove that output is part of the audit trail. When quality improves, that improvement flows through to every downstream consumer automatically.

Data quality is not a data team problem. It is a business problem that we make visible in the data platform. The DQ Mart makes it everyone's problem, which is exactly what it needs to be.

Getting Started

For most organisations, the path to better data quality begins with visibility. Before you can fix the right things, you need to know where the problems are, how serious they are, and what they are costing you.

If you are building or operating a Data Vault-based Lakehouse, implementing a DQ framework in the Business Vault layer, with a DQ Mart for reporting and tracking, is one of the highest-value investments you can make in the trustworthiness of your AI and analytics outputs.

It is also, in our experience, one of the most effective levers for creating the organisational accountability needed to drive source system remediation. The data is there. The evidence is there. The business case almost writes itself.

Ignition works with organisations at every stage of this journey, from initial DQ assessment and framework design through to full Business Vault implementation and DQ Mart deployment. If you are serious about trusted AI, data quality cannot be an afterthought. It needs to be part of the platform from the start.

This article is part of the Data Intelligence Series.

Blogs and Articles

Data Vault Q&A Sessions

Data Intelligence Series

Your AI Is Only as Trustworthy as Your Data

The Uncomfortable Truth About Data Quality

Where Data Quality Problems Actually Come From

A Practical Framework: Detect, Quantify, Remediate

Layer One: DQ Rules in the Business Vault

Layer Two: The DQ Mart

Layer Three: Source System Remediation

Quantifying the Business Impact of Data Quality

Where Data Quality Sits in the Modern Lakehouse

Getting Started

Continue Reading

Your Data Architecture Is Holding Your AI Back

Why Your Enterprise AI Project Stalled and How the Right Data Foundation Fixes It

From Taxonomy to Trusted AI: Understanding the Semantic Foundations That Make AI Work

Start your IRiS journey

Experience the smarter, faster way to automate your Data Vault.

What we do

Data Vault 2

Who we work with

Company