Sunday, June 8, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

The Past, Present, and Future of Data Quality Management: Understanding Testing, Monitoring, and Data Observability in 2024 | by Barr Moses | May, 2024

May 25, 2024
in AI Technology
Reading Time: 6 mins read
0 0
A A
0
Share on FacebookShare on Twitter


The data estate is evolving, and data quality management needs to evolve right along with it. Here are three common approaches and where the field is heading in the AI era. Image by author.

Are they different words for the same thing? Unique approaches to the same problem? Something else entirely?

And more importantly — do you really need all three?

Like everything in data engineering, data quality management is evolving at lightning speed. The meteoric rise of data and AI in the enterprise has made data quality a zero day risk for modern businesses — and THE problem to solve for data teams. With so much overlapping terminology, it’s not always clear how it all fits together — or if it fits together.

But contrary to what some might argue, data quality monitoring, data testing, and data observability aren’t contradictory or even alternative approaches to data quality management — they’re complementary elements of a single solution.

In this piece, I’ll dive into the specifics of these three methodologies, where they perform best, where they fall short, and how you can optimize your data quality practice to drive data trust in 2024.

Before we can understand the current solution, we need to understand the problem — and how it’s changed over time. Let’s consider the following analogy.

Imagine you’re an engineer responsible for a local water supply. When you took the job, the city only had a population of 1,000 residents. But after gold is discovered under the town, your little community of 1,000 transforms into a bona fide city of 1,000,000.

How might that change the way you do your job?

For starters, in a small environment, the fail points are relatively minimal — if a pipe goes down, the root cause could be narrowed to one of a couple expected culprits (pipes freezing, someone digging into the water line, the usual) and resolved just as quickly with the resources of one or two employees.

With the snaking pipelines of 1 million new residents to design and maintain, the frenzied pace required to meet demand, and the limited capabilities (and visibility) of your team, you no longer have the the same ability to locate and resolve every problem you expect to pop up — much less be on the lookout for the ones you don’t.

The modern data environment is the same. Data teams have struck gold, and the stakeholders want in on the action. The more your data environment grows, the more challenging data quality becomes — and the less effective traditional data quality methods will be.

They aren’t necessarily wrong. But they aren’t enough either.

To be very clear, each of these methods attempts to address data quality. So, if that’s the problem you need to build or buy for, any one of these would theoretically check that box. Still, just because these are all data quality solutions doesn’t mean they’ll actually solve your data quality problem.

When and how these solutions should be used is a little more complex than that.

In its simplest terms, you can think of data quality as the problem; testing and monitoring as methods to identify quality issues; and data observability as a different and comprehensive approach that combines and extends both methods with deeper visibility and resolution features to solve data quality at scale.

Or to put it even more simply, monitoring and testing identify problems — data observability identifies problems and makes them actionable.

Here’s a quick illustration that might help visualize where data observability fits in the data quality maturity curve. Image by author. Source.

Now, let’s dive into each method in a bit more detail.

The first of two traditional approaches to data quality is the data test. Data quality testing (or simply data testing) is a detection method that employs user-defined constraints or rules to identify specific known issues within a dataset in order to validate data integrity and ensure specific data quality standards.

To create a data test, the data quality owner would write a series of manual scripts (generally in SQL or leveraging a modular solution like dbt) to detect specific issues like excessive null rates or incorrect string patterns.

When your data needs — and consequently, your data quality needs — are very small, many teams will be able to get what they need out of simple data testing. However, As your data grows in size and complexity, you’ll quickly find yourself facing new data quality issues — and needing new capabilities to solve them. And that time will come much sooner than later.

While data testing will continue to be a necessary component of a data quality framework, it falls short in a few key areas:

  • Requires intimate data knowledge — data testing requires data engineers to have enough specialized domain knowledge to define quality, and enough knowledge of how the data might break to set-up tests to validate it.
  • No coverage for unknown issues — data testing can only tell you about the issues you expect to find — not the incidents you don’t. If a test isn’t written to cover a specific issue, testing won’t find it.
  • Not scalable — writing 10 tests for 30 tables is quite a bit different from writing 100 tests for 3,000.
  • Limited visibility — Data testing only tests the data itself, so it can’t tell you if the issue is really a problem with the data, the system, or the code that’s powering it.
  • No resolution — even if data testing detects an issue, it won’t get you any closer to resolving it; or understanding what and who it impacts.

At any level of scale, testing becomes the data equivalent of yelling “fire!” in a crowded street and then walking away without telling anyone where you saw it.

Another traditional — if somewhat more sophisticated — approach to data quality, data quality monitoring is an ongoing solution that continually monitors and identifies unknown anomalies lurking in your data through either manual threshold setting or machine learning.

For example, is your data coming in on-time? Did you get the number of rows you were expecting?

The primary benefit of data quality monitoring is that it provides broader coverage for unknown unknowns, and frees data engineers from writing or cloning tests for each dataset to manually identify common issues.

In a sense, you could consider data quality monitoring more holistic than testing because it compares metrics over time and enables teams to uncover patterns they wouldn’t see from a single unit test of the data for a known issue.

Unfortunately, data quality monitoring also falls short in a few key areas:

  • Increased compute cost — data quality monitoring is expensive. Like data testing, data quality monitoring queries the data directly — but because it’s intended to identify unknown unknowns, it needs to be applied broadly to be effective. That means big compute costs.
  • Slow time-to-value — monitoring thresholds can be automated with machine learning, but you’ll still need to build each monitor yourself first. That means you’ll be doing a lot of coding for each issue on the front end and then manually scaling those monitors as your data environment grows over time.
  • Limited visibility — data can break for all kinds of reasons. Just like testing, monitoring only looks at the data itself, so it can only tell you that an anomaly occurred — not why it happened.
  • No resolution — while monitoring can certainly detect more anomalies than testing, it still can’t tell you what was impacted, who needs to know about it, or whether any of that matters in the first place.

What’s more, because data quality monitoring is only more effective at delivering alerts — not managing them — your data team is far more likely to experience alert fatigue at scale than they are to actually improve the data’s reliability over time.

That leaves data observability. Unlike the methods mentioned above, data observability refers to a comprehensive vendor-neutral solution that’s designed to provide complete data quality coverage that’s both scalable and actionable.

Inspired by software engineering best practices, data observability is an end-to-end AI-enabled approach to data quality management that’s designed to answer the what, who, why, and how of data quality issues within a single platform. It compensates for the limitations of traditional data quality methods by leveraging both testing and fully automated data quality monitoring into a single system and then extends that coverage into the data, system, and code levels of your data environment.

Combined with critical incident management and resolution features (like automated column-level lineage and alerting protocols), data observability helps…



Source link

Tags: BarrdataFutureManagementmonitoringMosesobservabilityPresentQualityTestingUnderstanding
Previous Post

Busy Bee Airways withdraws bid for Go First

Next Post

Plaza Retail REIT: A 7.6% Yielding Commercial REIT In Canada (OTCMKTS:PAZRF)

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Plaza Retail REIT: A 7.6% Yielding Commercial REIT In Canada (OTCMKTS:PAZRF)

Plaza Retail REIT: A 7.6% Yielding Commercial REIT In Canada (OTCMKTS:PAZRF)

Tesla shareholders advised to reject Musk’s $56 billion pay By Reuters

Tesla shareholders advised to reject Musk's $56 billion pay By Reuters

CSS Landscape | 2024 #15

CSS Landscape | 2024 #15

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
Managing PDFs in Node.js with pdf-lib

Managing PDFs in Node.js with pdf-lib

November 16, 2023
Accenture creates a regulatory document authoring solution using AWS generative AI services

Accenture creates a regulatory document authoring solution using AWS generative AI services

February 6, 2024
Salesforce AI Introduces Moira: A Cutting-Edge Time Series Foundation Model Offering Universal Forecasting Capabilities

Salesforce AI Introduces Moira: A Cutting-Edge Time Series Foundation Model Offering Universal Forecasting Capabilities

April 3, 2024
The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

October 30, 2023
Programming Language Tier List

Programming Language Tier List

November 9, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In