- the Data Validator is an optional type of Stack Component that needs to be registered as part of your ZenML Stack.
- Data Validators used in ZenML pipelines usually generate data profiles and data quality check reports that are versioned and stored in the Artifact Store. They can be retrieved and inspected using the post-execution workflow API.
When to use it
Data-centric AI practices are quickly becoming mainstream and using Data Validators are an easy way to incorporate them into your workflow. These are some common cases where you may consider employing the use of Data Validators in your pipelines:- early on, even if it’s just to keep a log of the quality state of your data and the performance of your models at different stages of development.
- if you have pipelines that regularly ingest new data, you should use data validation to run regular data integrity checks to signal problems before they are propagated downstream.
- in continuous training pipelines, you should use data validation techniques to compare new training data against a data reference and to compare the performance of newly trained models against previous ones.
- when you have pipelines that automate batch inference or if you regularly collect data used as input in online inference, you should use data validation to run data drift analyses and detect training-serving skew, data drift and model drift.
Data Validator Flavors
Data Validator are optional stack components provided by integrations. The following table lists the currently available Data Validators and summarizes their features and the data types and model types that they can be used with in ZenML pipelines:Data Validator | Validation Features | Data Types | Model Types | Notes | Flavor/Integration |
---|---|---|---|---|---|
Deepchecks | data quality data drift model drift model performance | tabular: pandas.DataFrameCV: torch.utils.data.dataloader.DataLoader | tabular: sklearn.base.ClassifierMixinCV: torch.nn.Module | Add Deepchecks data and model validation tests to your pipelines | deepchecks |
Evidently | data quality data drift model drift model performance | tabular: pandas.DataFrame | N/A | Use Evidently to generate a variety of data quality and data/model drift reports and visualizations | evidently |
Great Expectations | data profiling data quality | tabular: pandas.DataFrame | N/A | Perform data testing, documentation and profiling with Great Expectations | great_expectations |
Whylogs/WhyLabs | data drift | tabular: pandas.DataFrame | N/A | Generate data profiles with whylogs and upload them to WhyLabs | whylogs |
How to use it
Every Data Validator has different data profiling and testing capabilities and uses a slightly different way of analyzing your data and your models, but it generally works as follows:- first, you have to configure and add a Data Validator to your ZenML stack
- every integration includes one or more builtin data validation steps that you can add to your pipelines. Of course, you can also use the libraries directly in your own custom pipeline steps and simply return the results (e.g. data profiles, test reports) as artifacts that are versioned and stored by ZenML in its Artifact Store.
- you can access the data validation artifacts in subsequent pipeline steps, or you can load them in the post-execution workflow to process them or visualize them as needed.