Evidently
How to keep your data quality in check and guard against data and model drift with Evidently profiling
The Evidently Data Validator flavor provided with the ZenML integration uses Evidently to perform data quality, data drift, model drift and model performance analyses and generate reports. The reports can be used to implement automated corrective actions in your pipelines or to render interactive representations for further visual interpretation, evaluation and documentation.
When would you want to use it?
Evidently is an open-source library that you can use to monitor and debug machine learning models by analyzing the data that they use through a powerful set of data profiling and visualization features. Evidently currently works with tabular data in pandas.DataFrame
or CSV file formats and can handle both regression and classification tasks.
You should use the Evidently Data Validator when you need the following data and/or model validation features that are possible with Evidently:
-
Data Quality: provides detailed feature statistics and a feature behavior overview for a single dataset. It can also compare any two datasets. E.g. you can use it to compare train and test data, reference and current data, or two subgroups of one dataset.
-
Data Drift: helps detects and explore feature distribution changes in the input data by comparing two datasets with identical schema.
-
Numerical Target Drift and Categorical Target Drift: helps detect and explore changes in the target function and/or model predictions by comparing two datasets where the target and/or prediction columns are available.
-
Regression Performance, Classification Performance, or Probabilistic Classification Performance: evaluate the performance of a model by analyzing a single dataset where both the target and prediction columns are available. It can also compare it to the past performance of the same model, or the performance of an alternative model by providing a second dataset.
You should consider one of the other Data Validator flavors if you need a different set of data validation features.
How do you deploy it?
The Evidently Data Validator flavor is included in the Evidently ZenML integration, you need to install it on your local machine to be able to register an Evidently Data Validator and add it to your stack:
The Data Validator stack component does not have any configuration parameters. Adding it to a stack is as simple as running e.g.:
How do you use it?
Evidently’s profiling functions take in a pandas.DataFrame
dataset or a pair of datasets and generate results in the form of a Profile
object containing all the relevant information, or as a Dashboard
visualization.
One of Evidently’s notable characteristics is that it only requires datasets as input. Even when running model performance comparison analyses, no model needs to be present. However, that does mean that the input data needs to include additional target
and prediction
columns for some profiling reports and, you have to include additional information about the dataset columns in the form of column mappings. Depending on how your data is structured, you may also need to include additional steps in your pipeline before the data validation step to insert the additional target
and prediction
columns into your data. This may also require interacting with one or more models.
There are three ways you can use Evidently in your ZenML pipelines that allow different levels of flexibility:
-
instantiate, configure and insert the standard EvidentlyProfileStep shipped with ZenML into your pipelines. This is the easiest way and the recommended approach, but can only be customized through the supported step configuration parameters.
-
call the data validation methods provided by the Evidently Data Validator in your custom step implementation. This method allows for more flexibility concerning what can happen in the pipeline step, but you are still limited to the functionality implemented in the Data Validator.
-
use the Evidently library directly in your custom step implementation. This gives you complete freedom in how you are using Evidently’s features.
Outside the pipeline workflow, you can use the ZenML Evidently visualizer to display and interact with the Evidently dashboards generated by your pipelines.
The Evidently standard step
ZenML wraps the Evidently functionality in the form of a standard EvidentlyProfileStep
step. You select which reports you want to generate in your step by passing a list of string identifiers into the EvidentlyProfileParameters
:
The step can then be inserted into your pipeline where it can take in two datasets, e.g.:
Possible report options supported by Evidently are:
-
“datadrift”
-
“categoricaltargetdrift”
-
“numericaltargetdrift”
-
“dataquality”
-
“classificationmodelperformance”
-
“regressionmodelperformance”
-
“probabilisticmodelperformance”
As can be seen from the step definition, the step takes in a reference dataset and a comparison dataset required for data drift and model comparison reports. It returns an Evidently Profile
object and a Dashboard
rendered as an HTML string:
If needed, Evidently column mappings can be passed into the step configuration, but as zenml.integrations.evidently.steps.EvidentlyColumnMapping
objects, which have the exact same structure as evidently.pipeline.column_mapping.ColumnMapping
:
You should consult the official Evidently documentation for more information on what each report is useful for and what data columns it requires as input.
The EvidentlyProfileConfig
step configuration also allows for additional profile options and dashboard options to be passed to the Profile
and Dashboard
constructors e.g.:
You can view the complete list of configuration parameters in the API docs.
You can also check out our examples pages for working examples that use the Evidently standard step:
The Evidently Data Validator
The Evidently Data Validator implements the same interface as do all Data Validators, so this method forces you to maintain some level of compatibility with the overall Data Validator abstraction, which guarantees an easier migration in case you decide to switch to another Data Validator.
All you have to do is call the Evidently Data Validator methods when you need to interact with Evidently to generate data profiles, e.g.:
Have a look at the complete list of methods and parameters available in the EvidentlyDataValidator API in the API docs.
Call Evidently directly
You can use the Evidently library directly in your custom pipeline steps, and only leverage ZenML’s capability of serializing, versioning and storing the Profile
objects in its Artifact Store, e.g.:
The Evidently ZenML Visualizer
In the post-execution workflow, you can load and render the Evidently dashboards generated and returned by your pipeline steps by means of the ZenML Evidently Visualizer, e.g.:
The Evidently dashboards will be opened as tabs in your browser, or displayed inline in your Jupyter notebook, depending on where you are running the code:
Evidently Visualization Example 1
Evidently Visualization Example 2