Step Parameterization and Caching
Iteration is native to ZenML.
Machine learning pipelines are rerun many times over throughout their development lifecycle.
Parameterizing steps
In order to iterate quickly, one must be able to quickly tweak pipeline runs by changing various parameters for the steps that make up your pipeline.
You can configure your pipelines at runtime in the following ways:
BaseParameters
: Runtime configuration passed down to steps as parameters.BaseSettings
: Runtime settings passed down to stack components and pipelines.
In this section, we will focus on BaseParameters
, and in the Advanced Guide we will dive deeper into BaseSettings.
You can parameterize a step by creating a subclass of the BaseParameters
. When such a config object is passed to a step, it is not treated like other artifacts. Instead, it gets passed into the step when the pipeline is instantiated.
The default value for the gamma
parameter is set to 0.001
. However, when the pipeline is instantiated you can override the default like this:
Behind the scenes, BaseParameters
is implemented as a Pydantic
BaseModel. Therefore, any
type that Pydantic
supports is also supported
as an attribute type in the BaseParameters
.
Try running the above pipeline, and changing the parameter gamma
through many runs. In essence, each pipeline can be viewed as an experiment, and each run is a trial of the experiment, defined by the BaseParameters
. You can always get the parameters again when you fetch pipeline runs, to compare various runs.
Caching in ZenML
When you tweaked the gamma
variable above, you must have noticed that the digits_data_loader
step does not re-execute for each subsequent run. This is because ZenML understands that nothing has changed between subsequent runs, so it re-uses the output of the last run (the outputs are persisted in the artifact store. This behavior is known as caching.
Prototyping is often a fast and iterative process that benefits a lot from caching. This makes caching a very powerful tool. Checkout this ZenML Blogpost on Caching for more context on the benefits of caching and ZenBytes lesson 1.2 for a detailed example on how to configure and visualize caching.
ZenML comes with caching enabled by default. Since ZenML automatically tracks and versions all inputs, outputs, and parameters of steps and pipelines, ZenML will not re-execute steps within the same pipeline on subsequent pipeline runs as long as there is no change in these three.
Currently, the caching does not automatically detect changes within the file
system or on external APIs. Make sure to set caching to False
on steps that
depend on external inputs or if the step should run regardless of caching.
Configuring caching behavior of your pipelines
Although caching is desirable in many circumstances, one might want to disable it in certain instances. For example, if you are quickly prototyping with changing step definitions or you have an external API state change in your function that ZenML does not detect.
There are multiple ways to take control of when and where caching is used:
- Configuring caching for the entire pipeline: Do this if you want to configure caching for all steps of a pipeline.
- Configuring caching for individual steps: Do this to configure caching for individual steps. This is, e.g., useful to disable caching for steps that depend on external input.
- Dynamically configuring caching for a pipeline run: Do this if you want to change the caching behaviour at runtime. This is, e.g., useful to force a complete rerun of a pipeline.
Configuring caching for the entire pipeline
On a pipeline level, the caching policy can be set as a parameter within the @pipeline
decorator as shown below:
The setting above will disable caching for all steps in the pipeline, unless a step explicitly sets enable_cache=True
(see below).
Configuring caching for individual steps
Caching can also be explicitly configured at a step level via a parameter of the @step
decorator:
The code above turns caching off for this step only. This is very useful in practice since you might want to turn off caching for certain steps that take external input (like fetching data from an API or File IO) without affecting the overall pipeline caching behaviour.
You can get a graphical visualization of which steps were cached using the ZenML Dashboard.
Dynamically configuring caching for a pipeline run
Sometimes you want to have control over caching at runtime instead of defaulting to the hard-coded pipeline and step decorator settings. ZenML offers a way to override all caching settings at runtime:
The code above disables caching for all steps of your pipeline, no matter what you have configured in the @step
or @parameter
decorators.
Code Example
The following example shows caching in action with the code example from the previous section.
For a more detailed example on how caching is used at ZenML and how it works under the hood, checkout ZenBytes lesson 1.2!