spark integration brings two different step operators:
-
Step Operator: The
SparkStepOperatorserves as the base class for all the Spark-related step operators. -
Step Operator: The
KubernetesSparkStepOperatoris responsible for launching ZenML steps as Spark applications with Kubernetes as a cluster manager.
Step Operators: SparkStepOperator
A summarized version of the implementation can be summarized in two parts. First, the configuration:
-
masteris the master URL for the cluster where Spark will run. You might see different schemes for this URL with varying cluster managers such as Mesos, YARN, or Kubernetes. -
deploy_modecan either be ‘cluster’ (default) or ‘client’ and it decides where the driver node of the application will run. -
submit_argsis the JSON string of a dictionary, which will be used to define additional params if required (Spark has a wide variety of parameters, thus including them all in a single class was deemed unnecessary.).
launch method of the step operator gets additional configuration parameters from the DockerSettings and ResourceSettings. As a result, the overall configuration happens in 4 base methods:
-
_resource_configurationtranslates the ZenMLResourceSettingsobject to Spark’s own resource configuration. -
_backend_configurationis responsible for cluster-manager-specific configuration. -
_io_configurationis a critical method. Even though we have materializers, Spark might require additional packages and configuration to work with a specific filesystem. This method is used as an interface to provide this configuration. -
_additional_configurationtakes thesubmit_args, converts, and appends them to the overall configuration.
_launch_spark_job comes into play. This takes the completed configuration and runs a Spark job on the given master URL with the specified deploy_mode. By default, this is achieved by creating and executing a spark-submit command.
Warning
In its first iteration, the pre-configuration with_io_configuration method is only effective when it is paired with an S3ArtifactStore (which has an authentication secret). When used with other artifact store flavors, you might be required to provide additional configuration through the submit_args.
Stack Component: KubernetesSparkStepOperator
The KubernetesSparkStepOperator is implemented by subclassing the base SparkStepOperator and uses the PipelineDockerImageBuilder class to build and push the required docker images.
-
namespaceis the namespace under which the driver and executor pods will run. -
service_accountis the service account that will be used by various Spark components (to create and watch the pods).
_backend_configuration method is adjusted to handle the Kubernetes-specific configuration.
When to use it
You should use the Spark step operator:- when you are dealing with large amounts of data.
- when you are designing a step which can benefit from distributed computing paradigms in terms of time and resources.
How to deploy it
TheKubernetesSparkStepOperator requires a Kubernetes cluster in order to run. There are many ways to deploy a Kubernetes cluster using different cloud providers or on your custom infrastructure, and we can’t possibly cover all of them, but you can check out the spark example to see how it is deployed on AWS.
How to use it
In order to use theKubernetesSparkStepOperator, you need:
- the ZenML
sparkintegration. If you haven’t installed it already, run
- Docker installed and running.
- A remote artifact store as part of your stack.
- A remote container registry as part of your stack.
- A remote secrets manager as part of your stack.
- A Kubernetes cluster deployed.
@step decorator as follows:
Additional configuration
For additional configuration of the Spark step operator, you can passSparkStepOperatorSettings when defining or running your pipeline. Check out the API docs for a full list of available attributes and this docs page for more information on how to specify settings.
A concrete example of using the Spark step operator can be found here.