In recent months we've updated Kedro documentation to illustrate three different ways of integrating Kedro with Databricks.
You can choose a workflow based on Databricks jobs to deploy a project that finished development.
For faster iteration on changes, the workflow documented in "Use a Databricks workspace to develop a Kedro project" is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.
Alternatively, you can work locally in an IDE as described by the workflow documented in "Use an IDE, dbx and Databricks Repos to develop a Kedro project". You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.
In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of QuantumBlack, AI by McKinsey, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.
What is Databricks Connect?
Databricks Connect is Databricks' official method of interacting with a remote Databricks instance while using a local environment.
To configure Databricks Connect for use with Kedro, follow the official setup to create a .databrickscfg
file containing your access token. It can be installed with a pip install databricks-connect
, and it will substitute your local SparkSession:
1from databricks.connect import DatabricksSession
2spark = DatabricksSession.builder.getOrCreate()
Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.
This tool was recently made available as a thin client for Spark Connect, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the documentation as previous versions had different limitations.
How can I use a Databricks Connect workflow with Kedro?
Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.
Find out more about Kedro
There are many ways to learn more about Kedro:
Join our Slack organisation to reach out to us directly if you’ve a question or want to stay up to date with news. There's an archive of past conversations on Slack too.
Read our documentation or take a look at the Kedro source code on GitHub.
Check out our introductory video titled Refactor your Jupyter Notebooks using Kedro on YouTube.
How to use Databricks as your PySpark engine
Kedro supports integration with PySpark through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your SPARK_REMOTE
environment variable with your Databricks configuration. Here is an example implementation:
1import configparser
2import os
3from pathlib import Path
4
5from kedro.framework.hooks import hook_impl
6from pyspark.sql import SparkSession
7
8class SparkHooks:
9 @hook_impl
10 def after_context_created(self) -> None:
11 """Initialises a SparkSession using the config
12 from Databricks.
13 """
14 set_databricks_creds()
15 _spark_session = SparkSession.Builder().getOrCreate()
16
17def set_databricks_creds():
18 """
19 Pass databricks credentials as OS variables if using the local machine.
20 If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
21 otherwise it will use the DEFAULT profile in databrickscfg.
22 """
23 DEFAULT = os.getenv("DATABRICKS_PROFILE", "DEFAULT")
24 if os.getenv("SPARK_HOME") != "/databricks/spark":
25 config = configparser.ConfigParser()
26 config.read(Path.home() / ".databrickscfg")
27
28 host = (
29 config[DEFAULT]["host"].split("//", 1)[1].strip()[:-1]
30 ) # remove "https://" and final "/" from path
31 cluster_id = config[DEFAULT]["cluster_id"]
32 token = config[DEFAULT]["token"]
33
34 os.environ[
35 "SPARK_REMOTE"
36 ] = f"sc://{host}:443/;token={token};x-databricks-cluster-id={cluster_id}"
This example will populate SPARK_REMOTE
with your local .databrickscfg
file. You don't setup the remote connection if the project is being run from inside Databricks (if SPARK_HOME
points to Databricks), so you're still able to run it in the usual hybrid development flow. Notice that you don’t need to setup a spark.yml
file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.
Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the DatabricksSession
:
1from databricks.connect import DatabricksSession
2spark = DatabricksSession.builder.getOrCreate()
When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using kedro_datasets.databricks.ManagedTableDataSet
as your dataset type in the catalog also allows you use Delta table features.
How to enable MLflow on Databricks
Using MLflow to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use kedro-mlflow. Note that kedro-mlflow
is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the documentation from mlflow directly.
After doing the basic setup of the library in your project, you should see a mlflow.yml
configuration file. In this file, change the following to set up your URI:
1server:
2 mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
3 mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default
Setup your experiment name (this should be a valid Databricks path):
1experiment:
2 name: /Shared/your_experiment_name
By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.
Limitations of this workflow
Databricks Connect, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.
Users also need to be conscious that .toPandas()
will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the kedro-mlflow documentation for all types of supported objects.
Recently on the Kedro blog
In the last few weeks we’ve published the following on the Kedro blog:
We’re always looking for collaborators to write about their experiences using Kedro, particularly if you’re working with Kedro datasets or converting an existing project to use Kedro. Get in touch with us on our Slack workspace to tell us your story.