A Polars exploration into Kedro
One year ago I travelled to Lithuania for the first time to present at PyCon/PyData Lithuania, and I had a great time there. The topic of my talk was an evaluation of some alternative dataframe libraries, including Polars, the one that I ended up enjoying the most.
I enjoyed it so much that this week I’m in Vilnius again, and I’ll be delivering a workshop at PyCon Lithuania 2023 called “Analyze your data at the speed of light with Polars and Kedro”.
In this blog post you will learn how using Polars in Kedro can make your data pipelines much faster, what’s the current status of Polars in Kedro, and what can be expected in the near future. In case it’s the first time you’ve heard about Polars, I have included a short introduction at the beginning.
Let’s dive in!
What is the Polars library?
Polars is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the Apache Arrow columnar data format (you can read more about Arrow on my earlier blog post “Demystifying Apache Arrow”), and it is optimised to be blazing fast.
The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.
I started experimenting with Polars one year ago, and it has now become my go-to data manipulation library. I gave several talks about it, for example at PyData NYC, and the room was full.
What is Kedro?
Kedro is an open-source Python toolbox that applies software engineering principles to data science code. It makes it easier for a team to apply software engineering principles to data science code, which reduces the time spent rewriting data science experiments so that they are fit for production.
Kedro was born at QuantumBlack to solve the challenges faced regularly in data science projects and promote teamwork through standardised team workflows. It is now hosted by the LF AI & Data Foundation as an incubating project.
If you want to learn more about Kedro, you can watch a video introduction on our YouTube channel.
How do Polars and Kedro get used together?
Traditionally Kedro has favoured pandas as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to the catalog, and then you would use that dataset as input for your node functions, which would, in turn, receive pandas DataFrame
objects:
1# catalog.yml
2
3openrepair-0_3-categories:
4 type: pandas.CSVDataSet
5 filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
6
7# nodes.py
8
9def join_events_categories(
10 events: pd.DataFrame,
11 categories: pd.DataFrame,
12) -> pd.DataFrame:
13 ...
(This is just one of the formats supported by Kedro datasets of course! You can also load Parquet, GeoJSON, images… have a look at the kedro-datasets reference for a list of datasets maintained by the core team, or the #kedro-plugin topic on GitHub for some contributed by the community!)
The idea of this blog post is to teach you how can you use Polars instead of pandas for your catalog entries, which in turn allow you to write all your data transformation pipelines using Polars dataframes. For that, I crafted some examples that use the Open Repair Alliance dataset, containing more than 80 000 records of repair events across Europe.
Let’s go!
Get started with Polars for Kedro
First of all, you will need to add kedro-datasets[polars.CSVDataSet]
to your requirements. At the time of writing (May 2023), the code below requires development versions of both kedro
and kedro-datasets
, which you can declare on your requirements.txt
or pyproject.toml
as follows:
1# requirements.txt
2
3kedro @ git+https://github.com/kedro-org/kedro@3ea7231
4kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets
5
6# pyproject.toml
7
8[project]
9dependencies = [
10 "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
11 "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
12]
If you are using the legacy setup.py
files, the syntax is very similar:
1setup(
2 requires=[
3 "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
4 "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
5 ]
6)
After you install these dependencies, you can start using the polars.CSVDataSet
by using the appropriate type
in your catalog entries:
1openrepair-0_3-categories:
2 type: polars.CSVDataSet
3 filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
and that’s it!
Reading real world CSV files with polars.CSVDataSet
It turns out that reading CSV files is not always that easy. The good news is that you can use the load_args
parameter of the catalog entry to pass extra options to the polars.CSVDataSet
, which mirror the function arguments of polars.read_csv
. For example, if you want to attempt parsing the date columns in the CSV, you can set the try_parse_dates
option to true
:
1openrepair-0_3-categories:
2 type: polars.CSVDataSet
3 filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
4 load_args:
5 # Doesn't make much sense in this case,
6 # but serves for demonstration purposes
7 try_parse_dates: true
Some of these parameters are required to be Python objects: for example, polars.read_csv
takes an optional dtypes
parameter that can be used to specify the dtypes of the columns, as follows:
1pl.read_csv(
2 "data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv",
3 dtypes={
4 "product_age": pl.Float64,
5 "group_identifier": pl.Utf8,
6 }
7)
Kedro catalog files only support primitive types. But fear not! You can use more sophisticated configuration loaders in Kedro that allow you to tweak how such files are parsed and loaded.
To pass the appropriate dtypes
to read this CSV file, you can use the TemplatedConfigLoader
, or alternatively the shiny new OmegaConfigLoader with a custom omegaconf
resolver. Such resolver will take care of parsing the strings in the YAML catalog and transforming them into the objects Polars needs. Place this code in your settings.py
:
1# settings.py
2
3import polars as pl
4from omegaconf import OmegaConf
5from kedro.config import OmegaConfigLoader
6
7if not OmegaConf.has_resolver("polars"):
8 OmegaConf.register_new_resolver("polars", lambda attr: getattr(pl, attr))
9
10CONFIG_LOADER_CLASS = OmegaConfigLoader
And now you can use the special OmegaConf syntax in the catalog:
1openrepair-0_3-events-raw:
2 type: polars.CSVDataSet
3 filepath: data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv
4 load_args:
5 dtypes:
6 # Notice the OmegaConf resolver syntax!
7 product_age: ${polars:Float64}
8 group_identifier: ${polars:Utf8}
9 try_parse_dates: true
Now you can access Polars data types with ease from the catalog!
Future plans for Polars integration in Kedro
This all looks very promising, but it’s only the tip of the iceberg. First of all, these changes need to land in stable versions of kedro
and kedro-datasets
. More importantly, we are working on a generic Polars dataset that will be able to read other file formats, for example Parquet, which is faster, more compact, and easier to use.
Polars makes me so excited about the future of data manipulation in Python, and I hope that all Kedro users are able to leverage this amazing project on their data pipelines very soon!
Find out more about Kedro
There are many ways to learn more about Kedro:
Join our Slack organisation to reach out to us directly if you’ve a question or want to stay up to date with news. There's an archive of past conversations on Slack too.
Read our documentation or take a look at the Kedro source code on GitHub.
Check out our introductory video titled Refactor your Jupyter Notebooks using Kedro on YouTube.