Pandera - a way to type your data pipelines in Python!
Personally I feel like the documentation is really good, feel free to check it out.
This blog is a continuous work and WIP.
Through decorators it smoothly integrates into your other Python-code. The validation is done run-time, the only possible way in Python, unlike other typed languages, e.g. Scala.
What makes me exited about pandera
?
- Robust & Clean Pipelines.
- Reproducible and Testable.
- Integrates with other great ecosystems like
pydantic
,fastapi
&mypy
.
What makes me not exited about pandera
?
- No
polars
support yet - It validates through run-time crashes (because of Python)
- It helps that
pandera
haslazy evaluation
, but still it’s runtime!
- It helps that
How I’m using pandera
in production setting
Because life is how life is a lot of our pipelines, especially ML pipelines, are written fully in Python. I’m really happy to have tools like polars
which makes it a tad bit speedier, but in all honesty I’d deeply prefer to use a fully typed language like Kotlin or Scala.
To build maintainable pipelines we need to know what to expect and what to do in the unexpected, in Python we glue these pieces together unlike strongly typed languages where it’s built into the core itself.
But because life is how life is we use Python and pandera
shines brightly in making the non-typed world a little bit better.
In our ML pipelines we decorate inputs to make sure we have the right data as expected before training our models. pandera
fits really nicely into an organisation that has a structured orchestration tool like dagster
or similar.