There has been a lot of buss around DeepSeek (R1) and their Open Source mission, and lately they released their full stack to train State-of-the-Art LLM’s.
One of the tools is a Distributed Data Processing framework named “smallpond” built on top of DuckDB & Ray.
Mike made an excellent write-up on his blog.
The summary? It’s a tool that you can’t even buy with millions $, insanely valuable Open Source code! Draw-back? A lot of setup, early days with few (if any) guides.
When should I use it? When you start to have more than 10 TB of data to query, especially above 1 PB.
My thoughts are that
- smallpond brings the “modern data stack” closer to end-user for truly Big Data, but not close enough.
- We see Apache Arrow and Ray (a lot more lean than say Apache Airflow) as key technologies, and the engines are interchangable between DuckDB and Polars.
- There’s other competition trying similar, e.g. Polars Cloud (albeit potentially not Open Source it’s an exciting future)!
- There’s other competition rather looking at vertical scaling, e.g. Motherduck .
At the end of the day Motherducks approach resonates a lot more to me, by storing data cleverly we can easily query huge amount of data efficiently on a single machine through metadata scanning, especially with vertical scaling. It’s also the simplest approach.
But some days you might be in need of that brute-force because there isn’t time, competence or your problem simply requires loading and working with insane amounts of data, i.e. LLM training.
All in all smallpond and 3FS are great additions to the open source community and extends the “distributed truly big data processing” which is a valuable target. Though I can’t help but think and hope that there’ll be even simpler tools moving forward.