This blog was supposed to be more in-depth but my enthusiasm was drastically cut and I felt like splitting it up into multiple smaller one, whereas daft one is already uploaded.
I started writing a “recipe-book” for
daft
where I realized it wasn’t as smoothly integrated as a lot of other tools. I believe that theDataFrame
format is both a winning and loosing concept, it’s very helpful but when you need to use two columns the wayRay
,HuggingFace Datasets
and others map data usingdict
is a winning concept for both element by element and batch mapping. With adict
way to mapDataFrame
I think thatdaft
might end up the perfect tool.
Anyhow, today I’ll compare the developer experience and performance of different tools for data loading.
All of the chosen tools are quite awesome, but HuggingFace and Ray can export to TensorFlow additionally. Although Ray currently cannot handle RaggedTensor
which is required for models with variable output - a letdown!
Quick Introduction
Hugging Face Datasets offers easy access to a vast library of datasets, with efficient memory handling through streaming and memory-mapping. Its API simplifies data loading and transformation for direct use with PyTorch and TensorFlow.
Ray Data enables scalable, distributed data processing across multiple nodes, ideal for large datasets. It integrates with Ray’s ML tools for parallel training and distributed transformations. It’s the tool for Large Language Model training, even embraced by OpenAI in their ChatGPT source.
Daft is a high-performance data processing library with lazy evaluation, optimized for structured data formats like Parquet and Arrow. It’s a strong choice for single-node and multi-node data preparation with PyTorch compatibility. It utilizes Ray to achieve multi-node behavior.
PyTorch’s Dataset and DataLoader offer a simple and flexible way to load data with minimal memory overhead, ideal for in-memory and custom datasets. It’s lightweight but lacks distributed and lazy loading features.
Feature | Hugging Face Datasets | Ray Data | Daft | PyTorch Dataset + DataLoader |
---|---|---|---|---|
Parallel Processing | + | +++ | ++ | + |
Distributed Processing | 0 | +++ | +++ | 0 |
Caching & Memory Mapping | +++ | + | + | 0 |
Lazy Loading | +++ | ++ | +++ | +++ (depends) |
Simple to Use | +++ | + | ++ | +++ |
Built-in Dataset Access | +++ | 0 | 0 | +++ |
Custom Transformations | ++ | +++ | +++ | +++ |
ML Framework Support | +++ | +++ | ++ | ++ |
Mini Benchmark
Tool | Num_worker | Pin_memory | Cache | Configuration | Time |
---|---|---|---|---|---|
HF Element | None | None | False | .map | 6m48s |
None | None | True | .with_transform | 3m23s | |
HF Batched | None | None | False | .map | 7m14s |
None | None | True | .map | 3m22s | |
Torch Dataset/Loader | None | None | - | Default | 3m20s |
Daft | - | - | - | daft-default | 14m55s |
- | - | - | daft-native | 3m30s | |
Ray | - | - | - | Default | 7m41s |
Running on full sized images we get a bit more interesting results:
Tool | Num_worker | Pin_memory | Cache | Configuration | Time |
---|---|---|---|---|---|
Additional Tests | - | - | - | torch | 4m19s |
- | - | - | hf_with_transf | 4m40s | |
- | - | - | hf_map | 8m14s, cached: 7m21s | |
- | - | - | daft | 3m49s |
Developer Experience (DX)
def test_elem_by_elem(num_workers: int | None, pin_memory: bool | None, cache: bool):
if not cache:
datasets.disable_caching()
def _preprocess(data: dict):
= [utils.PREPROCESS_TRANSFORMS(x.convert("RGB")) for x in data["image"]]
imgs "image"] = imgs
data[return data
= datasets.load_from_disk("./imagenette_full_size")
ds
def _augment(data: dict):
= _preprocess(data["image"])
tensor "image"] = utils.AUGMENTATIONS(tensor)
data[return data
= ds["train"].with_transform(_augment)
ds_train = ds["validation"].with_transform(_preprocess)
ds_valid
= dict(
kwargs =num_workers or 0,
num_workers=bool(num_workers),
persistent_workers=pin_memory,
pin_memory=32,
batch_size
)= torch.utils.data.DataLoader(ds_train, **kwargs)
dls_train = torch.utils.data.DataLoader(ds_valid, **kwargs)
dls_valid
class ImagenetteDataset(Dataset):
def __init__(self, hf_dataset, preprocess=None, augment=None):
self.hf_dataset = hf_dataset
self.preprocess = preprocess
self.augment = augment
def __len__(self):
return len(self.hf_dataset)
def __getitem__(self, idx):
= self.hf_dataset[idx]
data
= data["image"].convert("RGB")
image
# Apply preprocessing and augmentation if specified
if self.preprocess:
= self.preprocess(image)
image: torch.Tensor if self.augment:
= self.augment(image)
image
return {
"image": image,
"label": data["label"],
}= ImagenetteDataset(
train_dataset "train"], preprocess=utils.PREPROCESS_TRANSFORMS
ds[
)print(len(train_dataset))
= ImagenetteDataset(
valid_dataset "validation"], preprocess=utils.PREPROCESS_TRANSFORMS
ds[
)
# Create DataLoader instances
= dict(
kwargs =num_workers or 0,
num_workers=bool(num_workers),
persistent_workers=pin_memory,
pin_memory=32,
batch_size
)= DataLoader(train_dataset, shuffle=True, **kwargs)
dls_train = DataLoader(valid_dataset, **kwargs) dls_valid
def load_imagenette_datasets_daft(dataset_path="./imagenette_full_size"):
= datasets.load_from_disk(dataset_path)
ds = daft.col("image").struct.get("bytes").alias("image")
extract_img_bytes = daft.from_arrow(ds["train"].data.table).select(
ds_train "label", extract_img_bytes
)
= daft.from_arrow(ds["validation"].data.table).select(
ds_valid "label", extract_img_bytes
)
= (
img_decode_resize "image").image.decode(mode="RGB").image.resize(224, 224)
daft.col(
)
= ds_train.with_column("image", img_decode_resize)
ds_train = ds_valid.with_column("image", img_decode_resize)
ds_valid
def to_f32_tensor(ds: daft.DataFrame):
return ds.with_column(
"image",
"image").apply(
daft.col(lambda x: (x / 255.0).transpose(2, 0, 1),
=daft.DataType.tensor(daft.DataType.float32()),
return_dtype
),
)
= to_f32_tensor(ds_train)
ds_train = to_f32_tensor(ds_valid)
ds_valid
= ds_train.to_torch_iter_dataset()
ds_train = ds_valid.to_torch_iter_dataset()
ds_valid
= torch.utils.data.DataLoader(ds_train, batch_size=32)
dls_train = torch.utils.data.DataLoader(ds_valid, batch_size=32)
dls_valid return dls_train, dls_valid
def load_imagenette_datasets_ray(dataset_path="./imagenette_full_size"):
# Load the Arrow dataset with Ray
= datasets.load_from_disk(dataset_path)
ds
def extract_img_to_pil(data):
= data["image"]["bytes"]
image "image"] = PIL.Image.open(io.BytesIO(image)).convert("RGB")
data[return data
= ray.data.from_huggingface(ds["train"]).map(extract_img_to_pil)
ds_train = ray.data.from_huggingface(ds["validation"]).map(extract_img_to_pil)
ds_val
= transforms.Compose(
preprocess_transforms
[
utils.PREPROCESS_TRANSFORMS,
]
)= utils.AUGMENTATIONS
augmentation_transforms
# Apply transformations in Ray
def preprocess_image(batch):
"image"] = [preprocess_transforms(x) for x in batch["image"]]
batch[
return batch
def augment_image(elem):
"image"] = augmentation_transforms(elem["image"])
elem[
return batch
= ds_train.map_batches(preprocess_image).map(augment_image)
ds_train
= ds_val.map_batches(preprocess_image)
ds_val
= ds_train.to_torch(
d_train ="label",
label_column=32,
batch_size=512,
local_shuffle_buffer_size=5,
prefetch_batches
)= ds_val.to_torch(label_column="label", batch_size=32, prefetch_batches=5)
d_valid
return d_train, d_valid
I think most of the frameworks ends up at a similar place in the experience.
Quick DX Ranking
- PyTorch & HuggingFace Datasets
- Daft
- Ray (albeit I believe it to be the most scalable solution as you can truly tinker in detail)
I enjoyed Daft a lot with its multi-modal syntax, inspired by polars with namespaces (e.g. .image.decode()
), which can be phenomenal. Working with DataFrame’s is a cool addition, where you can drop into python simply by using apply
.
Working with Daft more and more I noticed that the DataFrame syntax sometimes becomes a big blocker and the simplicity of HF Datasets and Ray using dict
’s in .map
statements results in easier code and smoother integration with existing libraries.
Additionally HF Datasets / PyTorch DataLoaders feels more pythonic, where the latter is real simple. I can’t put my finger on it but they just seem easier to debug and understand.
It’ll sure be interesting to follow the progress being made, and I’m happy the dust isn’t settled yet!