Data Loading - HuggingFace Datasets

hf-dataset
recipe
Probably one of the easier and better ‘getting-started’ dataset libraries. It might not achieve the best results, but it’s good enough and the simplicity is hard to beat!
Published

December 2, 2024

In my previous blog-post I introduced Daft. Now I wish to introduce a new tool, and later I’ll share both a comparison blog and recipes for both libraries on common tasks, such as Object Detection.

Loading From HuggingFace Datasets

HuggingFace Datasets is one of the biggest dataset providers out there, integrating with them is something that is of great importance. Luckily it’s easy!

import datasets

ds = datasets.load_dataset("detection-datasets/fashionpedia")

Data Transforms

HuggingFace Datasets comes with a lot of nice-to-haves, like being able to .map on all Datasets directly in one call.

import albumentations as A
import numpy as np

PREPROCESS_TRANSFORMS = A.Compose(
    [A.Resize(224, 244)],
    bbox_params=A.BboxParams(format="pascal_voc", label_fields=["category"]),
)

AUGMENTATIONS = A.Compose([A.HorizontalFlip(p=0.1), A.VerticalFlip(p=0.1)])


# test if we'd run batched mode and compare.
def transform_images(data: dict) -> dict:
    out = PREPROCESS_TRANSFORMS(
        image=np.array(data["image"]),
        bboxes=data["objects"]["bbox"],
        category=data["objects"]["category"],
    )
    return out

ds.map(transform_images, num_proc=4)