Londogard Blog

ZenML or ClearML? Which MLOps tool strikes best?

Hampus Londögård — Sun, 05 May 2024 00:00:00 GMT

Making it as few words as possible.

Tool	Pro	Con
ClearML	Simple & Everything “fits”	Locked into ClearML, i.e. cannot use best tool for the job
ZenML	Composable & Extendable	Multiple tools to get job done (e.g. MLFlow not visualized in UI)

Similarities:

There’s a lot of similarities, it’s quite easy to get started.

Building Pipelines

They both have the possibility to use decorators which makes the code very simple to read, alas the ClearML way of doing things is not quite as smooth as ZenML.

ZenML builds pipelines and tasks/components in a simpler better way.

Tracking Experiments

To track experiments I believe both solutions got you covered. ClearML’s experiment tracker is quite good and works as you’d expect, while ZenML you decide which tool you want to use (I opted to MLFlow).

ZenML supports: Comet, MLFlow, Neptune, WandB, & Custom. ClearML supports: ClearML.

It’s a draw, ZenML supports “better” trackers BUT ClearML has a native integration which makes things a lot easier.

Orchestrators

Both have a simple to use orchestrator. Once again ZenML leans back towards the giants while ClearML uses a built-in native orchestrator that binds everything together.

It’s a draw.

UI

One of the more important parts of a tool is the UI. Here I believe in a way ZenML is strong as they “off-load” each components UI to the component itself, i.e. MLFlow tracing is shown in MLFlow UI.

The UI itself of each tool, i.e. WandB, is much better than ClearML’s offering in my opinion.
But the integration of ClearML as a tool “solve all” is a HUGE timesaver and I think could outweigh using the “better” tooling. Integrating everything from Experiment Comparison to Report Building is an quite amazing feat that I think is worthwhile applauding.

Conclusion

First and foremost, I see both Open Source offering moving more and more towards a SaaS. This is clearly visible by locking certain features in the UI (ZenML, the new UI is beautiful but locked down without your Cloud offering). It’s also shown by supplying additional superb features even when self-hosted. I do understand the need to pay your bills, but it’s sad to see Open Source moving to this either way.

See ZenML comparison (Open Source <> Cloud) and ClearML one.

Sometimes the best option is to opt for the “cloud-native” one, i.e. AWS/Azure/GCP tools. But I love open source… :)

Anyhow, to finalize here’s my judgement:

If you prefer to keep your stack as simple as possible: ClearML.
If you prefer to keep your stack customized having the best tool for each part: ZenML.

I cannot pick a winner, ZenML enables simpler transition and better tooling all in all, but the full-on integration of ClearML with “everything working together” is quite magical and similar to the cloud-native options (AWS Sagemaker/Azure MLStudio/GCP Vertex).

Find the code for each framework running MNIST: ….

Thanks for this time, Hampus

ClearML

task = Task.init(
    project_name="MNIST Digit Recognition",
    task_name="Simple NN model with PyTorch Lightning",
    task_type=Task.TaskTypes.training,
    output_uri=None,
)


class SimpleNN(pl.LightningModule):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(784, 512)
        self.dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return torch.log_softmax(x, dim=1)

    def training_step(self, batch, batch_idx):
        data, target = batch
        output = self(data)
        loss = nn.functional.cross_entropy(output, target)
        self.log("train_loss", loss)
        return loss

    def test_step(self, batch, batch_idx):
        return self(batch[0])

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.001)


params_dictionary = {"epochs": 3}
task.connect(params_dictionary)

transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
)

train_dataset = datasets.MNIST(
    root="./data", train=True, transform=transform, download=True
)
test_dataset = datasets.MNIST(root="./data", train=False, transform=transform)

train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset, batch_size=128, shuffle=True
)
test_loader = torch.utils.data.DataLoader(
    dataset=test_dataset, batch_size=128, shuffle=False
)

model = SimpleNN()
trainer = pl.Trainer(max_epochs=params_dictionary["epochs"])
trainer.fit(model, train_loader)
trainer.test(dataloaders=test_loader)

ZenML

@zenml.step
def load_mnist() -> Tuple[
    Annotated[torch.utils.data.DataLoader, "train_loader"],
    Annotated[torch.utils.data.DataLoader, "test_loader"],
]:
    ...
    return train_loader, test_loader

@zenml.step
def train_model(
    train_loader: torch.utils.data.DataLoader, test_loader: torch.utils.data.DataLoader
):
    model = SimpleNN()
    trainer = pl.Trainer()
    trainer.fit(model, train_loader)
    trainer.test(dataloaders=test_loader)


@zenml.pipeline
def train_pipeline():
    train_loader, test_loader = load_mnist()
    train_model(train_loader, test_loader)


if __name__ == "__main__":
    train_pipeline()

Streamlit Fragments - Make the Dashboard Dream come true

Hampus Londögård — Wed, 17 Apr 2024 00:00:00 GMT

An old coworker gave me a shout-out that Streamlits latest (1.33.0) release added Fragments.

Fragments simply put enables creation of indepedently updated fragments inside your streamlit application. Further they add a simple run_everywhich simplify dashboards (continuously fetching data).

As always, the documentation explains a lot of how it works.

Play Around

First I play around with fragments, testing the most simple use-case – and I’m sold!

N.B. this is already possible in other tools such as Solara that has a better reactive approach, but streamlit has a bigger user-base and I love to see a solution to this long-standing problem!

The code

import numpy as np
import streamlit as st


def main():
    st.write("# Main Function")
    st.write("Hello, World! (main)")
    st.toggle("Toggle me!")


@st.experimental_fragment()
def first_fragment():
    st.write("## First Fragment")
    random_choice = np.random.choice(["a", "b", "c"])
    st.write(f"Random choice: {random_choice}")
    st.toggle("Toggle me! (1st Fragment)")


@st.experimental_fragment(run_every="2s")
def second_fragment():
    st.write("## Second Fragment")
    st.write("Hello, World! (2nd Fragment)")
    random_choice = np.random.choice(["a", "b", "c"])
    st.write(f"Random choice: {random_choice}")
    st.toggle("Toggle me! (2nd Fragment)")


if __name__ == "__main__":
    main()
    c1, c2 = st.columns(2)

    with c1:
        first_fragment()
    with c2:
        second_fragment()

This enables the following behavior:

Toggling “main” will refresh everything
Toggling a fragment will only refresh that fragment
Second fragment will refresh every 2 seconds

What is refreshed? The Random choice letter is updated to a random letter (a, b, or c).

All in all this is what we’d probably do in a Dashboard. See the following GIF’s:

Streamlit Fragments and Toggling (simplest use-case) - note the ‘Random Choice’ changing.

Adding Complexity

As always it’s a lot more fun to test these things in scenarios that are closer to real-life, and that’s what I intend to do!

Fetching data from a data storage
Displaying different graphs
Sharing state from main

In this graph we have a Amplitude Multiplier (main) that affects both fragments, additionally we have a sine wave where the frequency is editable and will only re-render (re-compute) that fragment (first). Finally there’s a Stock Fragment (second) which automatically updates every 2 seconds, unless locked it’ll randomly select a stock, if locked we can still change stock and it’ll only re-render that fragment (second).

See the GIF below! 👇

Sine wave and Stocks, with automatic Stock Refresh

Code

import numpy as np
import streamlit as st
import polars as pl
import plotly.express as px


def main() -> float:
    st.write("# Main Function")
    st.write("Hello, World! (main)")
    multiplier = st.slider("Amplitude Multiplier", 0.0, 10.0, 1.0, 0.1)

    return multiplier


@st.cache_resource
def get_stocks() -> pl.DataFrame:
    return pl.read_csv(
        "https://raw.githubusercontent.com/vega/datalib/master/test/data/stocks.csv"
    )


@st.experimental_fragment()
def first_fragment(multiplier: float):
    st.write("## First Fragment")
    sine_frequency = st.slider("Sine Frequency", 0.0, 10.0, 1.0, 0.1)
    # create sine wave with multiplier height and sine_frequency as frequency
    t = np.linspace(0, 2 * np.pi * sine_frequency, 100)
    y = multiplier * np.sin(t)

    df = pl.DataFrame({"t": t, "y": y})
    st.plotly_chart(
        px.line(df, x="t", y="y", title="Sine wave"), use_container_width=True
    )


@st.experimental_fragment(run_every="2s")
def second_fragment(multiplier: float):
    st.write("## Second Fragment")
    c1, c2 = st.columns(2)

    with c1:
        if not st.checkbox("Lock company"):
            st.session_state["ticker_select"] = np.random.choice(
                ["AAPL", "GOOG", "AMZN"]
            )
    with c2:
        ticker = st.selectbox(
            "Company (symbol)", ["AAPL", "GOOG", "AMZN"], key="ticker_select"
        )
    stocks = get_stocks()
    stocks = stocks.filter(pl.col("symbol") == ticker).with_columns(
        pl.col("price") * multiplier
    )

    st.plotly_chart(
        px.line(stocks, x="date", y="price", title=f"Stock price ({ticker})"),
        use_container_width=True,
    )


if __name__ == "__main__":
    multiplier = main()
    c1, c2 = st.columns(2)

    with c1:
        first_fragment(multiplier)
    with c2:
        second_fragment(multiplier)

Drawbacks

This solution doesn’t fit every scenario, and as usual with Streamlit, integrating it introduces complexity via state management. Fragments add another level atop the existing st.state, potentially introducing more intricacies and headaches.

Other solutions such as Solara and Panel has this more built into the solution, but then again their entry threshold is a lot higher!

Outro

Any other questions? Please go ahead and ask!

This development is exciting and will for sure give Streamlit new life in “efficiency”. I, for one, am happy to see all new Data Apps fighting!

Finally, all the code is available on this blogs github under code_snippets.

/ Hampus Londögård

TIL: Pixi by prefix.dev

Hampus Londögård — Wed, 20 Mar 2024 00:00:00 GMT

This is a very short one. Keeping it for myself!

For my recent minor projects I’ve been utilizing Pixi to run my virtual environments and it actually works great!

It’s simple to start and keep going. What’s even better?

Supports multiple environments (e.g. CUDA + CPU)
Supports multiple platforms (e.g. osx-arm64 and linux-64)!
Fast (3x faster than micromamba, 10x faster than conda!)
Integrates better with pypi
Has tasks (e.g. pixi run test or pixi run inference) that you define yourself
Lockfiles, it’s painful to use micromambas lockfiles. Hence dual file system as in poetry/nodeJS etc is great!

Helpful right? Indeed!

Simple get-started

pixi init # create toml and lock files
pixi add python polars # add python and polars as dependencies

pixi shell # activates the virtual environment
# Alternatively, "pixi run python ..."

Add Task

Tasks are really awesome to reduce the threshold to enter projects. It’s simpler than a spread of bash-scripts or other things.

One standardized way to do things! :)

pixi task add gui "solara run app.py" # Adds task

pixi run gui # runs Solara App

Outro

Please read the docs to learn more!

TIL: Multiple Git Remotes

Hampus Londögård — Sun, 18 Feb 2024 00:00:00 GMT

It is really simple actually. Simply call the following command:

git remote add

To then push it is as simple as git push .

I use this to keep my repository in HuggingFace Spaces and GitHub at the same time.

That’s it!

TIL: Building Docker Images with Conda and Custom Users (and devcontainers!)

Hampus Londögård — Sun, 18 Feb 2024 00:00:00 GMT

When building Docker containers there’s a few things to clarify:

You’re building something layer by layer
Things you do in Docker is not kept in docker run unless you specify it through special commands.
- That is, it isn’t a stateful operation to run RUN conda activate highlights.
Docker can be run as multiple users, but is built by default as root
The original container X, FROM X, can have an environment with magic enabled

I found a few problems based on all this when trying to deploy my Solara application using Docker.

conda path is not enabled by default, even if I modified .bashrc.
User permissions where not available, i.e. I wasn’t allowed to create folders.
devcontainer as base image added a “default user” called vscode.

Let’s go through each one step-by-step!

Problem 1. Enabling `conda` in `docker run ..`

When running docker run no shell is applied by default. Additionally Docker defaults to /bin/sh because bash is not available in all images. This means that .bashrc modifications and similar not applied when starting your container!

There’s a few ways to make the conda env available by default in your shell, I opted for what I found simplest - modifying $PATH.

See the following edit in Dockerfile

ENV PATH /opt/conda/envs/<ENV_NAME>/bin/:$PATH

Through this when my CMD I can call solara run (solara is a python library available in my env).

Problem 2. Enabling permission to do OS changes, e.g. `os.mkdir`

As mentioned Docker is built using root user. From my understanding it is later run as any user depending on how you can docker run, this can deactivate capabilities to modify os. One such capability that I use is to run mkdir and creating files.

The simplest solution I found was to:

Create a new user
- For HuggingFace spaces it’s ID should be set as -u 1000
Allow new user to control the folder where you app resides COPY --chown=user
Default to running as user (USER user)

In a Dockerfile it ends up as follows.

RUN useradd -m -u 1000 user
USER user

COPY --chown=user . /app
...

Voila! That should make it work!

…unless you have a weird base-image that modifies the OS, i.e. devcontainer as base-image. More about this in the section below! 👇

Problem 3. vscode user

I mentioned that my base-image, FROM mcr.microsoft.com/vscode/devcontainers/miniconda:latest, there already is a user on id=1000. This user is apparently called vscode!

I found this by running docker run -it /bin/bash and grokking around in the terminal.

This was a huge blocker for me and took a long while to understand. To enable my deployment I instead of creating a custom user hijacked the existing user vscode. See the final Dockerfile.

FROM mcr.microsoft.com/vscode/devcontainers/miniconda:latest

USER vscode

COPY --chown=vscode . /app
WORKDIR /app

RUN conda env create -f env.yml
RUN conda clean -a -y
RUN conda init

EXPOSE 8765

ENV PATH /opt/conda/envs/highlights/bin/:$PATH

CMD ["solara", "run", "sol_app.py", "--host=0.0.0.0"]

And that makes everything work!

TIL: Stlite - running streamlit in WASM through b64-encoded URL’s

Hampus Londögård — Sun, 18 Feb 2024 00:00:00 GMT

Amazing one-off tools deployed as a URL that embeds a WASM app is officially here! 🤯

Wow, that was a mouthful! There’s a mini-breakdown of the tech utilized to enable this at the bottom.👇

What does this mean? It means that I can build assistive apps that has the following properties:

Deplyed as a URL - no server, no nothing, simply share away 🦸
Runs completely inside browser, on the user-device 💻
Requires NO developer or developer environment to run! 😇

This is what I call truly Serverless! Because code and everything is embedded in the URL and runs inside a contained environemnt in the browser it cannot be simpler to share one-off tools! I’m excited to utilize this a lot more to enable my colleagues in sister-teams!

Sample image of internal app I made

Here is a URL for a tool that does real-time image processing in the browser: stlite.net! Here’s where you can work inside the browser to build a script: edit.share.stlite.net.

Tech breakdown that enables everything: 1. Emscripten/WASM enables C code running efficiently in the browser 2. Pyodide is a cPython port to Emscripten/WASM that works brilliantly, including micropip that enables a bunch of libraries 3. Stlite is a Streamlit port to Pyodide (with few caveats)

P.S. I heard that Solara is potentially working on the same functionality natively 😉

Solara, League of Legends and Deep Learning to extract E-Sport Highlights

Hampus Londögård — Sun, 11 Feb 2024 00:00:00 GMT

Solara, League of Legends and Deep Learning to extract E-Sport Highlights

Hi all! 👋

I’m back with a New Years Resolution to release at least 6 blogs in 2024, focusing on interesting content that’s unique and not simple click-bait.

Today I’m sharing the continuation of a presentation I did at Foo Café. I have built a complex yet user-friendly Data App using solara.

There’s two parts that creates this complexity.

Heavy processing that requires threading
- Deep Learning & ffmpeg processing
Multi-stage app that requires state
- This will be clear when you view the video of the app

My presentation at Foo Café focused more on the training process and exploring option to deploy my tool. This blog rather focuses on the Data App itself! 🚀

Code is available on GitHub¹.

Code available on github.com/londogard/lol_highlight_detection.

Quick Backstory

My kid brother asked me to help him earn some quick bucks through automating a process to build highlight-videos of E-Sports to upload on YouTube. The emphasis was on a single streamer, TheBausFFS².

² TheBausFFS is a famous Swedish streamer in League of Legends (LoL). He’s focusing on being fun and speaking as much “swenglish” (svengelska) as possible.

When I set out my brother had a hope to earn cash, myself I was happy to learn more about new tools and being able assist my family with my expertise in Deep Learning. 💪

Choices during my Journey

I was contemplating four (4) options to deploy my resulting model.

Tool Streamlit	Pro (I’m) Experienced Beautiful Simple	Con Doesn’t support the dynamic nature of app without extreme state hacking
Solara	React Data Flow, supports our use-case Complexity grows logarithmically	Uglier (I’m) less experienced It’s the “new kid on the block”
Jupyter Notebook	Easy GPU	@ where I coNot User-Friendly (for non-tech persons) Not as dynamic/stateful as needed (i.e. everything is a flow)
Gradio	Simple	Complexity grows exponentially

Other tools such as Panel was briefly considered and rejected due to time and non-composability. For technical details such as model and more see my previous blog.

Result

The whole process gave me two results.

1. Knowledge

Most important to me I learned a lot, especially by playing around with Solara in both complex and siple use-cases.

Intimate knowledge about Solara and complex Data Apps that require high performance and efficiency.
Insights in Video Classification and how it differentiates from Image Classification.
Experimentation in how to build Data Apps to non-domain-experts that have no technical expertise nor willingness to learn new things.

2. A Data App

I built an exciting Data App that’s simple to use, applies automatic Video Processing and Deep Learning Inference for non-tech users.

Following from here I’ll share some building blocks to create an exciting Data App. To learn more about Solara basics I refer to my previous blog where I compare with Streamlit and introduce how to build a simple app. I’ll share code snippets and videos of the app.

The resulting code covers most of the building blocks required to build everything from basic to complex Data Apps!

App show case

See previous blog to learn more about model training.

Videos

Video

Full App Interaction (Button & Load)

Video

Plotly Callback Display

Video

Build Full Video Button

Screenshots

Inference Page

Inference Page 1

Inference Page 2

Other Pages

Download Twitch/Kick Video Page

Download Model(s) Page

Title Generation (Local LLM) Page

Building Blocks / Code-Snippets

I share some of the more interesting parts of Solara and how it can simplify your Data App.

Building Progress Loaders in Solara

I created a component to wrap Progress on top of the use_thread returned solara.Result class. This is very useful when you’re writing code and want a “prettier” spinner with some text.

The possibility to create re-usable components which are stateful like this is exciting!

@solara.component()
def ProgressDynamic(
    msg: str,
    result: solara.Result[Any],
):
    if result.state == solara.ResultState.RUNNING:
        Progress(msg)

This is a quite simple class, yet it cleans up a lot of code when called like:

res = write_video.use_thread(
    tstamp["start"],
    tstamp["end"],
    selected_vid,
    Path(file_name).stem,
)
Progress("Building Clip...", res)

Resulting spinner on loads

Checkpointing / State Management and Parent/Children (Hierarchy)

State in Solara is managed through solara.reactive’s. These enable a clean representation of UI & backend state. Adding hierarchy it becomes even better, it’s just as React! State trickles both up and down through the tree.

Trickle Down to Child

One way to use state hierarchy is to provide from parent to child a solara.reactive which enables downstream users to atomically update as parent, or anyone, updates said state variable.

@solara.component
def CutOffChartSelection(
    cut_off: solara.Reactive[int],
    df: pl.DataFrame,
): # this method is simplified to show-case important parts.
    div = solara.Column()
    solara.SliderInt(
        "Highlight Y-Cutoff",
        cut_off,
        min=df["preds"].min() + 1,
        max=df["preds"].max(),
        thumb_label="always",
        tick_labels="end_points",
    )
    with div:
        fig = px.line(df, x="timestamp", y="preds", line_shape="hv")
        fig.add_hline(y=cut_off.value, line_color="red")
        solara.FigurePlotly(fig)

What’s important here?

Note how I set div = solara.Column() this lets me re-order UI and disregard execution flow.
Note cut_off is solara.Reactive, whenever this is updated in parent the child will be updated additionally. And the reverse is also true.

Trickle Up to Parent

State does not only trickle up, as the solara.reactive is the same as every other use we can change state in a child and trickle up to parent and other consumers!

It’s a very simple and clean approach to state, kudos React and Solara!

@solara.component
def ModelFileSelectComponent(
    file: solara.Reactive[str],
    model: solara.Reactive[str],
):
    files = ...
    models = ...
    with solara.Card("Select Video/Model"):
        with solara.Columns():
            solara.Select("Select File", values=files, value=file)
            solara.Select("Select Model", values=models, value=model)

In this component we can see that we insert file: solara.Reactive which is edited through the solara.Select. This creates a clean trickle-up flow and allow us to hide details.

⚠️ It would be even cleaner if our component could return this reactive variable, rather than declaring it in parent.

Threading to improve UI experience

When working with UI it’s important to not block the main thread, sometimes called the rendering thread.

If we block the main thread the app becomes frozen and doesn’t respond. This becomes a bad user experience.

It’s solved by using threads which fortunately Solara makes a breeze. There’s two ways to apply threading.

# 1. Memoized function
@solara.memoize
def slow_func(...): ...
result = slove_func.use_thread(...)

# 2. Run thread directly
def slow_func_2(...): ...
result = solara.use_thread(lambda: slow_func_2(...), dependencies=...)

Both approaches results in solara.Result value that updates itself as it progress.

solara.Result[T] has two important values.

state which is one of the following: [..., RUNNING, FINISHED].

Once finished value will be filled of the result T.

The result is a smooth experience through threads. The developer side is not as perfect and feels underdeveloped, I end up writing something that feels almost like an anti-pattern with a bunch of if-elses to figure if result is done.

I can see a future where you call a .use_thread with something like apply_progress=True and also inject your own solara.Reactive to allow a more reactive approach to the result. The current result is clear code though, that’s explainable even on GitHub - as shared below.

res = write_video.use_thread(...)

if res.state == solara.ResultState.RUNNING:
    Progress("Writing video...")
elif res.state == solara.ResultState.FINISHED:
    show_finished_ui(...)

Again, this is no problem once-in-a-while, but if you’re using threads heavily it becomes a if-else craze.

All in all it’s a good experience to include threading though, something unimaginable in Streamlit!

Plotly Callbacks

I implemented a really cool Plotly Callback through Solara’s integration. It allows the user to select a subset of the video by “zooming”/”dragging”. It’s a clean approach to selecting sub-parts of videos to build the full video, i.e. one game at a time and not the full stream! See Figure 1.

def update_vals(relayout_dict: dict[str, Any] | None):
    if relayout_dict is not None:
        layout = relayout_dict["relayout_data"]
        start_stop.value = [
            parser.parse(layout["xaxis.range[0]"], ignoretz=True),
            parser.parse(layout["xaxis.range[1]"], ignoretz=True),
        ]

solara.FigurePlotly(fig, on_relayout=update_vals)

Video

Figure 1: Plotly Callback Display

How To Run App

I have added two ways to run the app. Either through Docker/podman containers or through Python invocation.

For Python I opted for a conda env using micromamba which is really awesome and simple to get a environment that’s reproducible.

Solara findings

`solara.reactive` is awesome

This simple tool enables things that are essentially impossible or creates unmaintainable code in Streamlit. To do similar things you’d need to hack the state.

Having a multi-stage app like this app where you checkpoint each step into the state would be such a impossibility.

Solaras UI/logic separation is better but not perfect

This might be on myself as I started using Streamlit before Solara. But Solara states that they separate UI and logic better than competitors through reactive state. My experience is that it’s almost as easy to create the same interdependent mess as in Streamlit.

Perhaps Reacts “clean” separation is more because of having a clear boundary between backend/frontend rather than React itself.

Solara Issues

I found myself having big problems with “resetting” state of variables that depend on the Operating System. In Streamlit you can use st.reload() to reload the whole UI, Solara does not have something similar.

My problem was that I have model/video download as part of the UI, and solara.reactive variables that has these variables. It’s not really possible to update these when changing tabs.
For now there was a decent work-around, rather than using Tabs I use Select to change page. By changing page this way the solara.reactive variables are reloaded.

Outro

This is it for today! I hope you enjoyed and potentially became intrigued to test Solara.

I hope to be back within 2 months, to make sure I keep my New Years Resolution 😉.

~Hampus Londögård

Automatic Highlight Detection of League of Legends Streams

Hampus Londögård — Sun, 10 Sep 2023 00:00:00 GMT

I received a quest by my little brother, creating a League of Legends (LoL) Highlight Extraction tool. It seemed simple enough and I was happy to be able to use my Deep Learning skills for something that could be used by a non-tech person! 🤓

Where do you even start in such a quest? Data!

Have my little brother annotate at minimum 2 streaming sessions of TheBauss (á 6 hours)
- It’s some data, not enough or varied enough but it’s something.
Create balanced dataset with good splits
- We cannot leak data between train and validation
  - This is solved by chunking videos into 30s segments which are then split between train/validation.
- Balancing the data to have a little higher highlight distribution than reality, to generalize better.
  - Reality is ~13% highlights, I rebalance the dataset to be ~30% highlights in training.
Start training!

TheBaussffs
Simon “Thebausffs” Hofverberg (born October 3, 1999) is a Swedish player who last played for G2 Esports. He’s one of the biggest League of Legends streaming personas on Twitch right now (2023).

Known for:

Highest ranked AD Sion player on EUW.
“Inting mode” where he dies but wins in gold based on calculated trades and “good deaths”.

Model Architectures

I chose to go with two different architectures:

Image Classification
- KISS: use ResNet (testing both pretrained & scratch)
- Potentially use some other architecture in the future
Video Classification
- KISS: Use RNN on top of “Image Classifier” that has no head, i.e. Feature Extractor
- RNN are naturally good with sequences as they’ve an internal state

In the future I’d like to try a VisionTransformer (ViT), but for now I’ll keep my compute low and see where I can go! As Jeremy Howard says, start simple and scale when required - it’s better to have something than nothing.

Implementation

The implementation exists in a single Notebook (runnable in Google Colab with a free GPU) - keeping it simple to use as a non-programmer (my brother).

There’s a few steps:

Download Twitch Stream (twitch-dl)
Convert into frames in a set FPS (3, ffmpeg)
Upload to Cloudflare R2 (persist)
Build DataLoader
Select and Train Model
Load model and run inference

FFMPEG + CUDA (step 2)

The ffmpeg conversion from Video to Frame was incredibly slow on Colab, I found two issues:

CPU is much poorer than M1 Macbooks (~30x)
Mounting storage (GDrive) makes it not work at all (100x)

The second option is simple, have all data located on the Colabs storage and do conversion, then when done copy to mounted storage (GDrive) for persistance.

The first is a bit harder, but I found that I could accelerate conversion using CUDA! 🤓

The result? Beautiful!
There’s some delay through moving data from RAM to GPU but it gave magnitudes of speed-up - which is awesome!

And it’s not that hard, I was confused by multiple places to enable CUDA, but found the following to be the best:

ffmpeg -hwaccel cuda -i <FILE> -preset faster -vf fps=3 -q:v 25 img%d.jpg

This command will use hwaccel as CUDA, which means that it accelerate certain workloads by moving data to the GPU (expensive) then computing it on the GPU (fast). It’ll run with fps=3 in this setup. I found it on average sped up my frame-creation by 14-30x.

Small to Medium Dataset Speedups

In my professional worklife I’ve found that smaller datasets (<= RAM) has huge performance gains by keeping tensors in-memory, it’s much larger than I ever anticipated.
For this dataset I could only keep labels in-memory, but doing that makes sense as well. 😉

Analysis

Analyzing my results was done by a few different things:

Metrics (😅)
LIME
Viewing data that model predicts as non-highlight but is highlight
*Timeline chart
- This was the best way to review a “unknown” clip, do the highlight intensities overlap real highlights? Boy they do!
- It’s also how it’ll be used by my brother.

Result

The results are better than expected! I hit ~80% accuracy which is pretty good based on image-to-image classification, if we did an average of the last few frames I’d bet it’d be better. All in all having non-perfect score in such a subjective task as Highlight Extraction is a strength. We don’t want to overfit the data but rather be able to find highlights in new clips.

Validation Accuracy

And to build my highlights I did a rolling_sum on the last 30 seconds of predictions, giving a highlight intensity chart as shown in Figure 1.

Figure 1: Highlight Intensity Chart

After manual validation of the Intensity Chart it looked really good! It had lower score on some “easier 1v1 kills”, but all the cool fights and ults was caught by the model! I’m a bit impressed about how well a simple Image Classifier solve this task, applying Video Classification didn’t really change the results that much and introduced unwaranted complexity.

The result is that the model predics ~8% of all frames as highlights, which is pretty darn good number as the “real” one is ~13%!

Appendix: Code

Data Tools

Data Splitter

def chunk_splitter(total_size: int, chunk_size: int, split: int | float) -> np.array:
    _, val_idxs = train_test_split(np.arange(total_size // chunk_size), test_size=split, random_state=42) # ignoring final unsized chunk
    is_valid = np.zeros(total_size, dtype="int")

    for index in val_idxs:
        index *= chunk_size
        is_valid[index:index+chunk_size] = 1

    return is_valid

FrameDataset

class FrameDataset(Dataset):
  def __init__(self,
               df: pl.DataFrame,
               augments: Compose,
               frames_per_clip: int,
               stride: int | None = None,
               is_train: bool = True,
               ):
        super().__init__()
        self.paths = df["path"].to_list()
        self.is_train = is_train
        if is_train:
          self.y = torch.tensor(df["label"])
        self.frames_per_clip = frames_per_clip
        self.augments = augments
        self.stride = stride or frames_per_clip

  def __len__(self):
      return len(self.paths) // self.stride

  def __getitem__(self, idx):
      start = idx * self.stride
      stop = start + self.frames_per_clip
      if stop-start<=1:
        path = self.paths[start]
        frames_tr = self._open_augment_img(path)
        if self.is_train:
          y = self.y[start]
      else:
        frames = [self._open_augment_img(path) for path in self.paths[start:stop]]
        frames_tr = torch.stack(frames)
        if self.is_train:
          y = self.y[start:stop].max()
      if self.is_train:
        return frames_tr, y
      else:
        return frames_tr

  def _open_augment_img(self, path):
      img = default_loader(path)
      img = self.augments(img)
      return img

Lightning DataModule

class FrameDataModule(L.LightningDataModule):
    def __init__(self,
                 dataset: Dataset,
                 batch_size: int = 32,
                 chunk_size_for_splitting: int = 3 * 30,
                 num_workers: int = 2,
                 pin_memory: bool = False
                 ):
        super().__init__()
        self.dataset = dataset
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.pin_memory = pin_memory
        self.chunk_size_for_splitting = chunk_size_for_splitting
        split = chunk_splitter(len(ds), chunk_size=self.chunk_size_for_splitting, split=.15)
        val_indices = np.where(split)[0]
        train_indices = np.where(split == 0)[0]
        self.ds_train = Subset(self.dataset, train_indices)
        self.ds_val = Subset(self.dataset, val_indices)

    def train_dataloader(self):
        return DataLoader(self.ds_train, shuffle=True, batch_size=self.batch_size, num_workers=self.num_workers, pin_memory=self.pin_memory)

    def val_dataloader(self):
        return DataLoader(self.ds_val, batch_size=self.batch_size, num_workers=self.num_workers, pin_memory=self.pin_memory)

Classifiers

Image Classifier (ResNet)

class ResNetClassifier(nn.Module):
    def __init__(self, model: ResNet, num_classes: int = 2):
        super().__init__()
        self.num_classes = num_classes
        self.model = model
        self.model.fc = nn.Linear(self.model.fc.in_features, num_classes)

    def forward(self, x):
        return self.model(x)

Video Classifier (RNN+ResNet)

class RNNClassifier(nn.Module):
    def __init__(self, model: ResNet, num_classes: int = 2):
        super().__init__()
        self.num_classes = num_classes
        self.feature_extractor = model # repeat thrice
        self.feature_extractor.fc = nn.Linear(512, 512) # New fc layer
        self.rnn = nn.LSTM(
            input_size=512,
            hidden_size=256,
            num_layers=1,
            batch_first=True
          )
        self.classifier = nn.Linear(256, num_classes)

    def forward(self, x):
      features = []

      for i in range(x.shape[1]):
          frame_feat = self.feature_extractor(x[:, i])
          features.append(frame_feat)

      x = torch.reshape(torch.stack(features), [x.shape[0], x.shape[1], -1])

      out, _ = self.rnn(x)
      out = out[:, -1, :]

      out = self.classifier(out)

      return out

LightningWrapper

from typing import Any
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
import torchmetrics

class LightningWrapper(L.LightningModule):
    def __init__(self, model: nn.Module, learning_rate=1e-3):
        super().__init__()
        self.model = model
        self.lr = learning_rate
        metrics = torchmetrics.MetricCollection({
            "accuracy": torchmetrics.Accuracy(task="multiclass", num_classes=self.model.num_classes)
            })
        self.train_metrics = metrics.clone(prefix="train_")
        self.val_metrics = metrics.clone(prefix="val_")

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        self.log('train_loss', loss)
        self.train_metrics(logits, y)
        self.log_dict(self.train_metrics, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        self.log('val_loss', loss)
        self.val_metrics(logits, y)
        self.log_dict(self.val_metrics, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

Training Loop

Lightning Loop

mlf_logger = MLFlowLogger(log_model=True)
mlflow.pytorch.autolog()

BATCH_SIZE = 128
# Define the duration of each chunk in seconds
chunk_duration_s = 30
chunk_duration_frames = 3 * chunk_duration_s
transform = transforms.Compose([
  transforms.Resize((224, 224)),
  transforms.ToTensor()
])
ds = FrameDataset(labeled_df_rebalanced, transform, 1)  # set 1 to higher if image classifier, is num_frames
data = FrameDataModule(ds, BATCH_SIZE, chunk_duration_frames, pin_memory=True)

model = ResNetClassifier(models.resnet50()) # image classifier
# model = RNNClassifier(models.resnet18()) # video classifier
model = LightningWrapper(model, learning_rate=1e-4)
trainer = L.Trainer(max_epochs=5, logger=mlf_logger, callbacks=[ModelCheckpoint(".")])


with mlflow.start_run():
  trainer.fit(model, datamodule=data)

Solara - ‘A new Reactive Streamlit’

Hampus Londögård — Fri, 30 Jun 2023 00:00:00 GMT

Solara (solara.dev) is a fresh and exciting web framework that enables React-style state-handling while keeping UI (almost) as easy as Streamlit (streamlit.io). Further by design it seems to have a bigger industrial potential than Streamlit.
In this post I’ll introduce Solara and compare it to the more well known Streamlit that can be used to build web apps and interactive Data Dashboards.

Solara and Streamlit are web frameworks which makes it very easy to write a full-stack app that can be everything from a small Proof-of-Concept to a big complicated Data Dashboard. There exist other competitors such as Dash, Panel and Voila which we’ll not include in-depth as comparison between those and Streamlit has been done previously.

I’m very excited to see where Solara ends up in the future. I believe it has a bright future ahead. During my initial tests it seems like a solid framework, there’s still some rough edges to fix but all in all it’s really good!

Here’s my quick introduction and comparison with Streamlit!

Streamlit and why it’s awesome

I’m a huge Streamlit fan. There’s a lot to love, the first time I tried it everything clicked. It’s simple and good looking, the User Experience (UX) and Developer Experience (DX) is exceptional. Using it I can easily build beautiful web apps in no time, efficient Data Dashboards and small Proof-of-Concepts (PoC).

With that said everything has a drawback, and Streamlit’s is performance once it scales to large. Streamlit works by rerunning everything in the script top-down once a change happens in the web app. This makes it incredibly simple to reason about and reduces the number of bugs, but it also builds what can become a huge bottleneck and slow down the app.

Streamlit solves this by introducing cache which allows reusing results from expensive computations and in the UI they reuse components to make the flow smother if nothing changes, this is done using the internal state. Once you start modifying state and cache the complexity grows quickly and the app becomes much harder to reason about.

Streamlit could solve this by enabling better API’s which allows better “data-flow” choices, like the st.form enables non-recalculated views unless submitted.

For now though… Let’s try solara a new exciting framework which does exactly this, but without the same simplicity.

Solara Introduction

Solara is a similar framework to Streamlit, but rather than rerunning everything top-down every time it uses a reactive approach through reacton (github.com/widgetti/reacton) that is a pure Python port of React to ipywidgets.

This means that only components that are using the reactive variable is rerun, which is very exciting! The performance improvements becomes great at the cost of a more declarative state-handling.

In Streamlit state is handled for you, in Solara state is separated from the component, like react, which means you handle it explicitly and further reduce hidden magic that in the end sometimes leads to a fragile complex app as the app grows.
shiny.rstudio.com shares examples of how complex state-handling can become in streamlit. What cannot be shared easily is how badly this scales with app size and complexity. The global state can, and probably sometimes will, lead to hard-to-find bugs and a hard time to achieve high-performant apps.

An example app and deeper introduction can be found in a later section.

🥊Solara versus Streamlit

In here I’ll compare Solara and Streamlit, for a more in-depth usage see later sections.

TL;DR

Code is on GitHub lundez/solara_app / Section 1, and I added suggested improvements via PR#180 and Issue#177 to Solara.

Con: Higher learning curve (floor) as you handle state manually (no magic), but with better DX I see it beating Streamlit because of the higher ceiling that brings new possibilities.
- Pro: Bonus: we learn the React paradigm which is widely used in frontend! A win-win for developers.
Pro: A better industrial thinking
1. Testing is actually handled E2E by using playwright – docs.
2. Routing is first-class citizen – docs.
3. Embeddable, in flask/fastAPI/… – docs.
Con: UX is not as good as Streamlit.
Pro: Embeddable in Notebooks making it simpler to go really fast.
Pro: Access to all ipywidgets which is an incredible ecosystem.

Additionally compared to another interesting framework like Panel it’s much simpler IMO.

Most certainly Solara has a bright future if the development and maintenance is kept up, with additional marketing it can become a great competitor in the space for Gradio, Streamlit and others.

Solara Introduction

All code is available on GitHub lundez/solara_app and Section 1.

A simple app

Using Solara to build a simple app is pretty clean, we can clearly see the react -hook being used for file which is cool!

The entry-point in Solara is defined as a component named Page. This is automatically picked up and rendered. To show in a notebook you simply use display(Page()).

import solara
import polars as pl

@solara.component
def Page():
    file, set_file = solara.use_state(None)
    solara.Markdown("# Solara Example App (Starbucks Data)")
    solara.FileDrop(on_file=set_file, lazy=False)
    if file is not None:
        df = pl.read_csv(file["data"], null_values="-").drop_nulls()
        solara.DataFrame(df.to_pandas()) # currently does not support polars
    else:
        solara.Text("Make sure to upload a file")

And run it through

solara run app.py
# use solara run --host localhost app.py on WSL until PR#180 or other is merged

Quite simple right? It’s incredibly similar to how you run Streamlit 😉.

Working more with state

In the previous code we only used the state inside the same component, that’s all fine but it’s not a very good use-case.

Solara introduces 2 other types of state, reactive and use_reactive. Both very similar but use_reactive is only possible to use locally inside a component. The reactive function should only be used outside of components, for application wide state. If you use it inside a component it’ll be reset as the re-render happens, which is not what you’d expect.

As such we define a reactive variable sodium.

sodium = solara.reactive(0)

further we bind this to a slider

solara.SliderInt(value=sodium)

where it’ll automatically update whenever a user changes the slider.
We can then send this to a child-component which would be updated as well when it’s changed.

ChildPage(sodium)

See lundez/solara_app to get a little more complex scenario.

And that’s my quick and dirty introduction to Solara!

Solara and Streamlit comparison

A comparison between the two is available in Section 1, where I use multiple components and share state between them. If you add some logging you can see that Solara doesn’t rerun code unnecessarily.

Quick Facts

Measurement	Solara	Streamlit
#lines of code (LOC)	53	49
Simplicity (0-5)	3	4
Performance (0-5)	4	3
UX (0-5) - images in Section 0.3.2.3	3	5

Streamlit wins most, but with the performance and possibilities of Solara I still see it as a very capable contender. With, hopefully soon, improved UX and DX solara can grow to be really big!

In the TL;DR section you can see some other niceties of Solara such as Testing, Embeddability and more!

Deeper Comparison

State and ‘pythonicism’

The React paradigm, while cool, is certainly not pythonic at all! I believe that the initial “PoC”-stage implementation should be simpler to get started with. To “turn the knobs” and squeeze performance as is possible is great and should certainly be available for the more performance intense sections.

The positive of all this is that we learn the “React-paradigm” and we handle state explicitly, i.e. no Streamlit magic!

It’s a tie.

Embeddability

The embedability of Solara is a clear winner, being able to include it in our FastAPI backend or building an app initially directly in a Jupyter Notebook is insanely good.

Solara wins.

Components & UX

Streamlits components are better and more beautiful but Solara is not far behind, and with the possibility to use all components from ipywidget-ecosystem makes it incredibily powerful.

I see this as a huge boon.

Image Comparisons of Components

Solara No File – bad design and too raw

Streamlit No File – clear in both drag & select files

Solara With File – we cannot remove file clearly

Streamlit With File – file can be removed easily

Solara Slider – we don’t know the values

Streamlit Slider – values are clear

Solara Moving Slider – we now see values

Streamlit Moving Slider

Solara DataFrame – great with defaulted pagination, no sorting though.

Streamlit DataFrame – we can sort and the columns are clearly separated, it’s also searchable

It’s a small win for Streamlit, Streamlit clearly wins the ‘simplicity’, ‘design’ and ‘clearness’ but Solara has a bonus for the amount of widgets available through ipywidgets ecosystem.

Readability

Because solara has less magic I believe Solara is easier to reason about in a complex app, but in the PoC’s and simple apps Streamlit is just as simple.

It’s a tie, but Solara wins as the app grows.

Outro

I believe that streamlit still is the best framework to get started with, but in 6 months from now on I can see solara as the better option.

If you’re developing a dashboard or app that needs high performance and industrial strength I can see solara as a better choice.

I’ll happily try to help Solara grow!

~Hampus Londögård

P.S. thanks to @maartenbreddels and @Gordon#1568 (Solara’s Discord) for all the help.

Appendix

I include full code here as well. These are taken from lundez/solara_app the 29/6 2023.

Solara App

Run using solara run .

from dataclasses import dataclass
import solara
import polars as pl
import solara.express as px

@solara.component
def Page():
    file, set_file = solara.use_state(None)
    
    solara.Markdown("# Solara Example App (Starbucks Data)")
    solara.FileDrop(on_file=set_file, lazy=False)
    if file is not None:
        df = pl.read_csv(file["data"], null_values="-").drop_nulls()
        DFViews(df)
    else:
        solara.Text("Make sure to upload a file")

@dataclass
class FilterValues:
    sodium: tuple[int, int]
    carb: tuple[int, int]

@solara.component
def Filters(df: pl.DataFrame, filters: solara.Reactive[FilterValues]):
    with solara.Card("Filter DataFrame"):
        carbs = solara.use_reactive((df["Carb. (g)"].min(), df["Carb. (g)"].max()))
        sodium = solara.use_reactive((df["Sodium"].min(), df["Sodium"].max()))
        solara.SliderRangeInt("Carbs (g)", value=carbs, min=df["Carb. (g)"].min(), max=df["Carb. (g)"].max())
        solara.SliderRangeInt("Sodium", value=sodium, min=df["Sodium"].min(), max=df["Sodium"].max())
        
        with solara.CardActions():
            solara.Button("Submit", on_click=lambda: filters.set(FilterValues(sodium.value, carbs.value)))

@solara.component
def FilteredPage(df: pl.DataFrame, filter_values: solara.Reactive[FilterValues]):
    df = df.filter(pl.col("Sodium").is_between(filter_values.value.sodium[0], filter_values.value.sodium[1]) &
                   pl.col("Carb. (g)").is_between(filter_values.value.carb[0], filter_values.value.carb[1]))
    DFVis(df)

@solara.component
def DFVis(df: pl.DataFrame):
    solara.Markdown(f"## DataFrame")
    solara.DataFrame(df.to_pandas(), items_per_page=5)
    px.histogram(df, x=["Carb. (g)", "Sodium"])

@solara.component
def DFViews(df: pl.DataFrame):
    filter_values = solara.use_reactive(FilterValues((df["Carb. (g)"].min(), df["Carb. (g)"].max()), (df["Sodium"].min(), df["Sodium"].max())))
    Filters(df, filter_values)
    with solara.Columns():
            DFVis(df)
            FilteredPage(df, filter_values)

Streamlit App

Run using streamlit run .

from dataclasses import dataclass
import streamlit as st
import polars as pl
import plotly.express as px

def Page():
    st.markdown("# Streamlit Example App (Starbucks Data)")
    file = st.file_uploader("Upload a file")
    if file is not None:
        df = pl.read_csv(file, null_values="-").drop_nulls()
        DFViews(df)
    else:
        st.write("Make sure to upload a file")

@dataclass
class FilterValues:
    sodium: tuple[int, int]
    carb: tuple[int, int]

def Filters(df: pl.DataFrame) -> FilterValues:
    with st.form("df_filer"):
        st.write("**Filter DataFrame**")
        carbs = (df["Carb. (g)"].min(), df["Carb. (g)"].max())
        sodium = (df["Sodium"].min(), df["Sodium"].max())
        carbs = st.slider("Carb. (g)", min_value=carbs[0], max_value=carbs[1], value=carbs)
        sodium = st.slider("Sodium", min_value=sodium[0], max_value=sodium[1], value=sodium)
        
        st.form_submit_button("Submit")
    return FilterValues(sodium, carbs)

def FilteredPage(df: pl.DataFrame, filter_values: FilterValues):
    df = df.filter(pl.col("Sodium").is_between(filter_values.sodium[0], filter_values.sodium[1]) &
                   pl.col("Carb. (g)").is_between(filter_values.carb[0], filter_values.carb[1]))
    DFVis(df)

def DFVis(df: pl.DataFrame):
    st.markdown(f"## DataFrame")
    st.dataframe(df.to_pandas())
    st.write(px.histogram(df, x=["Carb. (g)", "Sodium"]))

def DFViews(df: pl.DataFrame):
    filters = Filters(df)
    c1, c2 = st.columns(2)
    with c1:
        DFVis(df)
    with c2:
        FilteredPage(df, filters)

Page()

Kotlin DataFrame vs Polars DataFrame

Hampus Londögård — Sat, 06 May 2023 00:00:00 GMT

N.B. added dataset and link to Datalore Notebooks.

Benchmarking is notourusly hard, hence I know these results are not fully show-casing possibilities of the JVM. Nontheless, they’re results.

Benchmark Details

Pre-downloaded CSV (dataset: Plotly All Stocks 5 Years)
Use Eager-API as Kotlin DataFrame does not have a Lazy API (this would help polars further)
Run 10k times to make sure the JVM isn’t a slow starter (one should do this even better using JMH and their API to benchmark)

Results

The results speak clearly.

Kotlin DataFrame (5.4s) Datalore Notebook

polars DataFrame (2.6s) Datalore Notebook

Figure 1: DataFrame Comparison of 10k runs on Plotly All Stocks 5 Years.

polars is 2x faster (!).
polars uses 1GB less RAM.
polars actually downloaded the same CSV file 12x faster, and caches the result internally unlike Kotlin for later instant usage.

Thoughts

I think it’s interesting to see how much faster polars is, even if I use eager API and don’t use any fancy feature(s) like groupBy that’s optimized like crazy.

It really showcases what a powerhouse Rust is to run intensive applications with, and now I’m left wondering if perhaps one should wrap polars on the JVM. 🤓 This has been done for other platforms, such as NodeJS, R & Elixir.

Wrapping Rust from the JVM isn’t easy today though, but with the new progress with Project Panama it should be easier. Project Panama introduces a simpler, safer and more efficient way to call Native code from the JVM through the Foreign Function & Memory API. I expect it to become even better as it’s currently only in preview… 😉

That’s all for now.
~Hampus

DuckDB - Quacking Data (WIP)

Hampus Londögård — Wed, 03 May 2023 00:00:00 GMT

DuckDB is a relational database management system (RDBMS) that is used for managing data stored in relations, which are essentially tables. DuckDB is an in-process OLAP DBMS written in C++. It has no dependencies, is easy to set up, and is optimized to perform queries on data. It is similar to SQLite for Analytics. DuckDB is designed to run on a single machine, and it is not a multi-tenant database. It is perfect for data practitioners that want to perform local analytical workloads easily and quickly.

Warning

This blog is a continuous work and WIP.

DuckDB - Quacking Data

DuckDB is the OLAP-version of SQLite. It’s simple, feature rich, fast and most importantly free. It reall is an incredible tool for quick local analytical workloads. It has very impressive results in the H2O-benchmark resulting in a jack-of-all-trades for analytics.

It has neat integration into:

pandas
polars
Arrow, PostgreSQL, parquet, deltalake, …

Where you simply query on top of the datastructure which makes the query incredibly efficient even if it’s not “DuckDB”-format. It all becomes painless and efficient through the Apache Arrow-protocol. Arrow is a cross-language development platform for in-memory analytics which is very fast, because the underlying structure remains we never copy data between DuckDB and polars as an example. Rather DuckDB operates on polars Arrow dataset. Cool? Yes! 🤓

Lately they’ve added superb Spatial Support where they leverage GDAL.

Recently Motherduck gave a talk on how Big Data is Dead, where a BigQuery founding engineer laid out the average customer doesn’t have Big Data. Jordan Tigani notes the following trends.

Data Sizes at Companies

Query Intensity on Data by Age

Figure 1: Jordan Tigani’s ‘Feel of the Industry’

Further Jordan notes that from 2006 to 2023 we see a huge increase in how powerful single nodes are. In 2006 you could get 2GB RAM, today a standard instance has 256 GB RAM and if you slap your wallet at the problem you can get 24TB (!) RAM.

Through this power of a single node more than 99.9% of all problems should be possible to solve through a single node where DuckDB can reign supreme.

How am I using DuckDB in my professional life?

DuckDB is integral to my workflow. I use it when polars is not enough. Because DuckDB can easily query parquet we’re good to go in most Machine Learning setups.

Connecting directly to S3 is as simple as using the official httpfs extension which is done in the following way:

INSTALL httpfs; # run once

LOAD httpfs;

How I want to move forward with DuckDB

I have so many plans on what I wish to achieve.

Poor Mans Data Lake

I have been thinking a lot lately on building my own Data Lake using DuckDB, polars & dagster. This would be greatly inspired by dagster’s blog - ‘Poor Mans Data Lake’.

Best Practices Write-Up

Additionally it’d be very interesting to set up a “Best Practices” using “Local First” tooling such as DuckDB & polars. When to use what and how they differentiate.

Pandera - Type your Data Pipelines (WIP)

Hampus Londögård — Sun, 30 Apr 2023 00:00:00 GMT

Pandera - a way to type your data pipelines in Python!
Personally I feel like the documentation is really good, feel free to check it out.

Warning

This blog is a continuous work and WIP.

Through decorators it smoothly integrates into your other Python-code. The validation is done run-time, the only possible way in Python, unlike other typed languages, e.g. Scala.

What makes me exited about pandera?

Robust & Clean Pipelines.
Reproducible and Testable.
Integrates with other great ecosystems like pydantic, fastapi & mypy.

What makes me not exited about pandera?

No polars support yet
It validates through run-time crashes (because of Python)
- It helps that pandera has lazy evaluation, but still it’s runtime!

How I’m using `pandera` in production setting

Because life is how life is a lot of our pipelines, especially ML pipelines, are written fully in Python. I’m really happy to have tools like polars which makes it a tad bit speedier, but in all honesty I’d deeply prefer to use a fully typed language like Kotlin or Scala.

To build maintainable pipelines we need to know what to expect and what to do in the unexpected, in Python we glue these pieces together unlike strongly typed languages where it’s built into the core itself.

But because life is how life is we use Python and pandera shines brightly in making the non-typed world a little bit better.

In our ML pipelines we decorate inputs to make sure we have the right data as expected before training our models. pandera fits really nicely into an organisation that has a structured orchestration tool like dagster or similar.

Building Actionable Insights from Yelp Reviews using Setfit

Hampus Londögård — Fri, 31 Mar 2023 00:00:00 GMT

Installing Python Packages

We install a few libraries to assist us.

NLP: sentence-transformers (simple efficient sentence embeddings), setfit (few-shot training) & bpemb (fast efficient subword embeddings).
Data: polars (fast and sane DataFrame) & ydata-profiling (data profiling).
Other: pigeonXT-jupyter (annotation/labeling inside jupyter) & yake (keyword extraction).

from IPython.display import clear_output

# ⚠️Add the Kaggle lines if not running on Kaggle⚠️
#!pip install kaggle --upgrade
#!kaggle kernels pull lundet/yelp-reviews
!pip install -U pyarrow sentence-transformers plotly polars ydata-profiling setfit bpemb yake pigeonXT-jupyter
clear_output()

Introduction

This notebook-blog is adapted from my Yelp Reviews notebook on Kaggle.

In this notebook I was show-case a way to easily extract real valuable insights from Yelp Reviews using Few-Shot Learning. The dataset contains 5.2 million reviews and 174,000 businesses and is available here.

To create value from reviews we need to structurally find things not cleanly available in the data, i.e. classifying the “Stars” (1-5) isn’t that valuable value as the data already exist, unless we wish to find invalid reviews.

I chose to extract actionable feedback to improve a restaurant business, which is valuable to a owner.

I had a few other ideas in mind, but to keep the the blog short & to the point of SetFit I move remaining tasks and Data Analysis into Appendices at the end. The other task include Topic Classification/Tagging to extract patterns and Keyword Extraction.

`SetFit` what is it?

SetFit Training

SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers. It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples 🤯!

Compared to other few-shot learning methods, SetFit has several unique features:

🗣 No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that’s suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.

🏎 Fast to train: SetFit doesn’t require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.

🌎 Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.

Above is excerpt from github.com/huggingface/setfit

Explain like I’m Five (ELI5)

SetFit builds a larger dataset of sentence similarity by using pairs of sentences (permutations), we fine-tune our embeddings based on the similarity.
Finally we fine-tune a classification head using the fine-tuned embeddings on the original data.

ELI10

SetFit works by first finetuning a pretrained sentence-transformer (ST) on a small number of text pairs, in a contrastive Siamese manner. The resulting model is then used to generate rich text embeddings, which are used to train a classification head.
This is a simpler competitor to PEFT which requires complicated prompts and very large LLM’s.

During the first phase of fine-tuning the ST we enlarge the few-shot dataset into a bigger one using a contrastive approach which means that the dataset is expanded using the following method:

Sample set of Positive triplets, {a, b, 1}
- a & b are sentences with class C.
Sample set of Negative triplets, {c, d, 0}
- c is class C, d is another class
Produce dataset T by concatenating the positive and negative triplets

Where C is the number of classes. is chosen to be in the paper.

The grand idea is that we generate these sentence pairs which we can then fine-tune the model on as Positive or Negative in similarities by using the Siamese style of network and sentence similarity.

The second step then fine-tunes a Classification Head on the original labeling, embedding the sentences using the ST fine-tuned in the first step.

Classifying Review as Helpful or Not

I’m excited to get started! To find if a review is Helpful or Not, that is if it has actionable feedback or not is interesting.

The data is not available out of the box, but using SetFit we only need 8+ labels per class to achieve State-of-the-Art performance!

Keeping things simple, and lazy, I make it a binary classifier with the classes Improvements & None.

To see the Data Analysis go to Appendix.

How many reviews do we have?

import polars as pl

df_review_all = pl.scan_ndjson("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json") # scan, once again lazy.

"Number of reviews", df_review_all.select(pl.count()).collect(streaming=True)[0,0]

('Number of reviews', 6990280)

Selecting the most reviewed business as our choice, this is easily done in a lazy manner keeping RAM low.

max_reviewed_business = df_review_all.groupby("business_id").count().sort("count", descending=True).limit(1).collect()[0, "business_id"]
df_review = df_review_all.filter(pl.col("business_id") == max_reviewed_business).collect()

Annotating/Labeling the data

This is easily done using pigeonXT. Please note that the labeling widgets are not visible anymore because once the notebook is shut down the widgets stops working.
I counteract the manual labeling by keeping the labels in a list in the cell after.

Anyhow, I sample 40 items which I label depending if the feedback is helpful improvements or not helpful.

import pigeonXT as pixt

review_sample = df_review["text"].sample(n=40, seed=42).to_list()

labels = ['Improvements', 'None', ]

annotations = pixt.annotate(
    review_sample,
    options=labels
)

Unfortunately widgets cannot be saved in a notebooks state, as such you can’t see my labeling.
Additionally to make the notebook easily re-executable I add manual labels based on the result just underneath 👇

# Saving manually because each session removes annotations
annotation_labels = ['None', 'None', 'None', 'None', 'Improvements', 'None', 'Improvements', 'None', 'None', 'None', 'None', 'None', 'None', 'Improvements', 'None', 'Improvements', 'None', 'Improvements', 'Improvements', 'Improvements', 'None', 'None', 'None', 'None', 'Improvements', 'Improvements', 'None', 'None', 'None', 'None', 'Improvements', 'None', 'None', 'None', 'None', 'Improvements', 'None', 'Improvements', 'None', 'None']
annotation_labels = [0 if x == 'None' else 1 for x in annotation_labels]

Train/Eval split

We extract seven (7) reviews as holdout (eval_df) to evaluate model on. This eval_df has a few samples of each class.

import pandas as pd
df = pd.DataFrame({"text": review_sample, "label": annotation_labels})
train_df = df[:-7]
eval_df = df[-7:]

Training using SetFit

To train our model using SetFit we need to instantiate a Trainer, just like you do with HuggingFace Transformers.

We also need to use a sentence-transformer-model, I choose MiniLM because it’s very capable while small. Additionally we need to sample our dataset (sample_dataset) to make sure we have equal number of samples of each class, the default being n=8 per class.

Finally I annotated the params directly in the code 🤓

from IPython.display import clear_output
from setfit import SetFitModel, SetFitTrainer, sample_dataset
from sentence_transformers.losses import CosineSimilarityLoss
from datasets import Dataset

train_ds = Dataset.from_pandas(train_df)
train_ds_sampled = sample_dataset(train_ds, label_column="label")
eval_ds = Dataset.from_pandas(eval_df)

model = SetFitModel.from_pretrained(
    "sentence-transformers/all-MiniLM-L6-v2",
)

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds_sampled,
    eval_dataset=eval_ds,
    loss_class=CosineSimilarityLoss, # CosineSimilarty as a loss function on ST fine-tuning
    batch_size=16,
    # contrastive learning is explained earlier
    num_iterations=20, # Number of text pairs to generate for contrastive learning
    num_epochs=1 # Number of epochs to use for contrastive learning
)
clear_output()

Now training is as simple as trainer.train() and then we can evaluate the model using trainer.evaluate()!

trainer.train(batch_size=8)
metrics = trainer.evaluate();metrics

***** Running training *****
  Num examples = 640
  Num epochs = 1
  Total optimization steps = 80
  Total train batch size = 8
***** Running evaluation *****

{'accuracy': 1.0}

Wow! 🤯

1.0 accuracy on the 7 reviews in the evaluation dataset, that’s amazing!
…How about a custom test?

model([
    "I'd like the portions to be larger and the tables are very small. The food otherwise is pretty good!",
    "The food is simply amazing! A must go!"
])

tensor([1, 0])

1 = Improvement, 0 = None

Evaluation

This model performs incredibly well! We have 100% accuracy on Eval Dataset and it correctly predict handwritten reviews! 🤯

It’s an amazing feat that the fine-tuning require this few labels to perform this well, which additionally means it trains super fast! All in all, with this tool we can easily:

Label few samples (8+ per class)
Train model
Run model inference

Resulting in a custom classifier that can extract high value data from unstructured data, in an instant!

In this case we make it possible to extract all reviews that contains Helpful Improvements/Actionable Feedback suggested by Reviewers.
My mind is blown away!

Ending Thoughts

I believe we can make the application more powerful by rather doing few-shot Named Entity Recognition (NER). This way we’d tag the exact span that has actionable feedback, rather than the full reviews.

This would of course require token-by-token labeling which I was to lazy to do in my free time… 😅
Even without a NER I think the results are impressive and has good impact.

All in all I think this excercise shows how far Large Language Models (LLM’s) have gotten outside the ChatGPT bubble, and they’re darn powerful. Using a smaller LLM (MiniLM is 91 MB (!)) also provides new possibilities like running the model on-the-edge, directly at the users hands which simplifies questions about uploading data among others.

Appendix

Here I add the Data Analysis & other tasks I scratched (keyword & topic clustering).

Appendix A: Data Understanding & Analysis

To understand the data we should read the data documentation at yelp.com. The data is split into 5 files.

Review, 2. Check-In, 3. Business, 4. User & 5. Tip

And file-names (yelp_academic_dataset=*): ['Dataset_User_Agreement.pdf','*_review.json', '*_checkin.json', '*_business.json', '*_tip.json', '*_user.json']

To understand the data I think we’d like to know the size, how the data looks like and some other quick analysis.

Using LLM’s we don’t need to clean it as much but it’s still a good excercise of value to understand your data deeper.

Data Analysis: Reviews

import polars as pl

df_review_all = pl.scan_ndjson("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json") # scan, once again lazy.

"Number of reviews", df_review_all.select(pl.count()).collect(streaming=True)[0,0]

('Number of reviews', 6990280)

That’s a lot of reviews, we should probably sample fewer to examine.
Making use of pl.LazyFrame we make sure to read the full dataset, but rather we do take_every which means we skip loading 999/1000 rows, later sampling of the final dataset.

df_review_sample = df_review_all.take_every(1_000).collect().sample(n=1_000, seed=42)

print(df_review_sample.schema)
"Review: ", df_review_sample[0, "text"]

{'review_id': Utf8, 'user_id': Utf8, 'business_id': Utf8, 'stars': Float64, 'useful': Int64, 'funny': Int64, 'cool': Int64, 'text': Utf8, 'date': Utf8}

('Review: ',
 'First timer, came with my boyfriend and they took a while to seat us. Same problem with getting our orders, very slow service. Although the food was great. Found this place because of the Axxcess card. Brought the bill without us asking, felt like I was being rushed. Ignored that we wanted to us the offer from the card, had to get our bill red printed. Overall, the food was good, cannot say the same about the service.')

The review isn’t the nicest, but it’s at least the full text and very informative. That’s great for text applications!

In the sample of data I personally examined ~50 to make sure I understand what makes a review Helpful or Not.

Data Analysis: Business

We should view the other files too, to validate what the data contains and so on.
Let’s use pl.LazyFrame.limit to make sure we only read 3 samples and never open the rest of the data.

df_business = pl.scan_ndjson("/kaggle/input/yelp-dataset/yelp_academic_dataset_business.json")
df_business.limit(3).collect()

shape: (3, 14)

business_id	name	address	city	state	postal_code	latitude	longitude	stars	review_count	is_open	attributes	categories	hours
str	str	str	str	str	str	f64	f64	f64	i64	i64	struct[33]	str	struct[7]
"Pns2l4eNsfO8kk…	"Abby Rappoport…	"1616 Chapala S…	"Santa Barbara"	"CA"	"93101"	34.426679	-119.711197	5.0	7	0	{"True",null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null}	"Doctors, Tradi…	{null,null,null,null,null,null,null}
"mpf3x-BjTdTEA3…	"The UPS Store"	"87 Grasso Plaz…	"Affton"	"MO"	"63123"	38.551126	-90.335695	3.0	15	1	{null,"True",null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null}	"Shipping Cente…	{"0:0-0:0","8:0-18:30","8:0-18:30","8:0-18:30","8:0-18:30","8:0-14:0",null}
"tUFrWirKiKi_TA…	"Target"	"5255 E Broadwa…	"Tucson"	"AZ"	"85711"	32.223236	-110.880452	3.5	22	0	{"False","True","True","2","False","False","False","False","u'no'","{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","True","False","False","False","False","False",null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null}	"Department Sto…	{"8:0-22:0","8:0-22:0","8:0-22:0","8:0-22:0","8:0-23:0","8:0-23:0","8:0-22:0"}

Interesting, there’s a lot of things we don’t really care about. But stars & review_count is relevant.

So which restaurant has the most stars? What is their count?

df_business.sort("review_count", descending=True).select(["business_id", "stars", "review_count"]).limit(3).collect()

shape: (3, 3)

business_id	stars	review_count
str	f64	i64
"_ab50qdWOk0DdB…	4.0	7568
"ac1AeYqs8Z4_e2…	4.0	7400
"GXFMD0Z4jEVZBC…	4.5	6093

The most reviewed business has 7.5k reviews (!) and is averaging 4 stars, which is pretty great!
I’ll later choose this restaurant as the basis of my “Helpful” or “Not Helpful” review.

Ydata-Profiler Analysis

The business data is pretty tabular, as such I’d like to try ydata-profiling which is an automated data analysis tool which extracts some good statistics. It’s like DataFrame.describe() on steroids!

report = ProfileReport(df_business.drop(["name", "city", "is_open", "categories", "latitude", "longitude", "attributes", "hours", "business_id", "address"]).take_every(500).collect().to_pandas())
report.to_widgets()

Because widgets are removed we don’t see, but trust me when I say that the only thing we find is a high correlation between review_count and state which doesn’t give us a lot as we sampled the dataset.

Let’s extract our review dataset of the most common restaurant!

b_id = df_business.sort("review_count", descending=True).select(["business_id"]).limit(1).collect()[0,0]
df_review = df_review_all.filter(pl.col("business_id") == b_id).collect(streaming=True)
df_review.head()

shape: (5, 9)

review_id	user_id	business_id	stars	useful	funny	cool	text	date
str	str	str	f64	i64	i64	i64	str	str
"vHLTOsdILT7xgT…	"417HF4q8ynnWtu…	"_ab50qdWOk0DdB…	5.0	0	0	0	"This place has…	"2016-07-25 04:…
"I90lP6oPICTkrh…	"1UAb3zZQeGX6fz…	"_ab50qdWOk0DdB…	5.0	0	0	0	"OH MY!! A must…	"2016-12-19 20:…
"469eAl2fB069YT…	"p2kXD3gNu3N776…	"_ab50qdWOk0DdB…	5.0	0	0	0	"The fried seaf…	"2018-08-23 20:…
"aPpHBDs7Jiiq0s…	"7cDhfvTSH1wTxE…	"_ab50qdWOk0DdB…	5.0	0	0	0	"I love this pl…	"2013-06-24 18:…
"k9OG5kA5ebruSx…	"7QTh-fkw9Nr2lO…	"_ab50qdWOk0DdB…	3.0	0	0	0	"Loved the char…	"2010-10-06 08:…

Data Analysis: User Data

In the original document I also started an investigation of user-data to validate if we could find bots in the data, I did this by the following code snippets:

df_user = pl.scan_ndjson("/kaggle/input/yelp-dataset/yelp_academic_dataset_user.json")
df_user.limit(3).collect()

low_rating = pl.col("average_stars") <= 1.5
many_reviews = pl.col("review_count") > 5
df_user.filter(low_rating & many_reviews).limit(3).collect(streaming=True)

high_rating = pl.col("average_stars") >= 5
df_user.filter(high_rating & many_reviews).limit(3).collect(streaming=True)

This show-cases the modularity of polars to build queries, which is awesome!

Unfortunately I didn’t have the time to dive deeper, as such I removed outputs too.

Appendix B: Additional Tasks

Back to the actual action!

As I said I had a few different ideas. If we’d do a simple classification (e.g. Stars) I’d start with the following approach:

TF-IDF + SVC
Embeddings + SVC (e.g. BERT or GloVe)
RNN’s (e.g. ULMFit)
LLM (e.g. BERT)

The models grows in complexity as we move down the list and the preprocessing would change.
TF-IDF and SVC requries removal of Stopwords and using stemming or lemmatization to reduce the feature-space/dimensionality of data.

Most likely TF-IDF would yield great results, but potentially we’d have to upgrade into RNN’s or LLM’s.
An LLM is great in the way that we only need to remove outliers, otherwise no preprocessing required because it finds the semantic meaning of a review. As mentioned though this wasn’t implemented because of low value.

Task: Keyword Extraction

Extracting keywords sometimes yield very interesting results, sometimes not.
Reviews should have some things in common, such as opening hours, pricing and taste.

I see two (2) simple approaches:

Algorithmically using something like yake that uses statistical attributes
Using LLM/GPT with prompt-engineering

Because I don’t fully focus on this task I use an algorithmic approach where I sample 10k random reviews.

import yake
all_reviews = ' '.join(df_review_all.select("text").take_every(25_000).limit(10_000).collect(streaming=True)["text"])

kw_extractor = yake.KeywordExtractor(lan="en", 
                                     n=3, 
                                     dedupLim=0.9, 
                                     dedupFunc="seqm", 
                                     windowsSize=1, 
                                     top=20)

keywords = kw_extractor.extract_keywords(all_reviews)
keywords

[('hours from beginning', 0.020546851659972578),
 ('beginning to end', 0.020546851659972578),
 ('decide to eat', 0.025264111118808684),
 ('hours', 0.12825615159841589),
 ('end', 0.12825615159841589),
 ('decide', 0.15697631639850676),
 ('eat', 0.15697631639850676),
 ('aware', 0.15697631639850676),
 ('beginning', 0.15697631639850676),
 ('multiple times', 0.19363023454619058),
 ('long', 0.24804685754303113),
 ('long time', 0.25772883998254764),
 ('bad experience', 0.3044312085113188),
 ('food is good', 0.34348020376910804),
 ('long waiting', 0.36025637540118816),
 ('multiple', 0.3927272948795476),
 ('times', 0.41305917611316423),
 ('time', 0.41305917611316423),
 ('good', 0.47708038648245615),
 ('experience', 0.48107913691662785)]

My hypothesis seems to hold! This is interesting and I believe we could move forward with this track, but for now I’ll move back into the other topics I approached.

Anyhow, time, experience and hours seems to be contestion of most important keyword!

Task: Find Topic Themes by Stars

Let’s see if we can find a common thematic between different stars on reviews by embedding our sentences and coloring by stars. First we need to figure out how many stars we have and how to move them.

def get_star_perc(df: pl.DataFrame):
    return (df
            .groupby("stars")
            .count()
            .with_columns(perc=pl.col("count") / pl.sum("count")))

lhs = get_star_perc(df_review)
rhs = get_star_perc(df_review_all).collect()  # lazy.collect()

df_stars = lhs.join(rhs, on="stars")
df_stars = df_stars.select(["stars", pl.col("perc").alias("single_restaurant"), pl.col("perc_right").alias("sample_of_all")])
df_stars.sort("stars").to_pandas().plot.bar(x="stars", y=["single_restaurant", "sample_of_all"])

We can clearly see that 1-star vs 4-star is unbalanced in the two dataset, the selected restaurant is having a lot more positivity attached than the general theme!

Embedding and thematics of ‘Stars’

Let’s review if there’s a simple theme to find using a basic embedding.

BPEmb is built on Byte-Pair Encoding and gives us Subword Embeddings in 275 (!) languages. It’s really great!

The creators share that it has the same performance using a 11MB file as a 6GB FastText-file.

from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", dim=50) # low dimensionality to keep memory low

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz

100%|██████████| 400869/400869 [00:01<00:00, 358780.98B/s]
100%|██████████| 1924908/1924908 [00:01<00:00, 1139119.63B/s]

import numpy as np

def simple_avg_embeddings(text: str) -> np.array:
    """Would prefer to do USif or a similar more advanced sentence embedding"""
    return np.mean(np.array(bpemb_en.embed(text)), 0)

simple_avg_embeddings("hello world")

array([-3.77557367e-01, -3.78913283e-02,  5.37733175e-03,  3.01063985e-01,
        1.22853994e-01, -1.16067998e-01, -1.96355328e-01,  2.24513650e-01,
        1.52528659e-01,  1.50793329e-01, -1.32929996e-01,  2.45790675e-01,
        2.88366582e-02,  1.45034671e-01, -4.60664422e-04, -2.41063997e-01,
        1.13379657e-01,  2.01904342e-01, -1.51245669e-01,  7.02560022e-02,
        1.38975337e-01,  1.45603001e-01,  1.67376995e-01, -3.75553995e-01,
        8.85626674e-02, -1.05586670e-01, -1.04991339e-01,  1.67683307e-02,
       -3.47706318e-01,  7.66509920e-02,  4.86541659e-01, -5.46200061e-03,
        3.15280318e-01, -1.35019004e-01, -8.56519938e-02,  2.60051340e-01,
       -1.04355663e-01, -3.84614974e-01, -6.59673288e-02,  1.19441666e-01,
       -1.55402347e-01, -3.78577620e-01,  1.48357674e-01,  8.83906707e-02,
       -6.47209957e-02,  3.22343677e-01, -3.02187651e-01,  1.48631334e-01,
        2.30536342e-01,  1.86697006e-01], dtype=float32)

df_review = df_review.with_columns(pl.col("text").apply(simple_avg_embeddings).alias("emb"))

df_review.head(1)

shape: (1, 10)

review_id	user_id	business_id	stars	useful	funny	cool	text	date	emb
str	str	str	f64	i64	i64	i64	str	str	object
"vHLTOsdILT7xgT...	"417HF4q8ynnWtu...	"_ab50qdWOk0DdB...	5.0	0	0	0	"This place has...	"2016-07-25 04:...	[-0.19162591 0.15060863 -0.078258 0.15518992 -0.12187849 0.11793589 -0.07643473 0.1317546 -0.1048084 -0.03124337 0.05094633 0.12367388 -0.00363489 -0.2647824 -0.06410762 0.04014542 0.07875077 -0.278056 0.00739759 -0.07181404 -0.04439502 0.01154997 0.03282384 -0.0340792 0.15889876 0.0838833 0.01600609 0.10613644 -0.31489417 0.03023791 0.10793628 -0.03118962 -0.19363998 0.20003138 -0.15389948 0.03025126 0.00232455 -0.10293934 0.05226081 0.06783426 -0.22778073 0.12673096 -0.18912943 0.25403503 0.22409369 0.01150362 -0.01854107 0.07260673 0.21675265 0.22890206]

from sklearn.manifold import TSNE

tsne = TSNE()
result = tsne.fit_transform(np.array(df_review["emb"].to_list()))
del tsne
result

/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,

array([[ 10.191183  , -30.586485  ],
       [ 50.212452  ,  -0.08102626],
       [-23.563566  , -43.164993  ],
       ...,
       [ -6.7240796 ,  31.503109  ],
       [ 26.029615  ,  34.519146  ],
       [ 38.290615  ,   9.39187   ]], dtype=float32)

import plotly.express as px
import pandas as pd

df_tsne = pd.DataFrame({"x": result[:,0], "y": result[:,1], "rating": df_review["stars"].cast(str).to_list()})
del result

px.scatter(df_tsne, x="x", y="y", color="rating")

del df_tsne

This didn’t work out either really…

How about a LLM?

from sentence_transformers import SentenceTransformer
from IPython.display import clear_output

model = SentenceTransformer('all-MiniLM-L6-v2')

clear_output()

encoding = model.encode(df_review["text"].to_list())
del model
tsne = TSNE()
tsne_enc = tsne.fit_transform(encoding)
del encoding

df_tsne = pd.DataFrame({"x": tsne_enc[:,0], "y": tsne_enc[:,1], "rating": df_review["stars"].cast(str).to_list()})
del tsne_enc

px.scatter(df_tsne, x="x", y="y", color="rating")

Sentence BERT produces a bit better result, we can see that 1-star reviews clusters nicely by themself.

The other reviews are pretty similar.

How to improve the clustering
Make use of a smarter tooling which builds topics rather than raw clustering based on embeddings. Tools like Bertopic, Toc2Vec and simply sentence-transformers (script) can achieve this in a simple manner.

Automated Data Validation & Exploration

Hampus Londögård — Mon, 20 Mar 2023 00:00:00 GMT

To run this as slides use the following command in the terminal:

nbconvert posts/2023-03-20-deepchecks/index.ipynb --to slides --post serve

N.B. this blog is originally a presentation, hence it’s not really written in the best way. Potentially I’ll rewrite into a true blog in the future.

Data Validation & Exploration

Today we’ll dive into automated Data Validation and Data Exploration.

Every day we work through a multitude of data using heurestics, statistics and many tools. But is there better tools out there? Is there a way to automate some of the process to put greater emphasis on the important things?

Data Validation Tools

There is a few tools.

Deepchecks Tests for Continuous Validation of ML Models & Data
ydata-profiling (previously _pandas-profiling) Create HTML profiling reports from pandas DataFrame objects
greatexpectations Always know what to expect from your data.
pandera A Statistical Data Testing Toolkit

We’ll focus on a few discussion points today

When does it make sense to introduce this type of tool?
How do you use this type of tool today?
How can it be improved?
Can it be used as part of Data Analysis?
Can it be used in any other part of the process?

(This was for the presentation/‘journal circle’)

Introduction

As we all know to be true data is incredibly important when developing Machine Learning Applications.

Shit in, shit out

First we’ll make a quick introduction to each tool and their strengths.

~~Second I’ll share a few use-case examples.~~ (didn’t have time to complete)

Finally we’ll end up discussing how we can use, or use, these tools.

Deepchecks

Figure 1: Deepchecks Checks|

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort. This includes checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.

Data Formats

Deepchecks supports the following formats:

Tabular
Computer Vision
NLP (text)

Example

Video of a Deepcheck Evaluation Suite

Types of checks

The types of checks are divided into 3 variants,

Deepchecks Types and where they run

Running a Deepcheck

Either you run a full suite or a single feature. You choose!

Full Evaluation Suite

from deepchecks.tabular.suites import model_evaluation
suite = model_evaluation()
result = suite.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)
result.save_as_html() # replace this with result.show() or result.show_in_window() to see results inline or in window

Single Validation

from deepchecks.tabular.checks import FeatureDrift
import pandas as pd

train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')
# Initialize and run desired check
FeatureDrift().run(train_df, test_df)

ydata-profiling

ydata-profiling, previously pandas-profiling is a tool that allows you to easily profile a dataset quickly and grok the data.

Key features

Type inference: automatic detection of columns’ data types (Categorical, Numerical, Date, etc.)
Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
Time-Series: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.

The report contains three additional sections:

Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
Reproduction: technical details about the analysis (time, version and configuration)

How to use

ydata-profiling is incredibly simple to use!

All that needs to be done is

profile = ProfileReport(df, title="Profiling Report")

Examples found on github.
For a specific example see titanic.

Great Expectations (GX)

great expectations

Great Expectations (GX) helps data teams build a shared understanding of their data through quality testing, documentation, and profiling.

GX is a well-known tool with a huge community. This means that there’s multiple plugins in other tools to support this framework.

It support things like Snowflake, BigQuery, Spark, Pandas, ..!

It’s easy to use and gives Data Documentation of the tests which can be saved in S3 or other places giving everyone a possibility to view and share these!

Example of GX

# great expectations check example
# can also be JSON
expect_column_values_to_be_between(
    column="passenger_count",
    min_value=1,
    max_value=6
)

automated data docs

Even has Data Assistant to build automated checks based on Golden Dataset!

There’s > 50 built-in expexctations and >300 including community added!

Our stakeholders would notice data issues before we did – which eroded trust in our data

pandera

Define a schema once and use it to validate different dataframe types.
Check the types and properties of columns/values.
Perform more complex statistical validation like hypothesis testing.
Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
Synthesize data from schema objects for property-based testing with pandas data structures.
Lazily Validate dataframes so that all validation rules are executed before raising an error.
Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Pandera Dictionary Schema

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)

Pandera (Pydantic) Class Schema

from pandera.typing import Series

class Schema(pa.DataFrameModel):
    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field(str_startswith="value_")

    @pa.check("column3")
    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that column3 values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2

Schema.validate(df)

Final Comparison Table

Comparing the tools this is how they can be used, and if I really like using them :wink:

Tool	Data Stores (Pandas, Spark, DB, Other)	Steps (Analysis, Training, Production, Non-ML)	Drift	Hypothesis	Data Generation	Data Types	Personal Favorite(s)
deepchecks	✅😐❌✅	✅✅✅✅	✅	❌	❌	❌	✅
ydata-profiling	✅✅✅❌	✅❌❌✅	✅	❌	❌	✅	✅
greatexpectations	✅✅✅✅	❌❌✅✅	❌	❌	❌	✅	😐
pandera	✅✅⏳✅	❌✅✅✅	❌	✅	✅	✅	✅

Bonus: Additional Great Frameworks

Fairlearn Fairlearn is an open-source, community-driven project to help data scientists improve fairness of AI systems.
Torchdrift
alibi-detect
Evidently (superb)

Bonus 2: PyGWalker

I found a new tool lately called PyGWalker which was really cool! It cannot handle really large data, but it’s excellent for smaller datasets :)

Turn your pandas dataframe into a Tableau-style User Interface for visual analysis

PyGWalker

PyGWalker GIF

Best Practices Are Contextual

Hampus Londögård — Sun, 19 Feb 2023 00:00:00 GMT

Today Best Practices are commonly presented as the One True Solution™ without considering the context(s) in which the practice is to be applied. I believe that with todays frameworks & methodologies it’s getting more and more common that people start following a kind of “hive-mind”, and who’s the driver? Usually the ones selling it…

Of course there’s multiple other angles such as:

“No on ever got fired by buying IBM” (or following MANGA best practices)
Career Climbing
- Who wouldn’t want to make their CV more interesting by applying a well-known technique or framework? Even if it doesn’t make sense for the application.

Anyhow, this is the story of how we/I got caught in a best practice which actually didn’t apply in my context. Ironically giving a new advice! 😅 Who knows, perhaps it’ll end up being a trap in the future.. 🤔

Background

We started a project where we had a lot of data, not Big Data, but a decent chunk of > 100 GB.
Being active in the industry and following the latest trends I knew that streaming data rather than loading it all makes a lot of sense and is recommended in multiple libraries such as HuggingFace Datasets.

Further I knew that there’s certain file formats which works exceptionally when streaming, or mmap, to do a very cheap data load and usage.

mmap

Memory-mapping (mmap) is a solution to stream data into memory from disk, and back. This enables us to:

Transform a bigger-than-memory dataset and write to disk again.
Train on a bigger-than-memory dataset by streaming it into memory

Using mmap combined with a good file format such as Apache Arrow is praised around by companies and libraries.

Apache Arrow

Apache Arrow is a column-based file format which is saved in a deserialized format, i.e. it is the same as it is in-memory.

This results in incredibly efficient mmap where we can stream data into memory without deserialize/serialize! Further by being a OLAP (column-based) format you can slice the columns you use and not stream anything else. Exceptional!

I’ve first-hand experience of the gains of using OLAP-based file formats such as parquet which additionally supply column-compression which is very efficient in analytics. How many rows contain the same date repeated? Now it’s cheap! 😉

Based on my history of using mmap algorithmically (low-level) and OLAP files we ended up using 🤗 Datasets which is a library to work with datasets.

🤗 (HuggingFace)

HuggingFace is a company that helps other companies deploy State-of-the-Art text and image models while providing a huge Open Source community with a lot of datasets, models and much more.

Their dataset API must be based on best practise right? What we later learned is that best practice really is contextual. This is the story.

Problem Identification

As a primer, how do you even identify this type of issue? After all it’s very easy to hide such problems by using powerful compute in the cloud.

The answer is rather simple, if the training is equal fast using a CPU (M1) as GPU you’re thouroughly under-utilizing your GPU.

There’s multiple instances where one would never notice as a lot of practicioneers tend to run directly on a cloud compute, but if you found this to be true you’d surely pause and reflect on your task, right?

Step 1: Don’t Trust Environments

Don’t make your cloud compute solve all problems from the get-go, allow yourself to gradually move into the cloud by utilizing a Local-First approach. Allowing yourself to run locally just as easily as in the cloud opens a lot of possibilities such as:

Quick iterations by running subset-training
Improved Circular Data Analysis
A certain satisfaction of simplicity

I’ll write a blog on Local First-approach and all the bonuses of such workflow later.

As we started our journey to find problem(s) in our “best-practice” training pipeline we need to understand what is actually happening - introducing debuggers & profilers!

Step 2: Embrace Debuggers & Profilers

PyCharm and VS Code includes some great debuggers which allows us to step into functions and execute different logic to further understand what’s happening under the hood. Further there’s great tools to track what happens post-run which are called Profilers.

One such profiler is the PyTorch Profiler that we embraced. Using a profiler we found that we spend a lot of time inside the DataLoader - which is not a good sign! We’d like to optimize GPU usage.

We built a hypothesis, the Macbook Pro M1 has a much faster SSD and because of that it trains equally fast as the Azure VM based on bottlenecking in the DataLoader. The I/O-operations being much faster leads to equal performance even if it has slower mathematical operations.

To make sure that Azure had a fair challenge we validated the following:

We’re not using a mounted storage but a downloaded dataset on the VM
We’re using the correct VM

All seem correct, what else can we do to improve? 🤔

We decided to rethink our “best practice” pipeline. Where could we save time? What’s is actually the part of the DataLoader that’s slow?

Rethinking our pipeline

We found the biggest bottleneck pretty fast.

🎯 Random Access Read
Random Access is slow with high latency even on the best SSD’s, and this is why Random Access Memory (RAM) exist! It has improved substantially the last decade, but nonetheless it’s slow.

We built a system that retrieves data from columnar storage but randomly. Our batches are sequential which helps a little, but we extract our batch starting point randomly, see Figure 1. We’re solving a time series forecasting problem which also means we expand one data point into a window of the last X points to predict future Y points. This isn’t cheap either, to roll over data like this.
To keep a batch internally intact is very important for some models such as the Recurrent Neural Networks (RNN) that keeps an internal state being reset each batch.

Figure 1: RandomSubsequenceSampler that allows randomizing batches but keeping batches internally intact

This means that by keeping more data in-memory (RAM) we can reduce our latency and bottleneck! Especially on VM’s with slower SSD’s such as Azure.🦸‍♂️

With this realization decided to optimize our pipeline by sidestepping best-practice and building a simple but custom batch-operation.

Optimizing Preprocessing

As we applied one optimization after another it built into this beautiful onion where we by each layer we removed we had new opportunities based on the new base.

Preprocess by batch rather than streaming data (by batch)
- One batch being one file

This sped up our preprocessing enough that we don’t need to cache it and thereby no need to do it on all data. This means that we could apply our second optimization.

Preprocess by a sliced batch, i.e. only columns used

This sped up our pipeline and substantially reduced memory requirements leading us to our third and final optimization in pre-processing.

Scale the scalers on a sample of the data.

All in all we had huge speedups in our preprocessing, as follows:

~ 10x faster
~ 10x faster
~ 2x faster

All in all our pipeline went from ~20 minutes to seconds!

Optimizing Training Loop

With preprocessing completing in seconds rather than minutes we could move ahead to improve our training loop.

Based on our learnings from the preprocessing iterations we knew that we could essentially load all data into memory if we sliced it, which we usually did, resulting in only using 1/8th or 1/16th of the dataset. Additionally we learned that we could get cloud compute with 2-300 GB RAM at Azure.

Using this knowledge we applied the following optimization:

Load sliced data into RAM on-demand rather than reading by streaming it into memory

Applying this change we saw huge efficiency gains but we still spent a lot of time in the DataLoader, why? We found that on each batch load we converted our data into a torch.tensor which should be pretty fast, but it still end up being a bottleneck. Next optimization became clear, why not keep it as a tensor from the get-go?

Load data into RAM as a dictionary {"column": torch.tensor()} with each column being key

Thus we achieved a really efficient (Deep Learning) pipeline, training being cut from hours to minutes! 🤯

What we learned

Best practices are contextual.
- Custom “dumb” code could end up much more efficient.
Start Simple.
- Simpler is often better (apply KISS).
Custom code is not always more complex than libraries because they hide complexity.
Balance complexity and efficiency delicately.

By batching data smarter and keeping a lot of it as tensors in-memory we had an incredible amount of gains.

It’s simple, stupid and wonderful.

~Hampus Londögård

Baby Monitor pt 2

Hampus Londögård — Mon, 06 Feb 2023 00:00:00 GMT

Back in action and finalizing the baby monitor!👶

TL;DR Built a baby monitor that included the following features:

Bidirectional Audio & Unidirectional Video (Night & Day Vision!)
Temperature Sensor
Motor (Servo) to move left/right & up/down

The project was born the day I met an old friend and saw his expensive baby monitor that he had been gifted, I needed to match it! 🤓

Result: I’m very happy about the results, my wife asked me to draw a smile on the creepy monitor, hence the smile! 😜 Video of it running live can be found at the end!

Implementation Details

To implement and build this camera I had to combine both hardware and software into a package.

Hardware Details

Most of my hardware was bought through Aliexpress, with few parts being from an old Pi.

Hardware	Functionality	Software Required/Used	Notes
Raspberry Pi 3B+	The Brain which powers everything	Raspberry Pi OS Lite (Bullseye)	This OS uses the new Open Source camera-stack, Libcamera!
DS18B20	Temperature Sensor	W1ThermSensor	I wish I found this earlier, at first I parsed the raw file myself. And it was hard to find set-up instructions!
Nylon FPV Servo	Servo Motor (moving the camera)	gpiozero	A brilliant library. It has to be noted that this servo works through Pulse Width Modulation (PWM) and to make the servos quite we need to set `servo.value=None` after setting it to a value. Complicates the configuration a little.
Raspberry Pi 4 Camera 5MP	Camera with IR-cut (IR on/off via hardware automatically)	libcamera / picamera2	Very simple to use over all. Tricky that you needed to focus it yourself, I thought it was broken first! 😆
Microphone from Google AIY v1	Record sound	This is tricky because of the HAT, requires custom installation.
Speaker from Google AIY v1	Play sound
Pi HAT from Google AIY v1	Combine sensors, microphone & speakers

Software Stack

To make use of my beautiful hardware I need software! Keeping things simple (KISS) I decided to use a Python backend and show it through a simple webapp. That way I can view the baby monitor from my PC, Smartphone & anything that has a browser really.

The end result became as follows 👇

Webapp Client

Over all I really enjoyed playing around with Svelte. It felt very straight-forward and simple, although there’s less community and libraries compared to React. All in all I’d give it one up compared to React because of simplicity, but I’m just a ordinary Backend Dev / Data Engineer+Scientist.

Server/Backend

FastAPI as always is a blessing to work with! The auto-generated swagger page, superb type integration and much more makes me feel right at home as someone who’s really a Scala-dev.😉 FastAPI has its drawbacks though, the streaming component definitely showed some rather large overhead. I had to fall back to raw http to have good performance 😰

The end result became two backends, but I tried to keep the responsibilities clear and it worked out fine!

End Result

And a video to show how real-time it is!

Video

Test Video

I’m very happy about the results!

Images of the Building Process

And some images of when I built the monitor!

What	Image
Building the Camera
Connecting the final piece of Camera
Building Temperature Sensor
Connecting Temperature, Pi & Camera
Manual Temperature Validation
Testing the Servo	Testing the Servo
Connecting all in a paper box
First Wooden Baby Monitor Prototype
Final Wooden Baby Monitor

A sad ending

The servo motors showed to be too weak which interestingly means they’re too strong. As they try to move the housing it works slowly until it move everything at once which creates a force stronger than the pad that the monitor was standing on.

The end result was… Sweet release of machine breakage 😢

That’s it for this time! Now I look forward to become a father! 👨‍👩‍👧‍👦 ~Hampus Londögård

Polars - A Refreshingly Great DataFrame Library

Hampus Londögård — Wed, 30 Nov 2022 00:00:00 GMT

While working at AFRY we’ve noted that in performance intensive application that isn’t really Big Data ends up being slow when using pandas.

Coming from languages such as Scala, Kotlin & golang we knew there had to be more to it. There was a lot of performance to be squeezed! 🏎️

Cherry on the top? The pandas API is a constant source of confusion and thereby not very satisfying. I end up having to read/search the documentation more times than I care to admit. All in all a cleaner and more efficient tool was needed to handle our data & model training pipelines.

One day I stumbled upon polars - an blazing fast DataFrame-library written in Rust. Plenty of buzzwords, documentation and user-guide later I was ready to trial it in a personal project. 🤠

It was a smooth addition because of the pandas integration, pl.from_pandas & df.to_pandas(), which in turn made it a gradual adoption. The trial was an instant success, moving DataFrame’s to and from polars was diminished by the fact that polars sped up my pipeline so much. And the code was clean, the API more natural, only downgrade was a bit less reading options - otherwise only upgrades! 🤯

I was ready to trial it work, and boy was I in for a wonderful journey!

After gradually adopting it in one of our client project we saw huge speedups (some parts being >3 magnitudes (!) faster) and our code became a lot simpler. Additionally something I didn’t expect: we decoupled our code in a more efficient way producing leaner code that’s more testable! 🦸‍♂️

Then… what the actual fudge is polars?

Polars

Polars is a DataFrame library written purely in Rust, i.e. no runtime overhead, and uses Arrow as its foundation. The Python/JS bindings are simply a thin wrapper to be able to be able to use functionality in the core library. Very similar to pandas with a few major differences.

Why is it fast?

polars achieves its speed by utilizing available cores and being smart. It goes to great lengths to:

Reduce redundant copies
Traverse memory cache efficiently
Minimize contention in parallelism

polars has a lazy API with reminisence of SQL and Spark, this lazy API is automatically applied for certain operations such as groupBy. Using the lazy API polars enables query optimizations which improves performance and memory pressure. polars tracks your query in a logical plan which is optimized.

Here’s a list from pola-rs.github.io on how it achieves its performance:

Copy-on-write (COW) semantics
- “Free” clones
- Cheap appends
Appending without clones
Column oriented data storage
- No block manager (i.e. predictable performance)
Missing values indicated with bitmask
- NaN are different from missing
- Bitmask optimizations
Efficient algorithms
Very fast IO
- Its csv and parquet readers are among the fastest in existence
Query optimizations
- Predicate pushdown
  - Filtering at scan level
- Projection pushdown
  - Projection at scan level
- Aggregate pushdown
  - Aggregations at scan level
- Simplify expressions
- Parallel execution of physical plan
- Cardinality based groupby dispatch
  - Different groupby strategies based on data cardinality
SIMD vectorization
NumPy universal functions

Side-note: one ugly hack I remember doing in pandas was to slice columns used in a groupBy aggregation before applying groupBy to make it faster. In polars this operation is lazy and automatically does this optimization in its query optimizer! Boy is it beautiful to see! 😍

Why does it make sense?

I’m not sure this title makes sense, but polars sure do! 🤓

Did you ever ask yourself why numpy and pandas requires an indexing array to filter a list? E.g. x[x > 10] to return the list with all values >10.

I did, and the answer is vectorization which makes code incredibly fast. But we should be able to achieve this in a simpler and more efficient way right? Because it’s ugly and stupid… so let’s achieve it more efficiently!

polars uses semantic more familiar to other languages, with it’s .filter(pl.col(x) > 10).

Side-note: pl.col(x) > 10 is a pl.Expr which is not executed until queried via DataFrame or Series!

This way it’s incredibly easy to combine filters and even more importantly, decouple code.

def filter_age(age: int) -> pl.Expr:
  return pl.col("age") > age
  
 df.filter(filter_age(13))

To me this is really cool! 🤓

In Production

We use polars extensively in production and after evaluating we found:

Pipelines to be 2x-20x faster, averaging about 7x
Simpler pipelines
Easier testing of pipelines

Which is some pretty fantastic gains!

Future

I see a bright future with polars as it enables workloads which previously required to run in the cloud to be able to run locally, because the efficiency is so high.

Bonus

polars is more than a “simpler API” and “faster pandas” with its additional functionality. Ever heard of over? Not? Let me tell you a cool story!

`pl.Over`

pl.col(age).mean().over(gender) is like pd.groupBy(gender).transform({age: "mean"}) but way more expressive and powerful!

It can be used to build columns, filter DataFrame and anything really. We can combine multiple of them in the same select:

df.select([
  pl.col(age).mean().over(gender),
  pl.col(height).mean().over(gender),
  pl.col([age, height]).over([gender, species])
])

The first and second row of the select uses the same grouper, query optimizer yay!

The third line does it over multiple columns, combining all this into one single select is some pretty powerful stuff! 🦸‍♂️

`pl.Fold`

Yet another incredibly powerful piece of operation is the fold which most Scala- or FP-programmers will know and love. fold is a more powerful reduce as it allows us to define what type we’d like to accumulate.

The simplest example is using fold as a reduce to calculate the sum, e.g.

out = df.select(
    pl.fold(acc=pl.lit(0), f=lambda acc, x: acc + x, exprs=pl.col("*")).alias("sum"),
)

Which is an obvious overkill solution, but allowing to aggregate expressions with conditionals is an inredibly powerful concept which can yield the best types of expressions.

out = df.filter(
    pl.fold(
        acc=pl.lit(True),
        f=lambda acc, x: acc & x,
        exprs=pl.all() > 1,
    )
)

In this expression we filter that every coluumn is larger than 1.

That’s it for this small article. ~Hampus Londögård

Probabilistic Forecasting Made Simple

Hampus Londögård — Mon, 28 Nov 2022 00:00:00 GMT

Probabilistic Forecasting Made Simple

Probabilistic Forecasting is something very cool, but it is not approachable in the current state of affairs.

While researching probabilistic forecasting in a client project I managed to find a paper which opens the door to any neural network with dropout - which is the majority. That is, we can do probabilistic forecasting with essentially any network!

Darts, a brilliant timeseries library, includes a very competent probabilistic forecasting but it’s not really applicable to all models. This is the reason that I started diving into the whole space of probabilistic forecasting. A probabilistic model includes not only a raw prediction value but a distribution of possible points, which ends up with a prediction like:

Probabilistic Model by unit8/darts

Additionally models like ARIMA and ExponentialSmoothing allows to do this kind of thing very easily, simply sample running simulations of their state-spaced models with a bit of randomly sampled errors. To solve this on their deep learning models darts decided to model distribution using a Likelihood class. What does this mean?
The model does not actually predict a value but a distribution, using Gaussian we’d predict two values - mean and std .

How to do probabilistic forecasting on any deep learning model

By combining the knowledge in Deep and Confident Prediction Time Series at Uber by L. Zhi & N. Laptev (2017) with What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? by A. Kendall & Y. Gal (2017) one can conclude that it’s possible to model distributions using dropout during inference. In the Uber paper they use a special variant they call “Monte Carlo dropout”, which I don’t believe is required to achieve interesting results. Using the pure dropout-module which randomly zeroes some elements by a probability sampling from a Bernoulli Distribution.

How do we do this?

Activate Dropout during Inference.
Do predictions with a dropout probability .
Based on these x predictions we have a distribution of data.

Build a confidence interval from the points.

  outs = torch.vstack([model.predict(in_data) for i in range(x)])
  # Defined by confidence coefficients
  Z_TABLE = {0.8: 1.28, 0.85: 1.44, 0.9: 1.65, 0.95: 1.96, 0.99: 2.58, 0.999: 3.29, 0.9999: 3.89}

  # Confidence Interval with mean as line
  mean = outs.mean()
  lower = mean - Z_TABLE[confidence] * outs.std()
  upper = mean + Z_TABLE[confidence] * outs.std()

The possibilities

There’s a lot of possiblities, I’ll share two of our biggest ones.

1. Model Understanding (Weakness/Strength)

By returning a probabilistic forecast, i.e. a distribution/confidence interval, we can learn more about the model and its strengths/weaknesses.

In our project(s) we’ve seen that it opens a door to really figure out how to improve our models by focusing on the areas were the model is the most uncertain. This has proved to improve performance by a substantial amount which makes the effort worth it.

2. Downstream Consumer Happiness

We see that our clients trust the model further by being able to see how confident they are. Building trust between model and downstream consumer is really important to deliver an actual successful project, which once again makes the effort totally worth it!

Bonus: we also found that it opens new possibilities to chain of the inference power if you keep it in production, as your downstream tasks can now make use of a confidence interval rather than a raw data point. But the inference is very expensive compared to the usual (remember we do x predictions per prediction)!

Sources

Deep and Confident Prediction Time Series at Uber by L. Zhi & N. Laptev (2017) - https://arxiv.org/pdf/1709.01907.pdf

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? by A. Kendall & Y. Gal (2017) - https://arxiv.org/pdf/1703.04977.pdf

Timeseries Learnings at AFRY

Hampus Londögård — Wed, 23 Nov 2022 00:00:00 GMT

Timeseries Learnings

Intro

This is a blog based on a presentation I did at Foo Café, where we shared a lot of our hard-earned wisdom at AFRY X, mainly based on working with timeseries.

There’s both simple and more advanced, hopefully in an easy to understand fashion!

Background

We’re currently working with two (2) assignments involving timeseries right now.

Helping a telecom company to embrace Data Driven Testing of 5G-antennas
Helping a automotive company to forecast weird brake behavior.

So what have we learned? Let’s see…

1. Learning to embrace KISS 💋

We learned this by a few different experiences.

MLOps Tooling

Yes, it’s real. We all say it, all the time. But when the dust settles a lot of us ended up optimizing some part of our code anyways or even adding the “better tool” which involves just a “tiny bit” more complexity and code.

Well, we did end up adding that “better tool” and it bit us back by clearly reducing our innovation pace. The story…

To build our MLOps pipeline we used a tool (DVC) which is way better than the “default” of MLFlow. It has multiple bonuses:

Data Version Control
Reproducible Pipelines
CI/CD custom-made for Machine Learning
Cloud & Tool Agnostic
Simply git

There’s no real drawbacks, except the KISS principle. Don’t get me wrong DVC is an excellent tool but we ended up having to write a lot of code to download datasets, track models, keep things tidy in git. To top it off? It was cumbersome to test a new task inside the same repository.

The biggest drawback was the additional code, albeit not complex in itself it added a total complexity, each time we wanted to update something in the pipeline we had to make sure to keep DVC working.
And to get new features it was a little bit of adding lego bricks, which I love but ended up adding more complexity again.
I want to note that the complexity was from lines of code rather than complicated code and this complexity made it harder for us to innovate.

Biting the bullet and migrating to MLFlow, an inferior tool on the paper, made our pipelines more lean and easier to innovate upon.
MLFlow has magic integrations into all cloud providers and libraries which makes it very easy to add with basically 0 code.

This simplification led to further gains which are hard to show on paper.

In the end with our small team-size and project-size the gains of DVC isn’t worth the cost, as such the KISS approach leads us to MLFlow.

Local ↔︎ Cloud

Cloud compute brings a ton of goodies such as defaulting to containerization, which should be done locally to really, and having powerful computers at your fingertips.

But what is forgotten is that with great powers comes great responsibility. Cloud enables training heavy models but with that it hides a lot of problems.

Inefficient pipelines
Hard to debug
…

We’ve learned that using a Local First Approach gives us the best of both worlds. Our pipelines are able to run fully local, including unit testing, but are just as simple to run on a cluster.

This is enabled by using a local pipeline rather than fully embracing the ecosystem.

We are able to run:

⁠pipeline.py
pipeline_on_azure.py --exp_name --compute
pipeline_mlflow_azure.py
- Allows local experiments to be tracked on Azure MLFlow server rather than locally

Which makes our life incredibly easy!

Automate Boring Checks

Of course code reviews should, and is, mandatory. Based on our software engineering principles we’ve made sure to also add CI/CD verifications which validates that everything is nice and tidy with no breaking changes. It reduces our cognitive load, which is awesome.

Our current set-up:

pre-commit: local validation on each commit
- [flake8](https://github.com/pycqa/flake8 "https://github.com/pycqa/flake8"): Code Style Checker
  - Validates that we don’t break code-styles such as unused imports, unused variables, too complicated functions etc
- [black](https://github.com/psf/black "https://github.com/psf/black"): Code Formatter
  - It’s uncompromising and makes sure our repository has a standard stylistic with correct indenting and much more
- [mypy](https://github.com/python/mypy "https://github.com/python/mypy"): Static Type Checker
  - Makes sure that our types are valid and we’re not simply lucky in the duck-typing 🦆 world of Python 🐍!
- ^ All above also runs in CI/CD
CI/CD
- pytest: unit tests
  - Test that your functions, neural networks etc works as expected
- pre-commit - see above
- cypress: E2E frontend testing
  - Only for a user-facing analysis tool

This is running on both GitHub Actions and GitLab Pipelines.

We deploy our containers through Azure and experiment through the Azure ML Studio or locally.

2. Interactive Validation 👨‍💻

This is the killer deal. A lot of people out there makes heavy use of what I call “Static Analysis”, where metrics and static images are viewed.

Viewing static results isn’t enough, it barely scraps the surface and we’re Data * right? So why not dive into the data!? 🕵️

We’re extending our analyses to include custom built Streamlit Application(s) which allows us to work exploratory. Static validation is still important to keep track of actual objective metrics, but exploration is not only very satisfying but also awesome as you learn so much more about the model(s) and data.

Interactive analysis gives the following: Promote Explorability, Gives End-User a Great Experience, Increase Explainability, Efficient Feedback-Loop. Awesome? Hell yeah! 😎

Static Validation:

Interactive Validation:

3. Additional Wisdom 🤓

Probabilistic Forecasting

This is a powerful tool to have in your arsenal, and with clever tricks you can have it with all neural networks that uses dropout! 🤯

The idea is that rather than forecast/predict point by point we forecast a confidence interval! This way we capture uncertaintied of the model. This assists debugging your model to fix its weaknesses! 🔍

Custom Loss Functions

Using a custom loss function makes a lot of sense in almost any real problem. It’s very rare that you actually care about the overall impact, but rather specific regions actually matters in time-series. In our case we don’t care about forecasting noise nor do we care about forecasting the bad behavior after it happened, we need to find it before it happens - forecasting.

In our case it’s an oscillating behavior that builds amplitude over time, as we want the model to do the same we need to penalize undershooting heavily to not fit it to the noice.

This ended up being a big improvement, but what lessons did we learn?

Always invert scale before calculating test-metrics
- To not allow scaling functions to impact the final results
Never optimize a loss function that you use as a metric
- This will play the model
Decide on validation metrics before you start
- A moving target is impossible to hit and compare against

4. Unexpected Learnings 🤔

Model weights badly instantiated

We found this to be true in both papers and (5k⭐️) Open Source libraries.

Make sure that you validate code you’re using in-depth and don’t expect it to work as expected from the get-go!

Best Practices is contextual

Best Practices are certainly not true for any example or user. This was shown in multiple cases, such as our Local ↔︎ Cloud story.

Make sure to validate that what you’re doing is actually required for your use-case. A lot of the tools out there is built for huge scale while we’re working on small-medium scale. To use the same tooling is to introduce a large overhead and disconnect from the developer.

Generally winning concepts

Type hints, Type Hints everywhere
- It really assists you greatly. As a Scala/Kotlin and FP-enthusiast I’d like to talk even further about it, but I might grow boring.
Plotly/Altair rather than matplotlib – interactivity is king.
- I cannot emphasize how much is learnt by the kinder Garten style of panning, zooming and playing around
- It gives data and model understanding

Polars – efficient speedy DataFrame’s with a logical API

I can’t be alone thinking that whoever designed pandas API must’ve been a masochist
- Polars makes sense, includes lazy and it’s fast! 🏎️
- Con: it’s not lingua franca like pandas and thereby isn’t supported automatically by all different libraries
  - Solved by using to_pandas()

  >>> df.sort("fruits").select(
  ...     [
  ...         "cars",
  ...         pl.col("B").filter(pl.col("cars") == "beetle").sum(),
  ...         pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
  ...         pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
  ...         pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
  ...         pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
  ...     ]
  ... )

Streamlit quick interactive apps that makes a huge difference
- This is how we build our interactive validation and analysis tools.
CI/CD on all projects
Quarto reports
- They’re amazing, think markdown + code cells + all the goodies of LaTeX in a package 😍
Use Path-lib from the get-go
- Don’t waste a full working day of headache to help that Windows-user to run your project

Summarizing

Follow KISS
Embrace the systems you’re working in, it’ll pay off to not generalize too much
- Whenever you change cloud-provider you’ll have to update a lot of code anyways because of auth and more
Interactive Analysis is incredible and open new doors
Always be mindful
- Continuously evaluate, iterate and execute – be agile.

That’s it for now

~Hampus Londögård

Babymonitor #1

Hampus Londögård — Sun, 06 Nov 2022 00:00:00 GMT

Hi 👋

I’m building a babymonitor. It’s not gonna be anything novel, neither the first or last in history. But it’s a relevant project to me, and it makes me happy! 🤓

In this blog I’ll walk through different ways to stream video from the raspberry pi to the network, capturing it in a client browser.

Background

It all started when I talked to an old friend and he said that his in-laws gifted them the “Rolls-Royce” of babymonitors. The monitor has:

Bidirectional audio.
Unidirectional video.
1. Night Vision.
2. Rotation Horizontally.
3. Zoom.
Temperature
Play soothing bedtime songs.

Incredibly cool!

This led to a simple equation:
awesome + expensive + programmer = Do It Yourself (DIY)

Obviously I need to have the same specifications and a little better, while “cheaper” (my hours are “free” 🤓).

The greatest part? I have a natural deadline which enforce me to finish the project!

Goals

KISS - Keep It Simple Stupid

I’ve collected the following equipment:

Raspberry Pi 3B (had one already)
5 MP camera with mechanical IR filter
2 servos (rotating camera)
Temperature / Humidity Sensor
Microphone Array
Speaker

My bottleneck is the Raspberry Pi’s performance really. And with performance comes optimisations, which I love! It makes following the KISS principle a tad harder! 😅

I have settled on one of three languages to write the video streaming in, either golang, rust or python.
My initial idea is that the simpler parts will be a FastAPI (Python) server, like temperature and moving servos. Python really is lingua franca on the Raspberry Pi and the support is amazing.

From my initial experimentation I found Python to require a shit ton of CPU power to livestream video, as such I believe rust or golang will be the way to go. 🚀

Live Streaming: Initial Experimentation

I’ve tried multiple things, HTTP, HLS, Websocket & WebRTC. Each step proves a more complex, albeit more optimal, solution. Each with it’s trade-offs.

Some worthy mentions of other solutions is Real Time Streaming Protocol (RTSP).

Protocols / Variants

Describing the protocol and how it’s implemented, in a very general way.

Hypertext Transfer Protocol, lingua franca protocol of the internet, is a way to stream both video and audio. It’s easy, but not efficient.

How: Stream chunks using HTTP messages and let your element handle the consumption of stream.

HTTP Live Streaming is a simple and pretty efficient way to stream both video and audio over the internet.

How: video-/audio-files, .m3u8, which are then picked up by client. The chunks are built from a “live” stream, e.g. webcam, or a file.

Websockets is a communication protocol which allows much lower overhead than raw HTTP while providing bi-directional (duplex) communication. It’s really optimal to stream things that move in real-time through HTTP, such as a video or audio stream.

How: Stream your byte-array realtime through a websocket, like you’d a HTTP. There’s less headers involved, but it’s much harder to consume on the client. You need a special JS media player which can decode the websocket stream into video/audio.

WebRTC is a open framework that enables Real-Time Communications (RTC) inside the browser. From the get-go it supports bidirectional video/audio and also encrypted if required.
It’s the protocol to stream realtime really. It’s used in a lot of the tools you know today.

How: set up a WebRTC server and then stream your bytearrays directly after connecting with a client.

Each protocol comes with positives and negatives.

	HTTP	HLS	Websocket	WebRTC
Pros	+ Easy to implement. + Simple protocol.	+ Easy to implement + CPU efficient. + Easy to do “live” streams.	+ Low latency. + CPU efficient.	+ Supports all my use-cases + Low Latency. + CPU efficient.
Cons	- CPU inefficient (HTTP header overhead).	- High latency (5-10s+).	- Hard to consume on client. - Bi-directional streaming is also hard.	- Not straightforward implementation - Less documentation than HLS/HTTP.

Implementations

Using MJPEG.
The provided MJPEG server from picamera2 is excelent show-case on how to stream the video. It sets up a simple HTML with a element which streams new frames using MJPEG which is Motion-JPEG.

The performance is pretty OK, considering it’s Python & MJPEG. Compared to H264 which works much more effectively.
We see the CPU hovering around 130-150%, but the largest drawback is the network bandwidth, at ~50Mb/s compared to H.264 at ~3.5Mb/s.
This is because MJPEG sends the full frame each time, H.264 sends a frame and then some delta frame until it sends a full frame again. This has drawbacks and positives, the bandwidth is low but quality can suffer.

Code

Londogard Blog

ZenML or ClearML? Which MLOps tool strikes best?

Similarities:

Building Pipelines

Tracking Experiments

Orchestrators

UI

Conclusion

ClearML

ZenML

Streamlit Fragments - Make the Dashboard Dream come true

Play Around

Adding Complexity

Drawbacks

Outro

TIL: Pixi by prefix.dev

Simple get-started

Add Task

Outro

TIL: Multiple Git Remotes

TIL: Building Docker Images with Conda and Custom Users (and devcontainers!)

Problem 1. Enabling conda in docker run ..

Problem 2. Enabling permission to do OS changes, e.g. os.mkdir

Problem 3. vscode user

TIL: Stlite - running streamlit in WASM through b64-encoded URL’s

Solara, League of Legends and Deep Learning to extract E-Sport Highlights

Solara, League of Legends and Deep Learning to extract E-Sport Highlights

Quick Backstory

Choices during my Journey

Tool

Pro

Con

Result

1. Knowledge

2. A Data App

App show case

Videos

Screenshots

Building Blocks / Code-Snippets

Building Progress Loaders in Solara

Checkpointing / State Management and Parent/Children (Hierarchy)

Trickle Down to Child

Trickle Up to Parent

Threading to improve UI experience

Plotly Callbacks

How To Run App

Solara findings

solara.reactive is awesome

Solaras UI/logic separation is better but not perfect

Solara Issues

Outro

Automatic Highlight Detection of League of Legends Streams

Model Architectures

Implementation

FFMPEG + CUDA (step 2)

Small to Medium Dataset Speedups

Analysis

Result

Appendix: Code

Data Tools

Classifiers

Training Loop

Solara - ‘A new Reactive Streamlit’

Streamlit and why it’s awesome

Solara Introduction

🥊Solara versus Streamlit

TL;DR

Solara Introduction

A simple app

Working more with state

Solara and Streamlit comparison

Quick Facts

Deeper Comparison

State and ‘pythonicism’

Embeddability

Components & UX

Image Comparisons of Components

Readability

Outro

Appendix

Problem 1. Enabling `conda` in `docker run ..`

Problem 2. Enabling permission to do OS changes, e.g. `os.mkdir`

`solara.reactive` is awesome

How I’m using `pandera` in production setting

`SetFit` what is it?