I received a quest by my little brother, creating a League of Legends (LoL) Highlight Extraction tool. It seemed simple enough and I was happy to be able to use my Deep Learning skills for something that could be used by a non-tech person! 🤓
Where do you even start in such a quest? Data!
- Have my little brother annotate at minimum 2 streaming sessions of TheBauss (á 6 hours)
- It’s some data, not enough or varied enough but it’s something.
- Create balanced dataset with good splits
- We cannot leak data between train and validation
- This is solved by chunking videos into 30s segments which are then split between train/validation.
- Balancing the data to have a little higher highlight distribution than reality, to generalize better.
- Reality is ~13% highlights, I rebalance the dataset to be ~30% highlights in training.
- We cannot leak data between train and validation
- Start training!
TheBaussffs
Simon “Thebausffs” Hofverberg (born October 3, 1999) is a Swedish player who last played for G2 Esports. He’s one of the biggest League of Legends streaming personas on Twitch right now (2023).
Known for:
- Highest ranked AD Sion player on EUW.
- “Inting mode” where he dies but wins in gold based on calculated trades and “good deaths”.
Model Architectures
I chose to go with two different architectures:
- Image Classification
- KISS: use ResNet (testing both pretrained & scratch)
- Potentially use some other architecture in the future
- Video Classification
- KISS: Use RNN on top of “Image Classifier” that has no head, i.e. Feature Extractor
- RNN are naturally good with sequences as they’ve an internal state
In the future I’d like to try a VisionTransformer (ViT), but for now I’ll keep my compute low and see where I can go! As Jeremy Howard says, start simple and scale when required - it’s better to have something than nothing.
Implementation
The implementation exists in a single Notebook (runnable in Google Colab with a free GPU) - keeping it simple to use as a non-programmer (my brother).
There’s a few steps:
- Download Twitch Stream (
twitch-dl
) - Convert into frames in a set FPS (3,
ffmpeg
) - Upload to Cloudflare R2 (persist)
- Build DataLoader
- Select and Train Model
- Load model and run inference
FFMPEG + CUDA (step 2)
The ffmpeg
conversion from Video to Frame was incredibly slow on Colab, I found two issues:
- CPU is much poorer than M1 Macbooks (~30x)
- Mounting storage (GDrive) makes it not work at all (100x)
The second option is simple, have all data located on the Colabs storage and do conversion, then when done copy to mounted storage (GDrive) for persistance.
The first is a bit harder, but I found that I could accelerate conversion using CUDA! 🤓
The result? Beautiful!
There’s some delay through moving data from RAM to GPU but it gave magnitudes of speed-up - which is awesome!
And it’s not that hard, I was confused by multiple places to enable CUDA, but found the following to be the best:
ffmpeg -hwaccel cuda -i <FILE> -preset faster -vf fps=3 -q:v 25 img%d.jpg
This command will use hwaccel
as CUDA, which means that it accelerate certain workloads by moving data to the GPU (expensive) then computing it on the GPU (fast). It’ll run with fps=3
in this setup. I found it on average sped up my frame-creation by 14-30x.
Small to Medium Dataset Speedups
In my professional worklife I’ve found that smaller datasets (<= RAM) has huge performance gains by keeping tensors in-memory, it’s much larger than I ever anticipated.
For this dataset I could only keep labels in-memory, but doing that makes sense as well. 😉
Analysis
Analyzing my results was done by a few different things:
- Metrics (😅)
- LIME
- Viewing data that model predicts as non-highlight but is highlight
- *Timeline chart
- This was the best way to review a “unknown” clip, do the highlight intensities overlap real highlights? Boy they do!
- It’s also how it’ll be used by my brother.
Result
The results are better than expected! I hit ~80% accuracy which is pretty good based on image-to-image classification, if we did an average of the last few frames I’d bet it’d be better. All in all having non-perfect score in such a subjective task as Highlight Extraction is a strength. We don’t want to overfit the data but rather be able to find highlights in new clips.
And to build my highlights I did a rolling_sum
on the last 30 seconds of predictions, giving a highlight intensity chart as shown in Figure 1.
After manual validation of the Intensity Chart it looked really good! It had lower score on some “easier 1v1 kills”, but all the cool fights and ults was caught by the model! I’m a bit impressed about how well a simple Image Classifier solve this task, applying Video Classification didn’t really change the results that much and introduced unwaranted complexity.
The result is that the model predics ~8% of all frames as highlights, which is pretty darn good number as the “real” one is ~13%!
Appendix: Code
Data Tools
def chunk_splitter(total_size: int, chunk_size: int, split: int | float) -> np.array:
= train_test_split(np.arange(total_size // chunk_size), test_size=split, random_state=42) # ignoring final unsized chunk
_, val_idxs = np.zeros(total_size, dtype="int")
is_valid
for index in val_idxs:
*= chunk_size
index +chunk_size] = 1
is_valid[index:index
return is_valid
class FrameDataset(Dataset):
def __init__(self,
df: pl.DataFrame,
augments: Compose,int,
frames_per_clip: int | None = None,
stride: bool = True,
is_train:
):super().__init__()
self.paths = df["path"].to_list()
self.is_train = is_train
if is_train:
self.y = torch.tensor(df["label"])
self.frames_per_clip = frames_per_clip
self.augments = augments
self.stride = stride or frames_per_clip
def __len__(self):
return len(self.paths) // self.stride
def __getitem__(self, idx):
= idx * self.stride
start = start + self.frames_per_clip
stop if stop-start<=1:
= self.paths[start]
path = self._open_augment_img(path)
frames_tr if self.is_train:
= self.y[start]
y else:
= [self._open_augment_img(path) for path in self.paths[start:stop]]
frames = torch.stack(frames)
frames_tr if self.is_train:
= self.y[start:stop].max()
y if self.is_train:
return frames_tr, y
else:
return frames_tr
def _open_augment_img(self, path):
= default_loader(path)
img = self.augments(img)
img return img
class FrameDataModule(L.LightningDataModule):
def __init__(self,
dataset: Dataset,int = 32,
batch_size: int = 3 * 30,
chunk_size_for_splitting: int = 2,
num_workers: bool = False
pin_memory:
):super().__init__()
self.dataset = dataset
self.batch_size = batch_size
self.num_workers = num_workers
self.pin_memory = pin_memory
self.chunk_size_for_splitting = chunk_size_for_splitting
= chunk_splitter(len(ds), chunk_size=self.chunk_size_for_splitting, split=.15)
split = np.where(split)[0]
val_indices = np.where(split == 0)[0]
train_indices self.ds_train = Subset(self.dataset, train_indices)
self.ds_val = Subset(self.dataset, val_indices)
def train_dataloader(self):
return DataLoader(self.ds_train, shuffle=True, batch_size=self.batch_size, num_workers=self.num_workers, pin_memory=self.pin_memory)
def val_dataloader(self):
return DataLoader(self.ds_val, batch_size=self.batch_size, num_workers=self.num_workers, pin_memory=self.pin_memory)
Classifiers
class ResNetClassifier(nn.Module):
def __init__(self, model: ResNet, num_classes: int = 2):
super().__init__()
self.num_classes = num_classes
self.model = model
self.model.fc = nn.Linear(self.model.fc.in_features, num_classes)
def forward(self, x):
return self.model(x)
class RNNClassifier(nn.Module):
def __init__(self, model: ResNet, num_classes: int = 2):
super().__init__()
self.num_classes = num_classes
self.feature_extractor = model # repeat thrice
self.feature_extractor.fc = nn.Linear(512, 512) # New fc layer
self.rnn = nn.LSTM(
=512,
input_size=256,
hidden_size=1,
num_layers=True
batch_first
)self.classifier = nn.Linear(256, num_classes)
def forward(self, x):
= []
features
for i in range(x.shape[1]):
= self.feature_extractor(x[:, i])
frame_feat
features.append(frame_feat)
= torch.reshape(torch.stack(features), [x.shape[0], x.shape[1], -1])
x
= self.rnn(x)
out, _ = out[:, -1, :]
out
= self.classifier(out)
out
return out
from typing import Any
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
import torchmetrics
class LightningWrapper(L.LightningModule):
def __init__(self, model: nn.Module, learning_rate=1e-3):
super().__init__()
self.model = model
self.lr = learning_rate
= torchmetrics.MetricCollection({
metrics "accuracy": torchmetrics.Accuracy(task="multiclass", num_classes=self.model.num_classes)
})self.train_metrics = metrics.clone(prefix="train_")
self.val_metrics = metrics.clone(prefix="val_")
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
= batch
x, y = self(x)
logits = F.cross_entropy(logits, y)
loss self.log('train_loss', loss)
self.train_metrics(logits, y)
self.log_dict(self.train_metrics, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
= batch
x, y = self(x)
logits = F.cross_entropy(logits, y)
loss self.log('val_loss', loss)
self.val_metrics(logits, y)
self.log_dict(self.val_metrics, prog_bar=True)
def configure_optimizers(self):
= torch.optim.Adam(self.parameters(), lr=self.lr)
optimizer return optimizer
Training Loop
= MLFlowLogger(log_model=True)
mlf_logger
mlflow.pytorch.autolog()
= 128
BATCH_SIZE # Define the duration of each chunk in seconds
= 30
chunk_duration_s = 3 * chunk_duration_s
chunk_duration_frames = transforms.Compose([
transform 224, 224)),
transforms.Resize((
transforms.ToTensor()
])= FrameDataset(labeled_df_rebalanced, transform, 1) # set 1 to higher if image classifier, is num_frames
ds = FrameDataModule(ds, BATCH_SIZE, chunk_duration_frames, pin_memory=True)
data
= ResNetClassifier(models.resnet50()) # image classifier
model # model = RNNClassifier(models.resnet18()) # video classifier
= LightningWrapper(model, learning_rate=1e-4)
model = L.Trainer(max_epochs=5, logger=mlf_logger, callbacks=[ModelCheckpoint(".")])
trainer
with mlflow.start_run():
=data) trainer.fit(model, datamodule