Transformers From Scratch
In this post I walk through SelfAttention Transformers from scratch with demos at the end for Text Classification & Generation, where the PyTorchcode is wrapped by fast.ai to simplify end2end.
 Transformers From Scratch
 Appendix: Demos (Text Classification & Text Generation)
Transformers From Scratch
(Open directly in Google Colab here.)
The video from the workshop (in Swedish) can be found on YouTube.
Today I'm taking huge inspiration from Peter Bloem
(Vrije Universiteit Amsterdam) which is a great blog post & video lecture. It's essentially a 8090 % copy of the blog, with minor modifications and reworked code into TODOs to make it more like a workshop using the Jupyter format (fitting nicely with Google Colab).
As demonstrations in the end I integrate PyTorchcode with fast.ai framework.
These demos are done to have a foundation for you to play around with as the models are not deep enough and does not train enough to really showcase the power of Transformers.
I want to note that the original blog by Peter Bloems is a work of art and does a better job at visualizing graphics in combination with text.
Finally this blog was done to support a 'Competence Night' in Machine Learning that I did @ AFRY.
Transformers explained
Transformers are a very exciting family of machine learning architectures. Many good tutorials exist but in the last few years, transformers have mostly become simpler, so that it is now much more straightforward to explain how modern architectures work. This post (read:Peter Bloems blog) is an attempt to explain directly how modern transformers work, and why, without some of the historical baggage. It's assumed a basic understanding of neural networks and backpropagation, for those that are not sure you can either brush up your knowledge in this video and learn how they're used today here.
Further we'll use PyTorch, with some fast.ai for the demos in appendix, to implement the full self attention & transformer module(s).
Selfattention
The fundamental operation of any transformer architecture is the selfattention operation.
Selfattention is a sequencetosequence operation, that is we input a sequence and we have a sequence returned.
Let's say we've got x1..xn
as input and y1..yn
as output where each vector is dimensioned k
.
To produce output vector 𝐲i, the self attention operation simply takes a weighted average over all the input vectors
$y_i = \sum_{j=0}^{j=k}{w_{ij}x_j}$
Where the weights summed over $j=0..k$ is equal to $1$. $w_{ij}$ is not a parameter but is derived from a function over $x_i$ and $x_j$.
The simplest being the dotproduct.
$w_{ij}^{'} = x_i^Tx_j$
Where $x_i$ is at the same position as $y_i$ and $x_j$ traverses through all k values. For $x_{i+1}$ we get a completely different output!
Because the dot product has no bounds we apply softmax to the result, so that the whole sequence sum to $1$.
$w_{ij} = \frac{\text{exp } w_{ij}^{'}}{\sum_{j} \text{exp } w_{ij}^{'}}$
And that's really it.
A visual illustration of basic selfattention. Note that the softmax operation over the weights is not illustrated.
Obviously we need to apply a few more things to create a Transformer, but we'll get to that.
As you see, negative + negative = positive, positive + positive = positive.
The magnitude of the score increases the final outcome.
So combining these two vectors together will work out real nice!
This might not be so practical in reality and as such we make the movie and user features parameters of the model. Then we ask the user for for a small number of movies they like and optimize the features so that their dot product matches the known likes.
Even when not manually giving any features meaningful data is extracted.
Even though we don’t tell the model what any of the features should mean, in practice, it turns out that after training the features do actually reflect meaningful semantics about the movie content.
This is in essence the same as selfattention.
Using word embeddings on a sentence as
$v_{the}, v_{cat}, v_{walk}$
and feed it into self attention we get the outputvectors y,
$y_{the}, y_{cat}, y_{walk}$
where yvectors are the weighted sum over all embedding vectors in the first sequence, weighted by their (normalized) dotproduct with $v_{cat}$.
E.g. $y_{the}$ = $v_{the} * v_{the} + v_{the} * v_{cat} + v_{the} * v_{walk}$.
Because v is learned this will continually be updated to work better. While updating how v is created most likely walk and cat will have a high dotproduct as those are correlated.
This is the basic intuition. Please note:
 No parameters (yet) so the upstream mechanism creating the embeddings fully drives the selfattention by learning representations with particular dotproducts
 Selfattention see the input as a bag, i.e. not a continuous input as it is, e.g. it is permutation equivariant.
In Pytorch: basic selfattention
What I cannot create, I do not understand  Feynman
The first thing we need to implement is matrix multiplications, which will be done through torch as native python loops are too slow.
Input:$t$ vectors, dimension $k$ and $b$ minibatches (tensor: $b, t, k$)
We’ll represent the input, a sequence of t vectors of dimension k as a t by k matrix 𝐗. Including a minibatch dimension b, gives us an input tensor of size (b,t,k).
The set of all raw dot products w′ij forms a matrix, which we can compute simply by multiplying 𝐗 by its transpose:
Let's code!
class TODO(Exception):
"""Raised when there is something TODO"""
pass
import torch
import torch.nn.functional as F
# assume we have some tensor x with size (b=1, t=3, k=3)
x = torch.array([[1.,2.,3.,
1.,2.,3.,
1.,2.,3.]])
raw_weights = raise TODO("Use torch batchmatrixmultiply (bmm) to do x*x^T. Remember there's 3 dimensions!")
# Apply rowwise Softmax
weights = raise TODO("Use functional softmax to apply softmax rowwise (along which dim?)")
# creating y
y = torch.bmm(weights, x)
And that's it. We have created basic selfattention which is the basis of Transformers (stateoftheart).
Additional Tricks
To implement Transformers fully we need to add a few tricks.
1) Queries, keys & values
Each inputvector $x_i$ is used 3 times
 Creating its own weight for $y_i$
 Creating others weight for $y$ ($y_j$)
 Used as a part of the weighted sum for each output $y$
This is often called the query, the key, and the value (explained later).
We update the network to instead use three weightmatrices, one for each task, making it more controllable. Adding $W_k, W_q, W_v$ of size $k*k$, we've got
$q_i = W_qx_i, k_i = W_kx_i, v_i = W_vx_i$
$w_{ij}^{'} = q_i^Tk_j$
$w_ {ij} = softmax(w_{ij}^{'})$
$y_i = \sum_j w_{ij}v_j$
We've now given the Self Attention some controllable parameters & allows modification to the input vector to fit the task at hands.
Illustration of the selfattention with key (red), query (query) and value (green) transformations. Remember old image and compare!
2) Scaling the Dot Product
Softmax can be very sensitive to large values, exploding gradient or making it slow to learn.
I don't recall where but ~ 8 years ago someone figured out that scaling the value by $\frac{1}{\sqrt{k}}$ where $k$ is the embedding dimension.
$w_{ij}^{'} = \frac{q_i^Tk_j}{\sqrt{k}}$
Why $\sqrt{k}$? Imagine a vector in $ℝk$ with values all $c$. Its Euclidean length is $\sqrt{k}c$. Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.
3) Multihead Attention
The final improvement is to allow word to have different meanings with different neighbours (basically what ngram achieves).
By adding multiple, indexed r, self attention mechanisms with different matrices, $W_q^r$ etc. These are called attention heads.
For input $x_i$ each attention head produces a different output vector $y_i^r$. We concatenate these, and pass them through a linear transformation to reduce the dimension back to $k$. Remember what is a linear transformation?
Narrow and wide selfattention. There's two ways to apply MultiHead SelfAttention.
 (narrow) Cut embedding vector into chunks
 8 heads & $k=256$ > 8 chunks of size 32
 Each chunk gets Q, K, V matrices ($W_q^r$,...) ($32\times32$)
 (wide) Make matrices $256\times256$ and apply each head to the whole 256vector
 First (narrow) = faster & less memory
 Second (wider) = better result
Only second (wider) variant described.
import torch
from torch import nn
import torch.nn.functional as F
class SelfAttention():
def __init__(self, k, heads=8):
raise TODO("Make SelfAttention a torch module (nn.Module)")
super().__init__()
self.k, self.heads = k, heads
Combining three attention heads into one matrix multiplication (for the queries).
h attentionheads considered as h separate sets of $W_q^r,W_k^r,W_v^r$, but as shown in image above a more efficient approach is possible.
Combine all heads into three single $k\times hk$ matrices.
This means that we can compute the concatenated queries, keys & values in a single matrix multiplication.
class SelfAttention(nn.Module):
def __init__(self, k, heads=8):
super().__init__()
self.k, self.heads = k, heads
# These compute the queries, keys and values for all
# heads (as a single concatenated vector)
self.tokeys = raise TODO("Create a linear layer of k * hk size, no bias")
self.toqueries = raise TODO("Create a linear layer of k * hk size, no bias")
self.tovalues = raise TODO("Create a linear layer of k * hk size, no bias")
# This unifies the outputs of the different heads into a single kvector
self.unifyheads = raise TODO("Create a linear layer of k * hk size, WITH bias")
From here we can implement the computation of the selfattention (forward
function). Where we first calculate queries, keys & values.
class SelfAttention(nn.Module):
def __init__(self, k, heads=8):
super().__init__()
self.k, self.heads = k, heads
# These compute the queries, keys and values for all
# heads (as a single concatenated vector)
self.tokeys = nn.Linear(k, k * heads, bias=False)
self.toqueries = nn.Linear(k, k * heads, bias=False)
self.tovalues = nn.Linear(k, k * heads, bias=False)
# This unifies the outputs of the different heads into
# a single kvector
self.unifyheads = nn.Linear(heads * k, k)
def forward(self, x):
b, t, k = x.size()
h = self.heads
queries = raise TODO("Create queries and then reshape into h separate matrices, using `view`")
keys = raise TODO("Create keys and then reshape into h separate matrices, using `view`")
values = raise TODO("Create values and then reshape into h separate matrices, using `view`")
Having reshaped queries etc from $(b,t, h*k)$ into $(b,t,h,k)$ each head has its own dimension.
The step now is to compute dotproduct for each head.
We can batch this if we reshape the matrices into something that's possible to batch (transposing; as head/batch is not next to eachother). costly
class SelfAttention(nn.Module):
def __init__(self, k, heads=8):
super().__init__()
self.k, self.heads = k, heads
# These compute the queries, keys and values for all
# heads (as a single concatenated vector)
self.tokeys = nn.Linear(k, k * heads, bias=False)
self.toqueries = nn.Linear(k, k * heads, bias=False)
self.tovalues = nn.Linear(k, k * heads, bias=False)
# This unifies the outputs of the different heads into
# a single kvector
self.unifyheads = nn.Linear(heads * k, k)
def forward(self, x):
b, t, k = x.size()
h = self.heads
queries = self.toqueries(x).view(b, t, h, k)
keys = self.tokeys(x) .view(b, t, h, k)
values = self.tovalues(x) .view(b, t, h, k)
#  fold heads into the batch dimension ... contiguous = reshapes matrix in memory
keys = keys.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries.transpose(1, 2).contiguous().view(b * h, t, k)
values = values.transpose(1, 2).contiguous().view(b * h, t, k)
queries = raise TODO("Scale queries by k^(1/4)")
keys = raise TODO("Scale keys by k^(1/4)")
#  get dot product of queries and keys, and scale
dot = torch.bmm(queries, keys.transpose(1, 2))
#  dot has size (b*h, t, k) containing raw weights
dot = raise TODO("Normalize rowwise using F.softmax")
#  dot now contains rowwise normalized weights
# apply the self attention to the values
out = torch.bmm(dot, values).view(b, h, t, k)
# swap h, t back, unify heads
out = out.transpose(1, 2).contiguous().view(b, t, h * k)
return raise TODO("Unify the heads into the classes again using unifyheads")
And there you have it: multihead, scaled dotproduct self attention.
Building Transformers
A transformer is not just a selfattention layer, it is an architecture.
Any architecture designed to process a connected set of units—such as the tokens in a sequence or the pixels in an image—where the only interaction between units is through selfattention.
Like most mechanism, e.g. convolutions, a standard approach as emerged. The first step is to wrap the selfattention into a block that we can repeat.
The transformer block
There are some variations on how to build a basic transformer block, but most of them are structured roughly like this:
 MultiLayer Perceptron = MLP = Basic Feedforward Neural Network
 Blue lines = Residual Connection (allow the gradient to flow through the network on a kind of "highway", making the training faster and reducing "blown up gradients")
class TransformerBlock(nn.Module):
def __init__(self, k, heads):
super().__init__()
self.attention = raise TODO("make a SelfAttention")
self.norm1 = raise TODO("make a nn.LayerNorm of size k")
self.norm2 = raise TODO("make a nn.LayerNorm of size k")
# k * 4 is arbitrary, but needs to be larger than input/output layer (k)
self.ff = raise TODO("create a MLP: Sequential(Linear(k, 4*k), ReLU, Linear(4*k, k))")
def forward(self, x):
attended = raise TODO("call your created attention on x")
x = raise TODO("Use the first normalizer to normalizer attented+x")
fedforward = raise TODO("Call feedforward (ff) on new x")
return raise TODO("Finally normalize with 2nd on the addition of feedforward & x")
And that is really it! We have now built a Transformer Block & Self Attention.
Now we want to use it :)
Classification transformer
The simplest classification task is a Sequence Classifier.
The IMDB Review dataset is a great contender to try things out with, where each review is positive
or negative
.
Essentially we'll create a chain of Transformers Block and input our embedded vectors of words, and in the end transforming the output into a single value (true/false).
Output: producing a classifier
Most common way: Global Average Pooling on final output sequence and map result to a softmaxed class vector.
Overview of a simple sequence classification transformer. The output sequence is averaged to produce a single vector representing the whole sequence. This vector is projected down to a vector with one element per class and softmaxed to produce probabilities.
Input:Using Positions (think: ngram)We've already discussed that the current model uses embedding layer but has no sense of sequence time slotting (~ ngram).
We want our StateoftheArt model to have a sense of order so we need to fix it.
Add a second vector of the same length as word embedding, that represent the position of the sentence+word, and add it to the word embedding. There's two ways to do this.

Position Embedding We simply embed the positions like we did the words.
 Easy to implement
 Works pretty good
 Drawback is that we have to see sequences of every length during training, otherwise the position is not trained!

Position Encoding Position encodings work in the same way as embeddings, except that we don't learn the position vectors, we just choose some function $f:ℕ→ℝk$ to map the positions to real valued vectors, and let the network figure out how to interpret these encodings.
 E.g. sin/cos
 Works with longer sequences than seen (might not work well, but it works!)
 Drawback: choice of encoding function is a complicated hyperparameter, and more complicated implementation.
For this tutorial the Position Embedding is used.
PyTorch
Let's implement this!
class Transformer(nn.Module):
def __init__(self, k, heads, depth, seq_length, num_tokens, num_classes):
super().__init__()
self.num_tokens = num_tokens
self.token_emb = raise TODO("Create a Embedding layer (nn.X) with num_tokens & k")
self.pos_emb = raise TODO("Create Embedding Layer with seq_length & k")
# The sequence of transformer blocks that does all the
# heavy lifting
blocks = []
for i in range(depth):
raise TODO("Append a TransformerBlock we recently created for each loop; and why not use listcomprehension?")
self.t_blocks = raise TODO("Now make them a Sequential layer (*list = spread)")
# Maps the final output sequence to class logits
self.to_probs = raise TODO("To get class logits we simply use a Linear layer of k * num_classes")
def forward(self, x):
"""
:param x: A (b, t) tensor of integer values representing
words (in some predetermined vocabulary).
:return: A (b, c) tensor of logprobabilities over the
classes (where c is the nr. of classes).
"""
# generate token embeddings
tokens = raise TODO("Embedd the tokens")
b, t, k = tokens.size()
# generate position embeddings
positions = torch.arange(t)
positions = self.pos_emb(positions)[None, :, :].expand(b, t, k)
x = tokens + positions
x = raise TODO("Run the network through the tranformer blocks")
# Averagepool over the t dimension and project to class
# probabilities
x = raise TODO("Use the probability function, but first take the mean over dim=1 to average")
return raise TODO("Use the F.log_softmax on dim=1 to normalize output!")
At depth 6, with a maximum sequence length of 512, this transformer achieves an accuracy of about 85%, competitive with results from RNN models, and much faster to train. To see the real nearhuman performance of transformers, we’d need to train a much deeper model on much more data. More about how to do that later.
Text generation transformer
Let's move on!
In Text Generation we are not allowed to know the future during training, how else are we going to predict it? This means that we need to use an autoregressive model.
For Text Generation we'll train a charactertocharacter prediction, the input is a sequence of characters and output is the input shifted one charafter to the left.
For the usual RNN this is all that is needed, but as mentioned now we need to make our model autoregressive, meaning that it can't look ahead.
This is done by applying a mask which disables all elements that are ahead of current index, as in image below.
class SelfAttentionAutoRegressive(nn.Module):
def __init__(self, k, heads=8):
super().__init__()
self.k, self.heads = k, heads
# These compute the queries, keys and values for all
# heads (as a single concatenated vector)
self.tokeys = nn.Linear(k, k * heads, bias=False)
self.toqueries = nn.Linear(k, k * heads, bias=False)
self.tovalues = nn.Linear(k, k * heads, bias=False)
# This unifies the outputs of the different heads into
# a single kvector
self.unifyheads = nn.Linear(heads * k, k)
def forward(self, x):
b, t, k = x.size()
h = self.heads
queries = self.toqueries(x).view(b, t, h, k)
keys = self.tokeys(x) .view(b, t, h, k)
values = self.tovalues(x) .view(b, t, h, k)
#  fold heads into the batch dimension ... contiguous = reshapes matrix in memory
keys = keys.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries.transpose(1, 2).contiguous().view(b * h, t, k)
values = values.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries / (k ** (1/4))
keys = keys / (k ** (1/4))
#  get dot product of queries and keys, and scale
dot = torch.bmm(queries, keys.transpose(1, 2))
# === START OF CHANGES ===
indices = torch.triu_indices(t, t, offset=1)
dot[:, indices[0], indices[1]] = raise TODO("inf; and think off what we are doing. triu_indices is shown below")
# === END OF CHANGES ===
#  dot has size (b*h, t, t) containing raw weights
dot = F.softmax(dot, dim=2)
#  dot now contains rowwise normalized weights
# apply the self attention to the values
out = torch.bmm(dot, values).view(b, h, t, k)
# swap h, t back, unify heads
out = out.transpose(1, 2).contiguous().view(b, t, h * k)
return self.unifyheads(out)
??torch.triu_indices
Once the model has been 'handicapped' as this we're ready to go!
If you'd like to learn the historical aspect of Transformers, some design consideration & more details please visit the original blog by Peter Bloem and finish that one!
I hope you found this helpful!
Please take a good look in Appendix for demos with Text Classification & Text Generation.
Do note that the depth should probably be increased as should the amount of data. Transformers improve a lot with time, depth & data (especially data!).
Thanks
~Hampus
Appendix: Demos (Text Classification & Text Generation)
This appendix includes one demo of each type where I've integrated the PyTorch with fast.ai to smoothen the process from model to actually using it.
DISCLAIMER:
 We should train on more data for a longer while to really showcase their prowess
 Use the full IMDB data to at least get a little better performance
 The demos are more wrapped code for you to play around with
With that in mind, please play around with this. Tweak parameters, add your own data and have fun!
%%capture
!pip install U fastai
from fastai.text.all import *
from functools import partial # Can use partial to preload a transformer block
class SelfAttention(Module):
def __init__(self, k, heads=8):
self.k, self.heads = k, heads
# These compute the queries, keys and values for all
# heads (as a single concatenated vector)
self.tokeys = nn.Linear(k, k * heads, bias=False)
self.toqueries = nn.Linear(k, k * heads, bias=False)
self.tovalues = nn.Linear(k, k * heads, bias=False)
# This unifies the outputs of the different heads into
# a single kvector
self.unifyheads = nn.Linear(heads * k, k)
def forward(self, x):
b, t, k = x.size()
h = self.heads
assert self.k == k, f'Input embedding dim ({e}) should match layer embedding dim ({self.emb})'
queries = self.toqueries(x).view(b, t, h, k)
keys = self.tokeys(x) .view(b, t, h, k)
values = self.tovalues(x) .view(b, t, h, k)
#  fold heads into the batch dimension ... contiguous = reshapes matrix in memory
keys = keys.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries.transpose(1, 2).contiguous().view(b * h, t, k)
values = values.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries / (k ** (1/4))
keys = keys / (k ** (1/4))
#  get dot product of queries and keys, and scale
dot = torch.bmm(queries, keys.transpose(1, 2))
assert dot.size() == (b*h, t, t)
#  dot has size (b*h, t, t) containing raw weights
dot = F.softmax(dot, dim=2)
#  dot now contains rowwise normalized weights
# apply the self attention to the values
out = torch.bmm(dot, values).view(b, h, t, k)
# swap h, t back, unify heads
out = out.transpose(1, 2).contiguous().view(b, t, h * k)
return self.unifyheads(out)
class TransformerBlock(Module):
def __init__(self, k, heads):
super().__init__()
self.attention = SelfAttention(k, heads=heads)
self.norm1 = nn.LayerNorm(k)
self.norm2 = nn.LayerNorm(k)
self.ff = nn.Sequential(
nn.Linear(k, 4 * k),
nn.ReLU(),
nn.Linear(4 * k, k)
)
def forward(self, x):
attended = self.attention(x)
x = self.norm1(attended + x)
fedforward = self.ff(x)
return self.norm2(fedforward + x)
class Transformer(Module):
def __init__(self, k, heads, depth, seq_length, num_tokens, num_classes, device):
super().__init__()
self.device = device
self.seq_length = seq_length
self.num_tokens = num_tokens
self.token_emb = nn.Embedding(num_tokens, k)
self.pos_emb = nn.Embedding(seq_length, k)
# The sequence of transformer blocks that does all the
# heavy lifting
tblocks = [TransformerBlock(k=k, heads=heads) for x in range(depth)]
self.tblocks = nn.Sequential(*tblocks)
# Maps the final output sequence to class logits
self.toprobs = nn.Linear(k, num_classes)
def forward(self, x):
"""
:param x: A (b, t) tensor of integer values representing
words (in some predetermined vocabulary).
:return: A (b, c) tensor of logprobabilities over the
classes (where c is the nr. of classes).
"""
# generate token embeddings
# x = x.split(' ')
x = x[:,:self.seq_length,]
tokens = self.token_emb(x)
b, t, k = tokens.size()
# print(b, t, k)
# generate position embeddings
positions = torch.arange(t, device=self.device)
positions = self.pos_emb(positions)[None, :, :].expand(b, t, k)
x = tokens + positions
x = self.tblocks(x)
# Averagepool over the t dimension and project to class
# probabilities
x = self.toprobs(x.mean(dim=1))
return F.log_softmax(x, dim=1)
path = untar_data(URLs.IMDB)
path.ls()
# df = pd.read_csv(path/'texts.csv');df.head(2)
# dls = TextDataLoaders.from_df(df, path=path, text_col='text', label_col='label', valid_col='is_valid')
db = DataBlock(
blocks = (TextBlock.from_folder(path, max_vocab=10_000, seq_len=256), CategoryBlock),
get_items=get_text_files,
splitter=GrandparentSplitter(valid_name='test'),
get_y=parent_label
)
dls = db.dataloaders(path, path = path)
dls.show_batch(max_n=2)
len(dls.vocab[0]),dls.device,dls.one_batch()[0].size()
learn = Learner(dls, Transformer(k=256, heads=8, depth=4, seq_length=256, num_tokens=len(dls.vocab[0]), num_classes=2, device=dls.device), metrics=[accuracy])
learn.lr_find()
learn.fit_one_cycle(6, 5.7e5) # We can increase depth & more to improve the result
Fast.AI approach
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e2)
learn.show_results()
class SelfAttentionAutoRegressive(Module):
def __init__(self, k, heads=8):
self.k, self.heads = k, heads
# These compute the queries, keys and values for all
# heads (as a single concatenated vector)
self.tokeys = nn.Linear(k, k * heads, bias=False)
self.toqueries = nn.Linear(k, k * heads, bias=False)
self.tovalues = nn.Linear(k, k * heads, bias=False)
# This unifies the outputs of the different heads into
# a single kvector
self.unifyheads = nn.Linear(heads * k, k)
def forward(self, x):
b, t, k = x.size()
h = self.heads
queries = self.toqueries(x).view(b, t, h, k)
keys = self.tokeys(x) .view(b, t, h, k)
values = self.tovalues(x) .view(b, t, h, k)
#  fold heads into the batch dimension ... contiguous = reshapes matrix in memory
keys = keys.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries.transpose(1, 2).contiguous().view(b * h, t, k)
values = values.transpose(1, 2).contiguous().view(b * h, t, k)
queries = queries / (k ** (1/4))
keys = keys / (k ** (1/4))
#  get dot product of queries and keys, and scale
dot = torch.bmm(queries, keys.transpose(1, 2))
indices = torch.triu_indices(t, t, offset=1, device='cuda') # OBS this also changed
dot[:, indices[0], indices[1]] = float('inf')
#  dot has size (b*h, t, t) containing raw weights
dot = F.softmax(dot, dim=2)
#  dot now contains rowwise normalized weights
# apply the self attention to the values
out = torch.bmm(dot, values).view(b, h, t, k)
# swap h, t back, unify heads
out = out.transpose(1, 2).contiguous().view(b, t, h * k)
return self.unifyheads(out)
class TransformerBlock(Module):
def __init__(self, k, heads):
super().__init__()
self.attention = SelfAttentionAutoRegressive(k, heads=heads)
self.norm1 = nn.LayerNorm(k)
self.norm2 = nn.LayerNorm(k)
self.ff = nn.Sequential(
nn.Linear(k, 4 * k),
nn.ReLU(),
nn.Linear(4 * k, k)
)
def forward(self, x):
attended = self.attention(x)
x = self.norm1(attended + x)
fedforward = self.ff(x)
return self.norm2(fedforward + x)
class Transformer(Module):
def __init__(self, k, heads, depth, seq_length, num_tokens, device):
super().__init__()
self.device = device
self.seq_length = seq_length
self.num_tokens = num_tokens
self.token_emb = nn.Embedding(num_tokens, k)
self.pos_emb = nn.Embedding(seq_length, k)
# The sequence of transformer blocks that does all the
# heavy lifting
tblocks = [TransformerBlock(k=k, heads=heads) for x in range(depth)]
self.tblocks = nn.Sequential(*tblocks)
# Maps the final output sequence to class logits
self.toprobs = nn.Linear(k, num_tokens)
def forward(self, x):
"""
:param x: A (b, t) tensor of integer values representing
words (in some predetermined vocabulary).
:return: A (b, c) tensor of logprobabilities over the
classes (where c is the nr. of classes).
"""
# generate token embeddings
# print(x.size())
tokens = self.token_emb(x)
b, t, k = tokens.size()
# print(b, t, k)
# generate position embeddings
positions = torch.arange(t, device=self.device)
positions = self.pos_emb(positions)[None, :, :].expand(b, t, k)
x = tokens + positions
x = self.tblocks(x)
# Averagepool over the t dimension and project to class
# probabilities
x = self.toprobs(x.view(b*t, k)).view(b, t, self.num_tokens)
# print(x.size())
return F.log_softmax(x, dim=2)
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()
df = pd.read_csv(path/'texts.csv')
df.head(1)
dls = TextDataLoaders.from_df(df, path=path, text_col='text', is_lm=True, valid_col='is_valid', seq_len=256)
dls.show_batch(max_n=1)
len(dls.vocab),dls.one_batch()[0].size()
learn = Learner(dls, Transformer(k=256, heads=8, depth=3, seq_length=256, num_tokens=len(dls.vocab), device=dls.device), loss_func=CrossEntropyLossFlat())
learn.freeze()
learn.lr_find()
learn.unfreeze()
learn.fit_one_cycle(4) # Add suggested LR if you'd like
learn.fit_one_cycle(4)
learn.fit_one_cycle(4)
def predict(self, text, n_words=1, no_unk=True, temperature=1., min_p=None, no_bar=False,
decoder=decode_spec_tokens, only_last_word=False):
"Return `text` and the `n_words` that come after"
idxs = idxs_all = self.dls.test_dl([text]).items[0].to(self.dls.device)
if no_unk: unk_idx = self.dls.vocab.index(UNK)
for _ in (range(n_words) if no_bar else progress_bar(range(n_words), leave=False)):
with self.no_bar(): preds,_ = self.get_preds(dl=[(idxs[None],)])
# print(preds.size())
res = preds[0][1]
if no_unk: res[unk_idx] = 0.
if min_p is not None:
if (res >= min_p).float().sum() == 0:
warn(f"There is no item with probability >= {min_p}, try a lower value.")
else: res[res < min_p] = 0.
if temperature != 1.: res.pow_(1 / temperature)
idx = torch.multinomial(res, 1).item()
idxs = idxs_all = torch.cat([idxs_all, idxs.new([idx])])
if only_last_word: idxs = idxs[1][None]
num = self.dls.train_ds.numericalize
tokens = [num.vocab[i] for i in idxs_all if num.vocab[i] not in [BOS, PAD]]
sep = self.dls.train_ds.tokenizer.sep
return sep.join(decoder(tokens))
@delegates(Learner.get_preds)
def get_preds(self, concat_dim=1, **kwargs):
return super().get_preds(concat_dim=concat_dim, **kwargs)
predict(learn, "I think its a very", n_words=20, temperature=1.)
learn.save('gen.pkl')
!wget https://cs.stanford.edu/people/karpathy/charrnn/shakespear.txt
FastAI way
predict(language_model_learner(dls, AWD_LSTM), "I think its", n_words=2, temperature=1.)