§ Self attention? not really


The code is taken from The annotated transformer which explains the "attention is all you need paper".
On skimming the code, one sees the delightful line of code:
class EncoderLayer(nn.Module):
  "Encoder is made up of self-attn and feed forward (defined below)"
  def __init__(self, size, self_attn, feed_forward, dropout):
     super(EncoderLayer, self).__init__()
     self.self_attn = self_attn
     self.feed_forward = feed_forward
     self.sublayer = clones(SublayerConnection(size, dropout), 2)
     self.size = size
  def forward(self, x, mask):
    "Follow Figure 1 (left) for connections."
    x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
    return self.sublayer[1](x, self.feed_forward)

where the line:
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))

seems to imply that we are, indeed, performing a self attention with the same value x as the query, key, and value.
However, reading the code of the self-attention (or the paper) leads one to realise:
class MultiHeadedAttention(nn.Module):
  def __init__(self, h, d_model, dropout=0.1):
    "Take in model size and number of heads."
    super(MultiHeadedAttention, self).__init__()
    assert d_model % h == 0
    # We assume d_v always equals d_k
    self.d_k = d_model // h
    self.h = h
    self.linears = clones(nn.Linear(d_model, d_model), 4)
    self.attn = None
    self.dropout = nn.Dropout(p=dropout)

  def forward(self, query, key, value, mask=None):
    "Implements Figure 2"
    if mask is not None:
        # Same mask applied to all h heads.
        mask = mask.unsqueeze(1)
    nbatches = query.size(0)

    # 1) Do all the linear projections in batch from d_model => h x d_k
    query, key, value = \
        [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
         for l, x in zip(self.linears, (query, key, value))]

    # 2) Apply attention on all the projected vectors in batch.
    x, self.attn = attention(query, key, value, mask=mask,
                             dropout=self.dropout)

    # 3) "Concat" using a view and apply a final linear.
    x = x.transpose(1, 2).contiguous() \
         .view(nbatches, -1, self.h * self.d_k)
    return self.linears[-1](x)

where we notice:
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
  [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
   for l, x in zip(self.linears, (query, key, value))]

# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
                         dropout=self.dropout)

where we see that query, key, value are being linearly transformed before being used. Hence, an input of (x,x,x)(x, x, x) is transformed to (q,k,v)=(Qx,Kx,Vx)(q', k', v') = (Qx, Kx, Vx) where Q,K,VQ, K, V are arbitrary matrices.
Next, when we pass these into attention, the output we get is:
softmax(q,kT)v=(Qx)(Kx)T(Vx)=QxxTKTVx \text{softmax}(q', k'^T) v = (Q x) (K x)^T (V x) = Q x x^T K^T V x

the code below is the same thing, spelled out:
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

So It's not really self attention: it's more like: modulated attention to self :)