§ Word2Vec C code implements gradient descent really weirdly

I'll be posting snippets of the original source code, along with a link to the Github sources. We are interested in exploring the skip-gram implementation of Word2Vec, with negative sampling, without hierarchical softmax. I assume basic familiarity with word embeddings and the skip-gram model.

§ Construction of the sigmoid lookup table


// https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L708

expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));
for (i = 0; i < EXP_TABLE_SIZE; i++) {
  expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) *
                    MAX_EXP);  // Precompute the exp() table
  expTable[i] =
      expTable[i] / (expTable[i] + 1);  // Precompute f(x) = x / (x + 1)
}
Here, the code constructs a lookup table which maps [0...EXP_TABLE_SIZE-1] to [sigmoid(-MAX_EXP)...sigmoid(MAX_EXP)]. The index i first gets mapped to (i / EXP_TABLE_SIZE) * 2 - 1, which sends 0 to -1 and EXP_TABLE_SIZE to 1. This is then rescaled by MAX_EXP.

§ Layer initialization



// https://github.com/imsky/word2vec/blob/master/word2vec.c#L341
a = posix_memalign((void **)&syn0, 128,
               (long long)vocab_size * layer1_size * sizeof(real));
...

// https://github.com/imsky/word2vec/blob/master/word2vec.c#L355
for (a = 0; a < vocab_size; a++)
        for (b = 0; b < layer1_size; b++) {
            next_random = next_random * (unsigned long long)25214903917 + 11;
            syn0[a * layer1_size + b] =
                (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
        }


// https://github.com/imsky/word2vec/blob/master/word2vec.c#L350
a = posix_memalign((void **)&syn1neg, 128,
                   (long long)vocab_size * layer1_size * sizeof(real));
...
for (a = 0; a < vocab_size; a++)
    for (b = 0; b < layer1_size; b++) syn1neg[a * layer1_size + b] = 0;


// https://github.com/imsky/word2vec/blob/master/word2vec.c#L370
real *neu1e = (real *)calloc(layer1_size, sizeof(real));

§ Backpropogation


Throughout word2vec, no 2D arrays are used. Indexing of the form arr[word][ix] is manually written as arr[word * layer1_size + ix]. So, I will call word * layer1_size as the "base address", and ix as the "offset of the array index expression henceforth.
Here, l1 is the base address of the word at the center of window (the focus word). l2 is the base address of either the word that is negative sampled from the corpus, or the word that is a positive sample from within the context window.
label tells us whether the sample is a positive or a negative sample. label = 1 for positive samples, and label = 0 for negative samples.
// zero initialize neu1e
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L419
for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
...
// loop through each negative sample
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L508
if (negative > 0)  for (d = 0; d < negative + 1; d++) {
  ...
  // https://github.com/imsky/word2vec/blob/master/word2vec.c#L521
  // take the dot product: f=  syn0[focus] . syn1neg[context]
  for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];

  // compute: g = (label - sigmoid(2f - 1)) * alpha
  // g is computed using lookups into a lookup table and clamping for
  // efficiency.
  if (f > MAX_EXP) g = (label - 1) * alpha;
  else if (f < -MAX_EXP) g = (label - 0) * alpha;
  else
  g = (label - expTable[(int)((f + MAX_EXP) *
                              (EXP_TABLE_SIZE /
                               MAX_EXP / 2))]) * alpha;
  // Now that we have computed the gradient:
  // `g = (label - output) * learningrate`,
  // we need to perform backprop. This is where the code gets weird.

  for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
  for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
  } // end loop through negative samples
// Learn weights input -> hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];




The paper does not mentioned these implementation details, and neither does any blog post that I've read . I don't understand what's going on, and I plan on updating this section when I understand this better.