The setup is incredibly well done in the C code:
- An array called
syn0
holds the vector embedding of a word when it occurs as a focus word . This is random initialized .
https:
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
next_random = next_random * (unsigned long long)25214903917 + 11;
syn0[a * layer1_size + b] =
(((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
}
- Another array called
syn1neg
holds the vector of a word when it occurs as a context word . This is zero initialized .
https:
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
syn1neg[a * layer1_size + b] = 0;
- During training (skip-gram, negative sampling, though other cases are also similar), we first pick a focus word. This is held constant throughout the positive and negative sample training. The gradients of the focus vector are accumulated in a buffer, and are applied to the focus word after it has been affected by both positive and negative samples .
if (negative > 0) for (d = 0; d < negative + 1; d++) {
if (d == 0) {
target = word;
label = 1;
} else {
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size - 1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
if (f > MAX_EXP) g = (label - 1) * alpha;
else if (f < -MAX_EXP) g = (label - 0) * alpha;
else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}
https:
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
§ Why random and zero initialization?
Once again, since none of this actually explained in the original papers
or on the web , I can only hypothesize.
My hypothesis is that since the negative samples come from all over the text
and are not really weighed by frequency, you can wind up picking any word ,
and more often than not, a word whose vector has not been trained much at all .
If this vector actually had a value, then it could move the actually important
focus word randomly.
The solution is to set all negative samples to zero, so that
only vectors that have occurred somewhat frequently will affect the representation
of another vector.
It's quite ingenious, really, and until this, I'd never really thought of
how important initialization strategies really are.
§ Why I'm writing this
I spent two months of my life trying to reproduce word2vec
, following
the paper exactly, reading countless articles, and simply not succeeding.
I was unable to reach the same scores that word2vec
did, and it was not
for lack of trying.
I could not have imagined that the paper would have literally fabricated an
algorithm that doesn't work, while the implementation does something completely
different.
Eventually, I decided to read the sources, and spent three whole days convinced
I was reading the code wrong since literally everything on the internet told me
otherwise.
I don't understand why the original paper and the internet contain zero
explanations of the actual mechanism behind word2vec
, so I decided to put
it up myself.
This also explains GloVe's radical choice of having a separate vector
for the negative context --- they were just doing what word2vec
does, but
they told people about it :)
.
Is this academic dishonesty? I don't know the answer, and that's a heavy
question. But I'm frankly incredibly pissed, and this is probably the last
time I take a machine learning paper's explanation of the algorithm
seriously again --- from next time, I read the source first .