§ Everything you know about word2vec is wrong
The classic explanation of word2vec
, in skip-gram, with negative sampling,
in the paper and countless blog posts on the internet is as follows:
while(1) {
1. vf = vector of focus word
2. vc = vector of context word
3. train such that (vc . vf = 1)
4. for(0 <= i < negative samples):
vneg = vector of word *not* in context
train such that (vf . vneg = 0)
}
Indeed, if I google "word2vec skipgram", the results I get are:
the list goes on. However, every single one of these implementations is wrong .
The original word2vec C
implementation does not do what's explained above,
and is drastically different . Most serious users of word embeddings, who use
embeddings generated from word2vec
do one of the following things:
- They invoke the original C implementation directly.
- They invoke the
gensim
implementation, which is transliterated from the C source to the extent that the variables names are the same.
Indeed, the gensim
implementation is the
only one that I know of which is faithful to the C implementation .
§ The C implementation
The C implementation in fact maintains two vectors for each word , one where
it appears as a focus word, and one where it appears as a context word.
(Is this sounding familiar? Indeed, it appears that GloVe actually took this
idea from word2vec
, which has never mentioned this fact!)
The setup is incredibly well done in the C code:
- An array called
syn0
holds the vector embedding of a word when it occurs as a focus word . This is random initialized .
https:
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
next_random = next_random * (unsigned long long)25214903917 + 11;
syn0[a * layer1_size + b] =
(((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
}
- Another array called
syn1neg
holds the vector of a word when it occurs as a context word . This is zero initialized .
https:
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
syn1neg[a * layer1_size + b] = 0;
- During training (skip-gram, negative sampling, though other cases are also similar), we first pick a focus word. This is held constant throughout the positive and negative sample training. The gradients of the focus vector are accumulated in a buffer, and are applied to the focus word after it has been affected by both positive and negative samples .
if (negative > 0) for (d = 0; d < negative + 1; d++) {
if (d == 0) {
target = word;
label = 1;
} else {
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size - 1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
if (f > MAX_EXP) g = (label - 1) * alpha;
else if (f < -MAX_EXP) g = (label - 0) * alpha;
else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}
https:
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
§ Why random and zero initialization?
Once again, since none of this actually explained in the original papers
or on the web , I can only hypothesize.
My hypothesis is that since the negative samples come from all over the text
and are not really weighed by frequency, you can wind up picking any word ,
and more often than not, a word whose vector has not been trained much at all .
If this vector actually had a value, then it could move the actually important
focus word randomly.
The solution is to set all negative samples to zero, so that
only vectors that have occurred somewhat frequently will affect the representation
of another vector.
It's quite ingenious, really, and until this, I'd never really thought of
how important initialization strategies really are.
§ Why I'm writing this
I spent two months of my life trying to reproduce word2vec
, following
the paper exactly, reading countless articles, and simply not succeeding.
I was unable to reach the same scores that word2vec
did, and it was not
for lack of trying.
I could not have imagined that the paper would have literally fabricated an
algorithm that doesn't work, while the implementation does something completely
different.
Eventually, I decided to read the sources, and spent three whole days convinced
I was reading the code wrong since literally everything on the internet told me
otherwise.
I don't understand why the original paper and the internet contain zero
explanations of the actual mechanism behind word2vec
, so I decided to put
it up myself.
This also explains GloVe's radical choice of having a separate vector
for the negative context --- they were just doing what word2vec
does, but
they told people about it :)
.
Is this academic dishonesty? I don't know the answer, and that's a heavy
question. But I'm frankly incredibly pissed, and this is probably the last
time I take a machine learning paper's explanation of the algorithm
seriously again --- from next time, I read the source first .