to 1
. This is then rescaled by MAX_EXP
.
§ Layer initialization
-
syn0
is a global variable, initialized with random weights in the range of [-0.5...0.5]
. It has dimensions VOCAB x HIDDEN
. This layer holds the hidden representations of word vectors.
a = posix_memalign((void **)&syn0, 128,
(long long)vocab_size * layer1_size * sizeof(real));
...
for (a = 0; a < vocab_size; a++)
for (b = 0; b < layer1_size; b++) {
next_random = next_random * (unsigned long long)25214903917 + 11;
syn0[a * layer1_size + b] =
(((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
}
-
syn1neg
is a global variable that is zero-initialized. It has dimensions VOCAB x HIDDEN
. This layer also holds hidden representations of word vectors, when they are used as a negative sample .
a = posix_memalign((void **)&syn1neg, 128,
(long long)vocab_size * layer1_size * sizeof(real));
...
for (a = 0; a < vocab_size; a++)
for (b = 0; b < layer1_size; b++) syn1neg[a * layer1_size + b] = 0;
-
neu1e
is a temporary per-thread buffer (Remember that the word2vec
C code use CPU threads for parallelism) which is zero initialized. It has dimensions 1 x HIDDEN
.
real *neu1e = (real *)calloc(layer1_size, sizeof(real));
§ Backpropogation
Throughout word2vec
, no 2D arrays are used. Indexing of the form
arr[word][ix]
is manually written as arr[word * layer1_size + ix]
. So, I
will call word * layer1_size
as the "base address", and ix
as the "offset
of the array index expression henceforth.
Here, l1
is the base address of the word at the center of window (the focus
word). l2
is the base address of either the word that is negative sampled
from the corpus, or the word that is a positive sample from within the context
window.
label
tells us whether the sample is a positive or a negative sample.
label = 1
for positive samples, and label = 0
for negative samples.
for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
...
if (negative > 0) for (d = 0; d < negative + 1; d++) {
...
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
if (f > MAX_EXP) g = (label - 1) * alpha;
else if (f < -MAX_EXP) g = (label - 0) * alpha;
else
g = (label - expTable[(int)((f + MAX_EXP) *
(EXP_TABLE_SIZE /
MAX_EXP / 2))]) * alpha;
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
- We have two vectors for each word, one called
syn0[l1 + _]
and the other syn1neg[l2 + _]
. The syn1neg
word embedding is used whenever a word is used a negative sample, and is not used anywhere else. Also, the syn1neg
vector is zero initialized, while the syn0
vectors are randomly initialized.
- The values we backprop with
g * syn1neg[l2 + _]
, g * syn0[l1 + _]
are not the correct gradients of the error term! The derivative of a sigmoid is dsigmoid(x)/dx = sigmoid(x) [1 - sigmoid(x)]
. The [1 - sigmoid(x)]
is nowhere to be seen, let alone the fact that we are using sigmoid(2x - 1)
and not regular sigmoid. Very weird.
- We hold the value of
syn0
constant throughout all the negative samples, which was not mentioned in any tutorial I've read.
The paper does not mentioned these implementation details, and neither
does any blog post that I've read . I don't understand what's going on,
and I plan on updating this section when I understand this better.