Keras word embedding matrix has first row of zeros

Question

I am looking at the Keras Glove word embedding example and it is not clear why the first row of the embedding matrix is populated with zeros.

First, the embedding index is created where words are associated with arrays.

embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

Then the embedding matrix is created by looking at words from the index created by tokenizer.

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Since the loop will start with i=1, then the first row will contain only zeros and random numbers if the matrix is initialized differently. Is there a reason for skipping the first row?

Geeocode · Accepted Answer · 2019-12-30 18:51:57Z

1

The whole started from the fact that the Tokenizer's programmers reserved the index 0 for some reason, maybe for some compatibility (some other languages use indexing from 1) or coding technic reasons.

However they use numpy, where they want to indexing with the simple:

embedding_matrix[i] = embedding_vector

indexing, so the [0] indexed row stays full of zeros and there is no case where, as wrote "random numbers if the matrix is initialized differently", because this array has been initialized with zeros. So from this line we don't need the first row at all, but you can't delete it as the numpy array would lost the aligning its indexing with the tokenizer's indexing.

answered Dec 30, 2019 at 18:51

Geeocode

5,8173 gold badges22 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Peter B Over a year ago

Yes, thank you, different initialization is used when embedding matrix is estimated.

Collectives™ on Stack Overflow

Keras word embedding matrix has first row of zeros

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related