Word Vectors and Word2Vec in NLP

9 min readDec 12, 2022

Table of Contents
- Word vectors
- Word2vec
- Idea behind the word2vec
- Understand the difference between Center word and Context word
- Word2vec algorithm Family: Skip-gram model , Skip-gram model with negative sampling, Continuous Bag of Words.
- Stemming, Lemmatization.

How to do a better job at building computational systems that try to get better at guessing words.
GPT-3, or Generative Pretrained Transformer 3, is a language processing AI model developed by OpenAI. It is one of the most powerful language processing models currently available, with the ability to generate human-like text and perform a wide range of natural language processing tasks.
It uses its vast knowledge of the English language and its ability to process and understand the context of text to predict what words are likely to come next in a given sequence. This allows it to generate coherent and fluent text that reads like it was written by a human.

Image from Stanford’s NLP youtube video.

How do we have usable meaning in a computer?

Answer:
Common NLP Solution: Use WordNet, a thesaurus containing lists of synonym sets and hypernyms ( “ is a ” relationships)

This is only correct in some contexts.
Missing new meanings of words. Example wicked, badass, nifty, wizard, genius, ninja, bombest).
Subjective.
Require human labor to create and adapt.
Can’t compute accurate word similarity.

Word vectors

In natural language processing, word vectors are mathematical representations of words that capture the meaning and context of the word in a numerical form.
These vectors are typically created by training a model on a large corpus of text, and they capture the relationships between words based on how they are used in the text.
For example, the word vectors for the words “cat” and “kitten” would be similar, because they are both related to the concept of a small, furry animal.
Word vectors are useful for a variety of natural language processing tasks, such as language translation, text summarization, and sentiment analysis.

Examples:

“cat” and “kitten” might have similar word vectors, because they are both related to small, furry animals.
“dog” and “cat” might have different word vectors, because they are different types of animals.
“happy” and “sad” might have opposite word vectors, because they represent opposite emotional states.
“apple” and “fruit” might have similar word vectors, because they are both related to food.

These are just examples, and the exact relationships between different words will depend on the specific model and training data used to create the word vectors. However, the general idea is that word vectors capture the meaning and context of words in a numerical form, allowing them to be used for various natural language processing tasks.

Word2vec

Word2vec is a method for creating word vectors which are mathematical representation of words used in natural language processing.
In word2vec, each word is represented by two vectors: a vector for the word itself, and a vector for the context in which the word appears. These vectors are used to train the model and generate word vectors that capture the meaning and context of the words.

Idea behind the word2vec

To use a neural network learn the relationships between words based on the contexts in which they appear in a large cor[us of text.
This allow the model to capture the meaning and context of words in a numerical form, known as word vectors.

Example:

Suppose we have a corpus of text that contains the following sentence: “The cat sat on the mat.” We want to use word2vec to create word vectors for this sentence.
First, we train a neural network on the corpus using the continuous bag-of-words or skip-gram algorithm. The network learns to represent words in a high-dimensional space, where semantically similar words are mapped to nearby points.
Next, we feed the sentence “The cat sat on the mat” into the trained network. The network processes the sentence and generates word vectors for each word in the sentence.
For example, the word vector for “cat” might capture the fact that it is a small, furry animal. The word vector for “mat” might capture the fact that it is a flat object that is used to sit or lie on. These word vectors capture the meaning and context of the words, allowing them to be used for various natural language processing tasks.
This is a very simplified example, but it illustrates the basic idea behind how word2vec works. By training a neural network on a large corpus of text, word2vec is able to learn the relationships between words and generate word vectors that capture their meaning and context.

Word2vec: objective function

The exact form of the objective function will depend on the specific implementation of word2vec, but in general it will involve measuring the accuracy of the model’s predictions and using this information to adjust the model’s parameters. The goal is to find the values of the model’s parameters that result in the most accurate word vectors.

In other words, the objective function of word2vec defines how well the model is able to capture the meaning and context of words in its word vectors. By optimizing this function, the model can learn to generate more accurate and useful word vectors.

Understand the difference between Center word and Context word

In the context of word2vec, the center word is the target word that model is trying to predict, and the context words are the words that appear in the surrounding context of the center word
For example, if the sentence is “The cat sat on the mat,” the center word might be “sat” and the context words might be “The,” “cat,” “on,” “the,” and “mat.” The model uses the context words to predict the center word, and in doing so, it learns to represent the relationships between words in a numerical form.
The center word and context words are an important part of the word2vec model, as they are used to train the model and generate word vectors that capture the meaning and context of words. By using the context words to predict the center word, the model is able to learn the relationships between words and generate accurate word vectors.

The general form of the softmax function is as follows:

softmax(x) = exp(x_i) / sum(exp(x_j))

where x is a vector of inputs, and x_i is the ith element of the vector. The function maps each element of the input vector to a value between 0 and 1, such that the sum of all the values is equal to 1. This makes it useful for generating class probabilities in classification tasks.
Overall, the softmax function is a powerful tool for mapping the outputs of machine learning models to probabilities, and it is widely used in a variety of applications.

Word2vec algorithm Family

The word2vec family of algorithms includes the continuous bag-of-words (CBOW) model and the skip-gram model, which are trained using different objectives but share a similar architecture.

1. Skip-gram model

The skip-gram model is a neural network-based technique for natural language processing (NLP) tasks, such as language modeling and word representation learning.
It takes a word as input and predicts the surrounding words in the context of the input word. The skip-gram model uses a shallow, two-layer neural network with one input layer and one output layer.
The input layer represents the input word, and the output layer represents the surrounding context words. The model uses a technique called “word embedding” to map words to low-dimensional, continuous vector representations, which capture the semantic and syntactic similarities between words.
During training, the model uses a soft-max activation function on the output layer to predict the probabilities of the context words given the input word. The model minimizes the cross-entropy loss between the predicted probabilities and the true probabilities of the context words.
However, the main problem comes up when we want to calculate the denominator, which is a normalizing factor that has to be computed over the entire vocabulary. Considering the fact the size of the vocabulary can reach hundreds of thousands or even several million words, the computation becomes intractable. This is where negative sampling comes into play and makes this computation feasible.

2. Skip-gram model with negative sampling

It is an extension of the original skip-gram model, which is a neural network that takes a word as input and predicts the surrounding words in the context of the input word. The skip-gram model with negative sampling uses a sample of “negative” words, which are not related to the input word, to improve the efficiency and performance of the model.
During training, the model uses a softmax activation function on the output layer to predict the probabilities of the context words given the input word.
However, instead of having one output neuron for each possible context word, as in the original skip-gram model, the skip-gram model with negative sampling only considers a small sample of negative words in the output layer. These negative words are randomly chosen from the vocabulary and are not related to the input word. The model minimizes the cross-entropy loss between the predicted probabilities of the context words and the true probabilities of the context words.

3. Continuous Bag of Words

Its an algorithm that transform the text into fixed-length vectors. This is possibly by counting the number of times the word is present in a document.

Stemming

Stemming is a NLP technique that lowers inflection in words to their root forms.
Process of reducing the infected words to their word stem.
Stemming may not be legimate word in the language.
For example the root word of trouble, trouble, troubled and troubles is troubl . Troubl is not a recognized word.

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords 
## we use stopwords  because words like {them, them, of, .. etc} are repeated more 
## stops remove these kind of words which do not have much application
## Tokenize
sentences= nltk.sent_tokenize(paragraph)
stemmer= PorterStemmer()
## stemming
for i in range(len(sentences)): # for all the sentences
    words=nltk.word_tokenize(sentences[i]) # word tokenize from each sentence.
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]= ' '.join(words)
pd.DataFrame(sentences)

Lemmatization

The purpose of lemmatization is same as that of stemming but overcomes the drawbacks of stemming.
Lemmatization comes into picture as it gives meaningful word.
Lemmatization takes more time compared to stemming because it finds meaningful word/representation.
Lemmatization is used to get valid words as the actual word is returned.
Gives a meaningful representation word.

import nltk 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

for i in range(len(sentences)):
    words= nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]= ' '.join(words)

pd.DataFrame(sentences)

NLP series will be covered in upcoming blogs. These blog are part of my learning process, feel free to comment if you find any mistakes in the blog.

Check out my other blogs

Statistics Concepts in Data-Science

Why Statistics

amansinganamala.medium.com

Pandas ( Summary Function, Data Types, Missing Data)

Know how to handle with data with pandas.

amansinganamala.medium.com

K Means Clustering Algorithm

Table of Content

amansinganamala.medium.com

Thank You for Reading the Article.

Connect with me on Twitter.
https://twitter.com/amansinganamala

Word Vectors and Word2Vec in NLP

How do we have usable meaning in a computer?

Word vectors

Examples:

Word2vec

Idea behind the word2vec

Word2vec: objective function

Understand the difference between Center word and Context word

Word2vec algorithm Family

1. Skip-gram model

2. Skip-gram model with negative sampling

3. Continuous Bag of Words

Stemming

Lemmatization

Statistics Concepts in Data-Science

Why Statistics

Pandas ( Summary Function, Data Types, Missing Data)

Know how to handle with data with pandas.

K Means Clustering Algorithm

Table of Content

Written by Aman S