Text matching or finding text similarity gives the meaning of finding how similar two text words are. This can be in matching the semantic meaning of the texts or just the letters in words. There are different methods to find the semantic similarity between words.
1.Semantic textual similarity
Two words are semantically similar means that two words gives the similar meaning (synonyms in more or less).
Ex: Adjective tepid and lukewarm have very similar meaning. Therefore, tepid water and lukewarm water gives the same meaning that we can say tepid and lukewarm are semantically similar.
But the question is how we can we teach this thing to a computer machine such that it can identify the semantically similar meanings like our brain does…????
For that, first we need to understand how the computer works. Computer understand only the numbers and it cannot understand the word strings. Therefore, we need to find a method to convert the word strings into number. Here the Word embeddings comes into play!!
1.1 Word Embeddings
In the simplest form, Word embedding is a number/vector representation of a word. These word embeddings can be generated from a shallow neural network trained on large data corpuses which are available as pre-trained word embeddings or using the method of bag of words. Word2vec by Google, Glove word embeddings by Stanford university and fastText by Facebook are the most popular pre-trained word embeddings among the models that are currently available.
1.1.1 Bag of words
Bag of words represents the frequency of the occurrence of a word in a corpus (tally the occurrence of the word).
Ex. first few lines of text from the book “A Tale of Two Cities” by Charles Dickens, taken from Project Gutenberg.
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
Then, extract the unique words from the above sentences as follows.
Then tally the occurrence of words in each sentence. For the first sentence (“It was the best of times”) it is like,
- “it” = 1
- “was” = 1
- “the” = 1
- “best” = 1
- “of” = 1
- “times” = 1
- “worst” = 0
- “age” = 0
- “wisdom” = 0
- “foolishness” = 0
But the problem of this method is when the vocabulary size increases, the vector size increase and the vector may contain lot of zeros which create a sparse matrix (matrix with lot of zeros). This will cost highly and the solution for that may use n- gram method. In n- gram method we get the count of occurrence of n words instead of one word.
Another problem of this method is that the frequency of stop words in English dictionary (words like ‘’a’’, ‘’the’’ etc) may dominate while rare words get less frequency. This effect can be avoided by using Tf-Idf method.
TF-IDF stands for Term-frequency-inverse document frequency. Tf-IDF method is very powerful method when we want to judge or understand the topic of a document/news by using the words it contains. This method is mostly use in recommendation engines as well.
This method also gets the frequency of word counts but, the frequency of stop words is systematically discounted by using inverse-document frequency.
Wx,y = tfx,y * log(N/dfx)
Here tfx,y – frequency of word x in the document y
N – total number of documents
Dfx – number of documents that contain word x
This is the most powerful method and it captures the semantic relationship between words in a satisfactory level. The word embeddings from this method can be obtained using 2 methods as skip gram and common bag of words (CBOW) and both uses a shallow neural network.
In CBOW, neural network learns the context and using the learned knowledge it tries to predict the next word of the context.
Ex. If the context is Have a great day and say if we want the vector representation of word day, then neural network tries to predict the word day after the input word great where the input word is gives as one hot encoded vector. While trying this process for many rounds, neural network learns the vector representation of the word day.
In skip gram method, for an input a word neural network predicts the context. This is totally opposite of what CBOW model does.
Note: The word embeddings from word2vec, glove and FstText is a vector in multidimensional space.
After getting the word embeddings for the words that needs to check the similarity, a similarity measure is needed to quantify that value. Cosine similarity is one such method that uses the angle/orientation between these word embeddings in multidimensional space.
cosine similarity can be used instead of Euclidean distance, because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.
The smaller the angle, higher the cosine similarity. Here, as word embedding B and C are in the same orientation with A. so they should be semantically similar. Therefore, using Euclidean distance in such situations may give wrong results.
2.Other text similarity methods
FuzzyWuzzy is a python library which uses Levenshtein Distance to find the text similarity. Levenshtein distance is the number of minimum transformations needed to convert one text word (Usually the word with smaller length) to other.
These transformations are of 3 types as,
Note: Spacy is also a python package that helps to get text similarity. But for this you need to download their word dictionary (follow this link https://spacy.io/models/en to download the vocabulary) and for more details refer https://spacy.io/usage/vectors-similarity
Pabodha is an undergraduate at University of Colombo (Faculty of Science) major in Applied Statistics. Currently being an Intern at DeepDataInsight, she is an enthusiastic Data Science Learner who is interested in using statistical and Machine learning concepts to make human life easier.