Using SpaCy to Generate Synonyms and Grammatical Variations

Recently, I was working on a Natural Language Processing (NLP) project where I needed variations and synonyms for specified words or phrases. Effectively, I needed to create a scored list, where each item in the list was either a variation or synonym of a specified word/term and a score specifying how closely related the synonym matched the primary word. As I had several thousand words to create lists of synonyms for, the process of selection needed to be automated. While NLP techniques have been an area of advancing research in recent years, most of the advances have been in sentence and document development, not single word/phrases. In fact, this project may have been easier if it did utilize sentences or documents instead of these shorter terms. My first thought for this project was to utilize a good dictionary/corpus or look at a very large common usage"Word2Vec" (word-to-vector) models.

SpaCy is one of my favorite Natural Language Processing packages and, for that reason, I started my search there. SpaCy provides a wide range of useful tools and at this point, in my opinion, has surpassed the power of NLTK in both speed, performance, ease of use, and range of tools. Unfortunately, after looking around for a while, I did not find any predefined systems which automatically tackled my specific problem. The closest preprogrammed option was a "similarity" of two different words/phrases.

SpaCy offers several different English corpora, each offering vectors from a predefined Word2Vec algorithm built off of the Common Crawl. Having tested a number of different general purpose Word2Vec systems, SpaCy's "en_core_web_lg" corpus actually provides the best, even better than the infamous Google 300. The above code shows the similarity of "cat" to "dog" being a 0.8, where 0 is the least similar word and 1 being seemingly identical. Thus, we can assume "cat" is used in a fairly similar manner to "dog", which shouldn't be surprising.

A Word2Vec is a large, but shallow neural network which takes every word in the desired corpus as input, uses a single large hidden layer, commonly 300 dimensions, and then attempts to predict the correct word from a softmax output layer based on the type of Word2Vec model (CBOW or Skip Gram). For our purposes, the hidden layer acts as a vector space for all words, where words which have common vector space position also have common usage/meaning. It has also been shown that word relationships can even be expressed through these vectors' spaces.

When using the SpaCy corpora, each word becomes an SpaCy NLP object. These objects have many assigned values like a root lemma and part of speech, but if the word is apart of the corpus the object also contains the 300 dimensional vector based on the SpaCy Word2Vec. The SpaCy similarity function utilizes these vector spaces to determine word similarity, which is why both words in the above code require them to be wrapped in the NLP class. By wrapping both 'cat' and 'dog' in the class, we are able to quickly index the correct vectors and rapidly determine similarity.

However, as my problem required getting a large list of similar words, I could not simply suggest a few words to use with the SpaCy similarity function. Therefore, what I really needed to do was first determine the distance of all words in the corpus and, next, which words/phrases have the closest distance to the my input words. Its simply Pythagoras' theorem with 300 dimensions and tens of millions of words! Luckily, SpaCy has made it convenient to acquire all words and their associated vectors, so this is possible. Unfortunately, determining distances from our desired term is not a very time or processing efficient method. This is particularly an issue as the project I am working on will be used frequently and preferably should return nearly instant results. Therefore, waiting for the mass calculation was unacceptable.

To solve this problem, I took to more data science! If there is too much data to parse at once, break it into smaller chunks. So, I decided to use HDBScan to cluster the data into partitions. HDBScan (High Performance Density-Based Spatial Clustering of Applications with Noise) groups sets of data which are closely packed together. Using this method, areas with high density get clustered together, ensuring that words can be found with their best synonyms, whereas unrelated terms are clustered into an "other". With the corpora broken into smaller parts and each word assigned a cluster, I was able to determine the distances between words more efficiently. This significantly speeds up the processing time as you are only performing the calculations on a small fraction of the original corpus.

#DataScience #Python #Perspective