A Practical Intro to using Spark-NLP BERT word embeddings
September 27, 2019
The seemingly endless possibilities of Natural Language Processing are limited only by your imagination... and compute power. What good are ground breaking word vectors if it takes days to preprocess your data and train a model? Or maybe you already know PySpark, but don't know Scala. Well let’s explore combining PySpark, Spark ML, and Spark-NLP. The assumption throughout the rest of this post is that you have some familiarity with Spark and SparkML.
What are Word vectors and why is BERT a big deal?
Word vectors (or embeddings) are words that are mapped to a vector of numbers. There are several approaches to representing words as vectors. Good vectorization approaches create vectors where similar words have similar vectors. This allows for nearest neighbor searches as well as something known as word vector algebra. The classic word vector algebra example is that in many embeddings vector(“King”) - vector(“man”) + vector(“woman") = vector(“queen”). Pre-trained vectors can also be used as features in machine learning models, transferring the learning to another domain.
Bidirectional Encoder Representations from Transformer (BERT) are state of the art word vectors developed and researched by Google in 2019.
Bidirectional → left to right and right to left contexts of words
Encoder → encodes data to a vector
Representations → a vector of real numbers from
Transformer → novel model architecture
One of the main advantages of techniques such as BERT, or an earlier similar technique ELMo, is that the vector of a word changes depending on how it is used in a sentence. This allows for much richer meanings of embedded words.
Using Spark-NLP With Pyspark
Check you dependencies and make sure distributions match. For this tutorial, we're using spark-nlp:2.2.1. The PyPi distribution should be 2.2.1 and the maven distribution should be 2.2.1. If they don't match, you'll get unexpected behavior.
Spark nlp supports a lot of annotators. Check them out here. There's also a number of different models you can use if you want to use GloVe vectors, models trained in other languages or even a multi-language BERT model.
Combining Spark-NLP with SparkML
Let’s demonstrate a simple example of using document vectors from a piece of text to write a classifier. Our process will be to:
Average all the vectors in a text into one document vector
Convert that vector to a DenseVector that SparkML can train against
So in just a few steps we’ve managed to write a classifier using state of the art word embeddings. We’ve also managed to do it in a way that can scale to billions of data points, because we’ve used Spark. At Sigma, we’re using the power of word embeddings and spark to build advance topic models used to extract incredible insights from news and social media.
What previously required multiple tools and countless man hours, Sigma executes in under 10 seconds, giving your team faster-than-ever, unparalleled insights.