A Practical Intro to using Spark-NLP BERT word embeddings
Sigma Ratings

The seemingly endless possibilities of Natural Language Processing are limited only by your imagination... and compute power.  What good are ground breaking word vectors if it takes days to preprocess your data and train a model?  Or maybe you already know PySpark, but don't know Scala. Well let’s explore combining PySpark, Spark ML, and Spark-NLP. The assumption throughout the rest of this post is that you have some familiarity with Spark and SparkML. 

What are Word vectors and why is BERT a big deal? 

Word vectors (or embeddings) are words that are mapped to a vector of numbers. There are several approaches to representing words as vectors. Good vectorization approaches create vectors where similar words have similar vectors. This allows for nearest neighbor searches as well as something known as word vector algebra. The classic word vector algebra example is that in many embeddings vector(“King”) - vector(“man”) + vector(“woman") = vector(“queen”). Pre-trained vectors can also be used as features in machine learning models, transferring the learning to another domain.

Bidirectional Encoder Representations from Transformer (BERT) are state of the art word vectors developed and researched by Google in 2019. 

  • Bidirectional → left to right and right to left contexts of words
  • Encoder → encodes data to a vector
  • Representations → a vector of real numbers from
  • Transformer → novel model architecture

One of the main advantages of techniques such as BERT, or an earlier similar technique ELMo, is that the vector of a word changes depending on how it is used in a sentence. This allows for much richer meanings of embedded words. 

Using Spark-NLP With Pyspark

Check you dependencies and make sure distributions match.  For this tutorial, we're using spark-nlp:2.2.1.  The PyPi distribution should be 2.2.1 and the maven distribution should be 2.2.1. If they don't match, you'll get unexpected behavior. 

Pretrained pipelines offer general nlp functionality out of box. 


You can also create your own pipelines using both SparkML and Spark-NLP transformers. 


  • DocumentAssembler → A transformer to get raw data, text, to an annotator for processing
  • Tokenizer → An Annotator that identifies tokens
  • BertEmbeddings → An annotator that outputs BERT word embeddings

Spark nlp supports a lot of annotators. Check them out here. There's also a number of different models you can use if you want to use GloVe vectors, models trained in other languages or even a multi-language BERT model.

Combining Spark-NLP with SparkML

Let’s demonstrate a simple example of using document vectors from a piece of text to write a classifier. Our process will be to:

  • Average all the vectors in a text into one document vector
  • Convert that vector to a DenseVector that SparkML can train against
  • Train a Logistic regression model


So in just a few steps we’ve managed to write a classifier using state of the art word embeddings. We’ve also managed to do it in a way that can scale to billions of data points, because we’ve used Spark. At Sigma, we’re using the power of word embeddings and spark to build advance topic models used to extract incredible insights from news and social media.


What previously required multiple tools and countless man hours, Sigma executes in under 5 seconds, giving your team faster-than-ever, unparalleled insights.
There's nothing riskier than not seeing the full picture.

Thanks for submitting!