Learning to talk about wine using Javascript

Elchanan Shor
Frontend Weekly
Published in
8 min readJan 17, 2018

--

Wine is a great part of my life. My family owns a winery for over 170 years. My main income is from the wine business. You can say wine flows in my veins. So when I was looking for interesting data sets to demonstrate data science using JavaScript and I stumbled upon the wine magazine data set, I latched onto it.

My goal in these posts is to demonstrate using Javascript for data science. In my previous post I showed how to use Jupyter notebooks in Javascript, generating visualizations and analyzing data. In this post I will focus on textual analysis, trying to understand which terms are used to describe wine in general and specific varieties in particular. The tutorial is intended for Javascript programmers with no data science knowledge. I rely here on the dstools NPM module for handling the data and Jupyter notebooks for interaction and visualization. However, it is possible to execute the code in plain Javascript files or just REPL. For more information about using Jupyter notebooks and the dstools package have a look at my previous post. The notebook I used for this post in on github.

The Wine Reviews Data Set

Kaggle is a great place to find interesting data sets and analysis of them. If you are into data science you should have a close look at the site. The Wine Reviews data set is a collection of wine reviews scraped from the site of Wine Enthusiast, one of the leading wine magazines. The data contains several files. We will use the file winemag-data-130k-v2.csv. You can download it from here.

Loading the Data

In this tutorial, we will use the package dstools I developed to help with managing and analyzing data. The package is described in my first post. In short, it is a chaining based library to manage and visualize data. We will start by loading the library:

const Collection = require('dstools').Collection;

Next, we will load the data using the dstools function loadCSV:

data = Collection().loadCSV('/path/to/file/data/winemag-data-130k-v2.csv')

It is always a good approach to first have a look at the data. The head function selects the first n rows (default 5). The show function displays a table with the first 5 rows in the Jupyter notebook.

data.head().show()

This is a screen shot of the table. It has 14 fields so some scrolling is needed to see them all.

We will focus on the description field and the variety field. If you are not working with the Jupyter notebook, you can generate HTML for the table, save it to a file and then open the file.

data.head().table().save('my-table-html.html');

Here is how you get the list of fields on Jupyter:

So, Let’s start answering the question which terms are used to describe wine.

Simple Term Count

The easiest and most straight forward analysis we can do is simply count the terms in the description and choose the terms with the highest count. The logic is pretty obvious and self explanatory. The dstools function terms is used to “break” the description field into its contained terms. It uses a bag of words model representing the description field as a bag of words in which we ignore the order of the terms and their syntactical and semantic context. We only consider which terms appear in the description and the frequency they appear. The output of this function is a collection of terms with their counts. The only thing we will do is omit stop words, function words such as “it”,”the” and “for”, that do not add any information about the described wine. Here is the code for the described query:

data
.terms({field:'description'}) //extract terms in field description
.dropStopwords('term')//remove stopwords
.sortDesc('count') //sort by count or terms
.head(5)//choose top 5 terms
.show();//show it
Top terms used to describe wine

Word Cloud Visualization

In Data Science, we always like to visualize the data. A great visualization tool for a list of terms with quantity values is a word cloud. The dstools package has a word cloud function using the highcharts library (you can use the dstools with any highcharts visualization). We will change the head argument to 50 and append the wordMap function.

data.terms({field:'description'}).dropStopwords('term')
.sortDesc('count').head(50)
.wordCloud('term','count')//arguments are label and measure
.show();

You can see typical “wine” terms used to describe wine — such as flavor, palate, acidity and tannins. The larger the term, the more frequent it is in wine descriptions.

How to Describe a Cabernet Sauvignon

So we now have a good idea of the terms used to describe wines but how are the different varietals different from each other? How do we describe “Cabernet Sauvignon” as opposed to a “Chardonnay”? The most obvious method would be to count terms in descriptions of “Cabernet Sauvignon”. the terms function takes an option of groupBy. It then counts separately the terms appearing in each group. We will group the terms by “variety”:

These are the most frequent terms used to describe “Cabernet Sauvignon”. The problem: some of these terms can be used to describe any variety. Terms such as “flavors”, “wine” and “fruit” are too generic. We need a method to filter out terms that are not specific enough

TFIDF to the Rescue

In document retrieval, TFIDF is a common measure used to evaluate how important a term is to a document. It comprises of two parts: TF — short for term frequency, and IDF — inverse document frequency. We will not get into the details of this measure, but in short, TF represents how many times the term appears in the document and IDF is a representation of how specific the term is. Terms with high IDF are more meaningful. In our case, we can use TFIDF as a measure of the relationship between a term and a variety. Going back to our example, the terms function can calculate the TFIDF and IDF measures for each term.

data
.terms({field:'description',groupBy:'variety',calc:'tfidf,idf'}) //calculate tfidf and idf
.dropStopwords('term')
.filterEqual('variety','Cabernet Sauvignon')
.sortDesc('tfidf') //sort by tfidf
.head(30)//top 30 terms
.show()

We can now see both measures, tfidf and idf, for each of the top terms used to describe “Cabernet Sauvignon”. We can still see some generic terms. The IDF column show the specificity of each term. We can just filter them out by setting a threshold of IDF of 2. We will chain that to word cloud visualization (with the TFIDF measure) and get the following:

data
.terms({field:'description',groupBy:'variety',calc:'tfidf,idf'})
.dropStopwords('term')
.filterEqual('variety','Cabernet Sauvignon')
.filter((term)=>term.idf>2)
.sortDesc('tfidf')
.head(50)
.wordCloud('term','tfidf',{title:'Word Cloud for Cabernet Sauvignon'})
.show()

Terms such as blackberry, cassis and currant are typically used when describing Cabernet Savignon. We will repeat the query for Chardoanny and get a different set of terms

In the Chardonnay word cloud we can see terms, such as “tropical”, “pineapple” and “lemon” that are typical of Chardonnay descriptions and did not appear in the Cabernet word cloud.

Finding Similar Terms with Word2vec

Word2vec is a model, developed by researches from Google to represent words using vectors (word embeddings). As you probably know, computers understand numbers. They do not understand abstract concepts. Given two terms, the computer does not know if they are similar in meaning. word2vec converts the terms into numerical vectors. Finding similarities between terms is now a mathematical problem. Given two vectors, we can calculate the distance between the vectors (cosine distance) and assume that terms that their vector representations are close, their meaning is as close. The logic behind the word2vec model is as follows: terms that appear in similar contexts, are similar. In our example, the terms “Cabernet” and “cab” (shortening of Cabernet) will appear in similar contexts. Therefore, there vector representations should be close. We will use the word2vec algorithm to find similarities among terms used to describe wine.

First, we will need to install the NPM package word2vec. This module is a wraparound C based executables. The package only works on UNIX systems. For information about the executable and relevant arguments have a look over here.

First step we will need to save the descriptions into a text file:

data
.column('description')//get a vector of wine's description field
.toLowerCase()//turn the descriptions into lower case
.merge()//merge all descriptions into one string
.save('wine-descriptions.txt')//save them into a file

Next, we need to train the model, i.e. generate the vector representations for each term. For our purposes, we will use the default values for all model arguments. The code looks like this:

word2vec = require( 'word2vec' );
word2vec.word2vec('wine-descriptions.txt','wine-model.txt');

The function loads the text in file “wine-descriptions.txt”, generate vectors and store them in the output file “wine-model.txt”

Lastly, we will load the model and find the terms closest in meaning to a few input terms taken from the “Cabernet Sauvignon” and “Chardonnay” description terms:


//load the model from file
word2vec.loadModel('wine-model.txt', function( err, model ) {
['blackberry','chocolate','tropical','mineral','green']//terms
.forEach((base)=>console.log(base + ': ' +
//most similar function returns terms most similar to base
model.mostSimilar(base,10)
.map((term)=>term.word).join()));//show terms in list
});
output from the word2vec similarity function

You can see the word2vec did a pretty good job identifying the similarity between the different berries (blueberry, raspberry, blackberry). We can learn from it which terms are used to describe green aromas (did you know celery is used to describe wine) and which tropical fruit are most commonly used to describe wine (passion fruit and kiwi).

Final Words

There are a lot of insights hiding in your data text. Machine learning can help us understand how people talk about and describe any topic. There is a lot yet to be learned by slicing and dicing the data and applying the textual analysis tools described here. We can learn things like how the French Cabernet is different from the Californian counterparts or compare cheap Chardonnay with expensive ones. It is just a matter of defining the research question and applying these data tools. And of course, a lot of trial and error.

--

--