The Power of Words, Natural Language Processing
Let’s say that you in reading through the sports section of your favorite newspaper or online publication and you come upon a strange article focused on a sunflower farmer in Arizona. This article has nothing to do with sports, so how could it have ended up in this section? You start to read the article out of curiosity and happen to see that a lot of the words in that article might be seen in an article about baseball.
“Field” appears quite a few times talking about the fields of sunflowers he harvests during the summer. “Diamondback” appears a couple times in talking about how western diamondback rattlesnake has been a problem for the few chickens he had been raising for eggs. “Pitch” was mentioned once in how he had accidentally pitched his lucky boots for the harvest season in the trash a couple days before. Just be shear coincidence, the word “baseball” appeared due to him mentioning how he had always wanted to be a famous baseball player growing up but ended up farming instead. With the combination of all of these words, the algorithm that the newspaper uses to classify articles into sections, miss-classified this article as sports when it should have been in local news.
The Power of Words
The situation I just discussed is one of the many applications of natural language processing (NLP), where instead of constant and continuous numbers, words are used in the process of classification. The words and their definitions are not used directly in the machine learning algorithm because a computer only sees the ASCII representation of each letter to form the larger word. A computer can start to understand words through the use of stemming, lemmatization, bag of words, term frequency — inverse document frequency (TF-IDF), and much more. Let’s explore some of these and see how they can be used for natural language processing.
Bag of Words
A bag of words is used to represent a document or text by using the frequency of each unique word in that document. Lets say you have the following document:
The red fox jumped over the other red fox.
You could represent this as the following dictionary:
{the: 2, red: 2, fox: 2, jumped: 1, over: 1, other: 1}
That’s it, using a bag of words to represent a text. This can be used for machine learning if you were to do this for many documents because you would be able to identify the most common words in each type of document. Why would this be helpful? Let’s say you are working at a newspaper and you need to label each article that comes through as comedy, local news, entertainment, or one of the many other categories. You basically have 2 options for performing this task.
The first approach you try is reading each document and then taking the best guess as to what category it belongs to. The other approach would be to run each document through a function that creates a bag of word for each article. Then based on the most frequent words, make a judgment call from there. One could make assumptions like sports articles will contain the words “ball”, “stadium”, and “player” the most where as a political article might contain the words “president”, “law”, and “government” most frequently. This can however lead to problems like we discussed with the sunflower farmer where words can mean different things in different contexts.
Stemming and Lemmatization
With bag of words we run into the problem of the different forms of words having essentially the same meaning but will not registered as the same word. For example the word baseball
and baseballs
are the same word just with different suffixes. They will be seen as completely different words to the bag of words and thus add less power to that bag of words representing that text. A document that has the word baseball 10 times is almost certainly about baseball, but if it only has the word baseball 5 times and baseballs 5 times those words will be lower on the list of most occurring words giving less information.
Stemming is one method to getting around that problem by reducing each word just to the root of the word. The way it does this is by removing the prefixes and suffixes of words leaving just the steam of the word, even if it is not a real word. Words like change, changing, changes, and changer
will be stemmed to chang
. This will however lead to problems with words that have more complex derivations when conjugated to different tenses. For example, the word business
and busy
will be stemmed to the same word bus
, but neither of those words have anything to do with a bus or anything to do with each other. A more difficult but more intuitive method would be lemmatization.
Lemmatization essentially does what stemming does but with a better understanding of the english language and how words change from one context to another. In the case of changing
and changes
it will be reduced to the real word of change
. In order to have lemmatization work you must provide that algorithm with a set of rules that it can follow in order to understand what is the root and what is fluff. This rule set can be extremely complicated because the english language is quite complicated. There are many different already existing methods for lemmatization that I would recommend over making one yourself.
Stop Words
After you have completed your bag of words, stemming, and lemming you will have a dictionary (can be a list) of all unique words in that text and the frequency of each of those words. There is a problem with this, and it occurs due to the way text is written in english. Words like a
, the
, and it
will occur more frequently than any other words in almost every text. These words are known as stop words, and in order to convey any information from bag of words those must be removed. There are predefined stop words lists and they contain hundreds of words that convey no meaning to the text. There are some cases where the number of stop words can be a feature used in prediction.
Term Frequency — Inverse Document Frequency (TF-IDF)
Now that we have processed our documents and have them in a machine readable form, we can start to get actionable insight. TF-IDF is a way to identify the words in a document that can convey meaning in that document. This is used when you have a collection of documents (also known as a corpus) that will be classified by theme, author, or any other tag.
The basic idea behind TF-IDF is that if you are analyzing a document, and a word in that document appears infrequently in other documents but appears frequently in this document, it must be important to the meaning. For example, if the word “baseball” appears 10 times in the current document, but on average only appears 0–1 times in all other documents, this article is about most likely about baseball. If the word “temperature” appears only a few times in the current document, and appears about the same amount of times on average in all other documents, it does not give much information. The calculations for these are:
TF(t,d) = (# of occurrences of term t in document d) / (Total # of terms in document d)
IDF(t,D) = log_e((Total number of documents D) / (number of documents with term t in it)
TF-IDF = TF * IDF
t
is a term or word that you want to analyzed
is a documentD
is the number of documents in the corpus
Time for Classification
The way that this is all used for classification is somewhat simplistic compared to other machine learning algorithms. The algorithm will identify what important words (words that have been found to convey meaning used TF-IDF) are associated with each label and used this as a guide. If it comes upon an article with the word pitcher
or baseball
it will classify it as sports. If it comes upon an article with the word actress
or director
it will classify it as entertainment. This is how words are used to classify the documents they are a part of.
Like almost all of machine learning there are many more aspects of NLP that were not discussed here like part of speech tagging, sentiment analysis, syntax trees, and much more. I encourage you to read more in depth articles if you want to get a much deeper understand of what is possible with NLP. This article was just an introduction into some of the more basic ideas used for classification. Here are some articles I suggest if you want to read more:
- https://www.lexalytics.com/lexablog/machine-learning-natural-language-processing
- https://www.kdnuggets.com/2019/01/solve-90-nlp-problems-step-by-step-guide.html
- https://machinelearningmastery.com/natural-language-processing/
If you have any questions or comments please let me know. Have a great day!