Natural Language Processing

I recently started looking into some natural language processing (NLP) techniques, largely as a consumer of such research, rather than as a producer of such research. With the large amount of textual data available (10-K MD&A sections, Mutual Fund form N-CSR’s management discussion sections, analyst reports, news articles, earnings calls, etc. etc.) this seems to be fertile ground for new research.

My sense is the earlier work in this area largely revolved around word counts and treating text as a “bag of words” and then counting how many times certain types of words appeared in these bags. For example, for sentiment analysis, a common technique would be to count the number of positive words (where positive words were given by some dictionary, e.g. this one) and then counting the number of negative words and then taking a ratio of positive to negative words to determine the overall sentiment of a piece of text. Some work extended this by created custom dictionaries to address the unique vocabulary in finance and accounting.

Newer work seems more tech-ed up and generally considers the relationship between words (for example the word “board” in “being on board” and “board of directors” means very different things). This type of work uses constructs that are harder to parse through dictionaries, and generally uses some type of machine learning to link blocks of text with a measurable variable. For example, a researcher might train a computer by providing a few thousand sentences, along with the researcher’s classification of these sentences into positive, negative, and neutral sentences. After this, the computer can generally classify sentences quite accurately out-of sample.

I toyed around with the simplest version of this (bag of words, positive vs negative counts, etc.) and wrote some code that takes a news article and gives number of positive words, negative words, and total words. The code is below.


# these are imports not all are needed 
import pandas as pd
import urllib.request
import html2text
import requests
from string import punctuation
from googlefinance import getQuotes
import json
from yahoo_finance import Share
import time
import datetime
import ast

# This bit gets positive and negative words from your dictionaries
pos_sent = open("positive.txt").read()
positive_words=pos_sent.split('\n')
neg_sent = open("negative.txt").read()
negative_words=neg_sent.split('\n')

#this defines a function that takes a block of text as input, along with 3 number variables and returns 3 number variables with 
def parsenews(response,positive_counter,negative_counter,total_words):
    # this next bit formats the response txt as needed -
    txt = response.text
    simpletxt = html2text.html2text(txt)
    #print(simpletxt)
    #print(txt)      
    simpletxt_processed=simpletxt.lower() 
    # this removes punctuation
    for p in list(punctuation):
        simpletxt_processed=simpletxt_processed.replace(p,'')
        words=simpletxt_processed.split(' ')
    for word in words:
        if word in positive_words and len(word) > 2:
            #print(word)
            positive_counter=positive_counter+1
        if word in negative_words and len(word) > 2:
            #print(word)
            negative_counter=negative_counter+1
    total_words = total_words + len(words)
    return positive_counter,negative_counter,total_words

It seemed relatively straightforward to do the “bag of words” positive vs negative sentiment counts. At some point, I might try the more complicated stuff, but for now, I just look forward to seeing more cool studies using these techniques.