Homework 3: Markov Chains for Text Processing (46 Points)

Chris Tralie

Overview / Logistics

The purpose of this assignment is to introduce you to natural language processing via Markov chains, which are incredibly powerful tools for representing sequences mathematically, and which will form the foundation of several other topics we'll see in the class (hidden markov models, markov decision processes, and diffusion generative models).

First, you will use markov chains to classify text, seeing how they perform at telling whether Clinton or Trump was speaking during the 2016 presidential debates. This is our first application of supervised learning, in which we learn from labeled examples how to predict new, unknown data. In particular, you will learn statistics on the first two debates and predict the identity of the speaker in the third debate.

Beyond learning, you will also use Markov chains as a generative model to synthesize new text that statistically matches a style. For example, here is some text I synthesized based on the statistics trained on a bunch of Star Wars sith lord quotes, as well as Ursinus tweets, together:

  • "the cpd has yet another tip for undergraduates to explore the dark side"
  • "check out this scholarship alert: before your eyes. i can drive you mad with fear, shred your sanity, and leave you a raving lunatic"
  • "thomas j. watson fellowship, a yearlong grant that anyone who knows the words and scientific committee of the force"
  • "kill the spring! 78 days until opening day!"
  • "vanessa wilson-marshall '02 recalls the words and actions of significance is the result of conquest"

As funny as these are, I have to admit that I cherry-picked them; most of the examples I generated were garbled nonsense. We'll learn later in the course using neural networks to create better generative models, but this is a good foundation of what's to come.

Learning Objectives

  • Implement a statistical technique for supervised learning
  • Implement a basic generative technique for natural language
  • Follow the train set / test set paradigm for evaluating the performance of an algorithm, and learn how to "wrangle datasets" (i.e. load files and prepare them for testing).
  • Search for optimal model parameters, and explore how these parameters vary across datasets.
  • Use jupyter notebooks to organize experiments into a computational essay.

What to submit

Submit on canvas your markov.py file and the jupyter notebook you created in the experimental section.

Assignment Tasks

Click here to review the notes on Markov chains for document processing. These notes will help in the tasks below.

Starter Code

Click here to download the starter code for this assignment. The main file you will be editing is markov.py. This contains bare bones code for a Markov class for encapsulating all data in a model, and some (not necessarily all) of the instance methods that you need to write, as explained below.

The folder also contains a number of text documents you will use to test your code.

Data Structures / Adding Text (10 Points)

Your Task

Choose appropriate data structures to represent your Markov chain (dictionaries are highly recommended), and initialize them as instance variables in the constructor. Then, fill in the method add_string to add prefixes and character counts to your data structures for a string that's passed in. You should loop through all possible prefixes in the string passed as a parameter and update the counts of the prefixes (or add a new prefix if this is the first time you're seeing it). You should also update the counts of the characters that follow each prefix. Two things to watch out for here:
  • To prevent your model from running into dead ends when you go to synthesize new text, you should loop around once you run out of room to make a prefix at the end. For example, if you had the text

    CS 477 rocks

    And you chose to use prefix lengths of 5, you should have all of these prefixes

    	CS 47
    	S 477
    	 477 
    	477 r
    	77 ro
    	7 roc
    	 rock	
    

    But also all of these prefixes once you start looping around

    	rocks
    	ocksC
    	cksCS
    	ksCS 
    	sCS 4
    
    in code, you'd construct a Markov model and add this string as follows
  • Do not add strings with a length less than the chosen prefix length. Simply do nothing if such a string is passed.

As an example to check to make sure this is working properly, if you add all of the lines from the the spongebob file using 5-prefixes you should have 4032 unique prefixes, and the prefix counts for the string " you " should be as follows:

Computing Probabilities (10 Pts)

Given a model and some text, we can compute the likelihood that this text was drawn from the model by following the procedure discussed in the notes.

Your Task

Fill in the method get_log_probability to return the log probability of a sequence according to your model.

You should first test the simple example to make sure you're agreeing

For a longer example, if you run the following code

You should get a log probability of -40.5

If you then say

You should get a log probability of -69.6. Note how much lower this probability is, even though the sequences are of roughly the same length. This is our model telling is spongebob is much more likely to have said "I'm ready, I'm ready" than "Artificial intelligence".


2016 US Presidental Debate (10 Pts)

You will now perform an experiment in the "train/test" paradigm, which is common in machine learning; that is, you will build models on some labeled data that is designated as "training data," and you will see how well your models predict new labeled examples in the "test data" set. You should perform the experiment and write your reflections all in a single Jupyter notebook file

The first experiment you will do is inspired by very similar assignment I did over 15 years ago on figuring out if a quote during the 2004 presidential debate was said by Bush or Kerry. To freshen this up, I have obtained the debate transcripts from the 2016 election from Politico. You will train your models on the first two debates and see if you can predict who said a particular quote from the third.

Task 1: Classification for a fixed K (5 Points)

Initialize a trump and clinton model separately for some prefix length K. You can start with K=5

Then, loop through 40 different segments from the third debate and classify each one as having come from Trump or Clinton. Here's an example of how to load all of the segments from Trump as strings

To classify each segment from both Trump and Clinton, compute the log probability under both models and see which one has a higher probability.

  • If the correct model has a higher probability, this is a correct classification. If not, then it is a mistake.
  • You can report the accuracy as the total number of correct classifications over the total number of items in the test set. For example, if I got 60 correct and there were 80 items, the accuracy would be 60/80 = 0.75.

Below is pseudocode on how you might check the Clinton debate examples to see which ones your models got correct. You'd do a similar thing for the Trump debate examples.

To summarize the results, print out a confusion matrix as follows

Guessed that Trump said it (probability of Trump is higher)Guessed that Clinton said it (probability of Clinton is higher)
Quote from Trump x y
Quote from Clinton z w

The elements along the diagonals represent correct classifications, so the proportion of correctly guessed examples would be

\[ \frac{x+w}{x+y+z+w} \]

Task 2: Hyperparameter Optimization (3 Points)

It's unclear a priori what prefix length K would be best for this task. This is what's known as a "hyperparameter," and we should experiment some to tune it properly.

Your Task

Create a loop where you report accuracies on the test data over a range of K values from 1 to 20, and plot the accuracy versus K. You can use matplotlib to create such a plot. What trends do you notice in the plot? Given what you know about these models, can you explain these trends? Answer this in a markdown cell in your notebook directly under your code.

NOTE: Normally when we do hyperparameter optimization, we have 3 datasets we use: a training set (in this case the first two debates), a test set (in this case, the final debate), and another dataset called the validation dataset. We pick the best hyperparameter based on the validation set, and then we report the results on the test set. This is a more honest way of choosing the best parameter; as we're not tweaking it based on the final dataset on which we're reporting the results. Also, doing things this way will lead to better "generalization" of the models. We'll discuss this more in the next couple of units of the course, but I decided to keep things simple for now and to just use a training and test set.

Task 3: Explainability (3 Points)

For the best performing K in your experiment above, print out the text segments that the models got incorrect. Do these make sense to you? Write about your thoughts in the notebook

Task 4: Most Prominent Differences (5 Points)

For the best performing K, print the strings that were correctly classified that have the top 10 highest absolute difference in normalized log probability between the models; that is, divide the probability that clinton said each string s by len(s)-K, and do the same for Trump, then compute the absolute value of these differences. Do these make sense to you? Write about your thoughts in the notebook

Hint: The numpy argsort method may come in handy here

NOTE: The normalization controls for the length of the sequences and gives us an idea of how different the log probabilities of each prefix are on average.

Interpretable models

In the interest of time, I decided to cut a task that was there before about digging into the models to see why they're doing what they're doing. One way to do this is to narrow in on which prefixes are leading to the largest disparity in probability for sequences which are said to be very different between the two models. If you did this, you'd see prefixes involving "Hillary", "Clinton", "disaster", "invest", "i will tell you", "Sanders", "you think", "horrible", "millions of dollars", "unbelievable" were much more probable under Trump's model than Clinton's model, for example.

Because we can actually dig into our model and find out meaningful things about how it's working, it's referred to as interpretable. This is a nice feature to have in a model because we can explain more easily why it is or isn't working in practice. This will not be true of all of the models we examine in this class, but it's often important for us to have this property to trust models.


Synthesizing Text (10 Pts)

We will finish by doing a basic generative model for text. To accomplish this, you should choose a prefix at random from the training text to start with, and then synthesize the rest of the text one character at a time, updating the current prefix you're on to be the last k characters in the string you're synthesizing as you go along. Be sure that the next character you choose respects the probability of the model. You may want to refer back to our class exercise to see how we were able to draw characters respecting probabilities.

Your Task

Fill in the method synthesize_text to create new text of a particular length drawn with probabilities from your model.

For example, if I run the code

I might get

s birthday, to apply.The safety featured on @BartramsGarden, the event art prof. Carlita Favero. #MLKDay of single fathers participate in hanging to #Ursinus alum Phil DeSimone ’12 of @Carbon back to

Of course, this is only one of many, many possibilities, but it should look roughly like an amalgamation of things that were in the training data, just spat back with no long term memory.

For fun

In addition to the presidential debates, I also provided a spam vs ham dataset and a positive and negative movie reviews dataset. We'll return to the movie reviews dataset in a later assignment with a different technique, but you can try the Markov chain technique on both of these now if you're bored.

Beyond that, you can try synthesizing "Markov hybrids" like I did with the Star Wars and Ursinus quotes. Please share what you get on Discord with the rest of the class!