First, we'll loop through all of the training data, which is all of the words for the first two debates for both Trump and Clinton. We'll then keep track of the count of each word using a dictionary whose key is the word and whose value is the count. As suggested by Matt S. in class, we'll use the empty string "" to store total number of words
import numpy as np
counts = {}
for speaker in ["clinton", "trump"]:
counts[speaker] = {"":0}
for debate in [1, 2]:
fin = open("text/2016Debates/{}{}.txt".format(speaker, debate))
text = fin.read()
fin.close()
for word in text.split(): # Take out each word in the debate
word = word.rstrip() # Get rid of line breaks, etc
if not word in counts[speaker]:
counts[speaker][word] = 0
counts[speaker][word] += 1
counts[speaker][""] += 1
## Total number of words for each speaker
print("clinton total words:", counts['clinton'][""])
print("trump total words:", counts['trump'][""])
clinton total words: 12330 trump total words: 15143
Now we'll loop through each paragraph from the third debate and compute the lob probability under the clinton model and under the trump model
import glob
for ground_truth in ["clinton", "trump"]:
# Loop through the clinton debates, then the trump debates
for i, f in enumerate(glob.glob("text/2016Debates/{}3*.txt".format(ground_truth))):
fin = open(f)
test = fin.read()
fin.close()
## Compute the naive bayes log posterior likelihood given each class
results = []
for speaker_class in ["clinton", "trump"]:
denom = counts[speaker_class][""] + len(counts[speaker_class]) # Add on unique words for smoothing
# Start off with prior probability
p = np.log(counts[speaker_class][""] / (counts["clinton"][""] + counts["trump"][""]))
for word in test.lower().split():
word = word.rstrip()
if word in counts[speaker_class]:
# Probability is estimated as counts of this particular word
# over the total number of words the speaker said.
# We add on 1 to deal with smoothing for missing words
p_word = (1+counts[speaker_class][word]) / denom
else:
# If this word was never said by this speaker in the training data,
# we give it a small, but nonzero, probability
p_word = 1/denom
# Under the naive bayes assumption, each word is a new, independent
# observation. So we'd just accumulate multiply that probability with
# what we've seen so far. To prevent numerical underflow, we're actually
# taking the log of the probability at the end, so multiplying corresponds
# to a sum of the log
p += np.log(p_word)
results.append(p)
# Pick the class with the maximum log posterior likelihood
speaker_guess = ["clinton", "trump"][np.argmax(results)]
# Check to see if this is correct
result = "Correct!"
if speaker_guess != ground_truth:
result = "Incorrect :("
print("{} Debate {}: {}".format(ground_truth, i, result))
clinton Debate 0: Correct! clinton Debate 1: Correct! clinton Debate 2: Correct! clinton Debate 3: Correct! clinton Debate 4: Correct! clinton Debate 5: Correct! clinton Debate 6: Correct! clinton Debate 7: Correct! clinton Debate 8: Correct! clinton Debate 9: Correct! clinton Debate 10: Correct! clinton Debate 11: Correct! clinton Debate 12: Correct! clinton Debate 13: Correct! clinton Debate 14: Correct! clinton Debate 15: Correct! clinton Debate 16: Correct! clinton Debate 17: Correct! clinton Debate 18: Correct! clinton Debate 19: Correct! clinton Debate 20: Correct! clinton Debate 21: Correct! clinton Debate 22: Correct! clinton Debate 23: Correct! clinton Debate 24: Correct! clinton Debate 25: Correct! clinton Debate 26: Correct! clinton Debate 27: Correct! clinton Debate 28: Correct! clinton Debate 29: Correct! clinton Debate 30: Correct! clinton Debate 31: Correct! clinton Debate 32: Correct! clinton Debate 33: Correct! clinton Debate 34: Correct! clinton Debate 35: Correct! clinton Debate 36: Correct! clinton Debate 37: Correct! clinton Debate 38: Correct! clinton Debate 39: Correct! trump Debate 0: Correct! trump Debate 1: Correct! trump Debate 2: Incorrect :( trump Debate 3: Incorrect :( trump Debate 4: Incorrect :( trump Debate 5: Correct! trump Debate 6: Correct! trump Debate 7: Correct! trump Debate 8: Correct! trump Debate 9: Correct! trump Debate 10: Correct! trump Debate 11: Correct! trump Debate 12: Correct! trump Debate 13: Correct! trump Debate 14: Correct! trump Debate 15: Correct! trump Debate 16: Correct! trump Debate 17: Correct! trump Debate 18: Correct! trump Debate 19: Correct! trump Debate 20: Incorrect :( trump Debate 21: Correct! trump Debate 22: Correct! trump Debate 23: Correct! trump Debate 24: Correct! trump Debate 25: Correct! trump Debate 26: Correct! trump Debate 27: Correct! trump Debate 28: Correct! trump Debate 29: Correct! trump Debate 30: Correct! trump Debate 31: Correct! trump Debate 32: Correct! trump Debate 33: Correct! trump Debate 34: Correct! trump Debate 35: Correct! trump Debate 36: Correct! trump Debate 37: Correct! trump Debate 38: Correct! trump Debate 39: Correct!