We discussed the multi-armed bandit problem in class today. This is the starting point for **reinforcement learning**, or learning how to behave in some **environment** over time.

In the basic version of the multi-armed bandit problem problem, we have $N$ slot machines, each with a different *fixed* probability distribution of the amount of payout they have each time we "pull" on them. We don't know how they behave going in, and our job is to maximize the payout over some number of pulls of the bandits. In particular, we need to come up with a **policy** (a set of actions to take over time) that maximizes the **reward** (in this case, the cumulative sum of all amounts of money we get at each pull). The trick is, since we don't know anything about the bandits, we have to learn and update our policy over time as we explore and get more information about the **value** (long term behavior) of each machine.

Before we dive in, we should note that this is much generally applicable than just slot machines! Here are some other examples where exploring an environment and learning a "multi-armed bandit" policy would apply, including some people came up with in class:

Environment | Policy over time | Reward | |

Slot Machines | Slot machines with different payouts | Which machine to pull | Money accrued |

Fishing | Different densities of fish in different places | Where to fish | Total fish caught |

Efficient driving | Performance of car | How fast to drive | Miles per gallon |

But for the rest of this example, we'll stick to the "one-armed bandit" analogy of a slot machine. First, let's define what a bandit is in code, and then we'll talk more about how to learn different policies for this problem.

We're going to use a class called "Bandit" to encapsulate what it means to be a bandit. For the probability distributions, we'll pick numbers uniformly at random between 0 and some maximum value for each bandit. The maximum value is the hidden parameter. We'll "seed" the maximum value so that it seems to be randomly different for each bandit, but so that we all get the same results each time we run it. This is known as pseudorandomness. Let's see how we'd define such a class below:

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
```

In [2]:

```
class Bandit:
def __init__(self, seed):
"""
The constructor for bandit objects
Parameters
----------
seed: int
A seed for the pseudorandom number generator. Different seeds
will give different values of the max this bandit can give at
any pull (self.mx)
"""
self.seed = seed
np.random.seed(seed)
self.mx = np.random.rand()*10
def pull(self):
"""
Returns
-------
One random pull of this bandit, which returns a number in [0, self.mx]
"""
return np.random.rand()*self.mx
def __repr__(self):
return "Bandit seed {}, Max Payout {:.3f}".format(self.seed, self.mx)
# Setup a list of 10 bandits and print them out
bandits = [Bandit(i*2) for i in range(10)]
for b in bandits:
print(b)
results = {}
```

In the above list of 10 bandits, we see the one at index 2 has the maximum payout. So ideally, our policy would be to just pull on that one over and over. But we don't necessarily know that information going in, so we're going to have to learn it *online* as we're pulling on the bandits.

Below I'll discuss several strategies for automating good policy discovery. In each case, the algorithm will trade off **exploration** vs **exploitation**. The more the algorithm tries out the different bandit arms, the more it learns about each one, but the less time it has to exploit (pull on) the one that seems to be doing the best. By contrast, we can explore very little (e.g. some people in class only pulled on each bandit once at the beginning and then went with the one that paid the most that one time). But we might then not have correct information about which bandit is the best.

Our first strategy is known as an $\epsilon$-greedy strategy. The idea is that an $\epsilon$ proportion of the time, we will explore by randomly choosing a bandit to pull on, and the rest of the time we will pull on the one that has the greatest *average* payout so far. The average payout gives our **value estimate** of the machine; that is, how much it rewards us *over time*, even if it has some better than average or worse than average pulls at any particular instant.

One strategy for figuring out when to explore is akin to throwing a dart uniformly at random into the unit interval $[0, 1]$ at each pull. If the dart is in the interval $[0, \epsilon]$, then we will choose to pull a bandit at random. Otherwise, we will pull on the one that has the highest average payout so far. The picture below depicts this, where the red dots choose to randomly explore and the blue dots when it chooses to pull the one it thinks is the best so far:

Let's write some code to execute this strategy on some list of bandits for some number of pulls, and we'll test it out for $\epsilon = 0.1$, or exploring 10% of the time and exploiting 90% of the time. Actually, I'm going to write the code in a very general purpose way so we can use it for other strategies, as explained in the comments

For a particular bandit that's been pulled $n$ times with payouts $x_1, x_2, ..., x_n$, the average payout $\mu_n$ is

We could update this in constant time by keeping track of two variables: one that accumulates the sum of all pull rewards (the numerator), and one that keeps track of how many pulls have been made (the denominator). This is the way I did it in class to save time. But there's a more elegant and illuminating way to compute this (which also avoids the divide by zero issue we saw when no pulls had happened yet). Let's re-arrange the average equation to understand the impact of a single new observation and how it "reinforces" our previous value estimate. Note that we can write the following recurrence:

If we distribute through the $(n-1)$ in the numerator, we can simplify this to

An even more general way to write this is:

where $\alpha$ is referred to as the "learning rate" in machine learning parlance, or "step size" in optimization parlance. We will see this paradigm often in this class; we will update some estimate of something we have given new data, and we have to decide how much importance to place on this data. If $\alpha$ is higher, then we pull our estimate towards this new data much more rapidly. In our case, since we insisted in computing the mean over time, $\alpha$ decays as we have more pulls; the more data we have, the less an additional observation can affect our value estimate so far. And eventually, in the limit, it won't affect our estimate; this is the law of large numbers.

In the code below, I will leave $\alpha$ open as a function, which defaults to $\frac{1}{n}$, but we'll see a reason later why we might want to change this.

In [3]:

```
def learn_bandit(bandits, total_pulls, eps_fn, alpha_fn=lambda n: 1/n, bias=0):
"""
A general purpose method for learning policies for the multi-armed bandit problem
Parameters
----------
bandits: list of Bandit
The bandits that we can pull on
total_pulls: int
Number of pulls to execute
eps_fn: float OR method pull -> pull probability
The probability that we will explore at a particular pull
or exploit at a particular pull, where pull is the number pull we're on.
* If a float is passed along, this is a constant, and we refer to this
as an "epsilon_greedy" strategy
* If a method is passed along, then this can vary as additional pulls happen
alpha_fn: method pull_num -> float
A method that gives us "step size" or "learning rate" when integrating a new
pull's reward into our value estimate. If we're taking an average of all pulls on
that machine over time, this will simply be 1/pull_num.
bias: float
The amount of bias to add to the initial estimate of each average value
Returns
-------
total_reward: float
The total reward we got after all pulls
choices: ndarray(total_pulls, dtype=int)
The index of the bandit chosen at each pull over time
"""
from types import FunctionType
if not isinstance(eps_fn, FunctionType):
# If the user passed along a constant for the eps_fn, wrap
# a method around it that always returns that constant
eps = 1.0*eps_fn
eps_fn = lambda x: eps
if not isinstance(alpha_fn, FunctionType):
# If the user passed along a constant for the alpha_fn, wrap
# a method around it that always returns that constant
alpha = 1.0*alpha_fn
alpha_fn = lambda x: alpha
n_bandits = len(bandits)
total_reward = 0
# Create an array of all 0's which stores how many pulls have been taken at each bandit
pull_nums = np.zeros(n_bandits)
# Create an array of all 0's which stores the average payout of each bandit so far,
# which we consider our "value estimate" of that bandit
values = np.zeros(n_bandits)
# Create an array of all 0's stores the index of the bandit chosen at each pull
choices = np.zeros(total_pulls)
# If there's a bias, pretend every bandit had one pull with this amount when
# we start out
if bias > 0:
pull_nums[:] = 1
values[:] = bias
for pull in range(total_pulls):
## Step 1: Choose which bandit to pull
dart = np.random.rand() # Choose number in [0, 1]
if dart <= eps_fn(pull):
# If this number is in [0, eps_fn(pull)], choose a random bandit
idx = np.random.randint(n_bandits)
else:
# Otherwise, choose bandit which has maximum avg payout so far
idx = np.argmax(values)
## Step 2: Pull the bandit that we intend to pull, and update our
## value estimate for that bandit
this_pull = bandits[idx].pull()
total_reward += this_pull
pull_nums[idx] += 1
# Our update equation: value = value + alpha*(this_pull - value)
values[idx] = values[idx] + alpha_fn(pull_nums[idx])*(this_pull - values[idx])
choices[pull] = idx
return total_reward, choices
# We'll start by trying an "epsilon-greedy" strategy with a fixed epsilon of 0.1
plt.figure(figsize=(12, 4))
np.random.seed(0)
total_reward, choices = learn_bandit(bandits, total_pulls=1000, eps_fn=0.1)
print(total_reward)
plt.subplot(121)
plt.plot(choices)
plt.xlabel("Pull Number")
plt.ylabel("Bandit index")
plt.subplot(122)
plt.hist(choices)
plt.xticks(np.arange(len(bandits)))
plt.xlabel("Bandit Index")
plt.ylabel("Total Pulls on Bandit")
plt.title("Bandit Choices")
```

4547.278733183436

Out[3]:

Text(0.5, 1.0, 'Bandit Choices')

In [4]:

```
# Try these different numbers of pulls. We'll space the numbers out more as we get larger
# to save time
total_pulls = np.array(np.round(20000*np.linspace(0.01, 1, 500)**2), dtype=int)
# How many experiments to run for each epsilon for each number of pulls.
# We'll take the average over all of these experiments
n_experiments = 200
for eps in [0, 0.01, 0.1, 0.4]:
# Start out with an array of zeros to keep track of the average payout at each number
avg_reward = np.zeros_like(total_pulls)
# We'll also save the bandits that are chosen at each step
all_choices = []
for i, pulls in enumerate(total_pulls):
ipd.clear_output()
print(eps, total_pulls[i])
# For each number of pulls, run some number of experiments
reward = 0
choices = []
for j in range(n_experiments):
reward_j, choices_j = learn_bandit(bandits, total_pulls=pulls, eps_fn=eps)
reward += reward_j
choices = np.concatenate((choices, choices_j))
avg_reward[i] = reward/n_experiments
all_choices.append(choices)
results["Greedy $\\epsilon = {}$".format(eps)] = {"reward":avg_reward, "choices":all_choices}
```

0.4 20000

In [5]:

```
def plot_avg_rewards(total_pulls, results, mn, mx):
"""
Make a plot of the average rewards vs number of pulls in
some range
mn: min number of pulls
mx: max number of pulls
"""
for key in results.keys():
idx = (total_pulls <= mx)*(total_pulls >= mn)
plt.plot(total_pulls[idx], results[key]["reward"][idx])
plt.legend(list(results.keys()))
plt.xlabel("Number of pulls")
plt.ylabel("Average reward over {} trials".format(n_experiments))
plt.figure(figsize=(12, 4))
plt.subplot(121)
plot_avg_rewards(total_pulls, results, 0, 1000)
plt.subplot(122)
plot_avg_rewards(total_pulls, results, 0, np.max(total_pulls))
```

Let's also look at the choices that were made over time

In [6]:

```
def plot_choices(total_pulls, results, key):
"""
Make a plot of the proportion of times each bandit was chosen
"""
prop = [[np.sum(results[key]["choices"][p] == k)
for p in range(len(total_pulls))]
for k in range(len(bandits))]
prop = np.array(prop)
prop = prop/np.sum(prop, axis=0, keepdims=True)
for k in range(len(bandits)):
plt.plot(prop[k, :])
plt.legend(["{}".format(i) for i in range(len(bandits))])
plt.title(key)
plt.xlabel("Number of pulls")
plt.ylabel("Proportion chosen")
plt.ylim([0, 1])
plt.figure(figsize=(12, 6))
plt.subplot(221)
plot_choices(total_pulls, results, "Greedy $\\epsilon = 0.01$")
plt.subplot(222)
plot_choices(total_pulls, results, "Greedy $\\epsilon = 0.1$")
plt.subplot(223)
plot_choices(total_pulls, results, "Greedy $\\epsilon = 0.4$")
plt.tight_layout()
```

We notice a few interesting things here in this particular example:

- For a smaller number of pulls, $\epsilon = 0.01$ isn't exploring enough and does the worst. It is also the noisiest, because we explore so little that our data about which bandit is the best is unreliable
- $\epsilon = 0.4$ has very good estimates of which bandit is the best (bandit at index 2) because it explores the most (and is the smoothest curve as a result), but it doesn't exploit enough.
- For a smaller number of pulls, $\epsilon = 0.1$ seems to perform the best, and it is in the middle with noise
- Eventually for enough pulls, $\epsilon = 0.01$ catches up to $\epsilon=0.1$

So we want explore just enough but not too much. What is the optimal $\epsilon$? It depends on the number of bandits and number of pulls, but to get an idea for this specific example, we'll loop through a range of $\epsilon$ values and run a bunch of experiments for 1000 pulls, 5000 pulls, and 10000 pulls:

In [7]:

```
fixed_pulls = [1000, 5000, 10000]
epsilons = np.linspace(0, 0.5, 51)
n_experiments_eps = 400
avg_rewards_eps = {}
for pulls in fixed_pulls:
avg_rewards = np.zeros_like(epsilons)
for i, eps in enumerate(epsilons):
ipd.clear_output()
print(pulls, eps)
reward = 0
for j in range(n_experiments_eps):
reward_j, _ = learn_bandit(bandits, total_pulls=pulls, eps_fn=eps)
reward += reward_j
avg_rewards[i] = reward/n_experiments_eps
avg_rewards_eps[pulls] = avg_rewards
```

10000 0.5

In [8]:

```
plt.figure(figsize=(12, 4))
for p in fixed_pulls:
y = avg_rewards_eps[p]/np.max(avg_rewards_eps[p])
plt.plot(epsilons, y)
plt.legend(["{} Pulls".format(p) for p in fixed_pulls])
plt.xlabel("$\epsilon$")
plt.ylabel("Ratio of Average Reward To Max Reward")
plt.title("Normalized Rewards for 1000 Pulls for $\epsilon$-Greedy Strategies")
plt.xticks(np.linspace(0, 0.5, 11))
print(epsilons[np.argmax(avg_rewards)])
```

0.04

A few theoretical notes:

- For any nonzero epsilon in the limit, by the law of large numbers, the estimates of the means we have will eventually converge to the true means.
- By the above logic, in the limit as the number of pulls goes to infinity, sm