Abstract

My primary interest lies in reinforcement learning. For both research and learning purposes, I decided to write programs in Java to experiment with various reinforcement learning algorithms. This page includes records of these experiments and any additions to my Java library that resulted from them.

Markov Decision Processes

An infinite horizon Markov decision process is represented by a collection M = ( S , A , R , P , μ , γ ) .

To interact with a Markov decision process, first sample the initial state s 0 using μ . Then, repeat the following for all t = 0 , 1 , 2 , .

  1. Choose an action a t using a policy π ( s t ) .
  2. Receive reward r t = R ( s t , a t ) based on the current state.
  3. Determine the next state s t + 1 by sampling from P ( s t , a t ) .

The goal of reinforcement learning is to find a policy π that maximize the total expected discounted rewards received during this interaction if actions are determined using π . This quantity is called the value and is dependent on the chosen policy. Note that the discount factor is required to make the following summation finite because the interaction never terminates.

V π = E [ t = 0 γ t r t ]

The value function is a core concept in reinforcement learning. Some variations of the value function add a condition to the expectation. The state value function is the total expected discounted rewards given the current state. The action value function is the total expected discounted rewards given the current state and the first action taken.

Planning

Planning algorithms determine an optimal policy when given full information about the Markov decision process. Four standard algorithms exist for solving an infinite horizon Markov decision process.

Results from value iteration and policy iteration. Policies were extracted after each iteration. The plot shows the value of these policies.

Although it appears that policy iteration performs much better than value iteration, it should be noted that each iteration of policy iteration is more computationally expensive. It should also be noted that although the value function estimate used for value iteration did not converge, an optimal policy was still extracted. As a result, value iteration generally runs faster and is used more commonly in practice.

Visualization of the policy generated by value iteration. Click on the demonstration to reset the simulation.

The visualization demonstrates the policy generated by value iteration. The environment consists of movement on a grid. The agent can choose to move up, down, left, or right. However, there is a chance for the agent to "slip," causing the resulting movement to be perpendicular to the desired movement. There is a single position in the center of the grid where the agent receives a reward, so the agent tends to move towards the center.

Multi-Armed Bandits

Multi-armed bandits are a specific type of Markov decision process in which the state space is a singleton. However in bandit settings, the reward given to the agent after an action is taken usually is not deterministic. The agent would therefore like to perform the action that maximizes the expected reward.

In a planning setting, the solution to this problem is trivial. Instead, we care about the agent's policy when it is given no information about the environment. In these cases, the agent must balance exploration and exploitation.

The delicate balance is to combine both exploration and exploitation to maximize the total expected reward while also considering rewards gained along the way. To quantify this, we use the notion of regret.

Regret ( a i ) = μ * - μ i

The regret associated with taking an action is the difference between the expected reward for that action and the expected reward for the optimal action. If the regret for an action is zero, then that action is optimal. For online learning algorithms, we consider the total regret across all actions taken over the course of the episode. As more actions are taken, we would like the total regret to be sublinear with respect to the number of actions taken.

For the bandit setting, we have the Upper Confidence Bound (UCB) algorithm. When choosing an action, UCB considers the empirical average reward for each action, along with how many times each action was played. It then creates a pseudo-reward that favors actions with a high empirical reward expectation or a low sample size.

r ~ ( a i ) = μ ^ i + c · 1 N i

If the constant c is chosen properly, theory provides a guarantee that with high probability, the total regret is bounded by a square root curve.

Graph of the total regret and average regret. The total regret is sublinear, so the average regret decreases as more actions are played. The total regret plot strongly resembles the square root function, matching the theoretical guarantees for the UCB algorithm.

Markov Games

Partially Observable Environments

Attack Research