Abstract
My primary interest lies in reinforcement learning. For both research and learning purposes, I decided to write programs in Java to experiment with various reinforcement learning algorithms. This page includes records of these experiments and any additions to my Java library that resulted from them.
Markov Decision Processes
An infinite horizon Markov decision process is represented by a collection .
- represents the state space.
- represents the action space.
- specifies a reward for taking a given action in a given state.
- provides a probability distribution over the state space. Given the current state and the action taken, the next state is sampled from this probability distribution.
- is the probability distribution from which the initial state is sampled.
- is a discount factor applied on future rewards.
To interact with a Markov decision process, first sample the initial state using . Then, repeat the following for all .
- Choose an action using a policy .
- Receive reward based on the current state.
- Determine the next state by sampling from .
The goal of reinforcement learning is to find a policy that maximize the total expected discounted rewards received during this interaction if actions are determined using . This quantity is called the value and is dependent on the chosen policy. Note that the discount factor is required to make the following summation finite because the interaction never terminates.
The value function is a core concept in reinforcement learning. Some variations of the value function add a condition to the expectation. The state value function is the total expected discounted rewards given the current state. The action value function is the total expected discounted rewards given the current state and the first action taken.
Planning
Planning algorithms determine an optimal policy when given full information about the Markov decision process. Four standard algorithms exist for solving an infinite horizon Markov decision process.
- Value iteration attempts to iteratively calculate an estimate for the action value function by applying the Bellman operator. After a set number of iterations, a policy can be extracted from the action value function by taking an action that maximizes the action value in each state.
- Policy iteration uses a closed form linear equation to find the state value function associated with a policy. A new policy is created by finding the best action for each state using the calculated value function. This process repeats until the value function does not change across iterations.
- The remaining two algorithms use linear programming. Although I did perform experiments with these algorithms, my linear program solver was highly unstable and could not handle the complexity of the problem.
Although it appears that policy iteration performs much better than value iteration, it should be noted that each iteration of policy iteration is more computationally expensive. It should also be noted that although the value function estimate used for value iteration did not converge, an optimal policy was still extracted. As a result, value iteration generally runs faster and is used more commonly in practice.
The visualization demonstrates the policy generated by value iteration. The environment consists of movement on a grid. The agent can choose to move up, down, left, or right. However, there is a chance for the agent to "slip," causing the resulting movement to be perpendicular to the desired movement. There is a single position in the center of the grid where the agent receives a reward, so the agent tends to move towards the center.
Multi-Armed Bandits
Multi-armed bandits are a specific type of Markov decision process in which the state space is a singleton. However in bandit settings, the reward given to the agent after an action is taken usually is not deterministic. The agent would therefore like to perform the action that maximizes the expected reward.
In a planning setting, the solution to this problem is trivial. Instead, we care about the agent's policy when it is given no information about the environment. In these cases, the agent must balance exploration and exploitation.
- Exploration consists of trying various different actions to gain information about the reward distributions. Although exploration does not attempt to maximize the rewards in the current time step, the information gain may help the agent determine optimal future actions.
- Exploitation uses past information and determines action that maximizes the reward for the current time step.
The delicate balance is to combine both exploration and exploitation to maximize the total expected reward while also considering rewards gained along the way. To quantify this, we use the notion of regret.
The regret associated with taking an action is the difference between the expected reward for that action and the expected reward for the optimal action. If the regret for an action is zero, then that action is optimal. For online learning algorithms, we consider the total regret across all actions taken over the course of the episode. As more actions are taken, we would like the total regret to be sublinear with respect to the number of actions taken.
For the bandit setting, we have the Upper Confidence Bound (UCB) algorithm. When choosing an action, UCB considers the empirical average reward for each action, along with how many times each action was played. It then creates a pseudo-reward that favors actions with a high empirical reward expectation or a low sample size.
If the constant is chosen properly, theory provides a guarantee that with high probability, the total regret is bounded by a square root curve.