Exploration RL Lab: From Bandits to Gridworld

This lab introduces exploration versus exploitation in the simplest setting first, then carries that intuition into a compact reinforcement learning environment where state, delayed consequences, and path quality all matter.

Learning arc

Stage 1: one state, many actions

In a bandit, every decision starts from the same situation. The only question is which arm to trust next.

Stage 2: feedback becomes delayed

Once actions move the agent through space, each choice changes what can happen next and how rewards accumulate.

Stage 3: policy quality is a path property

In the gridworld, efficient exploration discovers a reliable route, avoids traps, and improves long-run returns.

Explore action choice before state enters the picture

The bandit setting isolates the exploration problem: there is no map, no future state, and no path planning. Each pull simply asks whether your current estimates are good enough to exploit or whether you should keep probing alternatives.

Simple Math

How the bandit strategies decide which arm to pull

Each method keeps a running estimate for every arm, then balances value and uncertainty in a different way when it picks the next action.

epsilon-greedy
Explore with probability epsilon.
Otherwise choose arg max_a q_hat(a).

Most steps exploit the best estimate so far, while a fixed random-exploration rate prevents early lock-in.

UCB
q_hat(a) + c * sqrt(log t / n(a))

Arms with less data receive a larger confidence bonus, so uncertainty itself becomes a reason to sample them.

Thompson sampling
Sample theta_a ~ Beta(alpha_a, beta_a)
Choose arg max_a theta_a

This strategy explores by acting on a plausible reward-rate sample from each arm's posterior rather than by adding an explicit bonus.

Pulls
0
Cumulative reward
0.00
Regret
0.00
Average reward
0.00
Best arm now
Arm 1

Arm estimates and pull counts

Each row compares the learned estimate with the current hidden reward chance used by the simulator.

Estimated value Latent mean

Reward trail

Compare realized reward with an oracle baseline that always picks the current best arm.

History shown
0 pulls
Cumulative reward Oracle reward

Oracle gap summary

This compares oracle minus learner over the whole run and over the most recent 50 pulls.

Recent decisions

Use the log to spot when the policy commits or deliberately re-explores.

Ready

From isolated action choice to sequential decision making

A bandit can be read as reinforcement learning with only one state: there are multiple actions, but every pull returns you to the same decision point. A gridworld adds many states, delayed consequences, and transitions, so the cost of a choice can propagate several steps into the future.

Bandit = one state, many actions

Exploration asks: which option seems best when feedback is immediate and the decision context never changes?

Gridworld = many states, delayed consequences

Now each move changes the next state. A poor exploratory step can lead toward a trap or away from the goal.

Exploration still matters

The same tension remains, but decisions now shape future opportunities. Learning a policy means learning where each action leads.

Transfer the exploration lesson into a small RL environment

The gridworld keeps the idea compact but introduces state, path dependence, and delayed rewards. Watch the agent trade off immediate moves against long-run return while the learned policy map gradually sharpens.

Simple Math

How the gridworld learner updates its action values

In reinforcement learning the estimate is attached to a state-action pair. After each move, the learner nudges that estimate toward a target built from the immediate reward and the next state's value.

Q-learning
Q(s,a) <- Q(s,a) + alpha [r + gamma max_a' Q(s',a') - Q(s,a)]

Q-learning looks ahead to the best possible next action, even if the behavior policy is still exploring.

SARSA
Q(s,a) <- Q(s,a) + alpha [r + gamma Q(s',a') - Q(s,a)]

SARSA updates toward the action the agent actually expects to take next, so it learns under its own exploratory behavior.

Exploration policy
Choose the greedy action most of the time, but explore with epsilon.

Optional epsilon decay makes the learner search broadly at first and then become more stable once it has mapped the environment.

Episodes trained
0
Current epsilon
0.18
Last reward
0.00
Last episode length
0
Goal rate (last 20)
0%

Policy and value map

Cell shading reflects the current value estimate. Icons show the greedy action, and numbered badges mark the latest trajectory.

Greedy action Path marker

Episode reward history

Returns should rise as the policy discovers a safer, shorter route to the goal.

History shown
0 episodes

Most recent trajectory

This makes the last episode legible as a path rather than only as a scalar reward.

Ready