Most steps exploit the best estimate so far, while a fixed random-exploration rate prevents early lock-in.
Exploration RL Lab: From Bandits to Gridworld
This lab introduces exploration versus exploitation in the simplest setting first, then carries that intuition into a compact reinforcement learning environment where state, delayed consequences, and path quality all matter.
Learning arc
In a bandit, every decision starts from the same situation. The only question is which arm to trust next.
Once actions move the agent through space, each choice changes what can happen next and how rewards accumulate.
In the gridworld, efficient exploration discovers a reliable route, avoids traps, and improves long-run returns.
Explore action choice before state enters the picture
The bandit setting isolates the exploration problem: there is no map, no future state, and no path planning. Each pull simply asks whether your current estimates are good enough to exploit or whether you should keep probing alternatives.
How the bandit strategies decide which arm to pull
Each method keeps a running estimate for every arm, then balances value and uncertainty in a different way when it picks the next action.
Arms with less data receive a larger confidence bonus, so uncertainty itself becomes a reason to sample them.
This strategy explores by acting on a plausible reward-rate sample from each arm's posterior rather than by adding an explicit bonus.
Arm estimates and pull counts
Each row compares the learned estimate with the current hidden reward chance used by the simulator.
Reward trail
Compare realized reward with an oracle baseline that always picks the current best arm.
Oracle gap summary
This compares oracle minus learner over the whole run and over the most recent 50 pulls.
Recent decisions
Use the log to spot when the policy commits or deliberately re-explores.
From isolated action choice to sequential decision making
A bandit can be read as reinforcement learning with only one state: there are multiple actions, but every pull returns you to the same decision point. A gridworld adds many states, delayed consequences, and transitions, so the cost of a choice can propagate several steps into the future.
Bandit = one state, many actions
Exploration asks: which option seems best when feedback is immediate and the decision context never changes?
Gridworld = many states, delayed consequences
Now each move changes the next state. A poor exploratory step can lead toward a trap or away from the goal.
Exploration still matters
The same tension remains, but decisions now shape future opportunities. Learning a policy means learning where each action leads.
Transfer the exploration lesson into a small RL environment
The gridworld keeps the idea compact but introduces state, path dependence, and delayed rewards. Watch the agent trade off immediate moves against long-run return while the learned policy map gradually sharpens.
How the gridworld learner updates its action values
In reinforcement learning the estimate is attached to a state-action pair. After each move, the learner nudges that estimate toward a target built from the immediate reward and the next state's value.
Q-learning looks ahead to the best possible next action, even if the behavior policy is still exploring.
SARSA updates toward the action the agent actually expects to take next, so it learns under its own exploratory behavior.
Optional epsilon decay makes the learner search broadly at first and then become more stable once it has mapped the environment.
Policy and value map
Cell shading reflects the current value estimate. Icons show the greedy action, and numbered badges mark the latest trajectory.
Episode reward history
Returns should rise as the policy discovers a safer, shorter route to the goal.
Most recent trajectory
This makes the last episode legible as a path rather than only as a scalar reward.