Tunnelling: Escaping Local Optima
How quantum tunnelling enables exploration beyond energy barriers
Understanding Quantum Tunnelling
In classical physics, a particle needs enough energy to cross a barrier. If a ball doesn't have enough energy to roll over a hill, it stays on one side. But in quantum mechanics, particles can "tunnel" through barriers even when they don't have enough energy.
This happens because quantum particles are described by wave functions that extend into classically forbidden regions. There's a non-zero probability of finding the particle on the other side of the barrier, even when it "shouldn't" be able to get there.
The Problem in Reinforcement Learning
In optimization and RL, we often get stuck in local optima. Consider a gridworld with a reward landscape:
- Local optimum: A nearby position with reward +5
- Global optimum: A distant position with reward +20, but separated by a valley of negative rewards (-10)
Classical RL agents (like ฮต-greedy or softmax) will converge to the local optimum and rarely explore the valley. They need to "tunnel through" the negative reward barrier to reach the global optimum.
Example: Optimization Landscape - Imagine optimizing a neural network policy. The loss landscape has many local minima. Gradient descent gets stuck in one. Quantum tunnelling allows the agent to explore beyond the local basin, potentially finding better solutions.
Chess Example: In chess, a move might look bad in the short term (losing a piece) but lead to a winning position later. Classical RL might avoid this move because it crosses through a "valley" of negative immediate rewards. Quantum tunnelling allows exploring these apparently suboptimal paths.
Why the Agent Got Stuck
Why local exploration gets trapped in suboptimal solutions
The Problematic Approach
Classical RL agents use standard exploration strategies: small perturbations around the current policy and short-horizon reward signals. This creates a fundamental problem: any attempt to explore beyond the local optimum requires passing through regions of lower immediate reward, which the agent's gradient-based updates actively avoid.
Critical Issues:
Our Tunnelling QiRL Design
Occasional non-local jumps in policy space
The Quantum-Inspired Solution
Quantum tunnelling provides a mechanism for escaping local optima: instead of being trapped by energy barriers (negative reward valleys), the agent can "tunnel" through them to discover globally optimal solutions. We implement this by complementing local exploration with rare, structured "tunnel jumps" to alternative policy basins.
Current policy
Alternative policy
Better long-term reward
Policy Basin Exploration
The agent maintains alternative policy configurations that represent different exploration strategies, enabling jumps between policy basins.
Temporal Commitment
The agent commits to an alternative policy for a full evaluation period, allowing it to escape local optima and explore globally better solutions.
Long-Horizon Evaluation
Policy basins are compared using long-horizon value estimates, not just immediate rewards, enabling discovery of globally optimal strategies.
If the new basin proves better, we shift the main policy towards it. If not, we tunnel back. Either way, we have evidence for or against the exploration strategy that enables discovery of globally optimal policies.
Production View: Scenario Tunnelling
How the agent jumps and evaluates
Tunnelling Implementation
We extend the training loop with a mechanism that occasionally samples scenario policies, runs them for a full evaluation horizon, and compares their performance to the current policy.
for epoch in range(num_epochs):
if should_tunnel(epoch):
# Sample a scenario policy from the library
scenario = scenario_library.sample()
returns = run_policy_for_horizon(scenario, env, horizon=H)
scenario_value = aggregate_returns(returns)
# Compare with current policy basin
baseline_returns = run_policy_for_horizon(agent.policy, env, horizon=H)
baseline_value = aggregate_returns(baseline_returns)
if scenario_value > baseline_value + delta:
agent.move_towards(scenario) # update parameters towards better basin
else:
# Standard local RL update
batch = collect_experience(agent.policy, env)
agent.update_locally(batch)
scenarios:
- name: exploratory
description: "High exploration rate, prioritizes long-term value"
- name: conservative
description: "Low exploration, focuses on immediate rewards"
- name: balanced
description: "Moderate exploration with adaptive exploration rate"
Key Features:
- Rare but structured exploration: tunnelling events are special, not random noise.
- Barrier-aware exploration: tunnelling probability depends on barrier height and width, enabling principled exploration beyond local optima.
- Long-horizon comparison: decisions are grounded in multi-step value.
Summary: Quantum Tunnelling for Global Optimization
How quantum tunnelling enables escape from local optima
Key Insights
Quantum tunnelling provides a mechanism for RL agents to explore beyond local optima. Unlike classical methods that get trapped by energy barriers (negative reward valleys), quantum-inspired algorithms can tunnel through these barriers to discover globally optimal solutions.
Chess Example: A move that sacrifices material (negative immediate reward) but leads to a winning endgame (positive long-term reward) requires "tunnelling" through the valley of negative rewards. Classical RL might avoid this, but quantum tunnelling enables exploration of such paths.
Barrier Penetration
Quantum wave functions extend into classically forbidden regions, enabling exploration of states separated by negative reward barriers.
Global Optima Discovery
Agents can discover globally optimal policies even when separated from local optima by valleys of poor performance.
Efficient Exploration
Tunnelling provides a principled way to explore beyond local basins without relying solely on random exploration.
๐ฎ Quantum Insights
Insight 1: Wave Function Tunnelling
๐๏ธ Click to revealIn quantum mechanics, the probability of tunnelling through a barrier of height V and width d decreases exponentially with โ(V - E) ร d, where E is the particle energy. In RL, this translates to exploring policies separated by reward barriers, with probability decreasing with barrier height and width.
Insight 2: Temperature and Tunnelling
๐๏ธ Click to revealQuantum tunnelling probability increases with temperature (thermal energy). In RL, this suggests that exploration temperature should be tuned to enable tunnelling through reward barriers while maintaining reasonable sample efficiency.
Continue the Quantum Saga
Tunnelling is the fourth story in the quantum saga. The final story is mixed states โ how QiRL acts under irreducible uncertainty about the environment itself.