1
Problem
2
Challenge
3
Solution
4
Code
5
Summary
๐Ÿš‡

Tunnelling: Escaping Local Optima

How quantum tunnelling enables exploration beyond energy barriers

Understanding Quantum Tunnelling

In classical physics, a particle needs enough energy to cross a barrier. If a ball doesn't have enough energy to roll over a hill, it stays on one side. But in quantum mechanics, particles can "tunnel" through barriers even when they don't have enough energy.

This happens because quantum particles are described by wave functions that extend into classically forbidden regions. There's a non-zero probability of finding the particle on the other side of the barrier, even when it "shouldn't" be able to get there.

The Problem in Reinforcement Learning

In optimization and RL, we often get stuck in local optima. Consider a gridworld with a reward landscape:

  • Local optimum: A nearby position with reward +5
  • Global optimum: A distant position with reward +20, but separated by a valley of negative rewards (-10)

Classical RL agents (like ฮต-greedy or softmax) will converge to the local optimum and rarely explore the valley. They need to "tunnel through" the negative reward barrier to reach the global optimum.

Example: Optimization Landscape - Imagine optimizing a neural network policy. The loss landscape has many local minima. Gradient descent gets stuck in one. Quantum tunnelling allows the agent to explore beyond the local basin, potentially finding better solutions.

Chess Example: In chess, a move might look bad in the short term (losing a piece) but lead to a winning position later. Classical RL might avoid this move because it crosses through a "valley" of negative immediate rewards. Quantum tunnelling allows exploring these apparently suboptimal paths.

๐Ÿ’ฅ

Why the Agent Got Stuck

Why local exploration gets trapped in suboptimal solutions

The Problematic Approach

Classical RL agents use standard exploration strategies: small perturbations around the current policy and short-horizon reward signals. This creates a fundamental problem: any attempt to explore beyond the local optimum requires passing through regions of lower immediate reward, which the agent's gradient-based updates actively avoid.

โŒ Local Exploration Only
Current Policy
โ†’
Small Variations
โ†’
Local Optimum

Critical Issues:

๐Ÿšจ
Short horizon: value estimates are dominated by immediate rewards, making the agent myopic and unable to see long-term benefits beyond local optima.
๐Ÿšจ
No scenario jumps: the agent never tries radically different policies in a coordinated way.
๐Ÿšจ
Human beliefs ignored: strategy scenarios are not represented in the RL design.
๐ŸŽฏ

Our Tunnelling QiRL Design

Occasional non-local jumps in policy space

The Quantum-Inspired Solution

Quantum tunnelling provides a mechanism for escaping local optima: instead of being trapped by energy barriers (negative reward valleys), the agent can "tunnel" through them to discover globally optimal solutions. We implement this by complementing local exploration with rare, structured "tunnel jumps" to alternative policy basins.

โœ… Tunnelling Between Policy Basins
Local Optimum
Current policy
โ‡ข
Tunnel Jump
Alternative policy
โ‡ข
Global Optimum
Better long-term reward
๐Ÿงช

Policy Basin Exploration

The agent maintains alternative policy configurations that represent different exploration strategies, enabling jumps between policy basins.

โณ

Temporal Commitment

The agent commits to an alternative policy for a full evaluation period, allowing it to escape local optima and explore globally better solutions.

๐Ÿ“Š

Long-Horizon Evaluation

Policy basins are compared using long-horizon value estimates, not just immediate rewards, enabling discovery of globally optimal strategies.

If the new basin proves better, we shift the main policy towards it. If not, we tunnel back. Either way, we have evidence for or against the exploration strategy that enables discovery of globally optimal policies.

โšก

Production View: Scenario Tunnelling

How the agent jumps and evaluates

Tunnelling Implementation

We extend the training loop with a mechanism that occasionally samples scenario policies, runs them for a full evaluation horizon, and compares their performance to the current policy.

qirl_tunnelling.py
for epoch in range(num_epochs):
    if should_tunnel(epoch):
        # Sample a scenario policy from the library
        scenario = scenario_library.sample()
        returns = run_policy_for_horizon(scenario, env, horizon=H)
        scenario_value = aggregate_returns(returns)

        # Compare with current policy basin
        baseline_returns = run_policy_for_horizon(agent.policy, env, horizon=H)
        baseline_value = aggregate_returns(baseline_returns)

        if scenario_value > baseline_value + delta:
            agent.move_towards(scenario)  # update parameters towards better basin
    else:
        # Standard local RL update
        batch = collect_experience(agent.policy, env)
        agent.update_locally(batch)
scenario_policies.yaml
scenarios:
  - name: exploratory
    description: "High exploration rate, prioritizes long-term value"
  - name: conservative
    description: "Low exploration, focuses on immediate rewards"
  - name: balanced
    description: "Moderate exploration with adaptive exploration rate"

Key Features:

  • Rare but structured exploration: tunnelling events are special, not random noise.
  • Barrier-aware exploration: tunnelling probability depends on barrier height and width, enabling principled exploration beyond local optima.
  • Long-horizon comparison: decisions are grounded in multi-step value.
๐Ÿ”ฎ

Summary: Quantum Tunnelling for Global Optimization

How quantum tunnelling enables escape from local optima

Key Insights

Quantum tunnelling provides a mechanism for RL agents to explore beyond local optima. Unlike classical methods that get trapped by energy barriers (negative reward valleys), quantum-inspired algorithms can tunnel through these barriers to discover globally optimal solutions.

Chess Example: A move that sacrifices material (negative immediate reward) but leads to a winning endgame (positive long-term reward) requires "tunnelling" through the valley of negative rewards. Classical RL might avoid this, but quantum tunnelling enables exploration of such paths.

๐Ÿš‡

Barrier Penetration

Quantum wave functions extend into classically forbidden regions, enabling exploration of states separated by negative reward barriers.

๐ŸŽฏ

Global Optima Discovery

Agents can discover globally optimal policies even when separated from local optima by valleys of poor performance.

โšก

Efficient Exploration

Tunnelling provides a principled way to explore beyond local basins without relying solely on random exploration.

๐Ÿ”ฎ Quantum Insights

Insight 1: Wave Function Tunnelling

๐Ÿ‘๏ธ Click to reveal

In quantum mechanics, the probability of tunnelling through a barrier of height V and width d decreases exponentially with โˆš(V - E) ร— d, where E is the particle energy. In RL, this translates to exploring policies separated by reward barriers, with probability decreasing with barrier height and width.

Insight 2: Temperature and Tunnelling

๐Ÿ‘๏ธ Click to reveal

Quantum tunnelling probability increases with temperature (thermal energy). In RL, this suggests that exploration temperature should be tuned to enable tunnelling through reward barriers while maintaining reasonable sample efficiency.

Continue the Quantum Saga

Tunnelling is the fourth story in the quantum saga. The final story is mixed states โ€“ how QiRL acts under irreducible uncertainty about the environment itself.