1
Problem
2
Challenge
3
Solution
4
Code
5
Summary
🔗

Entanglement: Correlated Decisions in Multi-Step Problems

How quantum entanglement enables coordinated learning across sequential decisions

Understanding Quantum Entanglement

Quantum entanglement is one of the most mysterious phenomena in physics. When two particles become entangled, measuring one instantly determines the state of the other, regardless of distance. Einstein called this "spooky action at a distance."

The classic example: two entangled electrons have correlated spins. If you measure one and find it spinning "up," the other is instantly "down"—even if they're light-years apart. This correlation exists before measurement; the particles share a quantum state.

The Problem in Reinforcement Learning

In multi-step RL problems, decisions are often correlated. Consider a gridworld where an agent must collect keys in a specific order, or a game where early moves affect late-game options. Classical RL treats each decision independently, optimizing actions locally without considering their global correlations.

Example: Key-Door Gridworld - An agent must collect a key (action at step t₁) before opening a door (action at step t₂). These actions are entangled: the value of collecting the key depends on whether we'll use it to open the door, and the value of opening the door depends on whether we have the key. Classical Q-learning might optimize each action separately, missing the correlation.

💥

The Classical RL Limitation

Why independent optimization fails for correlated decisions

The Problematic Approach

Classical RL often treats actions at different time steps as independent. In Q-learning, we learn Q(s, a) - the value of taking action a in state s. The Bellman equation updates Q-values based on immediate rewards and future values, but doesn't explicitly model correlations between distant actions.

❌ Independent Action Optimization
State s₁
Action a₁
Optimized independently
State s₂
Action a₂
Optimized independently

Fundamental Limitations:

🚨
No correlation modeling: Q-learning assumes actions are conditionally independent given states. It doesn't capture that action a₁ and action a₂ might be fundamentally linked.
🚨
Local optimization: Each Q-value is updated based on local rewards, missing global correlations that span multiple time steps.
🚨
Sample inefficiency: Without modeling correlations, the agent needs many samples to learn that certain action sequences work well together.

Example: In a multi-armed bandit with delayed rewards, pulling arm A at time t₁ might only pay off if we pull arm B at time t₂. Classical bandit algorithms optimize each pull independently, missing this correlation.

🎯

Our Entangled QiRL Design

A shared latent “relationship state” across decisions

The Quantum-Inspired Solution

In quantum mechanics, entangled particles share a quantum state. Measuring one particle instantly determines the state of the other, even at a distance. We apply this to RL by representing correlated actions as an entangled quantum state.

Mathematically, we represent a sequence of actions as an entangled state:

|a₁, a₂⟩ = Σᵢⱼ αᵢⱼ |a₁ᵢ⟩ ⊗ |a₂ⱼ⟩

where |a₁ᵢ⟩ and |a₂ⱼ⟩ are action states at different time steps, and αᵢⱼ are amplitudes that encode correlations. The key: this state cannot be factored into independent states |a₁⟩ ⊗ |a₂⟩—the actions are entangled.

✅ Entangled Action Representation
State s₁
Entangled State
|a₁, a₂⟩
Action a₁
Action a₂
Correlated with a₁

Example: Key-Door Problem - In a gridworld, collecting a key (action a₁) and opening a door (action a₂) are entangled. We represent this as |key, door⟩ where the amplitude αᵢⱼ is high only when i="collect" and j="open". The Q-value for this entangled action pair is learned jointly, not separately.

🔗

Entangled Representation

Actions are represented as entangled quantum states, capturing correlations that cannot be factored into independent components.

Joint Optimization

Q-values for action pairs are learned jointly, ensuring correlated actions are optimized together rather than independently.

📊

Sample Efficiency

By modeling correlations explicitly, the agent learns faster which action sequences work well together.

Production View: Multi-Head QiRL Agent

A shared encoder with entangled policy heads

Implementing Entanglement: Key-Door Gridworld Example

Let's implement entangled actions using a Key-Door gridworld. The agent must collect a key (action at step t₁) before opening a door (action at step t₂). These actions are entangled: the value of collecting the key depends on whether we'll use it to open the door.

We represent this as an entangled quantum state where actions at different time steps share a quantum correlation. The Q-function learns Q(s₁, a₁, s₂, a₂) jointly, not Q(s₁, a₁) and Q(s₂, a₂) separately.

qirl_entangled_agent.py
import numpy as np
import gymnasium as gym

class EntangledQRLAgent:
    """
    Quantum-inspired RL agent with entangled action representation.
    Example: Key-Door gridworld where collecting key and opening door are entangled.
    """
    def __init__(self, state_dim, action_dim, horizon=2):
        # Entangled Q-function: Q(s₁, a₁, s₂, a₂) for action pairs
        # This captures correlations between actions at different time steps
        self.entangled_q = np.zeros((state_dim, action_dim, state_dim, action_dim))
        
        # Shared encoder for entangled representation
        self.shared_encoder = self._init_encoder(state_dim)
        
        # Policy heads for different time steps (but sharing encoder)
        self.policy_head_t1 = self._init_policy_head()
        self.policy_head_t2 = self._init_policy_head()
        
        self.horizon = horizon
        self.learning_rate = 0.1
        self.gamma = 0.99
    
    def _init_encoder(self, state_dim):
        """Shared encoder that produces entangled latent state"""
        # In practice, this would be a neural network
        # For simplicity, we use a linear transformation
        return np.random.randn(state_dim, state_dim) * 0.1
    
    def _init_policy_head(self):
        """Policy head that reads from shared encoder"""
        return np.random.randn(state_dim, 4) * 0.1  # 4 actions: up, down, left, right
    
    def encode_state(self, state):
        """Encode state using shared encoder (creates entangled representation)"""
        return np.dot(self.shared_encoder, state)
    
    def act_entangled(self, state1, state2=None):
        """
        Select entangled action pair (a₁, a₂).
        The actions are correlated through the shared encoder.
        """
        z1 = self.encode_state(state1)
        
        if state2 is None:
            # For first action, use policy head
            logits = np.dot(z1, self.policy_head_t1)
            action1 = self._sample_action(logits)
            return action1
        else:
            # For second action, consider entanglement with first action
            z2 = self.encode_state(state2)
            
            # Entangled Q-value: depends on both states and their correlation
            # Q(s₁, a₁, s₂, a₂) = f(encoder(s₁), encoder(s₂), correlation)
            correlation = np.dot(z1, z2)  # Quantum correlation
            
            # Select action pair that maximizes entangled Q-value
            best_pair = None
            best_value = -np.inf
            
            for a1 in range(4):
                for a2 in range(4):
                    # Entangled value depends on correlation
                    value = self.entangled_q[state1, a1, state2, a2] + correlation
                    if value > best_value:
                        best_value = value
                        best_pair = (a1, a2)
            
            return best_pair
    
    def _sample_action(self, logits):
        """Sample action from logits (softmax policy)"""
        exp_logits = np.exp(logits - np.max(logits))
        probs = exp_logits / np.sum(exp_logits)
        return np.random.choice(len(probs), p=probs)
    
    def update_entangled(self, s1, a1, s2, a2, reward, next_s1, next_s2, done):
        """
        Update entangled Q-values using quantum-inspired TD learning.
        Updates capture correlations between action pairs.
        """
        if done:
            target = reward
        else:
            # Bellman update for entangled Q-function
            # Q*(s₁, a₁, s₂, a₂) = r + γ max_{a₁', a₂'} Q*(s₁', a₁', s₂', a₂')
            next_max = np.max(self.entangled_q[next_s1, :, next_s2, :])
            target = reward + self.gamma * next_max
        
        # TD error
        current_q = self.entangled_q[s1, a1, s2, a2]
        td_error = target - current_q
        
        # Update with quantum-inspired correlation term
        correlation = np.dot(self.encode_state(s1), self.encode_state(s2))
        self.entangled_q[s1, a1, s2, a2] += self.learning_rate * (td_error + 0.1 * correlation)
        
        # Also update shared encoder to strengthen entanglement
        self._update_encoder(s1, s2, td_error)
    
    def _update_encoder(self, s1, s2, td_error):
        """Update shared encoder to strengthen correlations"""
        # Gradient-based update (simplified)
        z1 = self.encode_state(s1)
        z2 = self.encode_state(s2)
        gradient = td_error * np.outer(z1, z2)
        self.shared_encoder += 0.01 * gradient

# Example: Key-Door Gridworld
class KeyDoorGridworld:
    """Gridworld where agent must collect key before opening door"""
    def __init__(self, size=5):
        self.size = size
        self.agent_pos = (0, 0)
        self.key_pos = (2, 2)
        self.door_pos = (4, 4)
        self.has_key = False
        self.door_open = False
    
    def reset(self):
        self.agent_pos = (0, 0)
        self.has_key = False
        self.door_open = False
        return self._get_state()
    
    def _get_state(self):
        """State includes position, key status, door status"""
        return (self.agent_pos[0] * self.size + self.agent_pos[1], 
                int(self.has_key), int(self.door_open))
    
    def step(self, action):
        """Actions: 0=up, 1=down, 2=left, 3=right"""
        # Move agent
        moves = [(-1,0), (1,0), (0,-1), (0,1)]
        new_pos = (self.agent_pos[0] + moves[action][0],
                  self.agent_pos[1] + moves[action][1])
        
        if 0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size:
            self.agent_pos = new_pos
        
        # Check for key
        if self.agent_pos == self.key_pos and not self.has_key:
            self.has_key = True
            reward = 10
        
        # Check for door
        elif self.agent_pos == self.door_pos:
            if self.has_key and not self.door_open:
                self.door_open = True
                reward = 50  # Big reward for opening door with key
            elif not self.has_key:
                reward = -10  # Penalty for trying without key
            else:
                reward = 0
        else:
            reward = -1  # Small penalty for each step
        
        done = self.door_open
        return self._get_state(), reward, done
training_loop.py
# Training loop for entangled Key-Door agent
env = KeyDoorGridworld(size=5)
agent = EntangledQRLAgent(state_dim=125, action_dim=4, horizon=2)

for episode in range(1000):
    state1 = env.reset()
    done = False
    total_reward = 0
    
    # First action: try to collect key
    action1 = agent.act_entangled(state1)
    state2, reward1, _ = env.step(action1)
    total_reward += reward1
    
    if not done:
        # Second action: try to open door (entangled with first action)
        action2 = agent.act_entangled(state2)
        state3, reward2, done = env.step(action2)
        total_reward += reward2
        
        # Update entangled Q-values
        # The update captures that (collect_key, open_door) is a good pair
        agent.update_entangled(
            s1=state1, a1=action1,
            s2=state2, a2=action2,
            reward=total_reward,
            next_s1=state2, next_s2=state3,
            done=done
        )
    
    if episode % 100 == 0:
        print(f"Episode {episode}, Total reward: {total_reward}")

# The agent learns that collecting key and opening door are entangled:
# Q(state_near_key, collect_key, state_near_door, open_door) >> 
#   Q(state_near_key, collect_key, state_near_door, other_action)

Key Concepts:

  • Entangled Q-function: Q(s₁, a₁, s₂, a₂) captures correlations between action pairs, not just individual actions.
  • Shared encoder: Creates entangled representation where actions at different time steps share quantum correlations through the latent state.
  • Joint optimization: The Q-values for action pairs are learned jointly, ensuring correlated actions (like collect_key + open_door) are optimized together.
  • Quantum correlation: The correlation term in the update strengthens the entanglement between actions that work well together.
🔮

Summary: The Power of Quantum Entanglement in RL

How entangled actions enable coordinated learning

Key Insights

Quantum entanglement in reinforcement learning allows us to model correlations between actions that span multiple time steps. Unlike classical RL, which treats actions independently, entangled QiRL captures that certain action sequences work well together.

🔗

Correlated Learning

Actions at different time steps are learned jointly, not independently. The agent discovers that (collect_key, open_door) is a good pair.

Sample Efficiency

By modeling correlations explicitly, the agent learns faster which action sequences work well together, reducing the number of samples needed.

🎯

Long-Term Dependencies

Entanglement naturally captures long-term dependencies, like in chess where early moves influence late-game options.

🔮 Quantum Insights

Insight 1: Non-Separable States

👁️ Click to reveal

Entangled quantum states cannot be factored into independent components. In RL, this means action pairs like (collect_key, open_door) form a non-separable state that must be learned jointly, not as separate Q(s₁, a₁) and Q(s₂, a₂).

Insight 2: Bell Inequalities

👁️ Click to reveal

Quantum entanglement violates Bell inequalities, meaning correlations are stronger than any classical theory allows. In RL, this translates to action correlations that cannot be captured by independent optimization but emerge naturally from entangled representations.

Continue the Quantum Saga

Entanglement is the second story in the quantum saga. Next is interference – how conflicting learning signals can reinforce or cancel each other instead of averaging out.