Entanglement: Correlated Decisions in Multi-Step Problems
How quantum entanglement enables coordinated learning across sequential decisions
Understanding Quantum Entanglement
Quantum entanglement is one of the most mysterious phenomena in physics. When two particles become entangled, measuring one instantly determines the state of the other, regardless of distance. Einstein called this "spooky action at a distance."
The classic example: two entangled electrons have correlated spins. If you measure one and find it spinning "up," the other is instantly "down"—even if they're light-years apart. This correlation exists before measurement; the particles share a quantum state.
The Problem in Reinforcement Learning
In multi-step RL problems, decisions are often correlated. Consider a gridworld where an agent must collect keys in a specific order, or a game where early moves affect late-game options. Classical RL treats each decision independently, optimizing actions locally without considering their global correlations.
Example: Key-Door Gridworld - An agent must collect a key (action at step t₁) before opening a door (action at step t₂). These actions are entangled: the value of collecting the key depends on whether we'll use it to open the door, and the value of opening the door depends on whether we have the key. Classical Q-learning might optimize each action separately, missing the correlation.
The Classical RL Limitation
Why independent optimization fails for correlated decisions
The Problematic Approach
Classical RL often treats actions at different time steps as independent. In Q-learning, we learn Q(s, a) - the value of taking action a in state s. The Bellman equation updates Q-values based on immediate rewards and future values, but doesn't explicitly model correlations between distant actions.
Optimized independently
Optimized independently
Fundamental Limitations:
Example: In a multi-armed bandit with delayed rewards, pulling arm A at time t₁ might only pay off if we pull arm B at time t₂. Classical bandit algorithms optimize each pull independently, missing this correlation.
Our Entangled QiRL Design
A shared latent “relationship state” across decisions
The Quantum-Inspired Solution
In quantum mechanics, entangled particles share a quantum state. Measuring one particle instantly determines the state of the other, even at a distance. We apply this to RL by representing correlated actions as an entangled quantum state.
Mathematically, we represent a sequence of actions as an entangled state:
where |a₁ᵢ⟩ and |a₂ⱼ⟩ are action states at different time steps, and αᵢⱼ are amplitudes that encode correlations. The key: this state cannot be factored into independent states |a₁⟩ ⊗ |a₂⟩—the actions are entangled.
|a₁, a₂⟩
Correlated with a₁
Example: Key-Door Problem - In a gridworld, collecting a key (action a₁) and opening a door (action a₂) are entangled. We represent this as |key, door⟩ where the amplitude αᵢⱼ is high only when i="collect" and j="open". The Q-value for this entangled action pair is learned jointly, not separately.
Entangled Representation
Actions are represented as entangled quantum states, capturing correlations that cannot be factored into independent components.
Joint Optimization
Q-values for action pairs are learned jointly, ensuring correlated actions are optimized together rather than independently.
Sample Efficiency
By modeling correlations explicitly, the agent learns faster which action sequences work well together.
Production View: Multi-Head QiRL Agent
A shared encoder with entangled policy heads
Implementing Entanglement: Key-Door Gridworld Example
Let's implement entangled actions using a Key-Door gridworld. The agent must collect a key (action at step t₁) before opening a door (action at step t₂). These actions are entangled: the value of collecting the key depends on whether we'll use it to open the door.
We represent this as an entangled quantum state where actions at different time steps share a quantum correlation. The Q-function learns Q(s₁, a₁, s₂, a₂) jointly, not Q(s₁, a₁) and Q(s₂, a₂) separately.
import numpy as np
import gymnasium as gym
class EntangledQRLAgent:
"""
Quantum-inspired RL agent with entangled action representation.
Example: Key-Door gridworld where collecting key and opening door are entangled.
"""
def __init__(self, state_dim, action_dim, horizon=2):
# Entangled Q-function: Q(s₁, a₁, s₂, a₂) for action pairs
# This captures correlations between actions at different time steps
self.entangled_q = np.zeros((state_dim, action_dim, state_dim, action_dim))
# Shared encoder for entangled representation
self.shared_encoder = self._init_encoder(state_dim)
# Policy heads for different time steps (but sharing encoder)
self.policy_head_t1 = self._init_policy_head()
self.policy_head_t2 = self._init_policy_head()
self.horizon = horizon
self.learning_rate = 0.1
self.gamma = 0.99
def _init_encoder(self, state_dim):
"""Shared encoder that produces entangled latent state"""
# In practice, this would be a neural network
# For simplicity, we use a linear transformation
return np.random.randn(state_dim, state_dim) * 0.1
def _init_policy_head(self):
"""Policy head that reads from shared encoder"""
return np.random.randn(state_dim, 4) * 0.1 # 4 actions: up, down, left, right
def encode_state(self, state):
"""Encode state using shared encoder (creates entangled representation)"""
return np.dot(self.shared_encoder, state)
def act_entangled(self, state1, state2=None):
"""
Select entangled action pair (a₁, a₂).
The actions are correlated through the shared encoder.
"""
z1 = self.encode_state(state1)
if state2 is None:
# For first action, use policy head
logits = np.dot(z1, self.policy_head_t1)
action1 = self._sample_action(logits)
return action1
else:
# For second action, consider entanglement with first action
z2 = self.encode_state(state2)
# Entangled Q-value: depends on both states and their correlation
# Q(s₁, a₁, s₂, a₂) = f(encoder(s₁), encoder(s₂), correlation)
correlation = np.dot(z1, z2) # Quantum correlation
# Select action pair that maximizes entangled Q-value
best_pair = None
best_value = -np.inf
for a1 in range(4):
for a2 in range(4):
# Entangled value depends on correlation
value = self.entangled_q[state1, a1, state2, a2] + correlation
if value > best_value:
best_value = value
best_pair = (a1, a2)
return best_pair
def _sample_action(self, logits):
"""Sample action from logits (softmax policy)"""
exp_logits = np.exp(logits - np.max(logits))
probs = exp_logits / np.sum(exp_logits)
return np.random.choice(len(probs), p=probs)
def update_entangled(self, s1, a1, s2, a2, reward, next_s1, next_s2, done):
"""
Update entangled Q-values using quantum-inspired TD learning.
Updates capture correlations between action pairs.
"""
if done:
target = reward
else:
# Bellman update for entangled Q-function
# Q*(s₁, a₁, s₂, a₂) = r + γ max_{a₁', a₂'} Q*(s₁', a₁', s₂', a₂')
next_max = np.max(self.entangled_q[next_s1, :, next_s2, :])
target = reward + self.gamma * next_max
# TD error
current_q = self.entangled_q[s1, a1, s2, a2]
td_error = target - current_q
# Update with quantum-inspired correlation term
correlation = np.dot(self.encode_state(s1), self.encode_state(s2))
self.entangled_q[s1, a1, s2, a2] += self.learning_rate * (td_error + 0.1 * correlation)
# Also update shared encoder to strengthen entanglement
self._update_encoder(s1, s2, td_error)
def _update_encoder(self, s1, s2, td_error):
"""Update shared encoder to strengthen correlations"""
# Gradient-based update (simplified)
z1 = self.encode_state(s1)
z2 = self.encode_state(s2)
gradient = td_error * np.outer(z1, z2)
self.shared_encoder += 0.01 * gradient
# Example: Key-Door Gridworld
class KeyDoorGridworld:
"""Gridworld where agent must collect key before opening door"""
def __init__(self, size=5):
self.size = size
self.agent_pos = (0, 0)
self.key_pos = (2, 2)
self.door_pos = (4, 4)
self.has_key = False
self.door_open = False
def reset(self):
self.agent_pos = (0, 0)
self.has_key = False
self.door_open = False
return self._get_state()
def _get_state(self):
"""State includes position, key status, door status"""
return (self.agent_pos[0] * self.size + self.agent_pos[1],
int(self.has_key), int(self.door_open))
def step(self, action):
"""Actions: 0=up, 1=down, 2=left, 3=right"""
# Move agent
moves = [(-1,0), (1,0), (0,-1), (0,1)]
new_pos = (self.agent_pos[0] + moves[action][0],
self.agent_pos[1] + moves[action][1])
if 0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size:
self.agent_pos = new_pos
# Check for key
if self.agent_pos == self.key_pos and not self.has_key:
self.has_key = True
reward = 10
# Check for door
elif self.agent_pos == self.door_pos:
if self.has_key and not self.door_open:
self.door_open = True
reward = 50 # Big reward for opening door with key
elif not self.has_key:
reward = -10 # Penalty for trying without key
else:
reward = 0
else:
reward = -1 # Small penalty for each step
done = self.door_open
return self._get_state(), reward, done
# Training loop for entangled Key-Door agent
env = KeyDoorGridworld(size=5)
agent = EntangledQRLAgent(state_dim=125, action_dim=4, horizon=2)
for episode in range(1000):
state1 = env.reset()
done = False
total_reward = 0
# First action: try to collect key
action1 = agent.act_entangled(state1)
state2, reward1, _ = env.step(action1)
total_reward += reward1
if not done:
# Second action: try to open door (entangled with first action)
action2 = agent.act_entangled(state2)
state3, reward2, done = env.step(action2)
total_reward += reward2
# Update entangled Q-values
# The update captures that (collect_key, open_door) is a good pair
agent.update_entangled(
s1=state1, a1=action1,
s2=state2, a2=action2,
reward=total_reward,
next_s1=state2, next_s2=state3,
done=done
)
if episode % 100 == 0:
print(f"Episode {episode}, Total reward: {total_reward}")
# The agent learns that collecting key and opening door are entangled:
# Q(state_near_key, collect_key, state_near_door, open_door) >>
# Q(state_near_key, collect_key, state_near_door, other_action)
Key Concepts:
- Entangled Q-function: Q(s₁, a₁, s₂, a₂) captures correlations between action pairs, not just individual actions.
- Shared encoder: Creates entangled representation where actions at different time steps share quantum correlations through the latent state.
- Joint optimization: The Q-values for action pairs are learned jointly, ensuring correlated actions (like collect_key + open_door) are optimized together.
- Quantum correlation: The correlation term in the update strengthens the entanglement between actions that work well together.
Summary: The Power of Quantum Entanglement in RL
How entangled actions enable coordinated learning
Key Insights
Quantum entanglement in reinforcement learning allows us to model correlations between actions that span multiple time steps. Unlike classical RL, which treats actions independently, entangled QiRL captures that certain action sequences work well together.
Correlated Learning
Actions at different time steps are learned jointly, not independently. The agent discovers that (collect_key, open_door) is a good pair.
Sample Efficiency
By modeling correlations explicitly, the agent learns faster which action sequences work well together, reducing the number of samples needed.
Long-Term Dependencies
Entanglement naturally captures long-term dependencies, like in chess where early moves influence late-game options.
🔮 Quantum Insights
Insight 1: Non-Separable States
👁️ Click to revealEntangled quantum states cannot be factored into independent components. In RL, this means action pairs like (collect_key, open_door) form a non-separable state that must be learned jointly, not as separate Q(s₁, a₁) and Q(s₂, a₂).
Insight 2: Bell Inequalities
👁️ Click to revealQuantum entanglement violates Bell inequalities, meaning correlations are stronger than any classical theory allows. In RL, this translates to action correlations that cannot be captured by independent optimization but emerge naturally from entangled representations.
Continue the Quantum Saga
Entanglement is the second story in the quantum saga. Next is interference – how conflicting learning signals can reinforce or cancel each other instead of averaging out.