5. Epsilon (ε), Epsilon-Greedy Policy and Epsilon Decay

Epsilon-Greedy Policy

In reinforcement learning, the epsilon-greedy policy is a strategy used to balance exploration and exploitation:

Exploration: The agent randomly selects actions to explore the environment and discover new knowledge.
Exploitation: The agent selects actions based on its current knowledge to maximize the reward.

Epsilon (ε)

Epsilon (ε): A parameter that determines the probability of choosing a random action (exploration) versus the best-known action (exploitation).
- When ε is high, the agent explores more.
- When ε is low, the agent exploits more.

Epsilon Decay

Epsilon Decay: To ensure the agent initially explores the environment but gradually shifts to exploiting its knowledge, ε is decreased over time.
- self.epsilon_decay: A factor by which ε is multiplied after each episode to reduce its value gradually.
- self.epsilon_min: The minimum value of ε to ensure that the agent always retains a small probability of exploring.

Purpose of the Code

The specific code snippet checks if the current value of ε is greater than the minimum threshold (self.epsilon_min). If it is, ε is multiplied by the decay factor (self.epsilon_decay) to decrease its value gradually. This ensures:

Initial Exploration: At the start, the agent explores the environment widely due to a higher ε.
Gradual Shift to Exploitation: Over time, as the agent learns, ε decreases, leading the agent to exploit its learned policy more frequently.
Prevent Stagnation: By ensuring ε never goes below a certain minimum value (self.epsilon_min), the agent retains some degree of exploration to avoid getting stuck in local optima.

epsilon 1