AI Deep Dive: The Q-Learning Model

The intelligence of the final boss in "Train Your Foes" is not based on pre-programmed scripts or decision trees. Instead, it is powered by a custom-built Q-Learning agent. This document provides a detailed technical breakdown of the algorithm, its implementation, and the specific model architecture that allows the boss to learn and apply winning strategies.

1. The Q-Learning Algorithm

Q-Learning is a model-free, value-based reinforcement learning algorithm. Its goal is to learn the quality of an action in a particular state. It does this by building up a "cheat sheet," called a Q-Table, which stores a value (the Q-value) for every possible state-action pair.

The core of the algorithm is the Bellman equation, which is used to iteratively update the Q-values after each action. The update rule is as follows:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha [R + \gamma \max_{a'} Q(s', a') - Q(s, a)] $$

Where: - $s$ is the current state. - $a$ is the action taken. - $s'$ is the new state after the action. - $R$ is the immediate reward received. - $a'$ represents all possible actions in the new state. - $\alpha$ (Alpha) is the learning rate. - $\gamma$ (Gamma) is the discount factor.

2. Model Architecture

Our implementation is a self-contained QLearning class written in C#, designed to be cleanly integrated with the Battle_System manager.

The Q-Table

The "brain" of our agent is the Q-Table, which stores all learned values. - Data Structure: It is implemented as a 3D List<List<List<float>>> in C#. - Dimensions: The table is initialized with dimensions of 81 States x 1 (unused) x 3 Actions, reflecting the entire possibility space of the battle. - Initialization: The system is designed for persistence. On startup, it first checks for a saved qtable.txt file. - If a file exists, it loads the Q-Table from the file, resuming its learned state. - If no file is found, it initializes the Q-Table with small random values. This differs from a zero-initialization and helps break ties early in the training process, encouraging more varied exploration.

Hyperparameters

The learning process is governed by two key hyperparameters defined in the code: - Learning Rate ($\alpha$): 0.01 - This very low value means the agent learns slowly and cautiously, making small updates to its Q-Table. This is often used to ensure stable learning over a long training period. - Discount Factor ($\gamma$): 0.2 - This low value makes the agent very "short-sighted." It heavily prioritizes immediate rewards over potential future rewards. This can lead to a more aggressive, reactive playstyle.

Exploration Strategy

This model employs a purely greedy policy during gameplay. - An epsilon-greedy strategy (which allows for random, exploratory actions) was not used. - The agent always chooses the action with the highest currently known Q-value for its given state. The GetMaxArg function handles tie-breaking by randomly selecting among actions with the same highest Q-value.

3. State Space Definition

The State Space is the set of all possible situations the agent can be in. To make this manageable for a Q-Table, the game's continuous variables (like HP and Aura) are discretized into a single integer from 0 to 80.

This state is calculated from four key variables, each broken down into 3 levels (e.g., Low, Medium, High): 1. Player Health: The player's current HP percentage. 2. Boss Health: The boss's current HP percentage. 3. Player Aura: The player's current Aura Meter value. 4. Boss Mana: The boss's current Mana value.

These four 3-level variables are then encoded into a single unique state index using a base-3 calculation:

State Index = (PlayerHP_level × 27) + (BossHP_level × 9) + (PlayerAura_level × 3) + (BossMana_level)

This results in a total of $3^4 = 81$ unique states, providing a comprehensive yet compact state space for the agent to learn from.

4. Action Space Definition

The Action Space is the set of all possible moves the agent can make. The boss AI has three distinct actions it can choose from on its turn.

Action Index	Action Name	Description
0	Attack	A standard, reliable attack that deals a moderate amount of damage.
1	Defend	Reduces all incoming damage from the player's next attack.
2	Special Attack	A high-damage attack that consumes a large amount of the boss's mana. If mana is insufficient, it defaults to a standard attack.

5. Reward Function

The Reward Function defines the goal for the agent. Our system uses a combination of a final terminal reward and smaller intermediate rewards.

Terminal Rewards

A large, definitive reward is given at the very end of the battle. The GetReward function takes a boolean lost which is true if the player was defeated. - Agent Wins (Player Loses): +1.0 - Agent Loses (Player Wins): -1.0

Intermediate Rewards (Reward Shaping)

To help the agent learn faster, small rewards and penalties are given after each turn based on the immediate outcome. The updated solve() function provides this feedback based on two factors: - A small positive reward (+0.1) for decreasing the Player's Health, and a penalty (-0.1) if the Player's Health increased. - A small positive reward (+0.05) for increasing the Boss's Health (or preventing it from decreasing), and a penalty (-0.05) if the Boss's Health decreased.

This "reward shaping" provides more frequent feedback, allowing the agent to learn the value of advantageous situations without having to wait for the battle to end.