Skip to main content

Reinforcement Learning

Reinforcement Learning (RL): an agent learns to make decisions by interacting with an environment. The goal of the agent is to take actions that maximize a cumulative reward.

The core components of RL are:

  • Agent: The learner or decision-maker.

  • Environment: The world the agent interacts with.

  • State SS: A snapshot of the environment at a particular time.

  • Action AA: A move the agent can make.

  • Reward RR: Feedback from the environment that tells the agent how good its action was.

  • Discount γ\gamma: To weigh future rewards relative to immediate ones.

  • Policy π(as)\pi(a\mid s): The function that tells the agent what action to take (defines the agent’s behavior) given the state it's in.


Common Reinforcement Learning Algorithms

RL algorithms are often categorized into two main types:

  • Value-Based Methods:

    • These algorithms learn a value function that estimates the expected future reward from being in a particular state.
    • The agent then chooses the action that leads to the state with the highest value.
    • A prominent example is Q-Learning, which learns a Q-value for each state-action pair.
  • Policy-Based Methods:

    • These algorithms directly learn a policy, which is a mapping from a state to an action.
    • The policy essentially tells the agent what action to take in a given state.
    • A common example is the REINFORCE algorithm.

This Policy is the function we want to learn, the goal is to find the optimal policy π\pi, the policy that maximizes expected return when the agent acts according to it. We find this π\pi through training using the value-based and policy-based approaches.


Benefits of RL (generally)

  • Learns by interaction with delayed rewards (not just labeled data).

  • Optimizes sequential decision-making and long-term objectives.

  • Naturally handles non-i.i.d. (non-independent and identically distributed) data and closed-loop control.


Discount Rate γ

What is the Discount Rate?

  • The discount rate (commonly denoted γ, between 0 and 1) is used to weigh future rewards relative to immediate ones.(Hugging Face)

  • The discounted return from time t onward is:

    Gt=rt+1+γrt+2+γ2rt+3+ G_t = r_{t+1} + γ r_{t+2} + γ^2 r_{t+3} + \dots

    so that more distant rewards get multiplied by higher powers of γ and thus contribute less if γ < 1.(Hugging Face)

Influence on Reward Valuation

  • When γ is close to 1 (e.g., 0.99):

    • The agent cares about long-term rewards, attributing nearly equal weight to near and far future.
    • This encourages planning for the long horizon, even if achieving those future rewards involves delays or intermediate negative steps.
  • When γ is low (e.g., 0.5):

    • Rewards far in the future get heavily discounted (exponentially diminished).
    • The agent prioritizes immediate or short-term payoffs; long-term consequences matter far less.

The Hugging Face course uses a mouse-and-cheese analogy: a reward (cheese) far away or behind danger (cat) will be heavily discounted and thus less likely to drive behavior if γ is low.(Hugging Face)

Influence on Actions and Policy Behavior

Because the agent’s value or Q-function reflects expected discounted return, γ directly shapes which states and actions appear valuable.

  1. Short vs Long-term Focus

    • Low γ → agent greedy for immediate reward, potentially myopic, ignoring beneficial but delayed outcomes.
    • High γ → agent strategically plans, possibly delaying gratification to reach bigger eventual gains.
  2. Risk Sensitivity and Stability

    • With high γ, the agent considers more future steps, increasing variance and complexity in value estimation—can make training less stable or slower to converge, and requires longer horizons for correct backup.
    • With low γ, the training is more localized in time, potentially more stable and faster, but may fail on tasks requiring long-term planning.
  3. Exploration Impacts

    • Long-term objectives might require exploration of risky or indirect paths. If γ is too low, the agent may never explore these paths because immediate rewards dominate.
    • If γ is too high, the agent may overvalue uncertain long-horizon returns, making learning more difficult or prone to overfitting to noisy future estimates.

Choosing the Right Discount Rate

γ RangeAgent’s Behavior FocusTraining Considerations
≈ 0–0.5Immediate rewardsFaster training, but myopic
≈ 0.9–0.99Long-horizon, strategic planningRequires longer training, more variance, careful tuning
Exactly 1Undiscounted return (if infinite horizon, might diverge if unbounded returns)

In practice, γ is a hyperparameter that must be tuned based on:

  • Environment’s episode length (if tasks are episodic or continue indefinitely).
  • Reward delays (how long before consequential rewards appear).
  • The stability of training and exploration strategy.

Summary

  • Discount rate γ modulates how much the agent values future rewards.
  • High γ → long-term weight, encouraging planning but increasing complexity and potential variance.
  • Low γ → short-term focus, simpler and faster learning but potential failure in tasks requiring foresight.
  • The choice of γ significantly influences the policy learned, the quality of exploration, and training stability.

Episodic vs Continuing tasks

A task is an instance of a Reinforcement Learning problem. Basically, a task describes the type of problem the agent is solving. We can have two types of tasks: episodic and continuing.

Episodic task

  • In this case, we have a starting point and an ending point (a terminal state).
  • After the terminal state, the environment resets, and the agent starts fresh.
  • This creates an episode: a list of States, Actions, Rewards, and new States.

For instance, think about Super Mario Bros: an episode begins at the launch of a new Mario Level and ends when you’re killed or you reached the end of the level.

Continuing tasks

  • These are tasks that continue forever (no terminal state).
  • In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.
  • Discounting (γ) is especially important here to ensure the total return stays finite.

For instance, an agent that does automated stock trading or controlling a power grid. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop it.


Exploration vs Exploitation trade-off

  • Exploration is exploring the environment by trying random actions in order to find more information about the environment.
  • Exploitation is exploiting known information to maximize the reward.

A key challenge in RL is the exploration vs. exploitation trade-off. The agent must decide whether to exploit its current knowledge to get a known reward or to explore new actions in the hope of discovering an even better reward.

Other key tensions include: on-policy (learn from what you do) vs off-policy (learn from data generated by other/older policies); model-free vs model-based.