Mountain Car and Pendulum Policy Learning - Control Systems

Q-Learning for Mountain Car and Pendulum Control

This project implements Q-learning algorithms to solve two reinforcement learning benchmarks: the Mountain Car and Inverted Pendulum environments. Key deliverables included state-action space discretization, epsilon-greedy exploration, and Bellman equation updates to learn optimal policies without prior knowledge of environment dynamics. The system achieved target rewards through iterative Q-table optimization and adaptive learning rate strategies.

Objectives

  • Implement Q-learning with discrete state-action spaces for continuous control tasks.
  • Achieve target rewards: ≥90 (Mountain Car) and ≤-300 (Pendulum) over 100 test episodes.
  • Design adaptive ε-greedy policies to balance exploration/exploitation.
  • Develop state discretization methods for continuous observations (position, velocity, angle).
  • Validate algorithm robustness through learning rate tuning and reward convergence analysis.

Project Process

  1. Environment Configuration: Integrated Gymnasium environments with state bounds:
    • Mountain Car: Position [-1.2, 0.6], Velocity [-0.07, 0.07]
    • Pendulum: Angle [-π, π], Angular Velocity [-8, 8]
  2. State Discretization: Created 50-bin grids for Mountain Car (3,125 states) and 20-bin grids for Pendulum (1,600 states) using linear partitioning.
  3. Q-Learning Core: Implemented Bellman updates with terminal state handling:
    Q(s,a) ← Q(s,a) + α[r + γ maxₐ’ Q(s’,a’) - Q(s,a)]
    Used α=0.1 (decaying) and γ=0.99 for discounted returns.
  4. Exploration Strategy: Deployed ε-decay from 1.0 to 0.01 over episodes:
    • Mountain Car: Exponential decay to encourage early exploration
    • Pendulum: Linear decay for steady policy refinement
  5. Training & Evaluation: Ran 5,000 episodes with periodic testing:
    • Mountain Car: 1,000-step episode cap during evaluation
    • Pendulum: Fixed 200-step episodes

Conclusion and Future Improvements

The Q-learning agent achieved average rewards of 94 (Mountain Car) and -280 (Pendulum), exceeding course targets. Future work could implement Deep Q-Networks (DQN) for continuous state handling, add prioritized experience replay, or integrate double Q-learning to mitigate maximization bias. Extending the state discretization granularity for Pendulum could further reduce torque oscillations.

Project Information

  • Category: Design/Hardware
  • Client: Rensselaer Polytechnic Institute
  • Project date: 7 November, 2024