Mountain Car and Pendulum Policy Learning - Control Systems
Q-Learning for Mountain Car and Pendulum Control
This project implements Q-learning algorithms to solve two reinforcement learning benchmarks: the Mountain Car and Inverted Pendulum environments. Key deliverables included state-action space discretization, epsilon-greedy exploration, and Bellman equation updates to learn optimal policies without prior knowledge of environment dynamics. The system achieved target rewards through iterative Q-table optimization and adaptive learning rate strategies.
Objectives
- Implement Q-learning with discrete state-action spaces for continuous control tasks.
- Achieve target rewards: ≥90 (Mountain Car) and ≤-300 (Pendulum) over 100 test episodes.
- Design adaptive ε-greedy policies to balance exploration/exploitation.
- Develop state discretization methods for continuous observations (position, velocity, angle).
- Validate algorithm robustness through learning rate tuning and reward convergence analysis.
Project Process
-
Environment Configuration:
Integrated Gymnasium environments with state bounds:
- Mountain Car: Position [-1.2, 0.6], Velocity [-0.07, 0.07]
- Pendulum: Angle [-π, π], Angular Velocity [-8, 8]
- State Discretization: Created 50-bin grids for Mountain Car (3,125 states) and 20-bin grids for Pendulum (1,600 states) using linear partitioning.
-
Q-Learning Core:
Implemented Bellman updates with terminal state handling:
Q(s,a) ← Q(s,a) + α[r + γ maxₐ’ Q(s’,a’) - Q(s,a)]
Used α=0.1 (decaying) and γ=0.99 for discounted returns. -
Exploration Strategy:
Deployed ε-decay from 1.0 to 0.01 over episodes:
- Mountain Car: Exponential decay to encourage early exploration
- Pendulum: Linear decay for steady policy refinement
-
Training & Evaluation:
Ran 5,000 episodes with periodic testing:
- Mountain Car: 1,000-step episode cap during evaluation
- Pendulum: Fixed 200-step episodes
Conclusion and Future Improvements
The Q-learning agent achieved average rewards of 94 (Mountain Car) and -280 (Pendulum), exceeding course targets. Future work could implement Deep Q-Networks (DQN) for continuous state handling, add prioritized experience replay, or integrate double Q-learning to mitigate maximization bias. Extending the state discretization granularity for Pendulum could further reduce torque oscillations.
Project Information
- Category: Design/Hardware
- Client: Rensselaer Polytechnic Institute
- Project date: 7 November, 2024