Group Sequence Policy Optimization
en construction
Définition
XXXXXXXXX
Français
XXXXXXXXX
Anglais
Group Sequence Policy Optimization
GSPO
A new reinforcement learning algorithm for training large language models that addresses critical stability issues in existing methods. Current state-of-the-art algorithms like GRPO exhibit severe stability issues when training gigantic language model that can lead to catastrophic model collapse. GSPO resolves these issues by performing optimization at the sequence level rather than the token level, leading to more stable and efficient training. GSPO, a solution to improve the stability of current RL training methods for large language models. By aligning the optimization approach with the sequence-level nature of rewards and avoiding problematic token-level importance weighting, GSPO provides a more stable and efficient foundation for scaling RL training.
Source
Contributeurs: wiki
