Group Sequence Policy Optimization

en construction

Définition

XXXXXXXXX

Français

XXXXXXXXX

Anglais

Group Sequence Policy Optimization

GSPO

A new reinforcement learning algorithm for training large language models that addresses critical stability issues in existing methods. Current state-of-the-art algorithms like GRPO exhibit severe stability issues when training gigantic language model that can lead to catastrophic model collapse. GSPO resolves these issues by performing optimization at the sequence level rather than the token level, leading to more stable and efficient training.

GSPO, a solution to improve the stability of current RL training methods for large language models. By aligning the optimization approach with the sequence-level nature of rewards and avoiding problematic token-level importance weighting, GSPO provides a more stable and efficient foundation for scaling RL training.

Source

Source : huggingface