« Group Sequence Policy Optimization » : différence entre les versions


(Page créée avec « ==en construction== == Définition == XXXXXXXXX == Français == ''' XXXXXXXXX ''' == Anglais == '''Group Sequence Policy Optimization''' '''GSPO''' A new reinforcement learning algorithm for training large language models that addresses critical stability issues in existing methods. Current state-of-the-art algorithms like GRPO exhibit severe stability issues when training gigantic language model that can lead to catastrophic model collapse. GSPO resolves t... »)
 
Aucun résumé des modifications
 
Ligne 2 : Ligne 2 :


== Définition ==
== Définition ==
XXXXXXXXX
'''[[Algorithme]]''' d''''[[apprentissage par renforcement]]''' qui améliore l'efficacité de l''''[[entraînement]]''' ainsi que les performances des '''[[Grand modèle de langues (GML)|grands modèles de langues]]''' en utilisant des ratios d'importance au niveau des séquences (?) et des opérations.


== Français ==
== Français ==
Ligne 12 : Ligne 12 :
'''GSPO'''
'''GSPO'''


A new reinforcement learning algorithm for training large language models that addresses critical stability issues in existing methods. Current state-of-the-art algorithms like GRPO exhibit severe stability issues when training gigantic language model that can lead to catastrophic model collapse. GSPO resolves these issues by performing optimization at the sequence level rather than the token level, leading to more stable and efficient training.
'' Reinforcement learning algorithm that improves training efficiency and performance of large language models by using sequence-level importance ratios and operations.''
GSPO, a solution to improve the stability of current RL training methods for large language models. By aligning the optimization approach with the sequence-level nature of rewards and avoiding problematic token-level importance weighting, GSPO provides a more stable and efficient foundation for scaling RL training.


== Source ==
== Source ==

Dernière version du 6 octobre 2025 à 11:27

en construction

Définition

Algorithme d'apprentissage par renforcement qui améliore l'efficacité de l'entraînement ainsi que les performances des grands modèles de langues en utilisant des ratios d'importance au niveau des séquences (?) et des opérations.

Français

XXXXXXXXX

Anglais

Group Sequence Policy Optimization

GSPO

Reinforcement learning algorithm that improves training efficiency and performance of large language models by using sequence-level importance ratios and operations.

Source

Source : huggingface

Contributeurs: Arianne Arel, wiki