Optimisation directe des préférences - Historique des versions

Pitpitt le 27 août 2024 à 13:43

2024-08-27T13:43:17Z

← Version précédente		Version du 27 août 2024 à 09:43
Ligne 20 :		Ligne 20 :
	[https://arxiv.org/abs/2305.18290 Source : arxiv ]		[https://arxiv.org/abs/2305.18290 Source : arxiv ]

	[[Catégorie:~~vocabulary~~]]		[[Catégorie:GRAND LEXIQUE FRANÇAIS]]

Pitpitt : Pitpitt a déplacé la page Direct Preference Optimization vers Optimisation directe des préférences

2024-08-27T13:41:49Z

Pitpitt a déplacé la page Direct Preference Optimization vers Optimisation directe des préférences

← Version précédente	Version du 27 août 2024 à 09:41
(Aucune différence)

Bouchard le 26 août 2024 à 16:57

2024-08-26T16:57:23Z

← Version précédente		Version du 26 août 2024 à 12:57
Ligne 13 :		Ligne 13 :
	== Anglais ==		== Anglais ==
	''' Direct Preference Optimization '''		''' Direct Preference Optimization '''

	''' DPO '''		''' DPO '''

Bouchard le 26 août 2024 à 16:57

2024-08-26T16:57:08Z

← Version précédente		Version du 26 août 2024 à 12:57
Ligne 1 :		Ligne 1 :
	==~~en construction~~==		== Définition ==
			Alors que les modèles de langage non supervisés à grande échelle acquièrent une connaissance générale du monde et certaines compétences de raisonnement, il est difficile d'obtenir un contrôle précis de leur comportement en raison de la nature totalement non supervisée de leur formation.

			Les méthodes existantes pour obtenir une telle maniabilité collectent des étiquettes humaines sur la qualité relative des générations de modèles et affinent le modèle de langue non supervisé pour l'aligner sur ces préférences, souvent avec l'apprentissage par apprentissage par renforcement à rétroaction humaine (ARRH).

			Cependant, le ARRH est une procédure complexe et souvent instable, qui consiste d'abord à adapter un modèle de récompense qui reflète les préférences humaines, puis à affiner le grand modèle de langue non supervisé à l'aide de l'apprentissage par renforcement pour maximiser cette récompense estimée sans trop s'éloigner du modèle d'origine.

	~~== Définition ==~~		L'optimisation directe des préférences (DPO) est une paramétrisation du modèle de récompense dans le ARRH qui permet d'extraire la politique optimale correspondante sous forme fermée, ce qui permet de résoudre le problème ARRH standard avec seulement une simple perte de classification. L'algorithme résultant est stable, performant et léger en termes de calcul, éliminant le besoin d'échantillonnage à partir du modèle de langue lors du réglage fin ou de l'exécution d'un réglage important des hyperparamètres.
	~~XXXXXXXXX~~

	== Français ==		== Français ==
	''' ~~XXXXXXXXX~~ '''		''' optimisation directe des préférences '''

	== Anglais ==		== Anglais ==
	''' Direct Preference Optimization'''		''' Direct Preference Optimization '''
			''' DPO '''
	While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

	==Sources==		==Sources==


	[https://arxiv.org/abs/2305.18290 Source : arxiv ]		[https://arxiv.org/abs/2305.18290 Source : arxiv ]



	[[Catégorie:vocabulary]]		[[Catégorie:vocabulary]]

Pitpitt : Remplacement de texte : « ↵↵↵↵ » par «   »

2024-01-29T17:34:43Z

Remplacement de texte : « ↵↵↵↵ » par «   »

← Version précédente		Version du 29 janvier 2024 à 13:34
Ligne 16 :		Ligne 16 :

	[https://arxiv.org/abs/2305.18290 Source : arxiv ]		[https://arxiv.org/abs/2305.18290 Source : arxiv ]




	[[Catégorie:vocabulary]]		[[Catégorie:vocabulary]]

Pitpitt : Remplacement de texte : « ↵↵↵==Sources== » par «  ==Sources== »

2024-01-29T15:54:01Z

Remplacement de texte : « ↵↵↵==Sources== » par «  ==Sources== »

← Version précédente		Version du 29 janvier 2024 à 11:54
Ligne 11 :		Ligne 11 :

	While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.		While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.



	==Sources==		==Sources==

Pitpitt : Remplacement de texte : « ↵ » par «  ==Sources==  »

2024-01-28T02:33:30Z

Remplacement de texte : « ↵<small> » par «  ==Sources==  »

← Version précédente		Version du 27 janvier 2024 à 22:33
Ligne 13 :		Ligne 13 :


	~~<small>~~
			==Sources==


	[https://arxiv.org/abs/2305.18290 Source : arxiv ]		[https://arxiv.org/abs/2305.18290 Source : arxiv ]

Pitpitt : Page créée avec « ==en construction== == Définition == XXXXXXXXX == Français == ''' XXXXXXXXX ''' == Anglais == ''' Direct Preference Optimization''' While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model g... »

2023-12-30T14:38:58Z

Page créée avec « ==en construction== == Définition == XXXXXXXXX == Français == ''' XXXXXXXXX ''' == Anglais == ''' Direct Preference Optimization''' While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model g... »

Nouvelle page

==en construction==

== Définition ==
XXXXXXXXX

== Français ==
''' XXXXXXXXX '''

== Anglais ==
''' Direct Preference Optimization'''

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

<small>

[https://arxiv.org/abs/2305.18290 Source : arxiv ]

[[Catégorie:vocabulary]]

Optimisation directe des préférences - Historique des versions

Pitpitt le 27 août 2024 à 13:43

Pitpitt : Pitpitt a déplacé la page Direct Preference Optimization vers Optimisation directe des préférences

Bouchard le 26 août 2024 à 16:57

Bouchard le 26 août 2024 à 16:57

Pitpitt : Remplacement de texte : « ↵↵↵↵ » par « »

Pitpitt : Remplacement de texte : « ↵↵↵==Sources== » par « ==Sources== »

Pitpitt : Remplacement de texte : « ↵ » par « ==Sources== »

Pitpitt : Pitpitt a déplacé la page Direct Preference Optimization vers Optimisation directe des préférences

Pitpitt : Remplacement de texte : « ↵↵↵↵ » par «   »

Pitpitt : Remplacement de texte : « ↵↵↵==Sources== » par «  ==Sources== »

Pitpitt : Remplacement de texte : « ↵ » par «  ==Sources==  »