« Encodage par paires d'octets » : différence entre les versions
 (Page créée avec « == en construction ==  == Définition == XXXXXXX  Voir aussi '''traitement automatique de la langue naturelle'''  == Français == ''' XXXXXXX'''  == Anglais == ''' Byte Pair Encoding'''  ''' BPE'''  ''BPE is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data''  == Source ==  [https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/   Source : Geeks... »)  | 
				Aucun résumé des modifications  | 
				||
| Ligne 4 : | Ligne 4 : | ||
XXXXXXX  | XXXXXXX  | ||
Voir aussi '''[[traitement automatique de la langue naturelle]]'''  | Voir aussi '''[[segment]]''', '''[[traitement automatique de la langue naturelle]]''' et '''[[Vocabulary (NLP)]]'''  | ||
== Français ==  | == Français ==  | ||
| Ligne 14 : | Ligne 14 : | ||
''' BPE'''  | ''' BPE'''  | ||
''  | ''Byte Pair Encoding is a simple form of data compression algorithms and is one of the most widely used subword-tokenization algorithms. It replaces the most frequent pair of bytes of data with a new byte that was not contained int the initial dataset. In Natural Language Processing, BPE is used to represent large vocabulary with a small set of subword units and most common words are represented in the vocabulary as a single token.''  | ||
''It is used in all of GPT versions, RoBERTa, XML, FlauBERT and more.''  | |||
== Source ==  | == Source ==  | ||
Version du 15 novembre 2024 à 11:40
en construction
Définition
XXXXXXX
Voir aussi segment, traitement automatique de la langue naturelle et Vocabulary (NLP)
Français
XXXXXXX
Anglais
Byte Pair Encoding
BPE
Byte Pair Encoding is a simple form of data compression algorithms and is one of the most widely used subword-tokenization algorithms. It replaces the most frequent pair of bytes of data with a new byte that was not contained int the initial dataset. In Natural Language Processing, BPE is used to represent large vocabulary with a small set of subword units and most common words are represented in the vocabulary as a single token.
It is used in all of GPT versions, RoBERTa, XML, FlauBERT and more.
Source
Contributeurs: Arianne Arel, Claude Coulombe, wiki
		
		 
	

 

 
