LongVILA
en construction
Définition
XXXXXXXXX
Français
LongVILA
Anglais
LongVILA
A comprehensive framework that enables vision-language models to perform complex reasoning on long videos using reinforcement learning. The work addresses the significant challenge of understanding hour-long videos that require temporal, spatial, goal-oriented, and narrative reasoning capabilities. A framework for scaling vision-language models to long videos using reinforcement learning, achieving strong performance on various reasoning tasks with a specialized training infrastructure.
Source
Contributeurs: wiki
