NVLM 1.0

NVIDIA · October 2024

activeOpen Weightdecoder onlymultimodal
Parameters72B
Context Window4K tokens

Description

A family of multimodal language models from NVIDIA that can process both text and images. Uniquely, adding vision capabilities actually improved the model's text performance — a rare achievement, since most multimodal models sacrifice some text quality when learning to handle images.

Key Innovations

Multimodal
MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.

Family Tree

External Links