NVIDIA releases new Mistral-NeMo-Minitron 8B LLM model; Leads the 8B category in 9 benchmarks

NVIDIA is happy to announce and release the new Mistral-NeMo-Minitron 8B LLM model to the public.

Released after last month's Mistral NeMo 12B, Team Green applied width pruning and light retraining via knowledge distillation that led to the creation of the 8B model.

The methodology was also proven effective with other LLMs like NVIDIA Minitron 8B and 4B, and Llama-3.1-Minitron 4B, coming all the way from the first proposal made in the 'Compact Language Models via Pruning and Knowledge Distillation' paper by NVIDIA themselves.

In terms of how they did it, they do not hesitate to share some of the details. First, they used an unpruned Mistral NeMo 12B model as the base using 127B tokens. Then, the width-only pruning process begins by reducing both the embedding (hidden) and MLP intermediate dimensions while keeping the attention headcount and layer number unchanged.

Figure-wise, the MLP intermediate dimension is down from 14,336 to 11,520 while the hidden size is 4,096 compared to the original 5,120.

Next, the model is distilled with a peak learning rate of 1e-4, a minimum learning rate of 4.5e-7, a 60-step linear warm-up, a cosine decay schedule, and a global batch size of 768 using 380 billion tokens (the same dataset used for teacher fine-tuning).

In short, the new Mistral-NeMo-Minitron 8B model is now the best and leading LLM in accuracy and other parameters that also proved the aforementioned pruning and knowledge distillation method to be generally effective.

NVIDIA will also implement these techniques into the NVIDIA NeMo framework for generative AI.

More information about the general idea behind the pruning and knowledge distillation method can be learned here.