To no one's surprise, NVIDIA is the one behind Meta's Llama 3 LLM inference acceleration
AI overlord NVIDIA is not yet done with the expansion of its futuristic empire because Meta's latest LLM model, the Llama 3, is fully optimized for NVIDIA GPUs for training and inferencing.
Citing the official announcement, Meta engineers from the Llama team completed the training through a giant cluster of H100 Tensor Core GPUs to the amount of 24,576 that were linked with RoCE and NVIDIA Quantum-2 InfiniBand networks.
But that figure seems to not satisfy Mark Zuckerberg's ambition just yet because they are planning to harness 350,000 H100 in the near future. Yikes.
As for developers who are eyeing the open-source model and planning for integration, Llama 3 is currently available for cloud, edge, data centers, and PCs. One may even try it out over here, powered by the announced-not-so-long-ago NVIDIA NIMs package.
And knowing Team Green, you can be sure that it is tuned for inferencing across different in-house platforms other than the usual professional RTX and gaming GeForce RTX GPUs, like the robotics and edge computing solutions Jetson Orin.
Not forgetting the real industrial folks, NVIDIA is also providing its best and sensible cost expectations by noting one H200 Tensor Core GPU is capable of serving 300 simultaneous users at 3000TPS (tokens per second), assuming the Llama 3 model is of the 70B parameter.
Spinning back to Jetson, the 8B parameter version can be tackled fairly efficiently with AGX Orin generating 40TPS and 15TPS on the Orin Nano.