AMD ROCm 6.2 update elevates AI and HPC performance to a new level
AMD has announced the release of the new ROCm 6.2 software stack to give users of AMD CPUs and GPUs the maximum performance and convenience in AI and HPC environments.
vLLM support expansion for AMD Instinct Accelerators
First on the list is the broadening vLLM support for AMD Instinct accelerators that solves the challenges of multi-GPU computational efficiency as well as reducing memory usage among other improvements.
Users can leverage new features like multi-GPU execution and FP8 KV cache as well as other experimental capabilities, including FP8 GEMMs and custom, decode paged attention.
To reap the benefits of these new integrations, click here for the full ROCm documentation or here for the steps to enable experimental features.
Better memory efficiency through Bitsandbytes Quantization Support
For the memory department, AMD has added the Bitsandbytes quantization library that utilizes 8-bit optimizers to cut down on memory usage, thus improving the overall capabilities and performance of large model training with limited hardware.
Specifically, the LLM.Int8() quantization optimizes AI, facilitating the deployment of LLMs on systems with less memory while lower-bit quantization accelerates both AI training and inference, increasing overall efficiency and productivity.
By lowering memory and computational demands, Bitsandbytes makes advanced AI capabilities accessible to a wider range of users, offers cost savings, democratizes AI development, and expands innovation opportunities.
It also supports scalability by enabling efficient management of larger models within existing hardware constraints while maintaining accuracy close to 32-bit precision versions.
Instructions are available on how to integrate this library with ROCm.
No more Internet dependency with ROCm Offline Installer Creator
Want to deploy AMD ROCm within an air-gapped system for security's sake? Then be sure to use the ROCm Offline Installer Creator to create an installer that includes all necessary dependencies to get it up and running without needing an online connection.
It can also help in things like large-scale local deployment that are made easier thanks to the automation of post-installation tasks such as user group management and driver handling, ensuring correct and consistent installations.
New Omnitrace and Omniperf Profiler Tools
A couple of new analysis tools are now out for the public to use where Omnitrace allows a holistic view of system performance across CPUs, GPUs, NICs, and network fabrics, helping developers identify and address bottlenecks, while Omniperf offers detailed GPU kernel analysis for fine-tuning.
Released as monitoring tools, both of them can work together to deliver optimizations both application-wide and compute kernel-specific performance to help make informed decisions and adjustments much more accurately, especially in the context of AI training and inference as well as HPC simulations.
Bigger and Better FP8 Support
Significant AI inference performance has been delivered thanks to the expansion of FP8 support that targets memory bottlenecks and high latency.
Addressing that has led to the possibility of larger models or batches being handled within the same hardware constraints, thus enabling more efficient training and inference processes. Additionally, reduced precision calculations in FP8 can lower the latency involved in data transfers and computations.
Here are some other FP8-related enhancements just in case:
- Transformer Engine: Adds FP8 GEMM support in PyTorch and JAX via HipBLASLt, maximizing throughput and reducing latency compared to FP16/BF16.
- XLA FP8: JAX and Flax now support FP8 GEMM through XLA to improve performance.
- vLLM Integration: Further optimizes vLLM with FP8 capabilities.
- FP8 RCCL: RCCL now handles FP8-specific collective operations, expanding its versatility.
- MIOpen: Supports FP8-based Fused Flash attention, boosting efficiency.
- Unified FP8 Header: Standardizes FP8 headers across libraries, simplifying development and integration.