This is one of the “smartest” models you can fit on a 24GB GPU now, with no offloading and very little quantization loss. It feels big and insightful, like a better (albeit dry) Llama 3.3 70B with thinking, and with more STEM world knowledge than QwQ 32B, but comfortably fits thanks the new exl3 quantization!
You need to use a backend that support exl3, like (at the moment) text-gen-web-ui or (soon) TabbyAPI.
AFAIK ROCm isn’t yet supported: https://github.com/turboderp-org/exllamav3
I hope the word “yet” means that it might come at some point, but for now it doesn’t seem to be developed in any form or fashion.
There’s a “What’s missing” section there that lists ROCm, so I’m pretty sure it’s planned to be added
That, and exl2 has ROCm support.
There was always the bugaboo of uttering a prayer to get rocm flash attention working (come on, AMD…), but exl3 has plans to switch to flashinfer, which should eliminate that issue.