This is one of the “smartest” models you can fit on a 24GB GPU now, with no offloading and very little quantization loss. It feels big and insightful, like a better (albeit dry) Llama 3.3 70B with thinking, and with more STEM world knowledge than QwQ 32B, but comfortably fits thanks the new exl3 quantization!
You need to use a backend that support exl3, like (at the moment) text-gen-web-ui or (soon) TabbyAPI.
Oh my, this worked much better than I expected. Thanks
Yeah, it’s an Nvidia model trained for STEM, and really good at that for a ‘3090 sized model.’ For reference, this was a zero-temperature answer.
exllamav3 is a game changer. 70Bs sorta fit, but I think 50B is the new sweetspot (or 32B with tons of context).