The LLMs use a lot of memory. So if you’re doing inference on a GPU you’re going to want one with enough VRAM. Like 16GB or 24GB. I heard lots of people like the NVidia 3090 Ti because that graphics card could(/can?) be bought used for a good price for something that has 24GB of VRAM. The 4060 Ti has 16GB of VRAM and (I think) is the newest generation. And AFAIK the 4090 is the newest consumer / gaming GPU with 24GB of VRAM. All the gaming performance of those cards isn’t really the deciding factor, the somewhat newer models all do. It’s mostly the amount of VRAM on them that is important for AI. (And pay attention, a NVidia card with the same model name can have variants with different amounts of VRAM.)
I think the 7B / 13B parameter models run fine on a 16GB GPU. But at around 30B parameters, the 16GB aren’t enough anymore. The software will start “offloading” layers to the CPU and it’ll get slow. With a 24GB card you can still load quantized models with that parameter count.
(And their professional equipment dedicated to AI includes cards with 40GB or 48GB or 80GB. But that’s not sold for gaming and also really expensive.)
You can also buy an AMD graphics card in that range. But most of the machine learning stuff is designed around NVidia and their CUDA toolkit. So with AMD’s ROCm you’ll have to do some extra work and it’s probably not that smooth to get everything running. And there are less tutorials and people around with that setup. But NVidia sometimes is a pain on Linux. If that’s of concern, have a look at RoCm and AMD before blindly buying NVidia.
With some video cards you can also put more than one into a computer, combine them and thus have more VRAM to run larger models.
The CPU doesn’t really matter too much in those scenarios, since the computation is done on the graphics card. But if you also want to do gaming on the machine, you should consider getting a proper CPU for that. And you want at least the amount of VRAM in RAM. So probably 32GB. But RAM is cheap anyways.
The Apple M2 and M3 are also liked by the llama.cpp community for their excellent speed. You could also get a MacBook or iMac. But buy one with enough RAM, 32GB or more.
It all depends on what you want to do with it, what size of models you want to run, how much you’re willing to quantize them. And your budget.
If you’re new to the hobby, I’d recommend trying it first. For example kobold.cpp and text-generation-webui with the llama.cpp backend (and a few others) can do inference on CPU (or CPU plus some of it on GPU). You can load a model on your current PC with that and see if you like it. Get a feeling what kind of models you prefer and their size. It won’t be very fast, but it’ll do. Lots of people try chatbots and don’t really like them. Or it’s too complicated for them to set it up. Or you’re like me and figure out you don’t mind waiting a bit for the response and your current PC is still somewhat fine.
What is a reasonable setup to run models locally, cpu, gpu, ram?
The LLMs use a lot of memory. So if you’re doing inference on a GPU you’re going to want one with enough VRAM. Like 16GB or 24GB. I heard lots of people like the NVidia 3090 Ti because that graphics card could(/can?) be bought used for a good price for something that has 24GB of VRAM. The 4060 Ti has 16GB of VRAM and (I think) is the newest generation. And AFAIK the 4090 is the newest consumer / gaming GPU with 24GB of VRAM. All the gaming performance of those cards isn’t really the deciding factor, the somewhat newer models all do. It’s mostly the amount of VRAM on them that is important for AI. (And pay attention, a NVidia card with the same model name can have variants with different amounts of VRAM.)
I think the 7B / 13B parameter models run fine on a 16GB GPU. But at around 30B parameters, the 16GB aren’t enough anymore. The software will start “offloading” layers to the CPU and it’ll get slow. With a 24GB card you can still load quantized models with that parameter count.
(And their professional equipment dedicated to AI includes cards with 40GB or 48GB or 80GB. But that’s not sold for gaming and also really expensive.)
Here is a VRAM calculator:
You can also buy an AMD graphics card in that range. But most of the machine learning stuff is designed around NVidia and their CUDA toolkit. So with AMD’s ROCm you’ll have to do some extra work and it’s probably not that smooth to get everything running. And there are less tutorials and people around with that setup. But NVidia sometimes is a pain on Linux. If that’s of concern, have a look at RoCm and AMD before blindly buying NVidia.
With some video cards you can also put more than one into a computer, combine them and thus have more VRAM to run larger models.
The CPU doesn’t really matter too much in those scenarios, since the computation is done on the graphics card. But if you also want to do gaming on the machine, you should consider getting a proper CPU for that. And you want at least the amount of VRAM in RAM. So probably 32GB. But RAM is cheap anyways.
The Apple M2 and M3 are also liked by the llama.cpp community for their excellent speed. You could also get a MacBook or iMac. But buy one with enough RAM, 32GB or more.
It all depends on what you want to do with it, what size of models you want to run, how much you’re willing to quantize them. And your budget.
If you’re new to the hobby, I’d recommend trying it first. For example kobold.cpp and text-generation-webui with the llama.cpp backend (and a few others) can do inference on CPU (or CPU plus some of it on GPU). You can load a model on your current PC with that and see if you like it. Get a feeling what kind of models you prefer and their size. It won’t be very fast, but it’ll do. Lots of people try chatbots and don’t really like them. Or it’s too complicated for them to set it up. Or you’re like me and figure out you don’t mind waiting a bit for the response and your current PC is still somewhat fine.
Thanks for the detailed reply! I need to upgrade my hardware this year anyway (it is 6+ years old).