I’ve installed koboldcpp on a thinkpad x1 with 32gb RAM and a i7-1355U, no GPU. Sure, it’s only just around 1 token/s but for a chat it is still usable (about 15 s per reply). The setup was easier than expected.
15 seconds per reply with just 1 token/s?! How short are they? What’s the context size to be processed? I get like 5 tokens per second on my GPU and need 1-2 minutes per reply on 4k context size.
context size default of 4096, replies are like 16 tokens or so.
I mean the actual context size to be processed for the message, based on chat history, character cards, world info, etc. And which model?
Nice! KoboldCpp is also my software of choice. It’s easy to install, all-in-one and has a good amount of features.
What kind of model size do you use to arrive at 1token/s? I’m in the same ballpark. Though my old desktop PC is a bit faster than my laptop. Probably because it has dual-channel memory and doesn’t throttle.
I think that’s the point where it gets usable. At least for consecutive chat. If I feed in longer text, or KoboldCpp decides to recalculate large portions of the context, it’ll be several minutes for me until I get a reply. And that’s less fun for use-cases like dialougue.
My first test was with Starcannon-Unleashed-12B-v1.0-f16, a 23Gbyte model. I did not expect that laptop to be usable at all.
I think doing the calculations at full precision (FP16) is a waste. You should try somewhere between the Q4_K_M version to Q6_K (or at least Q8_0, that’s supposed to the same quality as FP16). That way it should be considerably faster… At least twice as fast.
(The GGUF page of that model has a list of recommended quantization levels.)
thanks for the tips!