I’ve installed koboldcpp on a thinkpad x1 with 32gb RAM and a i7-1355U, no GPU. Sure, it’s only just around 1 token/s but for a chat it is still usable (about 15 s per reply). The setup was easier than expected.

  • @KinkyThoughts
    link
    English
    13 days ago

    15 seconds per reply with just 1 token/s?! How short are they? What’s the context size to be processed? I get like 5 tokens per second on my GPU and need 1-2 minutes per reply on 4k context size.

    • @raffaOP
      link
      English
      13 days ago

      context size default of 4096, replies are like 16 tokens or so.

      • @KinkyThoughts
        link
        English
        13 days ago

        I mean the actual context size to be processed for the message, based on chat history, character cards, world info, etc. And which model?

  • @magn418M
    link
    English
    1
    edit-2
    3 days ago

    Nice! KoboldCpp is also my software of choice. It’s easy to install, all-in-one and has a good amount of features.

    What kind of model size do you use to arrive at 1token/s? I’m in the same ballpark. Though my old desktop PC is a bit faster than my laptop. Probably because it has dual-channel memory and doesn’t throttle.

    I think that’s the point where it gets usable. At least for consecutive chat. If I feed in longer text, or KoboldCpp decides to recalculate large portions of the context, it’ll be several minutes for me until I get a reply. And that’s less fun for use-cases like dialougue.

    • @raffaOP
      link
      English
      23 days ago

      My first test was with Starcannon-Unleashed-12B-v1.0-f16, a 23Gbyte model. I did not expect that laptop to be usable at all.

      • @magn418M
        link
        English
        1
        edit-2
        3 days ago

        I think doing the calculations at full precision (FP16) is a waste. You should try somewhere between the Q4_K_M version to Q6_K (or at least Q8_0, that’s supposed to the same quality as FP16). That way it should be considerably faster… At least twice as fast.

        (The GGUF page of that model has a list of recommended quantization levels.)

        • @raffaOP
          link
          English
          23 days ago

          thanks for the tips!