How to calculate cost-per-tokens output of local model compared to enterprise model API access

SmokeyDope@lemmy.world · edit-2 9 days ago

How to calculate cost-per-tokens output of local model compared to enterprise model API access

slacktoid@lemmy.ml · 8 days ago

Damn! Thank you so much. This is very helpful and a great starting point for me to mess about to make the most of my LLM setup. Appreciate it!!

brucethemoose@lemmy.world · edit-2 5 days ago

Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.

If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.

slacktoid@lemmy.ml · 4 days ago

I need to mess with tabbyapi. Doesn’t help that there’s like 2 tabbys, one is tabbyapi and the other is tabbyml. I am guessing tool support is at its infancy stage.

brucethemoose@lemmy.world · edit-2 4 days ago

Tabby supports tool usage. It’s all just prompting to the underlying LLM, so you can get some frontend to hit the API and do whatever is needed, but I think it does have some kind of native prompt wrapper too.

It is confusing because there are 2 TabbyAPI formats now: exl2 (optimal around 4-5bpw), older and more mature (but now unsupported), and exl3, optimal down to ~3bpw (and usable even below), but slower on some GPUs.

slacktoid@lemmy.ml · 3 days ago

Thank you for all your insight!! This is really helpful