I mean it in the sense that I can upload a low quality phone photo of a page from a Chinese cookbook and it will OCR it, translate it into English and give me a summary of the ingredients.
I’ve been looking into vision models but they seem daunting to set up, and the specs say stuff like 384x384 image resolution, so it doesn’t seem like it would be able to do what I look for. Am I even searching in the right direction right now?
Sounds like what I’m looking for! What do you use for inference?
Ok, turned out to be as simple to run as downloading llama.cpp binaries, gguf of gemma3 and an mmproj file and running it all like this
(Could be even easier if I’d let it download weights itself, and just used -hf option instead of -m and —mmproj).
And now I can use it from my browser at localhost:5002, llama.cpp already provides an interface there that supports images!
Tested high resolution images and it seems to either downscale or cut them into chunks or both, but the main thing is that 20 megapixels photos work fine, even on my laptop with no gpu, they just take a couple of minutes to get processed. And while 4b model is not very smart (especially quantized), it could still read and translate text for me.
Need to test more with other models but just wanted to leave this here already in case someone stumbles upon this question and wants to do it themselves. It turned out to be much more accessible than expected.
Check out open webui 10/10 do recommend