Faster Ollama alternative

RandomlyRight@sh.itjust.works · 16 hours ago

Faster Ollama alternative

cmgvd3lw@discuss.tchncs.de · 8 hours ago

Try llamafile from Mozilla.

hendrik@palaver.p3x.de · 8 hours ago

I’m also aware of LocalAI with automatic model swapping and OpenAI compatible API.

But unless I’m mistaken, they all use ggml behind the scenes? So you might want to look for something that uses vllm or exllama or something if you want a completely different backend.

Daughter3546@lemmy.world · 4 hours ago

I would not recommend LocalAI. There documentation is somewhat lacking and it’s an all in one utility with many moving parts. The parts also tend to break, quite often.

CaptnBook@feddit.org · 8 hours ago

Vllm unfortunately doesn’t support switching the model without a restart.

Possibly linux@lemmy.zip · 12 hours ago

I don’t think you are going to find anything faster. Ollama is pretty much as fast as it gets

RandomlyRight@sh.itjust.works · 9 hours ago

There are many projects out there optimizing the speed significantly. Ollama is unbeaten in the convenience though

CaptnBook@feddit.org · edit-2 9 hours ago

It’s not, by far. But vllm or SGLang don’t support switching the model… such a shame.

theunknownmuncher@lemmy.world · edit-2 15 hours ago

Ummm… did you try /set parameter num_ctx # and /set parameter num_predict #? Are you using a model that actually supports the context length that you desire…?

RandomlyRight@sh.itjust.works · 9 hours ago

Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines

theunknownmuncher@lemmy.world · edit-2 4 hours ago

Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization

I have no problems with changing num_ctx or num_predict

RandomlyRight@sh.itjust.works · 3 hours ago

It was multiple models, mainly 32-70B

theunknownmuncher@lemmy.world · edit-2 56 minutes ago

Can you try setting the num_ctx and num_predict using a Modelfile with ollama? https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter

Arehandoro@lemmy.ml · 11 hours ago

I don’t think it’s OpenAI compatible, but deepseek is faster.

hendrik@palaver.p3x.de · 9 hours ago

Btw, Ollama is a software to run AI models. Deepseek is just a company. Or a model file or a service. But that’s not what OP is looking for. They want to run a model. And that needs software like Ollama.

just_another_person@lemmy.world · 15 hours ago

😂