[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313
Replies: 37 comments 45 replies
-
From this, IMO it only misses Linux+CUDA bundle to be useable as download & use. If we want better packaging on Linux, we can also work on snap/bash installer when trying to use pre-built packages. |
Beta Was this translation helpful? Give feedback.
-
It’s high time HuggingFace to copy Ollama’s packaging and GTM strategy, but this time, give credit to llama.cpp. Ideally, we should retain llama.cpp as the core component. |
Beta Was this translation helpful? Give feedback.
-
Is the barrier the installation process, or the need to use a complex command line to launch llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
For me the biggest thing is I'd love to see more emphasis placed on My ideal would be for the Maybe include systray integration and a simple UI for selecting and downloading models too. At that point |
Beta Was this translation helpful? Give feedback.
-
It will be cool if 'llama-server' would have auto configuration option to the machine/model like 'ollama' does it. |
Beta Was this translation helpful? Give feedback.
-
For windows maybe choco and windows store would be a good idea? 🤔 |
Beta Was this translation helpful? Give feedback.
-
I created a rpm spec to manage installation though I think flatpaks might be more user friendly and distribution agnostic. |
Beta Was this translation helpful? Give feedback.
-
The released Windows builds are available via Scoop. Updates happen automatically. Old installed versions are kept, and current one symlinked into a folder „current“ which provides the executables on the path. |
Beta Was this translation helpful? Give feedback.
-
is it feasible to have a single release for OS including all the backend? |
Beta Was this translation helpful? Give feedback.
-
For linux I just install the vulkan binaries and run the server from there. Maybe we can have a install script like ollama that detects the system and launches the server which can be controlled from an app as well as cli? The user then gets basic command line utillities like run start stop load list etc? |
Beta Was this translation helpful? Give feedback.
-
On Mac, the easiest way (also arguably the safest way) from a user's perspective is to find it in App Store, and install from there. Because of apps from App Store are in a sandbox, so from a user's point of view, installing or uninstalling is simple and clean. Creating a build and passing the App Store review might take some efforts (due to the sandbox constraint), but it should be a one-time thing. |
Beta Was this translation helpful? Give feedback.
-
Its my understanding that none of the automated installs support GPU acceleration. I might be wrong but its definitely the case for Windows, which makes it useless to install via winget. |
Beta Was this translation helpful? Give feedback.
-
To me the biggest advantage ollama currently has is that the optimal settings for a model are bundled, the gguf spec would allow for this to since its versatile enough to make this a metadata field inside the model. It would allow people to load the settings from a gguf and frontends can extract them and adapt them as they see fit. I think that part is going to be more valuable than obtaining the binary since downloading the binary from github is not that hard. |
Beta Was this translation helpful? Give feedback.
-
My personal wishlist
|
Beta Was this translation helpful? Give feedback.
-
I was about to open a similar issue and ask when CUDA Linux builds, like Another thing is to address the |
Beta Was this translation helpful? Give feedback.
-
Using the docker-image I have tested with different UIs and also ollama and as mentioned from other users, |
Beta Was this translation helpful? Give feedback.
-
Create a separate installer/launcher per platform that checks CPU/GPU/iGPU against a database and downloads the right executable. The same executable could be used to update. Advanced settings for configuring server options and model parameters. Have a curated list of quantized models to download and launch for that hardware. Have a "custom" option that prompts the user to save a well commented batch file with several examples. |
Beta Was this translation helpful? Give feedback.
-
You might want to look into my Mmojo-Server project. One binary built with Cosmopolitan, that runs on ARM/x86 and macoS/Linux/Windows. I have a Mmojo-Server prebuilt on HuggingFace. I use generic CPU inference, which isn't awful for small models. 12B on good i7/i9. 4B on a Raspberry Pi. While GPU support would be great, it mostly confuses users who are challenged a little by the Terminal. I even sell a Pi based appliance, if you want something to plug and play. https://Mmojo.net |
Beta Was this translation helpful? Give feedback.
-
llama.cpp on NixOS is very easy in my experience. Simply adding the following to your config gets you environment.systemPackages = [
pkgs.llama-cpp
]; Depending on your config it may already have CUDA enabled. If not you can do environment.systemPackages = [
(pkgs.llama-cpp.override { cudaSupport = true; })
]; For rocm environment.systemPackages = [
(pkgs.llama-cpp.override { rocmSupport = true; })
]; Vulkan environment.systemPackages = [
(pkgs.llama-cpp.override { vulkanSupport = true; })
]; Etc etc. You can also try combo multiple but your mileage may vary |
Beta Was this translation helpful? Give feedback.
-
Note that CUDA builds exists for Windows and Linux, and macOS builds are optimized for accelerate and Metal on This distribution can easily be updated by anyone to newer versions and refined for specific targets if needed. |
Beta Was this translation helpful? Give feedback.
-
I have a good idea for a self configuring front end, just been a little consumed with my new code companion project to start on it. For Windows users my Llama.Cpp-Toolbox is probably a good place to get up and running fast. I'm going to be updating it soon with more functionality for power users. The new idea basically would map out the functions after each build, generate a GUI from them and should never go out of date. I haven't thought hard about how it would look but pulling the options and the executables seems easy enough. The scripts and examples could be annoying but if I can determine how to pinpoint them and the instructions or options it could work. I'll get around to it when my new Ai-Code-Companion can scan it all and I'll see what's possible. eye rolls |
Beta Was this translation helpful? Give feedback.
-
I had to install extra stuff for compiling with curl to work when compiling CUDA version on ubuntu 25.04 today, it was pretty obscure: sudo apt-get install curl libssl-dev libcurl4-openssl-dev |
Beta Was this translation helpful? Give feedback.
-
@slaren if you build with GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS on. And have all backends installed, is there a way at runtime to switch between say vulkan and ROCm or vulkan and CUDA? |
Beta Was this translation helpful? Give feedback.
-
I just went through this, so it's fresh in my mind. I had not used Linux since about 2012, when I changed our in-house servers to Windows server. Running local llm's I have used Text-Generation-WebUI, Ollama, and LM Studio. With the release of GPT-OSS-120b, most of the wrappers did not keep up with the new changes as quickly as I wanted. I run a 5090, and I also wanted to move to the newer torch release. I was doing some small training stuff that needed Triton anyway, so, Linux. OP lists using winget, in Windows, but the recompiled llama.cpp is Vulcan. The binary distributions stop at 12.4, so if you need the newer copies of torch >12.4, it's source code. Having been away from Linux a long time, setting up WSL and creating everything for a build was a bit of a struggle. I actually had GPT-OSS write me a step by step walk through with copy and paste commands so I wouldn't screw it up. Once everything was running, it's great, and a significant speedup. I moved my dev work to Linux now. Then I wrote a script (really me and GPT-OSS wrote a script) to fire up llama.cpp, so I could just pick models from an old fashioned script menu. It's crude, but it's never out of date. I posted the script creation script that GPT wrote below because I thought the idea of a self creating script was nice. |
Beta Was this translation helpful? Give feedback.
-
IMHO the UI and the packaging can be a dedicated separate repo under the ggml umbrella. Llama.cpp should be remain true to its manifesto of "inference at the edge" and continue to be a fast and feature-full backend for edge devices. The impedance mismatch of being both a developer and a consumer product is probably not easy to manage. |
Beta Was this translation helpful? Give feedback.
-
llama.cpp is packaged well in Gentoo, no need to do anything fancy. |
Beta Was this translation helpful? Give feedback.
-
When will llama.cpp have the ability to search on the internet? Without the ability to search for information online and relying solely on existing training data, answering questions is always outdated or biased, and can only be considered a toy with no practical value |
Beta Was this translation helpful? Give feedback.
-
I keep my llama.cpp in a podman container. I've setup scripts to build it in a container as well. And I'm using cuda. Before I picked up llama.cpp I used ollama, found it bit too excessive and non-suited for me. Also saw that most libraries just package up llama.cpp and run my load there. I ended up just using llama.cpp directly through C API with python bindings. I figure it's nice to build llama.cpp but a bit difficult to pick up because there are so many options and things to look at. Non technical people might have difficulty picking it up just for that. But I personally am here for it. I find llama.cpp could use improvement on variety of places. One major place is the chat format interfaces. I think the current way chat templates work is too generic because the way this works is that you have an append-only log and you're paying high cost if you splice it. Chat templates on other hand pretend you can add stuff in the middle and they're essentially for-loops over messages list. I'd suggest you would completely change this design.
In practice that covers all message formats because they're forced to be constrained in same way as long as they generate tokens in sequence. Even if the constraint was lifted some day, I think this makes a good format. |
Beta Was this translation helpful? Give feedback.
-
I think we should focus on what it is now, non-tech users will use Ollama or LMStudio anyway. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.
One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.
Currently if someone was to use llama.cpp directly:
brew install llama.cpp
worksThis adds barrier for non technically inclined people specially since in all the above methods users would have to reinstall llama.cpp to get upgrades (and llama.cpp makes releases per commit - not a bad thing, but becomes an issue since you need to upgrade more frequently)
Opening this issue to discuss what could be done to package llama.cpp better and allow users to maybe download an executable and be on their way.
More so, are there people in the community interested in taking this up?
Beta Was this translation helpful? Give feedback.
All reactions