Llama.cpp: A cheap method for inferencing with LLms

With llama.cpp, you will be using GGUF files (formerly GGML). I would recommend not offloading to gpu, but instead finding models that you can tick the no mmap box for. Loading the model completely into RAM 24/7 will be much more reliable in terms of speed.

To quantize in GGUF, the github repo is fairly simple to follow (opens in a new tab).

Or...you can use gguf-my-repo which is a huggingface space that will do it for you online: https://huggingface.co/spaces/ggml-org/gguf-my-repo (opens in a new tab)