Llama.cpp: A cheap method for inferencing with LLms

With llama.cpp, you will be using GGUF files (formerly GGML). I would recommend not offloading to gpu, but instead finding models that you can tick the no mmap box for. Loading the model completely into RAM 24/7 will be much more reliable in terms of speed.