Local LLMs · TUI · CLI · llama.cpp · Rust
Fast local LLMs with zero overhead.
Launch llama.cpp workloads from a beautiful terminal UI, without the setup drag.
curl -fsSL https://llamastash.dev/install.sh | sh brew install llamastash/llamastash/llamastash cargo install llamastash --locked llamastash init
Features
The whole local-model loop, not just a launcher.
Install, discover, launch, smoke-test, script, and proxy local models through one tool. It stays transparent about llama.cpp while still smoothing off the annoying parts.
-
✓
Zero-to-chat init wizard
Run
llamastash initonce and it handles the annoying first-run work: detect hardware, install the right llama-server build, download a starter GGUF, write config, and smoke-launch it. -
✓
Scans what you already have
Walks HuggingFace, Ollama, and LM Studio caches plus user paths. Reads GGUF metadata, dedupes symlinks and split files, and watches the catalog for new models without a restart.
-
✓
One binary, three roles
The TUI, CLI, and daemon are the same binary. The daemon auto-spawns when needed, running models survive UI exit, and the same launch primitives show up in both human and agent workflows.
-
✓
Hardware-aware launches
Built-in arch defaults, per-model ports, health-probed lifecycle, and intelligent context auto-fit mean fewer bad launches and fewer manual llama.cpp flags for every machine you touch.
-
✓
OpenAI + Ollama proxy
A built-in localhost proxy on
127.0.0.1:11435/v1routes by model name, auto-starts the model you ask for, and can impersonate Ollama on11434when you need drop-in compatibility. -
✓
Agent-ready CLI
Stable
--jsonoutput, documented exit codes,pullandrecommendsubcommands, plus an installable AgentSkills bundle for Claude Code, OpenClaw, OpenCode, and other harnesses.
Why local
The cloud is great until it isn't.
Local models are catching up faster than most people realize. A modern laptop with 16+ GB of unified memory or a mid-range desktop GPU runs models that genuinely earn their keep — for code, writing, brainstorming, and tool-using agents.
LlamaStash exists to make that path low-friction: one binary, an init wizard that gets you to first response fast, and the same runtime surface in the TUI, CLI, and proxy.
-
Your data, your machine
Prompts, context windows, completions — all stay in RAM on the box you're sitting at. There is no third party in the chain who could log them, leak them, or change their mind about retention policy tomorrow.
-
No surprise bills
Local inference is free at the margin. The 10,000-token brainstorm at 2 a.m. costs the same as 10,000 tokens at noon: nothing. Sunsetting a model is your decision, not a vendor's.
-
Offline by default
Once a model is on disk, it works on a plane, in a SCIF, in a coffee shop with sketchy WiFi. The CLI surfaces it the same way every time — no cold-start latency from a remote API.
-
Works with the tools you already use
LlamaStash exposes a local OpenAI-compatible proxy on loopback. OpenCode, Pi, Cline, the OpenAI SDKs, or your own scripts can all talk to one stable URL while LlamaStash handles model routing and auto-start.
FAQ
Common questions
-
What does LlamaStash actually do?
It's a terminal-native TUI and CLI for launching local models through llama.cpp. It scans the GGUF files you already have, helps you pick the right one for your hardware, starts and supervises llama-server, and exposes a local OpenAI-compatible proxy for tools and agents. -
Does it send any data to a server?
Inference traffic stays on your machine. The main runtime surface is local-only: Unix-socket IPC for the daemon and a loopback-only proxy for OpenAI-compatible clients. `init`, `pull`, and parts of `doctor` do use the network when they need to download or verify artifacts, but LlamaStash itself has no telemetry or analytics pipeline. -
What platforms does it support?
macOS (Apple Silicon + Intel) and Linux (x86_64 + aarch64) are first-class. Windows is still on the roadmap; it is not part of the first release contract. -
Do I need llama.cpp installed already?
Optional. If you already have `llama-server` on your PATH, LlamaStash will use it. If not, the init wizard offers to install a recommended build via brew or by downloading a prebuilt binary from llama.cpp's releases. -
How is this different from Ollama, LM Studio, or jan?
Ollama is opinionated around its own model packaging and daemon workflow. LM Studio and jan are heavier GUIs. LlamaStash is the transparent middle ground: one binary, daemon on demand, works directly against the GGUF files you already have, and gives you the same primitives in the TUI, CLI, and local proxy. -
Is the install script safe to pipe into sh?
The script is served from this site as a content-verified mirror of the asset published with each GitHub Release, with a SHA-256 sidecar verified at deploy time. If you want the most paranoid path, download it, read it, and run it yourself. Or skip the script entirely and use `cargo install llamastash --locked` or `brew install llamastash/llamastash/llamastash`. -
Can I use LlamaStash with non-GGUF models?
Not in 0.0.1 — llama.cpp is the runtime, and llama.cpp consumes GGUF. Other runtimes (vLLM, mlx-lm) are on the roadmap as opt-in backends once the GGUF + llama.cpp path is solid. -
Can I point agents or editors at it?
Yes. LlamaStash ships a local OpenAI-compatible proxy at `http://127.0.0.1:11435/v1` by default, with optional Ollama-compat mode on `11434`. It also ships an AgentSkills bundle under `skills/llamastash/` and a Claude Code plugin manifest for install flows that want a repo-packaged skill. -
Is it open source?
Yes — MIT licensed. Source at github.com/llamastash/llamastash. Issues and PRs welcome.