ra-yavuz / hydra-llm

hydra-llm v0.2.1 - RAG
Docker for language models, with retrieval baked in.
One CLI to download, run, chat with, and search local LLMs. Each model runs in its own Docker container, so docker ps shows what is actually running. Hardware-aware curated GGUF catalog with anonymous Hugging Face downloads. OpenAI-compatible endpoints on stable local ports. Retrieval-augmented generation is a first-class feature: index any folder, query it, or add --rag <path> to any chat. Bundle a model, a persona, and a corpus into a single alias via hydra-llm create, then chat with it by name. Optional KDE Plasma 6 panel widget. No cloud, no API key, no telemetry.
Open source under the MIT License. Provided as is, no warranty: read the full disclaimer at the bottom of the page before installing.
Why use this instead of plain Ollama or llama.cpp?
Three answers, in increasing order of "you didn't know you wanted that":
- Transparent Docker over llama.cpp. Each model runs in its own container with a stable name and a reserved port.
docker psshows you exactly what's running. No wrapper daemon you can't see into.hydra-llmis a thin layer over real, inspectable infrastructure. - Hardware-aware curated catalog with anonymous downloads.
hydra-llm list-onlinefilters community GGUFs to what your machine can actually run, scored against the tier fromhydra-llm doctor. No Hugging Face account or token required.HF_TOKENis honored but never demanded. - RAG built in. Index any folder with one command. Query it with another. Or add
--rag <path>tohydra-llm chatand your model retrieves relevant chunks at every turn. Bundle a model + a persona + a corpus into one declarative alias and just sayhydra-llm chat my-bot. Nobody else does that.
If you've used Ollama, this will feel familiar. The difference: hydra-llm doesn't ask you to write modelfiles, runs in plain Docker (no opaque background daemon), exposes the same OpenAI endpoint shape, and ships a real KDE panel widget. And it has retrieval. Ollama doesn't.
Why Docker
Every model server (chat or embedder) runs in its own Docker container. The reasons:
- No host pollution. The llama.cpp engine, its CUDA/Vulkan runtime, and every model live inside a container. Nothing gets installed on your host beyond the
hydra-llmCLI itself and a Python or two. - Trivial removal.
sudo apt remove hydra-llmuninstalls the CLI.hydra-llm wipeadditionally tears down the engine image, every cached model, every embedder, and every chat session. Your host is back to where it started. - Reproducibility. The same image runs the same way on a Strix iGPU laptop, a workstation with a discrete GPU, a NUC, or an Apple Silicon Mac via Docker Desktop. Two engine variants ship: Vulkan for GPU-equipped boxes and a plain CPU build for everything else; the right one is auto-selected per machine.
- Inspectable.
docker pstells you exactly which models are running and on which port.docker logs hydra-<alias>gives you the raw llama-server output. No background daemon hiding state behind an opaque API. - Per-model isolation. Two models running side by side cannot interfere with each other's address space or file handles; they are separate processes in separate containers. Killing one never destabilises another.
Engine updates and pinning
Each model server runs in a Docker container hydra builds locally from a Dockerfile shipped with the deb. The Dockerfile pins a specific llama.cpp commit, so the engine you get is reproducible: every user on the same hydra-llm version runs the same llama.cpp.
When a hydra-llm release bumps the pinned commit (which we do whenever there is a meaningful llama.cpp fix or a model family that needs newer parser support), the deb's postinstall script notices the mismatch between the Dockerfile's ref and the label baked into your existing engine image, and rebuilds the image automatically. This works on upgrades and downgrades. apt upgrade hydra-llm keeps your engine current with no extra step. The rebuild log lands at /var/log/hydra-llm-engine-rebuild.log if you ever want to see what happened.
If a rebuild fails (transient network, Docker daemon down, an upstream llama.cpp build break), the postinstall logs a warning and continues; your previous engine still works. You can retry any time with:
hydra-llm setup --rebuild
That is also the way to force a rebuild manually for any other reason (corrupted layers, debugging a build).
Embedder auto-recover
When an embedder sidecar fails to start because the GGUF on disk looks corrupt (a partial download, a re-uploaded file with a different SHA, etc.), hydra detects the parse failure in the container logs, deletes the local file, re-downloads from the catalog URL, and retries once. Transparent to the user; no --force flag needed. If the second attempt also fails, the original error bubbles up.
Autokill: idle models stop themselves
Loaded chat models hold VRAM and the kernel page cache for the GGUF mmap. Forgetting to hydra-llm stop a model after you are done is the easy mistake. Hydra ships a small user-level systemd timer (hydra-llm-reaper.timer) that wakes once a minute, asks Docker which hydra containers are running, and stops the ones that have been idle longer than chat_idle_ttl_seconds (default: 600s, ten minutes). The timer is enabled automatically the first time hydra-llm setup finishes; manage it explicitly with hydra-llm reaper {status,enable,disable}. Disable autokill entirely by setting chat_idle_ttl_seconds: 0 in ~/.config/hydra-llm/config.yaml.
"Idle" combines two signals so the autokill works for hydra-llm chat users and external API clients alike: every chat turn refreshes a per-alias touch file, and each reap cycle also samples docker stats and treats any container above reap_cpu_busy_percent (default 1%) as in use. Aider, curl, the Plasma widget, lillycoder, anything sending real requests at the model's port keeps it alive purely by using it. The timer itself is dormant between ticks; one cycle is a docker ps, a docker stats --no-stream, and a few stat() calls. Cost is dominated by docker and stays a small fraction of one core for a couple hundred milliseconds per minute. Run hydra-llm reap to trigger one cycle by hand.
The same TTL mechanism has long applied to embedder sidecars (embedder_idle_ttl_seconds, default 60s); chat-model autokill brings the two surfaces under one timer.
Where it runs
hydra-llm targets Linux first (Debian, Ubuntu, Fedora, Arch, etc.) but the architecture is portable wherever Docker runs:
- Linux: primary target.
.debfor Debian/Ubuntu derivatives; the bash one-liner installer works on every distro with bash + Docker. - WSL2 on Windows: works. Run hydra-llm inside an Ubuntu WSL distro with Docker Desktop integrated, or with the native
docker.iopackage inside the WSL distro itself. The Vulkan engine uses WSLg's GPU passthrough on supported hardware; otherwise the CPU engine is used. The KDE Plasma widget needs Plasma 6, so it does not apply on Windows desktops; the CLI works fully. - macOS: CLI runs under Docker Desktop. The Vulkan engine variant does not apply on Apple Silicon (use the CPU engine, which still benefits from Apple's accelerators inside Docker's HVF layer); model speeds are competitive on M-series Pro/Max chips.
- Headless servers: the CLI is the only surface needed. The Plasma widget and other GUI hooks degrade gracefully; nothing blocks on a missing display.
Install
sudo bash -c 'set -e; install -m 0755 -d /etc/apt/keyrings && curl -fsSL https://ra-yavuz.github.io/apt/pubkey.gpg -o /etc/apt/keyrings/ra-yavuz.gpg && echo "deb [arch=amd64,arm64 signed-by=/etc/apt/keyrings/ra-yavuz.gpg] https://ra-yavuz.github.io/apt stable main" > /etc/apt/sources.list.d/ra-yavuz.list && apt update && apt install -y hydra-llm'
On KDE, also install the panel widget: sudo apt update && sudo apt install hydra-llm-plasma. The sudo apt update step is required even if you already have the repo: without it apt will not see new packages or new versions.
Step by step (manual repo setup)
# 1. Trust the signing key
sudo install -d -m 0755 /etc/apt/keyrings
curl -fsSL https://ra-yavuz.github.io/apt/pubkey.gpg \
| sudo tee /etc/apt/keyrings/ra-yavuz.gpg >/dev/null
# 2. Add the apt source
echo "deb [arch=amd64,arm64 signed-by=/etc/apt/keyrings/ra-yavuz.gpg] https://ra-yavuz.github.io/apt stable main" \
| sudo tee /etc/apt/sources.list.d/ra-yavuz.list
# 3. Refresh the package index, then install
sudo apt update
sudo apt install hydra-llm hydra-llm-plasma # plasma widget is optional
hydra-llm runs every model in a Docker container. If you do not have Docker yet: sudo apt install docker.io && sudo usermod -aG docker "$USER", then log out and back in.
~/.local/bin, builds the engine image, downloads a starter model, runs a smoke test.
curl -fsSL https://raw.githubusercontent.com/ra-yavuz/hydra-llm/main/get.sh | bash
Do not run with sudo. The installer asks for sudo only for missing system packages.
Quickstart, end to end
Zero to chatting with retrieval over your own folder. Every command below is real, in the order you run them. Deep-dive sections later on this page explain each piece.
1. First-run engine setup
Build the llama.cpp Docker image and pull a starter model so you can confirm the install works:
hydra-llm doctor # confirm hardware detection (tier, RAM, GPU)
hydra-llm setup # build engine image (5-10 min) + starter model + smoke test
hydra-llm setup is the only step that takes real time. It builds the Vulkan and CPU engine images locally; the deb itself does not ship them. On a typical laptop this is a one-time 5-10 minute investment.
2. Pick and download a chat model
hydra-llm list-online # browse the catalog, filtered to what fits
hydra-llm download gemma-2-2b # or any id from list-online
Pick something your hardware actually fits; hydra-llm doctor already told you the tier. Want to register your own GGUFs from somewhere else (Ollama, LM Studio, manual download)?
hydra-llm addlocal /path/to/your/gguf/folder/ --link --tier laptop
Recursively registers every .gguf under the folder.
3. Chat with the model
Just type hydra-llm chat. As of v0.2.3, no alias is needed:
hydra-llm chat
# - exactly one model running -> attaches to it
# - several running -> lists with port numbers, prompts
# - none running, one downloaded -> auto-starts that one
# - none running, several installed -> lists, prompts
# - nothing installed -> tells you to download one
Or be explicit:
hydra-llm chat gemma-2-2b
hydra-llm chat 4 # numeric # from `hydra-llm list`
Inside the REPL, slash commands matter: /help, /quit, /reset, /params, /set temperature 0.5.
Sessions persist by default. Each (folder, model) pair gets its own session at ~/.local/state/hydra-llm/sessions/<hash>-<alias>.json, so chats in different folders never share history. Override with --session foo for a named cross-folder session, or ./chat.json to pin to a project.
4. Set up RAG (one-time)
Now the fun part: index a folder so the model can answer questions about your code or notes.
hydra-llm rag setup
You will see a numbered menu like:
Recommended embedder for your tier: qwen3-embed-4b (2.5 GB) (NOT installed)
Options:
1. Download recommended (qwen3-embed-4b, 2.5 GB)
2. Use already-installed nomic-embed-text (prose)
3. Pick a different embedder from the catalog
4. Cancel; do nothing
Pick 1 for the recommendation, or 2 if you already have a smaller embedder you would rather use (especially if your folder is mostly prose: a novel, notes, docs). The "What is an embedder" section below explains the choice in detail.
5. Index a folder
cd ~/path/to/your/project
hydra-llm index .
Walks the folder, classifies each file as code or prose, chunks it, embeds every chunk, and stores everything in <folder>/.hydra-index/ (a LanceDB database, ~20-200 MB depending on folder size). Re-running is idempotent: it diffs (mtime, size) against the previous index and only re-embeds changed files. ~1 second for a no-op refresh.
Useful flags: --exclude '*.test.js', --include 'fixtures/important.json', --depth 2, --max-file-size-mb 0.5, --full, --dry-run, --tag work.
6. Sanity-check retrieval (no model needed)
hydra-llm query "where do we handle auth tokens"
Top-K chunks come back with file path and line range. No chat model involved. If results look off, your index is the problem (try a different embedder, or index . --full); if they look right, you are ready to chat.
7. Chat with retrieval
hydra-llm chat --rag . # use the index in cwd
hydra-llm chat gemma-2-2b --rag . --rag-show-chunks # also echo retrieved locations
Per turn, hydra embeds your message, fetches the top-K chunks from the index, and prepends them to the prompt as a <context>...</context> block before sending to the model. --rag-show-chunks prints which file paths were retrieved so you can see what the model is being given.
Slash commands inside the REPL: /rag on|off, /rag-show on|off, /rag-chunks on|off, /rag <text> for one-off retrieval without a model call.
8. Bundle a model + persona + corpus into one alias
If you find yourself doing the same chat <model> --rag <path> repeatedly:
hydra-llm create gemma-2-2b ~/personas/code-helper.md cool-app-bot \
--rag-index ~/projects/cool-app
hydra-llm chat cool-app-bot # no flags. Retrieval just works.
The bundle is a single declarative entry in ~/.config/hydra-llm/catalog.yaml. Move that file across machines and the bundle moves with it.
9. Federate: query across every indexed folder
hydra-llm rag stores # list every folder you have indexed
hydra-llm query "..." --all # federated search across all of them
hydra-llm query "..." --tag work # only stores tagged with 'work'
hydra-llm chat <model> --rag-all # chat with retrieval across all stores
That is the whole arc, end to end. Everything below is depth on each piece.
What you actually get
Stable OpenAI-compatible endpoints
Every running model exposes POST /v1/chat/completions on its own local port. Point Aider, Continue.dev, Open Interpreter, lillycoder, or your own scripts at http://localhost:18080/v1; rotate which model is behind that port with stop A && start B. No client config changes, no API keys.
Container lifecycle without docker-fu
start, stop, stop-all, status, api. Two engine images (Vulkan + CPU) build on first setup and auto-select per hardware. CPU fallback if Vulkan misbehaves. Each model gets a stable container name and a reserved port.
Hardware-aware curated catalog
list-online filters community GGUFs (Bartowski, lmstudio-community, mradermacher) to what your machine can run. Anonymous downloads, no HF account needed. Tiers cover 4 GB Pi-class up to 70B-on-iGPU Strix-class boxes.
RAG over any folder new
Walk + classify code-vs-prose + line-aware chunk + embed each chunk with the right embedder + store in <folder>/.hydra-index/ (LanceDB). Re-runs are incremental: only changed files re-embed. Federated query across every folder you've indexed. RRF fusion across code and prose.
Catalog-bound bundles new
A chat-catalog entry can carry system_prompt, params, and a rag_index: path. create <model> <persona.md> <id> --rag-index <path> bakes all three into one alias. Then chat <alias> runs everything, no flags. Move the model+persona+corpus across machines as one declarative unit.
KDE Plasma 6 panel widget
Per-row Start/Stop, Console launcher, inline log pane, prompt/params editor. HAL-eye tray indicator that breathes faster as system load rises and turns solid red once a model is healthy. Reads the same config as the CLI; addlocal entries appear automatically.
Personas, prompts, params
Three layers, narrowest wins. Persona files in personas/, per-alias prompts in prompts/, per-alias sampling params in params/. Inline catalog values override files. create <model> <persona.md> <id> bakes a persona's body into a new alias as inline system_prompt.
Persistent chat sessions
Sessions saved as JSON in ~/.local/state/hydra-llm/sessions/. Resume by name with --session <name>, or pin the session file to any path: chat gemma-2-2b ./project-notes.json.
RAG, in depth
RAG (retrieval-augmented generation) means: when you ask a question, hydra retrieves relevant chunks from a corpus and prepends them to the prompt, so the model answers based on text it didn't have to memorise.
What is an embedder
An embedder is a small specialised model that does one thing: take a piece of text in, output a fixed-size list of numbers (a vector) out. Different texts that mean similar things produce vectors that are mathematically close. That is what makes "find the chunk most similar to this question" possible: the question becomes a vector, every chunk in your folder is already a vector, and finding nearest neighbours is just arithmetic.
Embedders are not chat models. They do not generate text. They run in their own llama-server --embeddings containers (separate port range from chat models, default 19080-19099) and live in a separate catalog at ~/.config/hydra-llm/embedders.yaml. Six curated embedders ship: nomic-embed-text (lightweight, prose-leaning), qwen3-embed-{0.6b,4b,8b} (instruction-aware, strong on code), bge-m3 (multilingual), nomic-embed-code (code-tuned).
Single-embedder vs dual-index mode
By default, hydra runs single-embedder mode: one embedder serves all your chunks (both code and prose). One GGUF on disk, one container at query time, one LanceDB index per folder. Cheap, simple, fine for most personal-scale projects.
There is also an opt-in dual-index mode that runs two embedders (one tuned for code, one for prose), maintains a separate index per kind, and fuses query results via Reciprocal Rank Fusion at retrieval time. The theory: code questions surface from the code table, prose questions surface from the prose table, fusion gives you the best of both.
Most users should stay on single. The honest numbers: dual-index buys measurable retrieval quality on very large mixed corpora (think a monorepo with thousands of source files plus hundreds of design docs), but for personal-scale folders the marginal quality difference is small. The cost is two embedder downloads (often 4+ GB combined), two embedder containers running concurrently, two indexes per folder, and the fusion logic at query time.
Turn dual mode on later with:
hydra-llm rag setup --dual # one-off
# or persist in ~/.config/hydra-llm/config.yaml:
# rag:
# dual_index: true
Switching modes requires a one-time re-index of any existing folders.
1. First-run for RAG
hydra-llm rag setup detects your hardware tier and presents a numbered menu. The menu surfaces already-installed embedders as alternatives so you don't get pushed into a 2.5 GB download you don't need:
Recommended embedder for your tier: qwen3-embed-4b (2.5 GB) (NOT installed)
Default code embedder for halo+ tier. Instruction-aware ...
Options:
1. Download recommended (qwen3-embed-4b, 2.5 GB)
2. Use already-installed nomic-embed-text (prose)
3. Pick a different embedder from the catalog
4. Cancel; do nothing
Number [1-4, default 1]:
Or browse and pick manually:
hydra-llm rag list-online # catalog, filtered to hardware
hydra-llm rag list # what you have installed
hydra-llm rag download <id> # pull one
hydra-llm rag info <id> # dimensions, pooling, prefix, running state
2. Indexing classifies code vs prose
The walker uses .gitignore (via python-pathspec) plus a builtin blacklist (node_modules, .venv, target, build, lockfiles, binaries, archives, media, weights, files >1 MB). Each file is classified code or prose by extension first (.py, .sh, .md, .rst, ...), then by canonical basenames (Makefile, Dockerfile, README, LICENSE), then by shebang sniff. The chunker is line-aware (1500 chars target, 200 overlap, never splits mid-line).
What happens with the classification depends on the mode: in single-embedder mode every chunk goes to one shared LanceDB table with its kind recorded as a column (so --code-only / --prose-only at query time still filter correctly); in dual-index mode code chunks go to the code table embedded with the code embedder, prose chunks go to the prose table embedded with the prose embedder.
cd ~/projects/cool-app
hydra-llm index . # full index on first run
hydra-llm index . # incremental: diffs by (mtime, size)
hydra-llm index . --tag work # tag this store (--tag is repeatable)
hydra-llm index . --exclude '*.test.js' --include 'fixtures/important.json'
hydra-llm index . --depth 2 --max-file-size-mb 0.5
hydra-llm index . --full # force a from-scratch rebuild
hydra-llm index . --dry-run # print plan, don't embed
3. Storage is per-folder LanceDB
Each indexed folder grows a .hydra-index/ with two LanceDB tables (code.lance, prose.lance), a meta.yaml recording which embedders the index was built with, and a files.json that drives incremental refresh. The .hydra-index/ moves with the folder: copy a project to another machine, the index comes along.
4. Retrieval uses Reciprocal Rank Fusion
At query time the question is embedded with both embedders, top-K hits come back from each table, and the lists are fused by RRF (k=60). This is the 2026 best practice for code+prose corpora; it avoids the failure mode where a code embedder mangles README prose, while still surfacing the right code blocks first.
hydra-llm query "where do we handle auth tokens?" --in .
hydra-llm query "..." --top-k 10 --code-only
hydra-llm query "..." --tag work # federated across all tagged stores
hydra-llm query "..." --all # federated across every store
Default scope: if cwd has a .hydra-index/, query just that store. Otherwise, query every registered store. Override with --in, --stores, or --tag.
5. Chat-side RAG augments per turn
With --rag <path>, every user message is embedded, top-3 chunks are retrieved, and the user message becomes
<context>
--- src/auth/middleware.go:42-78 ---
func authenticate(r *http.Request) ...
--- README.md:54-72 ---
Auth flow: ...
</context>
(your original question)
Saved sessions keep the original (un-augmented) text so resumes don't carry stale context. New REPL slash commands: /rag on|off, /rag-show on|off, /rag-chunks on|off, /rag <text> for one-off retrieval without a model call.
hydra-llm chat llama-3.1-8b --rag . # single store
hydra-llm chat llama-3.1-8b --rag-all # every store
hydra-llm chat llama-3.1-8b --rag-tag work --rag-top-k 5 # by tag
hydra-llm chat my-bundle # catalog-bound, no flags
Catalog-bound bundles (the headline feature)
A chat-catalog entry can carry a rag_index: field. create bakes that field plus a persona's body plus its front-matter params into a new alias.
hydra-llm create llama-3.1-8b ~/personas/senior-engineer.md cool-app-bot \
--rag-index ~/projects/cool-app
# Persisted to ~/.config/hydra-llm/catalog.yaml as a single declarative entry:
# id: cool-app-bot
# filename: Llama-3.1-8B-Instruct-Q4_K_M.gguf (shared with the base, no extra download)
# system_prompt: "You are a senior engineer ..."
# rag_index: /home/yavuz/projects/cool-app
# tags: [persona-baked, rag-bound]
#
# Now:
hydra-llm chat cool-app-bot
This is what makes hydra-llm distinctive. Other local-RAG CLIs let you index a folder and chat with retrieval. Nobody else treats model + persona + corpus as a single declarative unit you can refer to by name. Move the catalog file across machines, and the bundle moves with it.
Hardware tiers
| Tier | Spec example | Recommended chat models | Recommended embedders |
|---|---|---|---|
tiny | 4-8 GB RAM, no dGPU | SmolLM2, Phi-3.5-mini, Gemma-2-2B | nomic-embed-text |
laptop | 16-32 GB RAM, integrated GPU | Llama-3.1-8B Q4, Mistral-7B Q4 | nomic-embed-text + qwen3-embed-0.6b |
halo | 48+ GB unified RAM, big iGPU (Strix Point/Halo, Apple Silicon Pro/Max) | Gemma-2-27B Q4, Qwen-2.5-32B Q4 | nomic-embed-text + qwen3-embed-4b |
workstation | 24+ GB dGPU | Llama-3.3-70B Q4 | + qwen3-embed-8b for SOTA code retrieval |
server | multi-GPU or 64+ GB RAM | Llama-3.3-70B Q5, MoEs | any of the above |
hydra-llm doctor classifies your machine. hydra-llm rag setup picks the recommended embedder pair for that tier and downloads after a yes/no.
Bring your own models
Already have a folder of GGUFs from Ollama, LM Studio, or a manual download? One command:
# Single file:
hydra-llm addlocal /path/to/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--tier laptop --vram-gb 6 --ram-gb 12
# Whole folder, recursively. Bulk-registers every .gguf:
hydra-llm addlocal /path/to/your/gguf/folder/ --link --tier laptop
Folder mode auto-derives the id, name, and port from each file. Group flags (--tier, --family, --license, --gpu-layers, --ram-gb, --vram-gb) apply to every entry. --link creates symlinks under the configured models_dir if the source is elsewhere.
Where things live by default:
- Chat models:
~/.local/share/hydra-llm/models/(override withmodels_dir) - Embedders:
~/.local/share/hydra-llm/embedders/ - Configs, personas, prompts, params, catalog overrides:
~/.config/hydra-llm/ - Chat sessions:
~/.local/state/hydra-llm/sessions/ - RAG store registry:
~/.local/state/hydra-llm/rag-stores.json - Per-folder index:
<folder>/.hydra-index/ - Cache:
~/.cache/hydra-llm/
Pairs with lillycoder
lillycoder is the sibling project: a local-first coder REPL with file and shell tools that talks to any OpenAI-compatible /v1 endpoint. hydra-llm provides exactly that, on a stable port. Composed:
# hydra-llm: pick something good at code
hydra-llm start qwen2.5-32b
hydra-llm api qwen2.5-32b # prints the URL
# in your project directory:
lillycoder --api http://localhost:18087/v1
# (lilly auto-detects common local LLM ports too, so just `lillycoder` often works)
hydra-llm manages the model server. lillycoder is the agent that sits in front of it: reads/writes files, runs shell commands, greps your codebase, all under a permission gate. No cloud, no API key, no telemetry on either end.
Privacy
- No telemetry, no analytics, no auto-update calls.
- All inference runs locally, in Docker, on your machine.
- Chat sessions stay on your disk; the CLI never uploads them.
- Model and embedder downloads work without a Hugging Face account.
HF_TOKENis honored but never required. - RAG indexes never leave your machine. The chunk text is stored in the LanceDB at
<folder>/.hydra-index/; embeddings are computed locally by the embedder container; queries never go to a third party.
Command reference (cheat sheet)
Chat-model lifecycle
hydra-llm doctor detect hardware tier
hydra-llm setup first-run for chat models
hydra-llm list-online browse the chat catalog
hydra-llm list catalogued models, with downloaded/running/RAG state
hydra-llm download <id> pull a chat model GGUF
hydra-llm addlocal <file|dir> register your own GGUFs
hydra-llm start <id> launch the chat-model container
hydra-llm stop <id> stop one
hydra-llm stop-all stop every chat-model container
hydra-llm status running containers, ready state
hydra-llm api <id> print the OpenAI endpoint URL
hydra-llm chat <id> interactive REPL
hydra-llm autostart <id> start at login (user systemd unit)
hydra-llm config <id> [k] [v] read/write per-alias server settings
hydra-llm create <m> <p.md> <id> [--rag-index <path>]
bake a persona (and optional corpus) into an alias
hydra-llm persona list|show|path
RAG pipeline
hydra-llm rag setup interactive first-run for RAG
hydra-llm rag list-online browse the embedder catalog
hydra-llm rag list installed embedders, with running state
hydra-llm rag download <id> pull an embedder GGUF
hydra-llm rag addlocal <file> register your own embedder GGUF
hydra-llm rag remove <id> delete an installed embedder
hydra-llm rag info <id> details, runtime status
hydra-llm rag stores [--prune] folders that have been indexed
hydra-llm rag stop <id> stop one embedder sidecar
hydra-llm rag stop-all stop every embedder sidecar
hydra-llm index [path] build/refresh a per-folder RAG index
hydra-llm query "<text>" search; defaults to cwd, federates if needed
hydra-llm chat <m> --rag <path> chat with retrieval (single store)
hydra-llm chat <m> --rag-all chat with retrieval across every store
hydra-llm chat <m> --rag-tag <t> chat with retrieval across tagged stores
hydra-llm chat <bundle-alias> chat with a catalog-bound bundle (no flags)
Inside the chat REPL
/help show all slash commands
/quit, /exit leave the chat
/reset clear history but keep the system prompt
/params show current sampling params
/set <k> <v> change a param for this session only
/thoughts on|off show or hide reasoning_content blocks
/rag on|off toggle retrieval for this session
/rag-show on|off show/hide the [rag: N chunks] line
/rag-chunks on|off echo retrieved chunk locations to the terminal
/rag <text> one-off retrieval, prints hits without sending to model
Uninstall
If you installed via apt:
sudo apt remove hydra-llm hydra-llm-plasma # add --purge to also drop config
If you installed with the one-liner (user mode, lives under ~/.local):
hydra-llm uninstall # keeps configs and downloaded models
hydra-llm wipe # also deletes models, embedders, sessions, engine image
Both paths stop running model containers and remove the Plasma widget files. The user-mode uninstaller refreshes plasmashell so the tray icon clears immediately. After apt remove hydra-llm-plasma, log out and back in (or run kquitapp6 plasmashell && kstart plasmashell) to refresh the panel.
Full disclaimer
This software runs large language models on your machine, manages Docker containers on your behalf, downloads multi-gigabyte model files from third-party hosts (primarily Hugging Face community mirrors), exposes HTTP APIs on local ports, and (when you use the RAG features) reads files inside any directory you index and stores embeddings of them on disk. It is provided as is, without warranty of any kind, express or implied, including but not limited to merchantability, fitness for a particular purpose, and noninfringement.
By installing or running this software you accept that:
- You alone are responsible for any damage to your hardware, data, network, or system.
- The author(s) and contributors are not liable for any harm, data loss, hardware failure, security incident, model output, voided warranty, or other damages, however caused.
- LLM weights and embedder weights downloaded via this tool are governed by their own upstream licenses (Llama, Gemma, Mistral, Qwen, Nomic, BGE, etc.). You are responsible for complying with each model's license. Some models prohibit specific uses; read the model card before using one in production.
- LLM outputs are unreliable. They will hallucinate, repeat training data, give incorrect medical/legal/financial advice, and produce harmful or biased content. Do not rely on them for safety-critical decisions. RAG reduces hallucination but does not eliminate it; the model can still misquote retrieved chunks.
- The RAG pipeline reads files in any directory you point
hydra-llm indexat and stores chunked text plus embeddings of those files at<directory>/.hydra-index/. If a directory contains secrets, credentials, or sensitive personal data, those will be embedded into a local LanceDB index. The index never leaves your machine, but anyone with read access to that directory can read the index. Audit what you index. - Running large models stresses CPU, RAM, and GPU. Sustained high utilisation can cause thermal throttling, fan wear, or in poorly-cooled systems, hardware damage. Monitor your machine.
- The CLI shells out to
docker. A misconfigured Docker setup or an attacker who can write to your config files could be used to run arbitrary containers as your user.
If you do not accept these terms, do not install or run this software.
Full legal license: MIT.
Source
- Code: github.com/ra-yavuz/hydra-llm
- Releases: github.com/ra-yavuz/hydra-llm/releases
- Other ra-yavuz projects: ra-yavuz.github.io
© 2026 Ramazan Yavuz. MIT-licensed. No warranty.
This page does not use cookies, tracking, or analytics. Outbound links lead to third-party websites outside our control; we are not liable for their content or privacy practices.