ra-yavuz / hydra-llm

hydra-llm v0.2.1 - RAG

Docker for language models, with retrieval baked in.

One CLI to download, run, chat with, and search local LLMs. Each model runs in its own Docker container, so docker ps shows what is actually running. Hardware-aware curated GGUF catalog with anonymous Hugging Face downloads. OpenAI-compatible endpoints on stable local ports. Retrieval-augmented generation is a first-class feature: index any folder, query it, or add --rag <path> to any chat. Bundle a model, a persona, and a corpus into a single alias via hydra-llm create, then chat with it by name. Optional KDE Plasma 6 panel widget. No cloud, no API key, no telemetry.

Open source under the MIT License. Provided as is, no warranty: read the full disclaimer at the bottom of the page before installing.

Why use this instead of plain Ollama or llama.cpp?

Three answers, in increasing order of "you didn't know you wanted that":

Transparent Docker over llama.cpp. Each model runs in its own container with a stable name and a reserved port. docker ps shows you exactly what's running. No wrapper daemon you can't see into. hydra-llm is a thin layer over real, inspectable infrastructure.
Hardware-aware curated catalog with anonymous downloads. hydra-llm list-online filters community GGUFs to what your machine can actually run, scored against the tier from hydra-llm doctor. No Hugging Face account or token required. HF_TOKEN is honored but never demanded.
RAG built in. Index any folder with one command. Query it with another. Or add --rag <path> to hydra-llm chat and your model retrieves relevant chunks at every turn. Bundle a model + a persona + a corpus into one declarative alias and just say hydra-llm chat my-bot. Nobody else does that.

If you've used Ollama, this will feel familiar. The difference: hydra-llm doesn't ask you to write modelfiles, runs in plain Docker (no opaque background daemon), exposes the same OpenAI endpoint shape, and ships a real KDE panel widget. And it has retrieval. Ollama doesn't.

Why Docker

Every model server (chat or embedder) runs in its own Docker container. The reasons:

No host pollution. The llama.cpp engine, its CUDA/Vulkan runtime, and every model live inside a container. Nothing gets installed on your host beyond the hydra-llm CLI itself and a Python or two.
Trivial removal. sudo apt remove hydra-llm uninstalls the CLI. hydra-llm wipe additionally tears down the engine image, every cached model, every embedder, and every chat session. Your host is back to where it started.
Reproducibility. The same image runs the same way on a Strix iGPU laptop, a workstation with a discrete GPU, a NUC, or an Apple Silicon Mac via Docker Desktop. Two engine variants ship: Vulkan for GPU-equipped boxes and a plain CPU build for everything else; the right one is auto-selected per machine.
Inspectable. docker ps tells you exactly which models are running and on which port. docker logs hydra-<alias> gives you the raw llama-server output. No background daemon hiding state behind an opaque API.
Per-model isolation. Two models running side by side cannot interfere with each other's address space or file handles; they are separate processes in separate containers. Killing one never destabilises another.

Engine updates and pinning

Each model server runs in a Docker container hydra builds locally from a Dockerfile shipped with the deb. The Dockerfile pins a specific llama.cpp commit, so the engine you get is reproducible: every user on the same hydra-llm version runs the same llama.cpp.

When a hydra-llm release bumps the pinned commit (which we do whenever there is a meaningful llama.cpp fix or a model family that needs newer parser support), the deb's postinstall script notices the mismatch between the Dockerfile's ref and the label baked into your existing engine image, and rebuilds the image automatically. This works on upgrades and downgrades. apt upgrade hydra-llm keeps your engine current with no extra step. The rebuild log lands at /var/log/hydra-llm-engine-rebuild.log if you ever want to see what happened.

If a rebuild fails (transient network, Docker daemon down, an upstream llama.cpp build break), the postinstall logs a warning and continues; your previous engine still works. You can retry any time with:

hydra-llm setup --rebuild

That is also the way to force a rebuild manually for any other reason (corrupted layers, debugging a build).

Embedder auto-recover

When an embedder sidecar fails to start because the GGUF on disk looks corrupt (a partial download, a re-uploaded file with a different SHA, etc.), hydra detects the parse failure in the container logs, deletes the local file, re-downloads from the catalog URL, and retries once. Transparent to the user; no --force flag needed. If the second attempt also fails, the original error bubbles up.

Autokill: idle models stop themselves

Loaded chat models hold VRAM and the kernel page cache for the GGUF mmap. Forgetting to hydra-llm stop a model after you are done is the easy mistake. Hydra ships a small user-level systemd timer (hydra-llm-reaper.timer) that wakes once a minute, asks Docker which hydra containers are running, and stops the ones that have been idle longer than chat_idle_ttl_seconds (default: 600s, ten minutes). The timer is enabled automatically the first time hydra-llm setup finishes; manage it explicitly with hydra-llm reaper {status,enable,disable}. Disable autokill entirely by setting chat_idle_ttl_seconds: 0 in ~/.config/hydra-llm/config.yaml.

"Idle" combines two signals so the autokill works for hydra-llm chat users and external API clients alike: every chat turn refreshes a per-alias touch file, and each reap cycle also samples docker stats and treats any container above reap_cpu_busy_percent (default 1%) as in use. Aider, curl, the Plasma widget, lillycoder, anything sending real requests at the model's port keeps it alive purely by using it. The timer itself is dormant between ticks; one cycle is a docker ps, a docker stats --no-stream, and a few stat() calls. Cost is dominated by docker and stays a small fraction of one core for a couple hundred milliseconds per minute. Run hydra-llm reap to trigger one cycle by hand.

The same TTL mechanism has long applied to embedder sidecars (embedder_idle_ttl_seconds, default 60s); chat-model autokill brings the two surfaces under one timer.

Where it runs

hydra-llm targets Linux first (Debian, Ubuntu, Fedora, Arch, etc.) but the architecture is portable wherever Docker runs:

Linux: primary target. .deb for Debian/Ubuntu derivatives; the bash one-liner installer works on every distro with bash + Docker.
WSL2 on Windows: works. Run hydra-llm inside an Ubuntu WSL distro with Docker Desktop integrated, or with the native docker.io package inside the WSL distro itself. The Vulkan engine uses WSLg's GPU passthrough on supported hardware; otherwise the CPU engine is used. The KDE Plasma widget needs Plasma 6, so it does not apply on Windows desktops; the CLI works fully.
macOS: CLI runs under Docker Desktop. The Vulkan engine variant does not apply on Apple Silicon (use the CPU engine, which still benefits from Apple's accelerators inside Docker's HVF layer); model speeds are competitive on M-series Pro/Max chips.
Headless servers: the CLI is the only surface needed. The Plasma widget and other GUI hooks degrade gracefully; nothing blocks on a missing display.

Install

apt (Debian/Ubuntu, recommended). One line. Sets up the signed apt repo if not already added, refreshes the package index, and installs hydra-llm. Idempotent, safe to re-run:

sudo bash -c 'set -e; install -m 0755 -d /etc/apt/keyrings && curl -fsSL https://ra-yavuz.github.io/apt/pubkey.gpg -o /etc/apt/keyrings/ra-yavuz.gpg && echo "deb [arch=amd64,arm64 signed-by=/etc/apt/keyrings/ra-yavuz.gpg] https://ra-yavuz.github.io/apt stable main" > /etc/apt/sources.list.d/ra-yavuz.list && apt update && apt install -y hydra-llm'

On KDE, also install the panel widget: sudo apt update && sudo apt install hydra-llm-plasma. The sudo apt update step is required even if you already have the repo: without it apt will not see new packages or new versions.

Step by step (manual repo setup)

# 1. Trust the signing key
sudo install -d -m 0755 /etc/apt/keyrings
curl -fsSL https://ra-yavuz.github.io/apt/pubkey.gpg \
  | sudo tee /etc/apt/keyrings/ra-yavuz.gpg >/dev/null

# 2. Add the apt source
echo "deb [arch=amd64,arm64 signed-by=/etc/apt/keyrings/ra-yavuz.gpg] https://ra-yavuz.github.io/apt stable main" \
  | sudo tee /etc/apt/sources.list.d/ra-yavuz.list

# 3. Refresh the package index, then install
sudo apt update
sudo apt install hydra-llm hydra-llm-plasma   # plasma widget is optional

hydra-llm runs every model in a Docker container. If you do not have Docker yet: sudo apt install docker.io && sudo usermod -aG docker "$USER", then log out and back in.

One-liner (any distro with bash + Docker). Installs the CLI into ~/.local/bin, builds the engine image, downloads a starter model, runs a smoke test.

curl -fsSL https://raw.githubusercontent.com/ra-yavuz/hydra-llm/main/get.sh | bash

Do not run with sudo. The installer asks for sudo only for missing system packages.

Quickstart, end to end

Zero to chatting with retrieval over your own folder. Every command below is real, in the order you run them. Deep-dive sections later on this page explain each piece.

1. First-run engine setup

Build the llama.cpp Docker image and pull a starter model so you can confirm the install works:

hydra-llm doctor          # confirm hardware detection (tier, RAM, GPU)
hydra-llm setup           # build engine image (5-10 min) + starter model + smoke test

hydra-llm setup is the only step that takes real time. It builds the Vulkan and CPU engine images locally; the deb itself does not ship them. On a typical laptop this is a one-time 5-10 minute investment.

2. Pick and download a chat model

hydra-llm list-online                  # browse the catalog, filtered to what fits
hydra-llm download gemma-2-2b          # or any id from list-online

Pick something your hardware actually fits; hydra-llm doctor already told you the tier. Want to register your own GGUFs from somewhere else (Ollama, LM Studio, manual download)?

hydra-llm addlocal /path/to/your/gguf/folder/ --link --tier laptop

Recursively registers every .gguf under the folder.

3. Chat with the model

Just type hydra-llm chat. As of v0.2.3, no alias is needed:

hydra-llm chat
# - exactly one model running          -> attaches to it
# - several running                    -> lists with port numbers, prompts
# - none running, one downloaded       -> auto-starts that one
# - none running, several installed    -> lists, prompts
# - nothing installed                  -> tells you to download one

Or be explicit:

hydra-llm chat gemma-2-2b
hydra-llm chat 4               # numeric # from `hydra-llm list`

Inside the REPL, slash commands matter: /help, /quit, /reset, /params, /set temperature 0.5.

Sessions persist by default. Each (folder, model) pair gets its own session at ~/.local/state/hydra-llm/sessions/<hash>-<alias>.json, so chats in different folders never share history. Override with --session foo for a named cross-folder session, or ./chat.json to pin to a project.

4. Set up RAG (one-time)

Now the fun part: index a folder so the model can answer questions about your code or notes.

hydra-llm rag setup

You will see a numbered menu like:

Recommended embedder for your tier: qwen3-embed-4b  (2.5 GB)  (NOT installed)

Options:
  1. Download recommended (qwen3-embed-4b, 2.5 GB)
  2. Use already-installed nomic-embed-text (prose)
  3. Pick a different embedder from the catalog
  4. Cancel; do nothing

Pick 1 for the recommendation, or 2 if you already have a smaller embedder you would rather use (especially if your folder is mostly prose: a novel, notes, docs). The "What is an embedder" section below explains the choice in detail.

5. Index a folder

cd ~/path/to/your/project
hydra-llm index .

Walks the folder, classifies each file as code or prose, chunks it, embeds every chunk, and stores everything in <folder>/.hydra-index/ (a LanceDB database, ~20-200 MB depending on folder size). Re-running is idempotent: it diffs (mtime, size) against the previous index and only re-embeds changed files. ~1 second for a no-op refresh.

Useful flags: --exclude '*.test.js', --include 'fixtures/important.json', --depth 2, --max-file-size-mb 0.5, --full, --dry-run, --tag work.

6. Sanity-check retrieval (no model needed)

hydra-llm query "where do we handle auth tokens"

Top-K chunks come back with file path and line range. No chat model involved. If results look off, your index is the problem (try a different embedder, or index . --full); if they look right, you are ready to chat.

7. Chat with retrieval

hydra-llm chat --rag .                          # use the index in cwd
hydra-llm chat gemma-2-2b --rag . --rag-show-chunks  # also echo retrieved locations

Per turn, hydra embeds your message, fetches the top-K chunks from the index, and prepends them to the prompt as a <context>...</context> block before sending to the model. --rag-show-chunks prints which file paths were retrieved so you can see what the model is being given.

Slash commands inside the REPL: /rag on|off, /rag-show on|off, /rag-chunks on|off, /rag <text> for one-off retrieval without a model call.

8. Bundle a model + persona + corpus into one alias

If you find yourself doing the same chat <model> --rag <path> repeatedly:

hydra-llm create gemma-2-2b ~/personas/code-helper.md cool-app-bot \
    --rag-index ~/projects/cool-app

hydra-llm chat cool-app-bot          # no flags. Retrieval just works.

The bundle is a single declarative entry in ~/.config/hydra-llm/catalog.yaml. Move that file across machines and the bundle moves with it.

9. Federate: query across every indexed folder

hydra-llm rag stores                 # list every folder you have indexed
hydra-llm query "..." --all          # federated search across all of them
hydra-llm query "..." --tag work     # only stores tagged with 'work'
hydra-llm chat <model> --rag-all     # chat with retrieval across all stores

That is the whole arc, end to end. Everything below is depth on each piece.

What you actually get

Stable OpenAI-compatible endpoints

Every running model exposes POST /v1/chat/completions on its own local port. Point Aider, Continue.dev, Open Interpreter, lillycoder, or your own scripts at http://localhost:18080/v1; rotate which model is behind that port with stop A && start B. No client config changes, no API keys.

Container lifecycle without docker-fu

start, stop, stop-all, status, api. Two engine images (Vulkan + CPU) build on first setup and auto-select per hardware. CPU fallback if Vulkan misbehaves. Each model gets a stable container name and a reserved port.

Hardware-aware curated catalog

list-online filters community GGUFs (Bartowski, lmstudio-community, mradermacher) to what your machine can run. Anonymous downloads, no HF account needed. Tiers cover 4 GB Pi-class up to 70B-on-iGPU Strix-class boxes.

RAG over any folder new

Walk + classify code-vs-prose + line-aware chunk + embed each chunk with the right embedder + store in <folder>/.hydra-index/ (LanceDB). Re-runs are incremental: only changed files re-embed. Federated query across every folder you've indexed. RRF fusion across code and prose.

Catalog-bound bundles new

A chat-catalog entry can carry system_prompt, params, and a rag_index: path. create <model> <persona.md> <id> --rag-index <path> bakes all three into one alias. Then chat <alias> runs everything, no flags. Move the model+persona+corpus across machines as one declarative unit.

KDE Plasma 6 panel widget

Per-row Start/Stop, Console launcher, inline log pane, prompt/params editor. HAL-eye tray indicator that breathes faster as system load rises and turns solid red once a model is healthy. Reads the same config as the CLI; addlocal entries appear automatically.

Personas, prompts, params

Three layers, narrowest wins. Persona files in personas/, per-alias prompts in prompts/, per-alias sampling params in params/. Inline catalog values override files. create <model> <persona.md> <id> bakes a persona's body into a new alias as inline system_prompt.

Persistent chat sessions

Sessions saved as JSON in ~/.local/state/hydra-llm/sessions/. Resume by name with --session <name>, or pin the session file to any path: chat gemma-2-2b ./project-notes.json.

RAG, in depth

RAG (retrieval-augmented generation) means: when you ask a question, hydra retrieves relevant chunks from a corpus and prepends them to the prompt, so the model answers based on text it didn't have to memorise.

What is an embedder

An embedder is a small specialised model that does one thing: take a piece of text in, output a fixed-size list of numbers (a vector) out. Different texts that mean similar things produce vectors that are mathematically close. That is what makes "find the chunk most similar to this question" possible: the question becomes a vector, every chunk in your folder is already a vector, and finding nearest neighbours is just arithmetic.

Embedders are not chat models. They do not generate text. They run in their own llama-server --embeddings containers (separate port range from chat models, default 19080-19099) and live in a separate catalog at ~/.config/hydra-llm/embedders.yaml. Six curated embedders ship: nomic-embed-text (lightweight, prose-leaning), qwen3-embed-{0.6b,4b,8b} (instruction-aware, strong on code), bge-m3 (multilingual), nomic-embed-code (code-tuned).

Single-embedder vs dual-index mode

By default, hydra runs single-embedder mode: one embedder serves all your chunks (both code and prose). One GGUF on disk, one container at query time, one LanceDB index per folder. Cheap, simple, fine for most personal-scale projects.

There is also an opt-in dual-index mode that runs two embedders (one tuned for code, one for prose), maintains a separate index per kind, and fuses query results via Reciprocal Rank Fusion at retrieval time. The theory: code questions surface from the code table, prose questions surface from the prose table, fusion gives you the best of both.

Most users should stay on single. The honest numbers: dual-index buys measurable retrieval quality on very large mixed corpora (think a monorepo with thousands of source files plus hundreds of design docs), but for personal-scale folders the marginal quality difference is small. The cost is two embedder downloads (often 4+ GB combined), two embedder containers running concurrently, two indexes per folder, and the fusion logic at query time.

Turn dual mode on later with:

hydra-llm rag setup --dual                  # one-off
# or persist in ~/.config/hydra-llm/config.yaml:
# rag:
#   dual_index: true

Switching modes requires a one-time re-index of any existing folders.

1. First-run for RAG

hydra-llm rag setup detects your hardware tier and presents a numbered menu. The menu surfaces already-installed embedders as alternatives so you don't get pushed into a 2.5 GB download you don't need:

Recommended embedder for your tier: qwen3-embed-4b  (2.5 GB)  (NOT installed)
  Default code embedder for halo+ tier. Instruction-aware ...

Options:
  1. Download recommended (qwen3-embed-4b, 2.5 GB)
  2. Use already-installed nomic-embed-text (prose)
  3. Pick a different embedder from the catalog
  4. Cancel; do nothing

Number [1-4, default 1]:

Or browse and pick manually:

hydra-llm rag list-online            # catalog, filtered to hardware
hydra-llm rag list                   # what you have installed
hydra-llm rag download <id>          # pull one
hydra-llm rag info <id>              # dimensions, pooling, prefix, running state

2. Indexing classifies code vs prose

The walker uses .gitignore (via python-pathspec) plus a builtin blacklist (node_modules, .venv, target, build, lockfiles, binaries, archives, media, weights, files >1 MB). Each file is classified code or prose by extension first (.py, .sh, .md, .rst, ...), then by canonical basenames (Makefile, Dockerfile, README, LICENSE), then by shebang sniff. The chunker is line-aware (1500 chars target, 200 overlap, never splits mid-line).

What happens with the classification depends on the mode: in single-embedder mode every chunk goes to one shared LanceDB table with its kind recorded as a column (so --code-only / --prose-only at query time still filter correctly); in dual-index mode code chunks go to the code table embedded with the code embedder, prose chunks go to the prose table embedded with the prose embedder.

cd ~/projects/cool-app
hydra-llm index .                    # full index on first run
hydra-llm index .                    # incremental: diffs by (mtime, size)
hydra-llm index . --tag work         # tag this store (--tag is repeatable)
hydra-llm index . --exclude '*.test.js' --include 'fixtures/important.json'
hydra-llm index . --depth 2 --max-file-size-mb 0.5
hydra-llm index . --full             # force a from-scratch rebuild
hydra-llm index . --dry-run          # print plan, don't embed

3. Storage is per-folder LanceDB

Each indexed folder grows a .hydra-index/ with two LanceDB tables (code.lance, prose.lance), a meta.yaml recording which embedders the index was built with, and a files.json that drives incremental refresh. The .hydra-index/ moves with the folder: copy a project to another machine, the index comes along.

4. Retrieval uses Reciprocal Rank Fusion

At query time the question is embedded with both embedders, top-K hits come back from each table, and the lists are fused by RRF (k=60). This is the 2026 best practice for code+prose corpora; it avoids the failure mode where a code embedder mangles README prose, while still surfacing the right code blocks first.

hydra-llm query "where do we handle auth tokens?" --in .
hydra-llm query "..." --top-k 10 --code-only
hydra-llm query "..." --tag work          # federated across all tagged stores
hydra-llm query "..." --all               # federated across every store

Default scope: if cwd has a .hydra-index/, query just that store. Otherwise, query every registered store. Override with --in, --stores, or --tag.

5. Chat-side RAG augments per turn

With --rag <path>, every user message is embedded, top-3 chunks are retrieved, and the user message becomes

<context>
--- src/auth/middleware.go:42-78 ---
func authenticate(r *http.Request) ...
--- README.md:54-72 ---
Auth flow: ...
</context>

(your original question)

Saved sessions keep the original (un-augmented) text so resumes don't carry stale context. New REPL slash commands: /rag on|off, /rag-show on|off, /rag-chunks on|off, /rag <text> for one-off retrieval without a model call.

hydra-llm chat llama-3.1-8b --rag .                         # single store
hydra-llm chat llama-3.1-8b --rag-all                       # every store
hydra-llm chat llama-3.1-8b --rag-tag work --rag-top-k 5    # by tag
hydra-llm chat my-bundle                                     # catalog-bound, no flags

Catalog-bound bundles (the headline feature)

A chat-catalog entry can carry a rag_index: field. create bakes that field plus a persona's body plus its front-matter params into a new alias.

hydra-llm create llama-3.1-8b ~/personas/senior-engineer.md cool-app-bot \
    --rag-index ~/projects/cool-app

# Persisted to ~/.config/hydra-llm/catalog.yaml as a single declarative entry:
#   id: cool-app-bot
#   filename: Llama-3.1-8B-Instruct-Q4_K_M.gguf   (shared with the base, no extra download)
#   system_prompt: "You are a senior engineer ..."
#   rag_index: /home/yavuz/projects/cool-app
#   tags: [persona-baked, rag-bound]
#
# Now:
hydra-llm chat cool-app-bot

This is what makes hydra-llm distinctive. Other local-RAG CLIs let you index a folder and chat with retrieval. Nobody else treats model + persona + corpus as a single declarative unit you can refer to by name. Move the catalog file across machines, and the bundle moves with it.

Hardware tiers

Tier	Spec example	Recommended chat models	Recommended embedders
`tiny`	4-8 GB RAM, no dGPU	SmolLM2, Phi-3.5-mini, Gemma-2-2B	nomic-embed-text
`laptop`	16-32 GB RAM, integrated GPU	Llama-3.1-8B Q4, Mistral-7B Q4	nomic-embed-text + qwen3-embed-0.6b
`halo`	48+ GB unified RAM, big iGPU (Strix Point/Halo, Apple Silicon Pro/Max)	Gemma-2-27B Q4, Qwen-2.5-32B Q4	nomic-embed-text + qwen3-embed-4b
`workstation`	24+ GB dGPU	Llama-3.3-70B Q4	+ qwen3-embed-8b for SOTA code retrieval
`server`	multi-GPU or 64+ GB RAM	Llama-3.3-70B Q5, MoEs	any of the above

hydra-llm doctor classifies your machine. hydra-llm rag setup picks the recommended embedder pair for that tier and downloads after a yes/no.

Bring your own models

Already have a folder of GGUFs from Ollama, LM Studio, or a manual download? One command:

# Single file:
hydra-llm addlocal /path/to/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --tier laptop --vram-gb 6 --ram-gb 12

# Whole folder, recursively. Bulk-registers every .gguf:
hydra-llm addlocal /path/to/your/gguf/folder/ --link --tier laptop

Folder mode auto-derives the id, name, and port from each file. Group flags (--tier, --family, --license, --gpu-layers, --ram-gb, --vram-gb) apply to every entry. --link creates symlinks under the configured models_dir if the source is elsewhere.

Where things live by default:

Chat models: ~/.local/share/hydra-llm/models/ (override with models_dir)
Embedders: ~/.local/share/hydra-llm/embedders/
Configs, personas, prompts, params, catalog overrides: ~/.config/hydra-llm/
Chat sessions: ~/.local/state/hydra-llm/sessions/
RAG store registry: ~/.local/state/hydra-llm/rag-stores.json
Per-folder index: <folder>/.hydra-index/
Cache: ~/.cache/hydra-llm/

Pairs with lillycoder

lillycoder is the sibling project: a local-first coder REPL with file and shell tools that talks to any OpenAI-compatible /v1 endpoint. hydra-llm provides exactly that, on a stable port. Composed:

# hydra-llm: pick something good at code
hydra-llm start qwen2.5-32b
hydra-llm api   qwen2.5-32b           # prints the URL

# in your project directory:
lillycoder --api http://localhost:18087/v1
# (lilly auto-detects common local LLM ports too, so just `lillycoder` often works)

hydra-llm manages the model server. lillycoder is the agent that sits in front of it: reads/writes files, runs shell commands, greps your codebase, all under a permission gate. No cloud, no API key, no telemetry on either end.

Privacy

No telemetry, no analytics, no auto-update calls.
All inference runs locally, in Docker, on your machine.
Chat sessions stay on your disk; the CLI never uploads them.
Model and embedder downloads work without a Hugging Face account. HF_TOKEN is honored but never required.
RAG indexes never leave your machine. The chunk text is stored in the LanceDB at <folder>/.hydra-index/; embeddings are computed locally by the embedder container; queries never go to a third party.

Command reference (cheat sheet)

Chat-model lifecycle

hydra-llm doctor               detect hardware tier
hydra-llm setup                first-run for chat models
hydra-llm list-online          browse the chat catalog
hydra-llm list                 catalogued models, with downloaded/running/RAG state
hydra-llm download <id>        pull a chat model GGUF
hydra-llm addlocal <file|dir>  register your own GGUFs
hydra-llm start <id>           launch the chat-model container
hydra-llm stop <id>            stop one
hydra-llm stop-all             stop every chat-model container
hydra-llm status               running containers, ready state
hydra-llm api <id>             print the OpenAI endpoint URL
hydra-llm chat <id>            interactive REPL
hydra-llm autostart <id>       start at login (user systemd unit)
hydra-llm config <id> [k] [v]  read/write per-alias server settings
hydra-llm create <m> <p.md> <id> [--rag-index <path>]
                               bake a persona (and optional corpus) into an alias
hydra-llm persona list|show|path

RAG pipeline

hydra-llm rag setup            interactive first-run for RAG
hydra-llm rag list-online      browse the embedder catalog
hydra-llm rag list             installed embedders, with running state
hydra-llm rag download <id>    pull an embedder GGUF
hydra-llm rag addlocal <file>  register your own embedder GGUF
hydra-llm rag remove <id>      delete an installed embedder
hydra-llm rag info <id>        details, runtime status
hydra-llm rag stores [--prune] folders that have been indexed
hydra-llm rag stop <id>        stop one embedder sidecar
hydra-llm rag stop-all         stop every embedder sidecar
hydra-llm index [path]         build/refresh a per-folder RAG index
hydra-llm query "<text>"       search; defaults to cwd, federates if needed
hydra-llm chat <m> --rag <path> chat with retrieval (single store)
hydra-llm chat <m> --rag-all           chat with retrieval across every store
hydra-llm chat <m> --rag-tag <t>       chat with retrieval across tagged stores
hydra-llm chat <bundle-alias>          chat with a catalog-bound bundle (no flags)

Inside the chat REPL

/help               show all slash commands
/quit, /exit        leave the chat
/reset              clear history but keep the system prompt
/params             show current sampling params
/set <k> <v>        change a param for this session only
/thoughts on|off    show or hide reasoning_content blocks
/rag on|off         toggle retrieval for this session
/rag-show on|off    show/hide the [rag: N chunks] line
/rag-chunks on|off  echo retrieved chunk locations to the terminal
/rag <text>         one-off retrieval, prints hits without sending to model

Uninstall

If you installed via apt:

sudo apt remove hydra-llm hydra-llm-plasma   # add --purge to also drop config

If you installed with the one-liner (user mode, lives under ~/.local):

hydra-llm uninstall   # keeps configs and downloaded models
hydra-llm wipe        # also deletes models, embedders, sessions, engine image

Both paths stop running model containers and remove the Plasma widget files. The user-mode uninstaller refreshes plasmashell so the tray icon clears immediately. After apt remove hydra-llm-plasma, log out and back in (or run kquitapp6 plasmashell && kstart plasmashell) to refresh the panel.

Full disclaimer

This software runs large language models on your machine, manages Docker containers on your behalf, downloads multi-gigabyte model files from third-party hosts (primarily Hugging Face community mirrors), exposes HTTP APIs on local ports, and (when you use the RAG features) reads files inside any directory you index and stores embeddings of them on disk. It is provided as is, without warranty of any kind, express or implied, including but not limited to merchantability, fitness for a particular purpose, and noninfringement.

By installing or running this software you accept that:

You alone are responsible for any damage to your hardware, data, network, or system.
The author(s) and contributors are not liable for any harm, data loss, hardware failure, security incident, model output, voided warranty, or other damages, however caused.
LLM weights and embedder weights downloaded via this tool are governed by their own upstream licenses (Llama, Gemma, Mistral, Qwen, Nomic, BGE, etc.). You are responsible for complying with each model's license. Some models prohibit specific uses; read the model card before using one in production.
LLM outputs are unreliable. They will hallucinate, repeat training data, give incorrect medical/legal/financial advice, and produce harmful or biased content. Do not rely on them for safety-critical decisions. RAG reduces hallucination but does not eliminate it; the model can still misquote retrieved chunks.
The RAG pipeline reads files in any directory you point hydra-llm index at and stores chunked text plus embeddings of those files at <directory>/.hydra-index/. If a directory contains secrets, credentials, or sensitive personal data, those will be embedded into a local LanceDB index. The index never leaves your machine, but anyone with read access to that directory can read the index. Audit what you index.
Running large models stresses CPU, RAM, and GPU. Sustained high utilisation can cause thermal throttling, fan wear, or in poorly-cooled systems, hardware damage. Monitor your machine.
The CLI shells out to docker. A misconfigured Docker setup or an attacker who can write to your config files could be used to run arbitrary containers as your user.

If you do not accept these terms, do not install or run this software.

Full legal license: MIT.

Source

Code: github.com/ra-yavuz/hydra-llm
Releases: github.com/ra-yavuz/hydra-llm/releases
Other ra-yavuz projects: ra-yavuz.github.io

This page does not use cookies, tracking, or analytics. Outbound links lead to third-party websites outside our control; we are not liable for their content or privacy practices.