llmfit helps you pick local LLMs that actually fit

llmfit is a cross-platform terminal tool that detects your CPU, RAM, and GPU to rank which local LLMs will run well. It adds a slick TUI, CLI, and REST API, plus plan mode to estimate the hardware needed for a target model.

llmfit helps you pick local LLMs that actually fit

TL;DR

  • llmfit: Cross-platform terminal tool matching local LLMs to detected CPU/RAM/GPU; interactive TUI, CLI, optional REST API
  • TUI model table: composite score, estimated tok/s, best quantization, run mode, memory usage, use-case category; search/sort/filter/download
  • Scoring system: Quality/Speed/Fit/Context (0–100); weighted by category (General, Coding, Reasoning, Chat, Multimodal, Embedding)
  • Dynamic quantization: tries Q8_0→Q2_K to fit memory; retries at half context if full context fails
  • Plan mode: estimates minimum/recommended VRAM/RAM/CPU cores; GPU/CPU-offload/CPU-only paths; CLI llmfit plan with optional JSON
  • Integrations & serve mode: Ollama, llama.cpp, MLX; llmfit serve endpoints /health, /api/v1/system, selection queries (min_fit, runtime, use_case, sort)

llmfit is a cross-platform terminal tool built around a simple promise: match local LLM choices to the hardware that’s actually available. It auto-detects CPU, RAM, and GPU resources, then ranks models by whether they’ll run well—factoring in memory constraints, estimated throughput, and context needs. The project ships as both an interactive TUI and a more traditional CLI, with an optional REST API layer for programmatic use.

A hardware-first model picker (with a TUI as the default)

At its core, llmfit “right-sizes” models to a machine by combining hardware detection with a bundled database of models. Launching llmfit opens its interactive terminal UI, showing system specs up top and a sortable table of models below. Each row includes composite score, estimated tok/s, best quantization fit, run mode, memory usage, and use-case category.

Navigation and filtering are designed for quick iteration: search, cycle fit thresholds (Runnable/Perfect/Good/Marginal), filter by availability (including installed models), switch sort columns, and download models directly from within the interface.

Scoring: quality, speed, fit, context—weighted by use case

llmfit scores each model across four dimensions (0–100): Quality, Speed, Fit, and Context. Those scores roll into a weighted composite that changes by category (General, Coding, Reasoning, Chat, Multimodal, Embedding). The result is a ranking that reflects tradeoffs: for example, Chat weights speed higher, while Reasoning weights quality higher.

Underneath, dynamic quantization selection tries higher-quality quantizations first (Q8_0 down to Q2_K), picking the best that fits available memory. If nothing fits at full context, llmfit retries at half context.

“What can I run?” and “What would it take?” with Plan mode

Beyond simple recommendations, llmfit includes a TUI Plan mode (activated with p) that flips the question around: estimate the hardware needed for a chosen model configuration. Plan mode provides estimate-based minimum and recommended VRAM/RAM/CPU cores, feasible run paths (GPU, CPU offload, CPU-only), and upgrade deltas to reach better targets.

On the CLI side, the same idea is available via llmfit plan ..., with optional JSON output intended for tooling and automation.

Provider integrations: Ollama, llama.cpp, and MLX

llmfit includes integrations with local runtime providers:

  • Ollama: detects installed models via GET /api/tags and can download via POST /api/pull, with support for remote instances through OLLAMA_HOST.
  • llama.cpp: supports direct GGUF downloads from Hugging Face and local cache detection, with runtime detection via llama-cli or llama-server in PATH.
  • MLX: targets Apple Silicon workflows, integrating with mlx-community caches and optional server mode.

A notable practical detail: llmfit maintains a mapping between HuggingFace model IDs and Ollama naming, so install detection and pulls resolve to the intended variants.

Estimation details: memory bandwidth and MoE-aware fit

For speed estimation, llmfit treats token generation as memory-bandwidth-bound and uses a formula based on (bandwidth_GB_s / model_size_GB) × efficiency_factor, with fallback constants by backend when GPU identification isn’t available. On the memory side, fit analysis categorizes run modes (GPU, MoE, CPU+GPU, CPU) and fit levels (Perfect, Good, Marginal, Too Tight).

MoE models get special handling: llmfit detects architectures like Mixtral and DeepSeek variants and accounts for the smaller active subset of experts per token when estimating effective VRAM needs.

A local REST API for schedulers and tooling

For environments that want model selection as a service, llmfit serve exposes HTTP endpoints including /health, /api/v1/system, and model listing/top-selection routes with query parameters like min_fit, runtime, use_case, and sort. This positions llmfit as a potential building block for node-level scheduling or cluster aggregators.

Model database and project structure

The bundled model list is generated from the HuggingFace API via scripts/scrape_hf_models.py, writing to data/hf_models.json (embedded at compile time). The repository also includes a standalone update flow (make update-models or scripts/update_models.sh), plus documentation files like API.md, AGENTS.md, and MODELS.md.

There’s also an OpenClaw skill (skills/llmfit-advisor) that uses llmfit’s JSON outputs to drive hardware-aware recommendations and provider configuration.

Source: https://github.com/AlexsJones/llmfit?

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community