The problem is obvious

Every local LLM session starts the same way: scrolling through shell history, fixing a typo, forgetting a flag.

Before

$ llama-server \
    -m ~/models/qwen2.5-7b-q4_k_m.gguf \
    --ctx-size 65536 \
    --n-gpu-layers 99 \
    --batch-size 2048 \
    --ubatch-size 512 \
    --flash-attn \
    --host 0.0.0.0 \
    --port 8080

Every. Single. Time.

After

$ infai
# select model, press enter. that's it.

Profiles remember everything. You just pick and run.

Configure once,
launch forever

Name your profile, set context size, GPU layers, batch parameters, flash attention, quantization type. Save it once and relaunch in seconds.

Watch inference
happen live

Built-in viewport streams logs plus system/model metrics in real-time. Stop, restart, or switch models without leaving the TUI.

What you get

Persistent configurations

The system remembers your complex flags, not you. Configure once, reuse instantly.

Model auto-discovery

Point at your directories. infai indexes GGUF files and sets them up instantly.

One-click management

Switch between context sizes, quantizations, or backends in seconds. Run directly on your hardware.

Live logs + metrics

Real-time viewport with process and system telemetry. No tmux splits or second monitor needed.

Terminal themes

Tokyonight, Everforest, One Dark, Rose Pine, Gruvbox. Match your terminal's vibe.

SQLite config

One database. No scattered dotfiles, no YAML, no env vars. Everything in one place.

One-key launch

Select model. Press enter. Server starts. That's the whole workflow.

45M+ GGUF downloads on HuggingFace, 2025

70% of local LLM users run on personal hardware

$2-15K typical hardware spend on local inference

You invested in the hardware. Stop wasting time on the flags.

Get it

Homebrew recommended

brew install dipankardas011/tap/infai copy

macOS & Linux. Pre-compiled. Zero dependencies.

Install script

curl -sL https://raw.githubusercontent.com/dipankardas011/infai/main/install.sh | bash copy

Linux. Installs to /usr/local/bin. One-liner.

Binary

Grab from GitHub Releases — linux & macOS, amd64 & arm64.

From source

go install github.com/dipankardas011/infai@latest copy

Requires Go 1.23+ and a C compiler (SQLite).

What's next

llama.cpp today. Your single control plane for all local inference tomorrow.

shipped

llama.cpp

GGUF auto-detect, launch profiles, live logs, 5 themes

shipped

Resource monitoring

Live system and model CPU, memory, and GPU telemetry in the run screen

SafeTensors & MLX

HuggingFace SafeTensors and Apple MLX architectures

vLLM backend

Production-grade batched inference management