_ lightweight inference harness

You bought the GPU. Now use it.

infai is a lightweight harness for local inference. It auto-detects your models, manages complex configurations, and launches fast.
One-click profiles. Persistent settings. Real-time metrics. No flag soup.

infai
~/models
 
qwen2.5-7b-instruct-q4_k_m.gguf 7.2 GB
  deepseek-r1-8b-q5_k_s.gguf 5.8 GB
  llama-3.3-70b-q2_k.gguf 26.4 GB
  mistral-7b-v0.3-q6_k.gguf 5.5 GB
 
enter: select · /: filter · a: all · f: folders · q: quit

The problem is obvious

Every local LLM session starts the same way: scrolling through shell history, fixing a typo, forgetting a flag.

Before
$ llama-server \ -m ~/models/qwen2.5-7b-q4_k_m.gguf \ --ctx-size 65536 \ --n-gpu-layers 99 \ --batch-size 2048 \ --ubatch-size 512 \ --flash-attn \ --host 0.0.0.0 \ --port 8080

Every. Single. Time.

vs
After
$ infai # select model, press enter. that's it.

Profiles remember everything. You just pick and run.

01

Configure once,
launch forever

Name your profile, set context size, GPU layers, batch parameters, flash attention, quantization type. Save it once and relaunch in seconds.

infai profile configuration
02

Watch inference
happen live

Built-in viewport streams logs plus system/model metrics in real-time. Stop, restart, or switch models without leaving the TUI.

infai live server logs

What you get

//

Persistent configurations

The system remembers your complex flags, not you. Configure once, reuse instantly.

//

Model auto-discovery

Point at your directories. infai indexes GGUF files and sets them up instantly.

//

One-click management

Switch between context sizes, quantizations, or backends in seconds. Run directly on your hardware.

//

Live logs + metrics

Real-time viewport with process and system telemetry. No tmux splits or second monitor needed.

//

Terminal themes

Tokyonight, Everforest, One Dark, Rose Pine, Gruvbox. Match your terminal's vibe.

//

SQLite config

One database. No scattered dotfiles, no YAML, no env vars. Everything in one place.

//

One-key launch

Select model. Press enter. Server starts. That's the whole workflow.

45M+ GGUF downloads on HuggingFace, 2025
70% of local LLM users run on personal hardware
$2-15K typical hardware spend on local inference

You invested in the hardware. Stop wasting time on the flags.

Get it

Binary

Grab from GitHub Releases — linux & macOS, amd64 & arm64.

From source
go install github.com/dipankardas011/infai@latest copy

Requires Go 1.23+ and a C compiler (SQLite).

What's next

llama.cpp today. Your single control plane for all local inference tomorrow.

shipped

llama.cpp

GGUF auto-detect, launch profiles, live logs, 5 themes

shipped

Resource monitoring

Live system and model CPU, memory, and GPU telemetry in the run screen