14 min read

XandLLM: High-Performance LLM Inference in Rust with Knowledge Distillation

Introducing XandLLM - a production-grade LLM inference engine written in Rust. OpenAI-compatible API, GPU acceleration, and built-in knowledge distillation for creating smaller, faster models.

XandAIRustLLMAIOpen SourceKnowledge Distillation

XandLLM: High-Performance LLM Inference in Rust with Knowledge Distillation

Today I'm excited to introduce XandLLM - the newest addition to the XandAI ecosystem. A production-grade, high-performance LLM inference engine written in Rust that brings together everything I've learned about running AI locally.

XandLLM is functionally comparable to vLLM and llama.cpp, but with a modern Rust architecture, OpenAI-compatible HTTP API, and something unique: built-in knowledge distillation for compressing large models into smaller, faster ones.

What is XandLLM?

XandLLM is a complete LLM serving solution that includes:

  • High-performance inference engine (Rust + CUDA)
  • OpenAI-compatible REST API (drop-in replacement)
  • Interactive CLI for local inference and chat
  • Web UI for browser-based interaction
  • Knowledge distillation pipeline (teacher → student compression)
  • Docker Compose deployment with GPU passthrough

Key differentiator: Unlike Ollama (which focuses on ease of use) or vLLM (which focuses on throughput), XandLLM focuses on customization and model compression - making it ideal for creating specialized, efficient models.

Why Another LLM Tool?

The LLM inference space has excellent options:

ToolStrengthWeakness
OllamaEasy setupLimited customization
vLLMHigh throughputComplex deployment
llama.cppEdge devicesSlower on GPUs
Text Generation InferenceProduction featuresResource heavy

XandLLM fills the gap for users who want:

  1. Full control over model architecture
  2. Knowledge distillation to create custom small models
  3. Modern Rust codebase (memory safety, performance)
  4. Simple deployment (single binary or Docker)
  5. OpenAI compatibility (existing tools work seamlessly)

Architecture & Features

Core Components

XandLLM Architecture
├── xandllm-core (Rust)
│   ├── Model loading (GGUF, Safetensors)
│   ├── Tokenization
│   ├── KV-cache management
│   └── CUDA kernels (GPU acceleration)
│
├── xandllm-api (Rust + Axum)
│   ├── OpenAI-compatible routes
│   ├── Streaming (SSE)
│   └── Health checks
│
├── xandllm-cli (Rust)
│   ├── serve (API server)
│   ├── run (single inference)
│   ├── chat (interactive)
│   ├── pull (model download)
│   └── distill (knowledge distillation)
│
├── xandllm-hub (Rust)
│   └── HuggingFace integration
│
└── Frontend (React + TypeScript)
    └── Streaming chat UI

Key Features

Performance:

  • GPU acceleration via CUDA (automatic CPU fallback)
  • Efficient KV-cache management
  • Streaming via Server-Sent Events
  • Concurrent request handling

Compatibility:

  • OpenAI API format (/v1/chat/completions, /v1/models)
  • Multiple model formats (GGUF, Safetensors)
  • Various architectures (LLaMA, Qwen, Gemma, Phi)
  • Chat template auto-detection

Knowledge Distillation:

  • Compress teacher models into smaller students
  • Fine-tune existing models on custom datasets
  • Export to Safetensors or GGUF formats

Supported Models

Architectures

ArchitectureFormatsChat Templates
LLaMAGGUF, Safetensorsllama2, llama3
Qwen2GGUFchatml
Qwen3GGUFchatml, chatml-thinking
Gemma3GGUFgemma
Phi-3GGUFphi3
ChatML-compatibleGGUFchatml

Tested & Verified Models

ModelBest ForSize
Qwen2.5-Coder-7BCode generation7B
Qwen3-4B-ThinkingReasoning tasks4B
Gemma-3-4b-itGeneral instruction4B
Llama-3.1-8BGeneral purpose8B

Installation & Setup

Prerequisites

# Rust 1.76+ (install via rustup)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# (Optional) CUDA 12.x for GPU support
# Download from NVIDIA developer site

Build from Source

Linux (CPU-only):

git clone https://github.com/XandAI-project/XandLLM.git
cd XandLLM
bash scripts/build-linux.sh

Linux (with CUDA GPU):

bash scripts/build-linux-cuda.sh

Windows (with CUDA):

scripts\build-cuda.bat

Manual build:

# CPU-only
cargo install --path crates/xandllm-cli --locked

# With CUDA
cargo install --path crates/xandllm-cli --features cuda --locked

The xandllm binary will be available in your PATH.

Quick Start Guide

1. Pull a Model

XandLLM uses HuggingFace Hub for model management. Models auto-download on first use.

# Set HuggingFace token (for gated models)
export HUGGING_FACE_HUB_TOKEN=hf_...

# Pull a model (downloads and caches)
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0

# List cached models
xandllm list

Popular models to try:

# Code generation
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0

# Reasoning with thinking blocks
xandllm pull TeichAI/Qwen3-4B-Thinking-2507-Claude-4.5-Opus-High-Reasoning-Distill-GGUF

# Compact general purpose
xandllm pull unsloth/gemma-3-4b-it-GGUF:Q6_K

# Llama 3.1 (requires HF token for gated models)
xandllm pull meta-llama/Llama-3.1-8B-Instruct

2. Run Local Inference

Single prompt:

xandllm run \
  --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  --prompt "Explain quantum entanglement in one paragraph."

With performance stats:

xandllm run \
  --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  --prompt "Write a haiku about Rust programming." \
  --stats

Output:

System:
  Memory: 8192 MB allocated
  Device: CUDA (NVIDIA GeForce RTX 3060)
  CUDA Compute: 8.6

Model:
  Architecture: llama
  Parameters: 7.24B
  Quantization: Q4_0

Performance:
  Tokens/second: 42.3
  Total time: 1.23s
  Generated: 52 tokens

Memory safety,
Zero-cost abstractions,
Fearless concurrency.

3. Interactive Chat

xandllm chat --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --gpu

Session example:

> help
Available commands:
  /help, /h     - Show this help
  /quit, /q     - Exit chat
  /clear, /c    - Clear conversation
  /system <msg> - Set system prompt
  /stats        - Show performance stats

> /system You are a helpful coding assistant.

> Write a Python function to calculate fibonacci numbers.

def fibonacci(n):
    """Calculate fibonacci number at position n."""
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# Or more efficiently:
def fibonacci_iterative(n):
    if n <= 0:
        return 0
    a, b = 0, 1
    for _ in range(n - 1):
        a, b = b, a + b
    return b

> /stats
Session statistics:
  Total tokens: 1,247
  Avg speed: 38.5 tok/sec
  Peak memory: 4.2 GB

4. Start the API Server

This is where XandLLM shines - OpenAI-compatible API for integration with existing tools.

xandllm serve \
  --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  --port 11435 \
  --gpu

Server output:

🚀 XandLLM Server starting...

Configuration:
  Host: 0.0.0.0
  Port: 11435
  Model: Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
  Device: CUDA (NVIDIA GeForce RTX 3060)
  Max batch size: 8
  Max sequence length: 4096

Routes:
  POST /v1/chat/completions
  POST /v1/completions
  GET  /v1/models
  GET  /health

Server ready at http://0.0.0.0:11435
Press Ctrl+C to stop

Using the Web Frontend

XandLLM includes a React-based streaming chat interface.

Setup

cd frontend

# Install dependencies
pnpm install

# Configure API endpoint
# Edit .env.local:
VITE_API_URL=http://localhost:11435

# Start development server
pnpm dev

Access at: http://localhost:5173

Features

  • Streaming responses - See tokens appear in real-time
  • Model selection - Switch between loaded models
  • Parameter tuning - Temperature, top_p, max_tokens
  • Chat history - Save and revisit conversations
  • System prompts - Customize behavior per session
  • Export - Download conversations as Markdown

Using with the Server

  1. Start the API server:

    xandllm serve --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --port 11435
    
  2. In another terminal, start the frontend:

    cd frontend && pnpm dev
    
  3. Open browser to http://localhost:5173

  4. The UI automatically connects to the API server and shows:

    • Available models
    • Connection status
    • GPU/CPU indicator

API Usage Examples

Chat Completions

Request:

curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Rust's ownership system?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1708000000,
  "model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Rust's ownership system is a set of rules that the compiler checks at compile time..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152
  }
}

Streaming (SSE)

Request:

curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Response (Server-Sent Events):

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":" there"},"index":0}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0}]}

data: [DONE]

List Models

curl http://localhost:11435/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
      "object": "model",
      "created": 1708000000,
      "owned_by": "xandllm"
    }
  ]
}

Using with OpenAI SDK

Since XandLLM is OpenAI-compatible, existing tools work with just a base URL change:

Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="not-needed"  # XandLLM doesn't require auth
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
    messages=[
        {"role": "user", "content": "Explain lifetimes in Rust"}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

JavaScript/TypeScript:

import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'http://localhost:11435/v1',
  apiKey: 'not-needed'
});

const stream = await openai.chat.completions.create({
  model: 'Qwen/Qwen2.5-Coder-7B-Instruct-GGUF',
  messages: [{ role: 'user', content: 'Hello!' }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Knowledge Distillation (Unique Feature)

This is XandLLM's killer feature - compress large teacher models into efficient student models.

What is Knowledge Distillation?

A technique to train a smaller "student" model to mimic a larger "teacher" model:

┌──────────────┐      Training Data      ┌──────────────┐
│   Teacher    │ ──────────────────────▶ │    Student    │
│   (7B params)│                         │   (1B params) │
│   (Slow)     │                         │   (4x Faster) │
└──────────────┘                         └──────────────┘

Benefits:

  • 4-7x faster inference
  • 1/7th the memory usage
  • 90%+ accuracy retention
  • Run on cheaper hardware

Create a Fresh Student Model

Train a new 1B parameter model from scratch:

# Prepare training data (JSONL format)
mkdir -p my_dataset
cat > my_dataset/train.jsonl << 'EOF'
{"prompt": "Explain recursion in programming.", "completion": "Recursion is when a function calls itself to solve smaller instances of the same problem..."}
{"prompt": "What is a closure?", "completion": "A closure is a function that remembers the environment in which it was created..."}
EOF

# Run distillation
xandllm distill \
  --model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
  --dataset ./my_dataset \
  --model-to ./output/XandLM-1B \
  --size 1b \
  --epochs 3 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --gpu

What happens:

  1. Teacher model (7B) generates responses to your dataset
  2. Student model (1B) learns to mimic those responses
  3. After training, student achieves ~85-90% of teacher quality
  4. Student is 7x smaller and 4x faster

Serve the distilled model:

xandllm serve --model ./output/XandLM-1B --port 11435

Fine-tune an Existing Small Model

Instead of training from scratch, fine-tune a small base model:

xandllm distill \
  --model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
  --dataset ./my_dataset \
  --model-to ./output/MyFineTuned-3B \
  --student-base Qwen/Qwen2.5-3B-Instruct \
  --epochs 5 \
  --batch-size 2 \
  --learning-rate 5e-5 \
  --gpu

Options:

  • --type safetensor (default) - HuggingFace format
  • --type gguf - llama.cpp compatible (requires llama.cpp tools)

Convenience Scripts

Pre-configured distillation scripts for common setups:

# Distill 7B teacher → 1B student (fresh)
bash scripts/distill-1b.sh

# Distill 7B teacher → 3B student (fresh)
bash scripts/distill-3b.sh

# Distill 8B teacher → 7B student (fresh)
bash scripts/distill-7b.sh

# Fine-tune existing 1B model
bash scripts/distill-finetune-1b.sh

# Fine-tune existing 3B model
bash scripts/distill-finetune-3b.sh

# Fine-tune existing 7B model
bash scripts/distill-finetune-7b.sh

Override defaults:

bash scripts/distill-1b.sh \
  --dataset ./my_data \
  --output ./my-1b-model \
  --no-gpu

Real-World Use Case: Code Assistant

Create a specialized coding model:

# 1. Collect code examples
cat > code_dataset/train.jsonl << 'EOF'
{"prompt": "Write a Python function to reverse a string.", "completion": "def reverse_string(s):\n    return s[::-1]"}
{"prompt": "Explain async/await in JavaScript.", "completion": "Async/await is syntactic sugar over Promises..."}
{"prompt": "Create a Rust struct for a user.", "completion": "struct User {\n    name: String,\n    email: String,\n    age: u32,\n}"}
EOF

# 2. Distill from Qwen2.5-Coder-7B to 3B student
xandllm distill \
  --model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
  --dataset ./code_dataset \
  --model-to ./XandCoder-3B \
  --student-base Qwen/Qwen2.5-3B-Instruct \
  --epochs 5 \
  --gpu

# 3. Serve your custom model
xandllm serve --model ./XandCoder-3B --port 11435

Result: A 3B parameter model specialized for code that runs at 60+ tokens/sec on an RTX 3060.

Docker Deployment

Quick Start with Docker Compose

# Clone repository
git clone https://github.com/XandAI-project/XandLLM.git
cd XandLLM

# Configure environment
cp .env .env.local
echo "HUGGING_FACE_HUB_TOKEN=hf_..." >> .env.local

# Build and start with GPU
sudo docker compose --env-file .env.local -f docker/docker-compose.yml up --build

Services started:

  • API server: http://localhost:11435
  • Web UI: http://localhost:5173

Testing the Docker Setup

# Health check
curl http://localhost:11435/health
# Response: {"status": "ok"}

# Chat completion
curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
    "messages": [{"role": "user", "content": "Hello from Docker!"}]
  }'

Configuration

Config File (config/default.toml)

[server]
host = "0.0.0.0"
port = 11435
request_timeout_secs = 120

[inference]
max_batch_size = 8
max_sequence_length = 4096
default_max_new_tokens = 512
temperature = 0.7
top_p = 0.9

[model]
cache_dir = "~/.cache/xandllm"

[device]
prefer_gpu = true
cuda_device_id = 0

Environment Variables

# Logging
export RUST_LOG=info  # trace, debug, info, warn, error

# HuggingFace (for gated models)
export HUGGING_FACE_HUB_TOKEN=hf_...

# Server overrides
export XANDLLM_SERVER_PORT=11435
export XANDLLM_DEVICE_PREFER_GPU=true
export XANDLLM_MODEL_CACHE_DIR=/custom/cache/path

Comparison: XandLLM vs Alternatives

FeatureXandLLMOllamavLLMllama.cpp
LanguageRustGoPythonC++
Knowledge Distillation✅ Built-in❌ No❌ No❌ No
OpenAI API✅ Native✅ Native✅ Native⚠️ Via proxy
Web UI✅ Included⚠️ Community❌ No⚠️ Various
GPU Support✅ CUDA✅ CUDA✅ CUDA✅ CUDA/Metal/Vulkan
GGUF✅ Yes✅ Yes⚠️ Limited✅ Best
Safetensors✅ Yes❌ No✅ Yes⚠️ Via conversion
DeploymentBinary/DockerBinary/DockerPython/DockerBinary
Memory Safety✅ Rust✅ Go❌ Python⚠️ C++
Setup ComplexityMediumEasyHardMedium

Use Cases

1. Personal AI Assistant

Setup for home use:

# Small, fast model for daily use
xandllm serve --model unsloth/gemma-3-4b-it-GGUF:Q6_K --port 11435

# Access via:
# - Web UI (localhost:5173)
# - API from any app
# - CLI chat

2. Code Development Assistant

# Code-specialized model
xandllm serve \
  --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  --port 11435 \
  --gpu

# Integrate with IDE via OpenAI-compatible plugin

3. Custom Model Creation

Create specialized models for your domain:

# Distill from general to domain-specific
xandllm distill \
  --model-from meta-llama/Llama-3.1-8B-Instruct \
  --dataset ./medical_domain_data \
  --model-to ./Medical-Assistant-3B \
  --size 3b \
  --gpu

# Serve specialized model
xandllm serve --model ./Medical-Assistant-3B

4. API Service for Apps

# Production deployment
xandllm serve \
  --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  --host 0.0.0.0 \
  --port 11435 \
  --gpu

# Load balanced behind nginx
# Multiple model instances
# Monitoring via /health

Troubleshooting

CUDA Not Detected

Problem: GPU not being used despite CUDA being installed.

Solutions:

# Check CUDA availability
nvidia-smi

# Verify environment
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# Rebuild with CUDA
cargo install --path crates/xandllm-cli --features cuda --locked

Out of Memory

Problem: Model doesn't fit in GPU memory.

Solutions:

# Use smaller quantization
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0

# Or use CPU
xandllm serve --model MODEL --no-gpu

# Or use smaller model
xandllm serve --model unsloth/gemma-3-4b-it-GGUF:Q4_0

Model Loading Errors

Problem: "Failed to load model" or tokenizer errors.

Solutions:

# Clear cache and re-download
rm -rf ~/.cache/xandllm
xandllm pull MODEL

# Check model compatibility
xandllm list

Frontend Connection Issues

Problem: Web UI can't connect to API.

Solutions:

# Verify server is running
curl http://localhost:11435/health

# Check frontend env
cat frontend/.env.local
# Should be: VITE_API_URL=http://localhost:11435

# Check CORS (if accessing from different host)
# Server allows all origins by default

Roadmap

Upcoming Features

Mixture of Experts (MoE):

  • Mixtral support
  • Qwen-MoE
  • DeepSeek-MoE

Additional Architectures:

  • Phi (Microsoft)
  • Falcon
  • Mamba (state-space models)
  • RWKV

Multi-Modal:

  • Vision-Language (LLaVA, Qwen-VL)
  • Audio models (Whisper)

Performance:

  • Continuous batching
  • PagedAttention
  • Tensor/pipeline parallelism
  • AWQ/GPTQ quantization
  • LoRA/QLoRA adapters

API:

  • Function calling
  • JSON mode
  • Logprobs
  • Multiple choices

Conclusion

XandLLM brings together:

High-performance Rust implementation
OpenAI-compatible API
Knowledge distillation (unique feature)
Multiple model formats (GGUF, Safetensors)
Easy deployment (binary or Docker)
Complete toolchain (CLI, API, Web UI)

Get started in 5 minutes:

# 1. Build
git clone https://github.com/XandAI-project/XandLLM.git
cd XandLLM && bash scripts/build-linux.sh

# 2. Pull model
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0

# 3. Serve
xandllm serve --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --port 11435

# 4. Chat
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-Coder-7B-Instruct-GGUF","messages":[{"role":"user","content":"Hello!"}]}'

Try knowledge distillation:

xandllm distill \
  --model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
  --dataset ./my_data \
  --model-to ./MyModel-1B \
  --size 1b \
  --gpu

Links:

Star the repository, try it out, and let me know what you think! 🦀🚀

Written by

Found this helpful? Share your thoughts on GitHub