XandLLM: High-Performance LLM Inference in Rust with Knowledge Distillation
Introducing XandLLM - a production-grade LLM inference engine written in Rust. OpenAI-compatible API, GPU acceleration, and built-in knowledge distillation for creating smaller, faster models.
XandLLM: High-Performance LLM Inference in Rust with Knowledge Distillation
Today I'm excited to introduce XandLLM - the newest addition to the XandAI ecosystem. A production-grade, high-performance LLM inference engine written in Rust that brings together everything I've learned about running AI locally.
XandLLM is functionally comparable to vLLM and llama.cpp, but with a modern Rust architecture, OpenAI-compatible HTTP API, and something unique: built-in knowledge distillation for compressing large models into smaller, faster ones.
What is XandLLM?
XandLLM is a complete LLM serving solution that includes:
- High-performance inference engine (Rust + CUDA)
- OpenAI-compatible REST API (drop-in replacement)
- Interactive CLI for local inference and chat
- Web UI for browser-based interaction
- Knowledge distillation pipeline (teacher → student compression)
- Docker Compose deployment with GPU passthrough
Key differentiator: Unlike Ollama (which focuses on ease of use) or vLLM (which focuses on throughput), XandLLM focuses on customization and model compression - making it ideal for creating specialized, efficient models.
Why Another LLM Tool?
The LLM inference space has excellent options:
| Tool | Strength | Weakness |
|---|---|---|
| Ollama | Easy setup | Limited customization |
| vLLM | High throughput | Complex deployment |
| llama.cpp | Edge devices | Slower on GPUs |
| Text Generation Inference | Production features | Resource heavy |
XandLLM fills the gap for users who want:
- Full control over model architecture
- Knowledge distillation to create custom small models
- Modern Rust codebase (memory safety, performance)
- Simple deployment (single binary or Docker)
- OpenAI compatibility (existing tools work seamlessly)
Architecture & Features
Core Components
XandLLM Architecture
├── xandllm-core (Rust)
│ ├── Model loading (GGUF, Safetensors)
│ ├── Tokenization
│ ├── KV-cache management
│ └── CUDA kernels (GPU acceleration)
│
├── xandllm-api (Rust + Axum)
│ ├── OpenAI-compatible routes
│ ├── Streaming (SSE)
│ └── Health checks
│
├── xandllm-cli (Rust)
│ ├── serve (API server)
│ ├── run (single inference)
│ ├── chat (interactive)
│ ├── pull (model download)
│ └── distill (knowledge distillation)
│
├── xandllm-hub (Rust)
│ └── HuggingFace integration
│
└── Frontend (React + TypeScript)
└── Streaming chat UI
Key Features
Performance:
- GPU acceleration via CUDA (automatic CPU fallback)
- Efficient KV-cache management
- Streaming via Server-Sent Events
- Concurrent request handling
Compatibility:
- OpenAI API format (
/v1/chat/completions,/v1/models) - Multiple model formats (GGUF, Safetensors)
- Various architectures (LLaMA, Qwen, Gemma, Phi)
- Chat template auto-detection
Knowledge Distillation:
- Compress teacher models into smaller students
- Fine-tune existing models on custom datasets
- Export to Safetensors or GGUF formats
Supported Models
Architectures
| Architecture | Formats | Chat Templates |
|---|---|---|
| LLaMA | GGUF, Safetensors | llama2, llama3 |
| Qwen2 | GGUF | chatml |
| Qwen3 | GGUF | chatml, chatml-thinking |
| Gemma3 | GGUF | gemma |
| Phi-3 | GGUF | phi3 |
| ChatML-compatible | GGUF | chatml |
Tested & Verified Models
| Model | Best For | Size |
|---|---|---|
| Qwen2.5-Coder-7B | Code generation | 7B |
| Qwen3-4B-Thinking | Reasoning tasks | 4B |
| Gemma-3-4b-it | General instruction | 4B |
| Llama-3.1-8B | General purpose | 8B |
Installation & Setup
Prerequisites
# Rust 1.76+ (install via rustup)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# (Optional) CUDA 12.x for GPU support
# Download from NVIDIA developer site
Build from Source
Linux (CPU-only):
git clone https://github.com/XandAI-project/XandLLM.git
cd XandLLM
bash scripts/build-linux.sh
Linux (with CUDA GPU):
bash scripts/build-linux-cuda.sh
Windows (with CUDA):
scripts\build-cuda.bat
Manual build:
# CPU-only
cargo install --path crates/xandllm-cli --locked
# With CUDA
cargo install --path crates/xandllm-cli --features cuda --locked
The xandllm binary will be available in your PATH.
Quick Start Guide
1. Pull a Model
XandLLM uses HuggingFace Hub for model management. Models auto-download on first use.
# Set HuggingFace token (for gated models)
export HUGGING_FACE_HUB_TOKEN=hf_...
# Pull a model (downloads and caches)
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0
# List cached models
xandllm list
Popular models to try:
# Code generation
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0
# Reasoning with thinking blocks
xandllm pull TeichAI/Qwen3-4B-Thinking-2507-Claude-4.5-Opus-High-Reasoning-Distill-GGUF
# Compact general purpose
xandllm pull unsloth/gemma-3-4b-it-GGUF:Q6_K
# Llama 3.1 (requires HF token for gated models)
xandllm pull meta-llama/Llama-3.1-8B-Instruct
2. Run Local Inference
Single prompt:
xandllm run \
--model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
--prompt "Explain quantum entanglement in one paragraph."
With performance stats:
xandllm run \
--model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
--prompt "Write a haiku about Rust programming." \
--stats
Output:
System:
Memory: 8192 MB allocated
Device: CUDA (NVIDIA GeForce RTX 3060)
CUDA Compute: 8.6
Model:
Architecture: llama
Parameters: 7.24B
Quantization: Q4_0
Performance:
Tokens/second: 42.3
Total time: 1.23s
Generated: 52 tokens
Memory safety,
Zero-cost abstractions,
Fearless concurrency.
3. Interactive Chat
xandllm chat --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --gpu
Session example:
> help
Available commands:
/help, /h - Show this help
/quit, /q - Exit chat
/clear, /c - Clear conversation
/system <msg> - Set system prompt
/stats - Show performance stats
> /system You are a helpful coding assistant.
> Write a Python function to calculate fibonacci numbers.
def fibonacci(n):
"""Calculate fibonacci number at position n."""
if n <= 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n-1) + fibonacci(n-2)
# Or more efficiently:
def fibonacci_iterative(n):
if n <= 0:
return 0
a, b = 0, 1
for _ in range(n - 1):
a, b = b, a + b
return b
> /stats
Session statistics:
Total tokens: 1,247
Avg speed: 38.5 tok/sec
Peak memory: 4.2 GB
4. Start the API Server
This is where XandLLM shines - OpenAI-compatible API for integration with existing tools.
xandllm serve \
--model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
--port 11435 \
--gpu
Server output:
🚀 XandLLM Server starting...
Configuration:
Host: 0.0.0.0
Port: 11435
Model: Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
Device: CUDA (NVIDIA GeForce RTX 3060)
Max batch size: 8
Max sequence length: 4096
Routes:
POST /v1/chat/completions
POST /v1/completions
GET /v1/models
GET /health
Server ready at http://0.0.0.0:11435
Press Ctrl+C to stop
Using the Web Frontend
XandLLM includes a React-based streaming chat interface.
Setup
cd frontend
# Install dependencies
pnpm install
# Configure API endpoint
# Edit .env.local:
VITE_API_URL=http://localhost:11435
# Start development server
pnpm dev
Access at: http://localhost:5173
Features
- Streaming responses - See tokens appear in real-time
- Model selection - Switch between loaded models
- Parameter tuning - Temperature, top_p, max_tokens
- Chat history - Save and revisit conversations
- System prompts - Customize behavior per session
- Export - Download conversations as Markdown
Using with the Server
-
Start the API server:
xandllm serve --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --port 11435 -
In another terminal, start the frontend:
cd frontend && pnpm dev -
Open browser to
http://localhost:5173 -
The UI automatically connects to the API server and shows:
- Available models
- Connection status
- GPU/CPU indicator
API Usage Examples
Chat Completions
Request:
curl -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Rust's ownership system?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1708000000,
"model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Rust's ownership system is a set of rules that the compiler checks at compile time..."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 128,
"total_tokens": 152
}
}
Streaming (SSE)
Request:
curl -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Response (Server-Sent Events):
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":" there"},"index":0}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0}]}
data: [DONE]
List Models
curl http://localhost:11435/v1/models
Response:
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
"object": "model",
"created": 1708000000,
"owned_by": "xandllm"
}
]
}
Using with OpenAI SDK
Since XandLLM is OpenAI-compatible, existing tools work with just a base URL change:
Python:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11435/v1",
api_key="not-needed" # XandLLM doesn't require auth
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
messages=[
{"role": "user", "content": "Explain lifetimes in Rust"}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
JavaScript/TypeScript:
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'http://localhost:11435/v1',
apiKey: 'not-needed'
});
const stream = await openai.chat.completions.create({
model: 'Qwen/Qwen2.5-Coder-7B-Instruct-GGUF',
messages: [{ role: 'user', content: 'Hello!' }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
Knowledge Distillation (Unique Feature)
This is XandLLM's killer feature - compress large teacher models into efficient student models.
What is Knowledge Distillation?
A technique to train a smaller "student" model to mimic a larger "teacher" model:
┌──────────────┐ Training Data ┌──────────────┐
│ Teacher │ ──────────────────────▶ │ Student │
│ (7B params)│ │ (1B params) │
│ (Slow) │ │ (4x Faster) │
└──────────────┘ └──────────────┘
Benefits:
- 4-7x faster inference
- 1/7th the memory usage
- 90%+ accuracy retention
- Run on cheaper hardware
Create a Fresh Student Model
Train a new 1B parameter model from scratch:
# Prepare training data (JSONL format)
mkdir -p my_dataset
cat > my_dataset/train.jsonl << 'EOF'
{"prompt": "Explain recursion in programming.", "completion": "Recursion is when a function calls itself to solve smaller instances of the same problem..."}
{"prompt": "What is a closure?", "completion": "A closure is a function that remembers the environment in which it was created..."}
EOF
# Run distillation
xandllm distill \
--model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
--dataset ./my_dataset \
--model-to ./output/XandLM-1B \
--size 1b \
--epochs 3 \
--batch-size 4 \
--learning-rate 1e-4 \
--gpu
What happens:
- Teacher model (7B) generates responses to your dataset
- Student model (1B) learns to mimic those responses
- After training, student achieves ~85-90% of teacher quality
- Student is 7x smaller and 4x faster
Serve the distilled model:
xandllm serve --model ./output/XandLM-1B --port 11435
Fine-tune an Existing Small Model
Instead of training from scratch, fine-tune a small base model:
xandllm distill \
--model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
--dataset ./my_dataset \
--model-to ./output/MyFineTuned-3B \
--student-base Qwen/Qwen2.5-3B-Instruct \
--epochs 5 \
--batch-size 2 \
--learning-rate 5e-5 \
--gpu
Options:
--type safetensor(default) - HuggingFace format--type gguf- llama.cpp compatible (requires llama.cpp tools)
Convenience Scripts
Pre-configured distillation scripts for common setups:
# Distill 7B teacher → 1B student (fresh)
bash scripts/distill-1b.sh
# Distill 7B teacher → 3B student (fresh)
bash scripts/distill-3b.sh
# Distill 8B teacher → 7B student (fresh)
bash scripts/distill-7b.sh
# Fine-tune existing 1B model
bash scripts/distill-finetune-1b.sh
# Fine-tune existing 3B model
bash scripts/distill-finetune-3b.sh
# Fine-tune existing 7B model
bash scripts/distill-finetune-7b.sh
Override defaults:
bash scripts/distill-1b.sh \
--dataset ./my_data \
--output ./my-1b-model \
--no-gpu
Real-World Use Case: Code Assistant
Create a specialized coding model:
# 1. Collect code examples
cat > code_dataset/train.jsonl << 'EOF'
{"prompt": "Write a Python function to reverse a string.", "completion": "def reverse_string(s):\n return s[::-1]"}
{"prompt": "Explain async/await in JavaScript.", "completion": "Async/await is syntactic sugar over Promises..."}
{"prompt": "Create a Rust struct for a user.", "completion": "struct User {\n name: String,\n email: String,\n age: u32,\n}"}
EOF
# 2. Distill from Qwen2.5-Coder-7B to 3B student
xandllm distill \
--model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
--dataset ./code_dataset \
--model-to ./XandCoder-3B \
--student-base Qwen/Qwen2.5-3B-Instruct \
--epochs 5 \
--gpu
# 3. Serve your custom model
xandllm serve --model ./XandCoder-3B --port 11435
Result: A 3B parameter model specialized for code that runs at 60+ tokens/sec on an RTX 3060.
Docker Deployment
Quick Start with Docker Compose
# Clone repository
git clone https://github.com/XandAI-project/XandLLM.git
cd XandLLM
# Configure environment
cp .env .env.local
echo "HUGGING_FACE_HUB_TOKEN=hf_..." >> .env.local
# Build and start with GPU
sudo docker compose --env-file .env.local -f docker/docker-compose.yml up --build
Services started:
- API server:
http://localhost:11435 - Web UI:
http://localhost:5173
Testing the Docker Setup
# Health check
curl http://localhost:11435/health
# Response: {"status": "ok"}
# Chat completion
curl -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF",
"messages": [{"role": "user", "content": "Hello from Docker!"}]
}'
Configuration
Config File (config/default.toml)
[server]
host = "0.0.0.0"
port = 11435
request_timeout_secs = 120
[inference]
max_batch_size = 8
max_sequence_length = 4096
default_max_new_tokens = 512
temperature = 0.7
top_p = 0.9
[model]
cache_dir = "~/.cache/xandllm"
[device]
prefer_gpu = true
cuda_device_id = 0
Environment Variables
# Logging
export RUST_LOG=info # trace, debug, info, warn, error
# HuggingFace (for gated models)
export HUGGING_FACE_HUB_TOKEN=hf_...
# Server overrides
export XANDLLM_SERVER_PORT=11435
export XANDLLM_DEVICE_PREFER_GPU=true
export XANDLLM_MODEL_CACHE_DIR=/custom/cache/path
Comparison: XandLLM vs Alternatives
| Feature | XandLLM | Ollama | vLLM | llama.cpp |
|---|---|---|---|---|
| Language | Rust | Go | Python | C++ |
| Knowledge Distillation | ✅ Built-in | ❌ No | ❌ No | ❌ No |
| OpenAI API | ✅ Native | ✅ Native | ✅ Native | ⚠️ Via proxy |
| Web UI | ✅ Included | ⚠️ Community | ❌ No | ⚠️ Various |
| GPU Support | ✅ CUDA | ✅ CUDA | ✅ CUDA | ✅ CUDA/Metal/Vulkan |
| GGUF | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Best |
| Safetensors | ✅ Yes | ❌ No | ✅ Yes | ⚠️ Via conversion |
| Deployment | Binary/Docker | Binary/Docker | Python/Docker | Binary |
| Memory Safety | ✅ Rust | ✅ Go | ❌ Python | ⚠️ C++ |
| Setup Complexity | Medium | Easy | Hard | Medium |
Use Cases
1. Personal AI Assistant
Setup for home use:
# Small, fast model for daily use
xandllm serve --model unsloth/gemma-3-4b-it-GGUF:Q6_K --port 11435
# Access via:
# - Web UI (localhost:5173)
# - API from any app
# - CLI chat
2. Code Development Assistant
# Code-specialized model
xandllm serve \
--model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
--port 11435 \
--gpu
# Integrate with IDE via OpenAI-compatible plugin
3. Custom Model Creation
Create specialized models for your domain:
# Distill from general to domain-specific
xandllm distill \
--model-from meta-llama/Llama-3.1-8B-Instruct \
--dataset ./medical_domain_data \
--model-to ./Medical-Assistant-3B \
--size 3b \
--gpu
# Serve specialized model
xandllm serve --model ./Medical-Assistant-3B
4. API Service for Apps
# Production deployment
xandllm serve \
--model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
--host 0.0.0.0 \
--port 11435 \
--gpu
# Load balanced behind nginx
# Multiple model instances
# Monitoring via /health
Troubleshooting
CUDA Not Detected
Problem: GPU not being used despite CUDA being installed.
Solutions:
# Check CUDA availability
nvidia-smi
# Verify environment
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# Rebuild with CUDA
cargo install --path crates/xandllm-cli --features cuda --locked
Out of Memory
Problem: Model doesn't fit in GPU memory.
Solutions:
# Use smaller quantization
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0
# Or use CPU
xandllm serve --model MODEL --no-gpu
# Or use smaller model
xandllm serve --model unsloth/gemma-3-4b-it-GGUF:Q4_0
Model Loading Errors
Problem: "Failed to load model" or tokenizer errors.
Solutions:
# Clear cache and re-download
rm -rf ~/.cache/xandllm
xandllm pull MODEL
# Check model compatibility
xandllm list
Frontend Connection Issues
Problem: Web UI can't connect to API.
Solutions:
# Verify server is running
curl http://localhost:11435/health
# Check frontend env
cat frontend/.env.local
# Should be: VITE_API_URL=http://localhost:11435
# Check CORS (if accessing from different host)
# Server allows all origins by default
Roadmap
Upcoming Features
Mixture of Experts (MoE):
- Mixtral support
- Qwen-MoE
- DeepSeek-MoE
Additional Architectures:
- Phi (Microsoft)
- Falcon
- Mamba (state-space models)
- RWKV
Multi-Modal:
- Vision-Language (LLaVA, Qwen-VL)
- Audio models (Whisper)
Performance:
- Continuous batching
- PagedAttention
- Tensor/pipeline parallelism
- AWQ/GPTQ quantization
- LoRA/QLoRA adapters
API:
- Function calling
- JSON mode
- Logprobs
- Multiple choices
Conclusion
XandLLM brings together:
✅ High-performance Rust implementation
✅ OpenAI-compatible API
✅ Knowledge distillation (unique feature)
✅ Multiple model formats (GGUF, Safetensors)
✅ Easy deployment (binary or Docker)
✅ Complete toolchain (CLI, API, Web UI)
Get started in 5 minutes:
# 1. Build
git clone https://github.com/XandAI-project/XandLLM.git
cd XandLLM && bash scripts/build-linux.sh
# 2. Pull model
xandllm pull Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0
# 3. Serve
xandllm serve --model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --port 11435
# 4. Chat
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-Coder-7B-Instruct-GGUF","messages":[{"role":"user","content":"Hello!"}]}'
Try knowledge distillation:
xandllm distill \
--model-from Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_0 \
--dataset ./my_data \
--model-to ./MyModel-1B \
--size 1b \
--gpu
Links:
- GitHub: github.com/XandAI-project/XandLLM
- XandAI CLI: github.com/XandAI-project/Xandai-CLI
- Contact: av.souza2018@gmail.com
Star the repository, try it out, and let me know what you think! 🦀🚀