Why Self-Host Your LLM?
Every call to OpenAI's API sends your data to someone else's servers. Every token you generate costs money you can't predict. Every rate limit slows your users down. Self-hosting solves all three problems.
- Privacy. Your prompts, documents, and user data never leave your server. Critical for healthcare, legal, finance, and any application handling PII. GDPR compliance becomes trivial when the data physically stays in your EU data center.
- Cost. At scale, self-hosted inference is 80–95% cheaper than API providers. A $199/mo GPU server can handle the same workload as $2,000–5,000/mo in OpenAI API calls.
- Control. Choose your model, quantization, context length, and system prompt. No content filters you didn't ask for. No model deprecations breaking your application overnight. No vendor deciding your model is "end-of-life."
- No rate limits. Your server, your throughput. Process 1,000 requests per minute if your hardware can handle it. No waiting, no 429 errors, no "please retry in 30 seconds."
- Latency. A GPU server in the same region as your application serves responses in 50–200ms for first token. No round-trip to a third-party API. No variable queue times.
Hardware Requirements for Popular Models
The key constraint is VRAM — the GPU's dedicated memory. A model must fit in VRAM to run at GPU speed. If it spills to system RAM, inference becomes 10–100× slower.
| Model | Parameters | VRAM (Q4) | VRAM (FP16) | Min GPU |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | ~5 GB | ~16 GB | RTX 4000 SFF (20 GB) |
| Mistral 7B v0.3 | 7B | ~4.5 GB | ~14 GB | RTX 4000 SFF (20 GB) |
| DeepSeek-R1 8B | 8B | ~5 GB | ~16 GB | RTX 4000 SFF (20 GB) |
| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | RTX 6000 Ada (48 GB) |
| DeepSeek-V3 671B | 671B (37B active) | ~45 GB | ~80 GB | RTX PRO 6000 (96 GB) |
| Mixtral 8x7B | 46.7B (12.9B active) | ~26 GB | ~93 GB | RTX 6000 Ada (48 GB) |
| Qwen2.5 72B | 72B | ~42 GB | ~144 GB | RTX 6000 Ada (48 GB) |
Q4 = 4-bit GGUF quantization. FP16 = half precision. VRAM estimates include ~2 GB overhead for KV cache at 4K context.
For most production use cases, Q4 quantization provides 95%+ of FP16 quality at a fraction of the VRAM cost. Unless you're running evals or need maximum accuracy, Q4 is the right choice.
Step 1: Deploy a GPU Server on RAW
Go to rawhq.io/pricing and select the GPU tab. Choose your plan based on the model you want to run:
- RTX 4000 SFF ($199/mo) — 20 GB VRAM. Runs any 7B–13B model at Q4 or FP16.
- RTX 6000 Ada ($899/mo) — 48 GB VRAM. Runs 70B models at Q4 quantization.
- RTX PRO 6000 ($999/mo) — 96 GB VRAM. Runs the largest open-weight models at full precision.
GPU servers are dedicated hardware — your entire machine, not a shared VM. Provisioning takes 1–48 hours. You'll receive an email with your IP address and SSH credentials.
Once you have access:
# SSH into your server
$ ssh root@your-server-ip
# Verify GPU is available
$ nvidia-smi
# Expected output: your GPU model, driver version, CUDA version
# NVIDIA drivers come pre-installed on RAW GPU serversStep 2: Install and Run Ollama
Ollama is the simplest way to run LLMs locally. It handles model downloads, quantization selection, and serves an API — all from a single binary.
# Install Ollama (one command)
$ curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3.1 8B
$ ollama run llama3.1
# The model downloads (~4.7 GB) and starts immediately.
# Type a message and get a response.
>>> What is the capital of France?
The capital of France is Paris.
# Exit with /byeTo run Ollama as a background service and expose its API:
# Start Ollama server (runs on port 11434)
$ ollama serve &
# Pull models you want available
$ ollama pull llama3.1
$ ollama pull mistral
$ ollama pull deepseek-r1:8b
# Test the API
$ curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'Ollama is great for development, prototyping, and low-to-medium traffic workloads. For production with concurrent users, use vLLM.
Step 3: Install and Run vLLM for Production Inference
vLLM is the gold standard for production LLM serving. Its key advantages over Ollama for production use:
- Continuous batching — handles multiple concurrent requests efficiently, dynamically batching them for maximum GPU utilization.
- PagedAttention — manages VRAM like an operating system manages RAM, enabling 2–4× more concurrent requests than naive serving.
- OpenAI-compatible API — drop-in replacement for any application using the OpenAI SDK. Change the base URL, done.
- Higher throughput — at 10+ concurrent users, vLLM can serve 2–5× more tokens/second than Ollama.
# Install vLLM
$ pip install vllm
# Serve Llama 3.1 8B with OpenAI-compatible API
$ vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
# The model downloads from Hugging Face and starts serving.
# API is available at http://your-server-ip:8000Query it exactly like OpenAI:
# Chat completions (same format as OpenAI)
$ curl http://your-server-ip:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in 3 sentences."}
],
"max_tokens": 256,
"temperature": 0.7
}'To use it from Python with the OpenAI SDK:
# Python — drop-in replacement for OpenAI
from openai import OpenAI
client = OpenAI(
base_url="http://your-server-ip:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)To run vLLM as a persistent systemd service:
# Create systemd service file
$ cat > /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 --port 8000 --max-model-len 8192
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
$ systemctl daemon-reload
$ systemctl enable --now vllmPerformance Benchmarks
Real-world throughput depends on model size, quantization, batch size, and context length. Here are representative numbers for single-user and concurrent workloads:
| Model | GPU | Engine | 1 User (tok/s) | 10 Users (tok/s total) |
|---|---|---|---|---|
| Llama 3.1 8B (Q4) | RTX 4000 SFF | Ollama | ~70 | ~90 |
| Llama 3.1 8B (Q4) | RTX 4000 SFF | vLLM | ~65 | ~250 |
| Mistral 7B (Q4) | RTX 4000 SFF | vLLM | ~75 | ~280 |
| Llama 3.1 70B (Q4) | RTX 6000 Ada | vLLM | ~25 | ~120 |
| Llama 3.1 8B (FP16) | RTX PRO 6000 | vLLM | ~100 | ~450 |
Benchmarks at 4K context length, 256 max output tokens. "10 Users" shows aggregate throughput with continuous batching in vLLM.
The key insight: vLLM's continuous batching means throughput scales sub-linearly with concurrent users. With 10 concurrent users, you get 3–4× the per-user throughput compared to Ollama's sequential processing. For production workloads, this is the difference between needing 1 server and needing 4.
Cost Comparison: OpenAI API vs Self-Hosted
Let's do the math for a real scenario: a customer support chatbot handling 50,000 conversations per month, averaging 500 input tokens and 300 output tokens per conversation.
| Provider | Model | Monthly Cost | Cost per 1K Conversations |
|---|---|---|---|
| OpenAI | GPT-4o-mini | ~$45/mo | $0.90 |
| OpenAI | GPT-4o | ~$575/mo | $11.50 |
| Anthropic | Claude 3.5 Haiku | ~$60/mo | $1.20 |
| RAW (self-hosted) | Llama 3.1 8B | $199/mo (flat) | $3.98 |
At 50K conversations/month, GPT-4o-mini is cheaper. But here's where self-hosting wins:
- At 200K conversations/month: OpenAI GPT-4o-mini costs ~$180/mo. Self-hosted stays at $199/mo. Break-even.
- At 500K conversations/month: OpenAI GPT-4o-mini costs ~$450/mo. Self-hosted still $199/mo. You save $250/mo.
- At 1M conversations/month: OpenAI GPT-4o-mini costs ~$900/mo. Self-hosted still $199/mo. You save $700/mo.
The economics flip once you hit consistent volume. And the non-monetary benefits — privacy, zero rate limits, no vendor dependency — apply from day one.
For GPT-4o-class workloads: If you need 70B-class model quality, compare the RTX 6000 Ada ($899/mo) against GPT-4o ($575/mo at 50K conversations). At 200K+ conversations/month, GPT-4o costs $2,300/mo while self-hosted stays at $899/mo. The larger the volume, the bigger the savings.
Common Architectures
Here are three proven setups depending on your use case:
Chatbot / Customer Support
- vLLM serving Llama 3.1 8B on RTX 4000 SFF
- Nginx reverse proxy with SSL termination
- Your application talks to vLLM's OpenAI-compatible API
- Cost: $199/mo flat, handles 100K+ conversations/month
RAG Pipeline (Retrieval-Augmented Generation)
- vLLM serving Llama 3.1 8B for generation
- Nomic Embed or BGE for embedding generation (runs on same GPU)
- PostgreSQL + pgvector for vector storage (runs on same server)
- Cost: $199/mo for the complete stack
Multi-Model Serving
- RTX 6000 Ada with 48 GB VRAM
- vLLM serving Llama 3.1 8B (~5 GB) + Whisper (~3 GB) + embedding model (~0.5 GB)
- Still 40 GB VRAM headroom for batch processing or a second LLM
- Cost: $899/mo for a full AI infrastructure stack
Deploy Your First Self-Hosted LLM
Self-hosting an LLM isn't a weekend project anymore. It's one server, one install command, and a production-ready API. Full privacy, predictable costs, zero rate limits.
GPU servers starting at $199/mo. Pre-installed NVIDIA drivers. Full root access.
View GPU Plans →