Self-Host LLMs: Run Llama, Mistral, DeepSeek on Your Own GPU

Full walkthrough — from spinning up a GPU server to serving production inference via an OpenAI-compatible API. With cost math that makes the business case obvious.

Why Self-Host Your LLM?

Every call to OpenAI's API sends your data to someone else's servers. Every token you generate costs money you can't predict. Every rate limit slows your users down. Self-hosting solves all three problems.

Privacy. Your prompts, documents, and user data never leave your server. Critical for healthcare, legal, finance, and any application handling PII. GDPR compliance becomes trivial when the data physically stays in your EU data center.
Cost. At scale, self-hosted inference is 80–95% cheaper than API providers. A $199/mo GPU server can handle the same workload as $2,000–5,000/mo in OpenAI API calls.
Control. Choose your model, quantization, context length, and system prompt. No content filters you didn't ask for. No model deprecations breaking your application overnight. No vendor deciding your model is "end-of-life."
No rate limits. Your server, your throughput. Process 1,000 requests per minute if your hardware can handle it. No waiting, no 429 errors, no "please retry in 30 seconds."
Latency. A GPU server in the same region as your application serves responses in 50–200ms for first token. No round-trip to a third-party API. No variable queue times.

Hardware Requirements for Popular Models

The key constraint is VRAM — the GPU's dedicated memory. A model must fit in VRAM to run at GPU speed. If it spills to system RAM, inference becomes 10–100× slower.

Model	Parameters	VRAM (Q4)	VRAM (FP16)	Min GPU
Llama 3.1 8B	8B	~5 GB	~16 GB	RTX 4000 SFF (20 GB)
Mistral 7B v0.3	7B	~4.5 GB	~14 GB	RTX 4000 SFF (20 GB)
DeepSeek-R1 8B	8B	~5 GB	~16 GB	RTX 4000 SFF (20 GB)
Llama 3.1 70B	70B	~40 GB	~140 GB	RTX 6000 Ada (48 GB)
DeepSeek-V3 671B	671B (37B active)	~45 GB	~80 GB	RTX PRO 6000 (96 GB)
Mixtral 8x7B	46.7B (12.9B active)	~26 GB	~93 GB	RTX 6000 Ada (48 GB)
Qwen2.5 72B	72B	~42 GB	~144 GB	RTX 6000 Ada (48 GB)

Q4 = 4-bit GGUF quantization. FP16 = half precision. VRAM estimates include ~2 GB overhead for KV cache at 4K context.

For most production use cases, Q4 quantization provides 95%+ of FP16 quality at a fraction of the VRAM cost. Unless you're running evals or need maximum accuracy, Q4 is the right choice.

Step 1: Deploy a GPU Server on RAW

Go to rawhq.io/pricing and select the GPU tab. Choose your plan based on the model you want to run:

RTX 4000 SFF ($199/mo) — 20 GB VRAM. Runs any 7B–13B model at Q4 or FP16.
RTX 6000 Ada ($899/mo) — 48 GB VRAM. Runs 70B models at Q4 quantization.
RTX PRO 6000 ($999/mo) — 96 GB VRAM. Runs the largest open-weight models at full precision.

GPU servers are dedicated hardware — your entire machine, not a shared VM. Provisioning takes 1–48 hours. You'll receive an email with your IP address and SSH credentials.

Once you have access:

# SSH into your server
$ ssh root@your-server-ip

# Verify GPU is available
$ nvidia-smi

# Expected output: your GPU model, driver version, CUDA version
# NVIDIA drivers come pre-installed on RAW GPU servers

Step 2: Install and Run Ollama

Ollama is the simplest way to run LLMs locally. It handles model downloads, quantization selection, and serves an API — all from a single binary.

# Install Ollama (one command)
$ curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.1 8B
$ ollama run llama3.1

# The model downloads (~4.7 GB) and starts immediately.
# Type a message and get a response.

>>> What is the capital of France?
The capital of France is Paris.

# Exit with /bye

To run Ollama as a background service and expose its API:

# Start Ollama server (runs on port 11434)
$ ollama serve &

# Pull models you want available
$ ollama pull llama3.1
$ ollama pull mistral
$ ollama pull deepseek-r1:8b

# Test the API
$ curl http://localhost:11434/api/chat -d '{
 "model": "llama3.1",
 "messages": [{"role": "user", "content": "Hello!"}],
 "stream": false
}'

Ollama is great for development, prototyping, and low-to-medium traffic workloads. For production with concurrent users, use vLLM.

Step 3: Install and Run vLLM for Production Inference

vLLM is the gold standard for production LLM serving. Its key advantages over Ollama for production use:

Continuous batching — handles multiple concurrent requests efficiently, dynamically batching them for maximum GPU utilization.
PagedAttention — manages VRAM like an operating system manages RAM, enabling 2–4× more concurrent requests than naive serving.
OpenAI-compatible API — drop-in replacement for any application using the OpenAI SDK. Change the base URL, done.
Higher throughput — at 10+ concurrent users, vLLM can serve 2–5× more tokens/second than Ollama.

# Install vLLM
$ pip install vllm

# Serve Llama 3.1 8B with OpenAI-compatible API
$ vllm serve meta-llama/Llama-3.1-8B-Instruct \
 --host 0.0.0.0 \
 --port 8000 \
 --max-model-len 8192 \
 --gpu-memory-utilization 0.9

# The model downloads from Hugging Face and starts serving.
# API is available at http://your-server-ip:8000

Query it exactly like OpenAI:

# Chat completions (same format as OpenAI)
$ curl http://your-server-ip:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "meta-llama/Llama-3.1-8B-Instruct",
 "messages": [
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "Explain quantum computing in 3 sentences."} 
 ],
 "max_tokens": 256,
 "temperature": 0.7
 }'

To use it from Python with the OpenAI SDK:

# Python — drop-in replacement for OpenAI
from openai import OpenAI

client = OpenAI(
 base_url="http://your-server-ip:8000/v1",
 api_key="not-needed" # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
 model="meta-llama/Llama-3.1-8B-Instruct",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256
)

print(response.choices[0].message.content)

To run vLLM as a persistent systemd service:

# Create systemd service file
$ cat > /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/vllm serve meta-llama/Llama-3.1-8B-Instruct \
 --host 0.0.0.0 --port 8000 --max-model-len 8192
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

$ systemctl daemon-reload
$ systemctl enable --now vllm

Performance Benchmarks

Real-world throughput depends on model size, quantization, batch size, and context length. Here are representative numbers for single-user and concurrent workloads:

Model	GPU	Engine	1 User (tok/s)	10 Users (tok/s total)
Llama 3.1 8B (Q4)	RTX 4000 SFF	Ollama	~70	~90
Llama 3.1 8B (Q4)	RTX 4000 SFF	vLLM	~65	~250
Mistral 7B (Q4)	RTX 4000 SFF	vLLM	~75	~280
Llama 3.1 70B (Q4)	RTX 6000 Ada	vLLM	~25	~120
Llama 3.1 8B (FP16)	RTX PRO 6000	vLLM	~100	~450

Benchmarks at 4K context length, 256 max output tokens. "10 Users" shows aggregate throughput with continuous batching in vLLM.

The key insight: vLLM's continuous batching means throughput scales sub-linearly with concurrent users. With 10 concurrent users, you get 3–4× the per-user throughput compared to Ollama's sequential processing. For production workloads, this is the difference between needing 1 server and needing 4.

Cost Comparison: OpenAI API vs Self-Hosted

Let's do the math for a real scenario: a customer support chatbot handling 50,000 conversations per month, averaging 500 input tokens and 300 output tokens per conversation.

Provider	Model	Monthly Cost	Cost per 1K Conversations
OpenAI	GPT-4o-mini	~$45/mo	$0.90
OpenAI	GPT-4o	~$575/mo	$11.50
Anthropic	Claude 3.5 Haiku	~$60/mo	$1.20
RAW (self-hosted)	Llama 3.1 8B	$199/mo (flat)	$3.98

At 50K conversations/month, GPT-4o-mini is cheaper. But here's where self-hosting wins:

At 200K conversations/month: OpenAI GPT-4o-mini costs ~$180/mo. Self-hosted stays at $199/mo. Break-even.
At 500K conversations/month: OpenAI GPT-4o-mini costs ~$450/mo. Self-hosted still $199/mo. You save $250/mo.
At 1M conversations/month: OpenAI GPT-4o-mini costs ~$900/mo. Self-hosted still $199/mo. You save $700/mo.

The economics flip once you hit consistent volume. And the non-monetary benefits — privacy, zero rate limits, no vendor dependency — apply from day one.

For GPT-4o-class workloads: If you need 70B-class model quality, compare the RTX 6000 Ada ($899/mo) against GPT-4o ($575/mo at 50K conversations). At 200K+ conversations/month, GPT-4o costs $2,300/mo while self-hosted stays at $899/mo. The larger the volume, the bigger the savings.

Common Architectures

Here are three proven setups depending on your use case:

Chatbot / Customer Support

vLLM serving Llama 3.1 8B on RTX 4000 SFF
Nginx reverse proxy with SSL termination
Your application talks to vLLM's OpenAI-compatible API
Cost: $199/mo flat, handles 100K+ conversations/month

RAG Pipeline (Retrieval-Augmented Generation)

vLLM serving Llama 3.1 8B for generation
Nomic Embed or BGE for embedding generation (runs on same GPU)
PostgreSQL + pgvector for vector storage (runs on same server)
Cost: $199/mo for the complete stack

Multi-Model Serving

RTX 6000 Ada with 48 GB VRAM
vLLM serving Llama 3.1 8B (~5 GB) + Whisper (~3 GB) + embedding model (~0.5 GB)
Still 40 GB VRAM headroom for batch processing or a second LLM
Cost: $899/mo for a full AI infrastructure stack

Deploy Your First Self-Hosted LLM

Self-hosting an LLM isn't a weekend project anymore. It's one server, one install command, and a production-ready API. Full privacy, predictable costs, zero rate limits.

GPU servers starting at $199/mo. Pre-installed NVIDIA drivers. Full root access.

View GPU Plans →

How to Self-Host an LLM: Run Llama, Mistral, or DeepSeek on Your Own Server