API Reference
Private LLM
API Docs
OpenAI-compatible. Drop-in replacement for GPT-4o. Your data stays on our hardware.
Base URL
https://llm.ro2-labs.ai Authentication
All authenticated endpoints require a Bearer token:
Authorization: Bearer ro2_... Request a free key to get started. No credit card required.
Endpoints
/health Public API status and uptime /v1/models Auth required List available models /v1/chat/completions Auth required Chat completion (OpenAI-compatible) Quick Start
curl https://llm.ro2-labs.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ro2_..." \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize GLBA Section 501(b)."}
],
"temperature": 0.7,
"max_tokens": 2048
}' OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(
base_url="https://llm.ro2-labs.ai/v1",
api_key="ro2_...",
)
response = client.chat.completions.create(
model="llama3:70b",
messages=[
{"role": "user", "content": "Summarize ITAR 22 CFR 120."}
],
)
print(response.choices[0].message.content) Response Format
Responses include data residency headers confirming on-prem processing:
{
"id": "chatcmpl-ro2-a7f3c2b1",
"model": "llama3:70b",
"choices": [{
"message": {
"role": "assistant",
"content": "..."
},
"finish_reason": "stop"
}],
"x-ro2-data-residency": "on-prem-austin-tx",
"x-ro2-third-party-routing": "none"
} Pricing
Free
10 req/hr · 100 req/mo
- Full 70B model access
- Streaming + batch
- Data residency headers
Pro
300 req/hr · 15,000 req/mo
- Everything in Free
- Priority inference queue
- Email support
B2B Pilot
Tailored to your workload
- Everything in Pro
- Dedicated capacity windows
- Pilot agreement + SLA
- Direct Slack/email support
Infrastructure Controls
Our architecture provides properties that support teams operating under regulatory constraints. Compliance is a shared responsibility between provider and customer.
Single-tenant hardware under U.S. jurisdiction. No foreign-national data access. No offshore sub-processors.
Data never leaves the inference environment. No third-party processors in the request path.
Dedicated hardware. No shared GPU pools. No third-party sub-processors handling customer data.
Documents processed on isolated, single-tenant hardware with no external data routing.