Skip to main content

LLM Data Sources

LLM data sources connect hugr to AI model providers for text generation and chat completion. Three provider types are supported, each implementing the same unified GraphQL interface via the core.models runtime module.

Provider Types

llm-openai — OpenAI-Compatible

Covers: OpenAI, Azure OpenAI, Ollama, LM Studio, vLLM, Mistral, Qwen, LiteLLM, and any OpenAI-compatible endpoint.

mutation {
core {
insert_data_sources(data: {
name: "my_llm"
type: "llm-openai"
prefix: "my_llm"
path: "http://localhost:1234/v1/chat/completions?model=gemma-4&max_tokens=4096&timeout=120s"
}) { name }
}
}

llm-anthropic — Anthropic

mutation {
core {
insert_data_sources(data: {
name: "claude"
type: "llm-anthropic"
prefix: "claude"
path: "https://api.anthropic.com/v1/messages?model=claude-sonnet-4-20250514&api_key=${secret:ANTHROPIC_KEY}&max_tokens=4096"
}) { name }
}
}

llm-gemini — Google Gemini

mutation {
core {
insert_data_sources(data: {
name: "gemini"
type: "llm-gemini"
prefix: "gemini"
path: "https://generativelanguage.googleapis.com/v1beta?model=gemini-2.5-flash&api_key=${secret:GEMINI_KEY}&max_tokens=4096&thinking_budget=2048"
}) { name }
}
}

Gemini 2.5+ models with tool calling return a thought_signature that must be included when sending tool results back. See Tool Call Round-trip for details.

Path Parameters

ParameterRequiredDefaultDescription
modelyesModel identifier (e.g., gpt-4o, claude-sonnet-4-20250514, gemini-2.5-flash)
api_keynoAPI key. Supports ${secret:ENV_VAR} syntax
api_key_headernoAuthorization: BearerCustom auth header (OpenAI only)
max_tokensno4096Default max tokens per request
timeoutno60sRequest timeout
rpmnoMax requests per minute (0 = unlimited)
tpmnoMax tokens per minute (0 = unlimited)
thinking_budgetno0Maximum thinking tokens for extended reasoning (0 = disabled). Per-request thinking_budget argument is capped at this value
rate_storenoName of a redis data source for shared rate limit counters. If unset, counters are in-memory per-node

Configuration Examples

# OpenAI cloud
https://api.openai.com/v1/chat/completions?model=gpt-4o&api_key=${secret:OPENAI_API_KEY}

# Ollama local
http://localhost:11434/v1/chat/completions?model=mistral&timeout=120s

# LM Studio local
http://localhost:1234/v1/chat/completions?model=gemma-4&timeout=120s

# Azure OpenAI
https://myorg.openai.azure.com/openai/deployments/gpt4/chat/completions?model=gpt-4o&api_key=${secret:AZURE_KEY}&api_key_header=api-key

# vLLM / any OpenAI-compatible
http://gpu-server:8000/v1/chat/completions?model=llama3-70b&timeout=180s

Usage

LLM sources are accessed through the core.models module:

# Simple completion
{ function { core { models {
completion(model: "my_llm", prompt: "Explain GraphQL", max_tokens: 200) {
content finish_reason prompt_tokens completion_tokens latency_ms
}
} } } }

# Chat with tool calling
{ function { core { models {
chat_completion(
model: "claude",
messages: [
"{\"role\":\"system\",\"content\":\"You are a helpful assistant.\"}",
"{\"role\":\"user\",\"content\":\"What is the weather?\"}"
],
tools: ["{\"name\":\"get_weather\",\"description\":\"Get weather\",\"parameters\":{\"type\":\"object\",\"properties\":{\"city\":{\"type\":\"string\"}}}}"],
max_tokens: 200
) {
content finish_reason tool_calls
}
} } } }

Response Normalization

All providers return the same llm_result structure:

FieldDescription
contentGenerated text
modelModel that responded
finish_reasonstop, tool_use, or length
prompt_tokensInput token count
completion_tokensOutput token count
total_tokensTotal tokens
provideropenai, anthropic, or gemini
latency_msRequest latency
tool_callsJSON string: [{"id":"...","name":"...","arguments":{...}}]
thought_signatureGemini 2.5+: thought signature for tool call round-trip (see Tool Call Round-trip)

Rate Limiting

LLM sources support per-minute rate limits for both requests and tokens. Limits are enforced before making API calls.

In-Memory (Single Node)

# Limit to 100 requests/min and 100k tokens/min
http://api.openai.com/v1/chat/completions?model=gpt-4o&api_key=${secret:OPENAI_KEY}&rpm=100&tpm=100000

Counters are per-node only. Suitable for single-node deployments.

Shared (Redis)

For multi-node clusters, use a Redis data source for shared counters:

# First, register a Redis data source
# Then reference it in the LLM path:
http://api.openai.com/v1/chat/completions?model=gpt-4o&api_key=${secret:OPENAI_KEY}&rpm=100&tpm=100000&rate_store=redis

Redis keys auto-expire after 2 minutes. Key format: ratelimit:{source_name}:rpm:{minute_window}.

When rate limits are exceeded, requests return an error: rate limit exceeded: N requests per minute exceeded for source_name.

Streaming

All LLM providers support streaming completions via GraphQL subscriptions. When a completion or chat_completion is invoked as a subscription, the server opens a streaming connection to the provider and delivers tokens incrementally to the client using Server-Sent Events (SSE).

# Example URL with thinking enabled
https://api.anthropic.com/v1/messages?model=claude-sonnet-4-20250514&api_key=${secret:ANTHROPIC_KEY}&max_tokens=4096&thinking_budget=2048

See Streaming Completions in the AI Models documentation for the subscription query format and event types.

See Also

  • AI Models Module — full core.models API reference
  • Embeddings — vector embedding sources
  • Redis — Redis data source for key-value storage and shared counters