LLM Data Sources

LLM data sources connect hugr to AI model providers for text generation and chat completion. Three provider types are supported, each implementing the same unified GraphQL interface via the core.models runtime module.

Provider Types

`llm-openai` — OpenAI-Compatible

Covers: OpenAI, Azure OpenAI, Ollama, LM Studio, vLLM, Mistral, Qwen, LiteLLM, and any OpenAI-compatible endpoint.

mutation {
  core {
    insert_data_sources(data: {
      name: "my_llm"
      type: "llm-openai"
      prefix: "my_llm"
      path: "http://localhost:1234/v1/chat/completions?model=gemma-4&max_tokens=4096&timeout=120s"
    }) { name }
  }
}

`llm-anthropic` — Anthropic

mutation {
  core {
    insert_data_sources(data: {
      name: "claude"
      type: "llm-anthropic"
      prefix: "claude"
      path: "https://api.anthropic.com/v1/messages?model=claude-sonnet-4-20250514&api_key=${secret:ANTHROPIC_KEY}&max_tokens=4096"
    }) { name }
  }
}

`llm-gemini` — Google Gemini

mutation {
  core {
    insert_data_sources(data: {
      name: "gemini"
      type: "llm-gemini"
      prefix: "gemini"
      path: "https://generativelanguage.googleapis.com/v1beta?model=gemini-2.5-flash&api_key=${secret:GEMINI_KEY}&max_tokens=4096&thinking_budget=2048"
    }) { name }
  }
}

Gemini 2.5+ models with tool calling return a thought_signature that must be included when sending tool results back. See Tool Call Round-trip for details.

Path Parameters

Parameter	Required	Default	Description
`model`	yes	—	Model identifier (e.g., `gpt-4o`, `claude-sonnet-4-20250514`, `gemini-2.5-flash`)
`api_key`	no	—	API key. Supports `${secret:ENV_VAR}` syntax
`api_key_header`	no	`Authorization: Bearer`	Custom auth header (OpenAI only)
`max_tokens`	no	`4096`	Default max tokens per request
`timeout`	no	`60s`	Request timeout
`rpm`	no	—	Max requests per minute (0 = unlimited)
`tpm`	no	—	Max tokens per minute (0 = unlimited)
`thinking_budget`	no	`0`	Maximum thinking tokens for extended reasoning (0 = disabled). Per-request `thinking_budget` argument is capped at this value
`rate_store`	no	—	Name of a `redis` data source for shared rate limit counters. If unset, counters are in-memory per-node

Configuration Examples

# OpenAI cloud
https://api.openai.com/v1/chat/completions?model=gpt-4o&api_key=${secret:OPENAI_API_KEY}

# Ollama local
http://localhost:11434/v1/chat/completions?model=mistral&timeout=120s

# LM Studio local
http://localhost:1234/v1/chat/completions?model=gemma-4&timeout=120s

# Azure OpenAI
https://myorg.openai.azure.com/openai/deployments/gpt4/chat/completions?model=gpt-4o&api_key=${secret:AZURE_KEY}&api_key_header=api-key

# vLLM / any OpenAI-compatible
http://gpu-server:8000/v1/chat/completions?model=llama3-70b&timeout=180s

Usage

LLM sources are accessed through the core.models module:

# Simple completion
{ function { core { models {
  completion(model: "my_llm", prompt: "Explain GraphQL", max_tokens: 200) {
    content finish_reason prompt_tokens completion_tokens latency_ms
  }
} } } }

# Chat with tool calling
{ function { core { models {
  chat_completion(
    model: "claude",
    messages: [
      "{\"role\":\"system\",\"content\":\"You are a helpful assistant.\"}",
      "{\"role\":\"user\",\"content\":\"What is the weather?\"}"
    ],
    tools: ["{\"name\":\"get_weather\",\"description\":\"Get weather\",\"parameters\":{\"type\":\"object\",\"properties\":{\"city\":{\"type\":\"string\"}}}}"],
    max_tokens: 200
  ) {
    content finish_reason tool_calls
  }
} } } }

Response Normalization

All providers return the same llm_result structure:

Field	Description
`content`	Generated text
`model`	Model that responded
`finish_reason`	`stop`, `tool_use`, or `length`
`prompt_tokens`	Input token count
`completion_tokens`	Output token count
`total_tokens`	Total tokens
`provider`	`openai`, `anthropic`, or `gemini`
`latency_ms`	Request latency
`tool_calls`	JSON string: `[{"id":"...","name":"...","arguments":{...}}]`
`thought_signature`	Gemini 2.5+: thought signature for tool call round-trip (see Tool Call Round-trip)

Rate Limiting

LLM sources support per-minute rate limits for both requests and tokens. Limits are enforced before making API calls.

In-Memory (Single Node)

# Limit to 100 requests/min and 100k tokens/min
http://api.openai.com/v1/chat/completions?model=gpt-4o&api_key=${secret:OPENAI_KEY}&rpm=100&tpm=100000

Counters are per-node only. Suitable for single-node deployments.

Shared (Redis)

For multi-node clusters, use a Redis data source for shared counters:

# First, register a Redis data source
# Then reference it in the LLM path:
http://api.openai.com/v1/chat/completions?model=gpt-4o&api_key=${secret:OPENAI_KEY}&rpm=100&tpm=100000&rate_store=redis

Redis keys auto-expire after 2 minutes. Key format: ratelimit:{source_name}:rpm:{minute_window}.

When rate limits are exceeded, requests return an error: rate limit exceeded: N requests per minute exceeded for source_name.

Streaming

All LLM providers support streaming completions via GraphQL subscriptions. When a completion or chat_completion is invoked as a subscription, the server opens a streaming connection to the provider and delivers tokens incrementally to the client using Server-Sent Events (SSE).

# Example URL with thinking enabled
https://api.anthropic.com/v1/messages?model=claude-sonnet-4-20250514&api_key=${secret:ANTHROPIC_KEY}&max_tokens=4096&thinking_budget=2048

See Streaming Completions in the AI Models documentation for the subscription query format and event types.

Provider Types​

llm-openai — OpenAI-Compatible​

llm-anthropic — Anthropic​

llm-gemini — Google Gemini​

Path Parameters​

Configuration Examples​

Usage​

Response Normalization​

Rate Limiting​

In-Memory (Single Node)​

Shared (Redis)​

Streaming​

See Also​