Hugr IPC Protocol

The Hugr IPC (Inter-Process Communication) protocol is a custom HTTP-based protocol designed for efficient transfer of large datasets between the hugr server and clients. It uses HTTP Multipart format and Apache Arrow for high-performance data serialization, making it ideal for analytics and machine learning pipelines.

Overview

What is Hugr IPC?

Hugr IPC is a specialized protocol that:

Uses HTTP Multipart (multipart/mixed) for structured data transfer
Employs Apache Arrow IPC format for efficient binary serialization
Supports streaming of large datasets via WebSocket
Provides Python-compatible output (pandas.DataFrame and geopandas.GeoDataFrame)
Enables efficient transfer of tabular and geospatial data

Why Use a Separate Protocol?

While hugr provides a standard GraphQL API over HTTP with JSON responses, the IPC protocol offers significant advantages:

Performance Benefits:

5-10x faster serialization/deserialization compared to JSON
Zero-copy reads with Apache Arrow format
Lower memory footprint for large datasets
Efficient streaming for datasets that don't fit in memory

Type Preservation:

Native support for complex types (timestamps, decimals, nested structures)
Precise numeric types without JSON float precision issues
Direct geometry encoding (WKB for top-level, GeoJSON for nested)

Integration:

Direct compatibility with pandas and geopandas
Seamless integration with Apache Arrow ecosystem
Efficient data pipelines for analytics and ML

When to Use Hugr IPC

Use Hugr IPC when:

Working with large datasets (>1000 rows)
Building data science/ML pipelines in Python
Processing geospatial data
Streaming data in real-time
Performance is critical

Use GraphQL JSON API when:

Working with small datasets
Building web applications (JavaScript/TypeScript)
Need human-readable responses
Rapid prototyping and debugging

Protocol Architecture

HTTP Multipart Format

The protocol uses HTTP Multipart (multipart/mixed) to combine multiple response parts in a single HTTP response. Each part contains:

Headers - Metadata about the part (type, format, path)
Body - Actual data (Arrow IPC stream or JSON)

This structure allows the server to:

Return data and metadata together
Stream multiple result sets efficiently
Include extensions and error information

Apache Arrow Data Format

Apache Arrow is a columnar memory format designed for efficient analytics:

Key Features:

Columnar storage - Optimized for analytical queries
Zero-copy reads - Direct memory access without deserialization
Language-agnostic - Works with Python, R, Java, C++, JavaScript
Efficient compression - Built-in compression support
Type-rich - Preserves data types precisely

Arrow IPC Stream Format:

Self-describing binary format
Includes schema information
Supports streaming (multiple RecordBatch messages)
Efficient for network transfer

Request Format

HTTP Method and Endpoint

Endpoint: /ipc or /graphql (server-dependent)

Method: POST

Request Headers

Required Headers

POST /ipc HTTP/1.1
Host: hugr-server:15001
Content-Type: application/json

Optional Authentication Headers

Authorization: Bearer <token>
X-API-Key: <api-key>
X-User-Role: <role-name>

Header names can be customized via server configuration.

Request Body

The request body is a JSON object containing the GraphQL query:

{
  "query": "query($limit: BigInt) { users(limit: $limit) { id name email } }",
  "variables": {
    "limit": 100
  },
  "operationName": "GetUsers"
}

Fields:

query (string, required) - GraphQL query text
variables (object, optional) - Query variables
operationName (string, optional) - Operation name for named queries

Complete Request Example

curl -X POST http://localhost:15001/ipc \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "query": "query { devices { id name status geom } }",
    "variables": {}
  }'

Response Format

Response Headers

HTTP/1.1 200 OK
Content-Type: multipart/mixed; boundary=HUGR
Transfer-Encoding: chunked

The boundary=HUGR parameter defines the delimiter between multipart parts.

Multipart Structure

A typical response consists of multiple parts separated by the boundary marker:

--HUGR
<Part 1 Headers>

<Part 1 Body>
--HUGR
<Part 2 Headers>

<Part 2 Body>
--HUGR--

The final boundary (--HUGR--) marks the end of the response.

Part Headers

Each part includes custom headers that describe its content:

X-Hugr-Part-Type

Identifies the type of content in this part:

data - Query result data
extensions - GraphQL extensions
error - Error information

X-Hugr-Part-Type: data

X-Hugr-Format

Specifies the encoding format:

table - Arrow IPC RecordBatch (for tabular data)
object - JSON (for scalar/object results)

X-Hugr-Format: table

X-Hugr-Path

The GraphQL query path for this data:

X-Hugr-Path: data.users

For nested data, the path uses dot notation:

X-Hugr-Path: data.companies.departments.employees

X-Hugr-Chunk (Optional)

For streaming responses, identifies the chunk number:

X-Hugr-Chunk: 0

X-Hugr-Empty (Optional)

Boolean flag indicating if a table part is empty:

X-Hugr-Empty: true

Complete Response Example

HTTP/1.1 200 OK
Content-Type: multipart/mixed; boundary=HUGR
Transfer-Encoding: chunked

--HUGR
Content-Type: application/vnd.apache.arrow.stream
X-Hugr-Part-Type: data
X-Hugr-Path: data.devices
X-Hugr-Format: table

<Binary Arrow IPC RecordBatch data>
--HUGR
Content-Type: application/json
X-Hugr-Part-Type: extensions
X-Hugr-Path: extensions
X-Hugr-Format: object

{"timing": {"query_ms": 45, "total_ms": 52}}
--HUGR--

Error Response

When an error occurs, the response includes an error part:

HTTP/1.1 400 Bad Request
Content-Type: multipart/mixed; boundary=HUGR

--HUGR
Content-Type: application/json
X-Hugr-Part-Type: error
X-Hugr-Format: object

{"error": "Invalid query syntax: unexpected token at line 2"}
--HUGR--

Alternatively, simple errors may return plain JSON:

HTTP/1.1 500 Internal Server Error
Content-Type: application/json

{"error": "Internal server error"}

Data Types

GraphQL to Arrow Type Mapping

DuckDB exports data using its Arrow C-API. The following table shows how Hugr GraphQL types map to Arrow types via DuckDB:

GraphQL Type	DuckDB Type	Arrow Type	Notes
`ID`	`BIGINT` or `VARCHAR`	`Int64` or `Utf8`	Depends on actual value type
`String`	`VARCHAR`	`Utf8`	UTF-8 encoded strings
`Int`	`INTEGER`	`Int32`	32-bit signed integer
`BigInt`	`BIGINT`	`Int64`	64-bit signed integer
`Float`	`DOUBLE`	`Float64`	64-bit floating point
`Boolean`	`BOOLEAN`	`Boolean`	True/false values
`Timestamp`	`TIMESTAMP`	`Timestamp(Microsecond, None)`	Microsecond precision, no timezone
`Date`	`DATE`	`Date32`	Days since epoch (1970-01-01)
`Time`	`TIME`	`Time64(Microsecond)`	Microseconds since midnight
`Interval`	`INTERVAL`	`Interval(MonthDayNano)`	Months, days, and nanoseconds components
`JSON`	`VARCHAR`	`Utf8`	JSON as string
`H3Cell`	`UBIGINT`	`UInt64`	H3 cell index as unsigned 64-bit integer
`Vector`	`FLOAT[]`	`List<Float32>`	Array of floats for embeddings

Nullable vs Non-Nullable

Arrow preserves GraphQL nullability:

Non-null fields (String!) → Non-nullable Arrow field
Nullable fields (String) → Nullable Arrow field (with validity bitmap)

Lists and Arrays

GraphQL lists map to Arrow List types:

[String] → List<Utf8>
[Int!]! → List<Int32> (non-null list of non-null ints)

Nested Structures

Nested GraphQL objects become Arrow Struct types:

type User {
  id: ID!
  name: String!
  address: Address
}

type Address {
  street: String
  city: String
}

Arrow representation:

Struct<
  id: Int64,
  name: Utf8,
  address: Struct<
    street: Utf8,
    city: Utf8
  >
>

Geometry Types

Hugr handles geospatial data with special encoding rules:

Top-Level Table Geometry

Geometry fields at the top level of a table are encoded as WKB (Well-Known Binary):

{
  devices {
    id
    name
    geom  # WKB format in Arrow Binary column
  }
}

Arrow type: Binary (contains WKB bytes)

This format is efficient and directly compatible with geopandas.

Nested Table Geometry

Geometry fields in nested structures within tables use GeoJSON strings:

{
  drivers {
    id
    name
    devices {
      id
      geom  # GeoJSON string
    }
  }
}

Arrow type: Utf8 (contains GeoJSON string)

Single Object Geometry

For non-tabular (object) results, all geometry is GeoJSON regardless of nesting:

{
  device(id: 1) {
    id
    geom  # GeoJSON
  }
}

Response format: JSON with GeoJSON geometry

Type Preservation Example

JSON response (loses precision):

{
  "timestamp": "2025-01-15T10:30:45.123456",  // String representation
  "value": 123.456789012345678  // May lose precision in JSON
}

Arrow IPC (preserves types and precision):

timestamp: Timestamp(Microsecond, None) = 1736938245123456  // DuckDB TIMESTAMP (microseconds)
value: Float64 = 123.456789012345678  // Full precision maintained

Key differences:

Arrow preserves exact microsecond timestamp values as INT64
Arrow maintains full Float64 precision without JSON serialization loss
DuckDB TIMESTAMP type does not include timezone information (stored as UTC)

WebSocket Streaming Protocol

For real-time streaming of large datasets, Hugr IPC supports WebSocket connections.

Connection

Endpoint: ws://hugr-server:15001/stream (or /ipc/stream)

Subprotocol: hugr-ipc-ws

const ws = new WebSocket('ws://localhost:15001/stream', 'hugr-ipc-ws');

Message Types

Client → Server Messages

1. Query Object

Execute a query and stream results:

{
  "type": "query_object",
  "data_object": "devices",
  "selected_fields": ["id", "name", "status"],
  "variables": {
    "limit": 1000
  }
}

2. Query

Execute a GraphQL query string:

{
  "type": "query",
  "query": "query { devices { id name } }"
}

3. Cancel

Cancel the current query:

{
  "type": "cancel"
}

Server → Client Messages

1. Data (Binary)

Binary WebSocket message containing Arrow IPC RecordBatch.

The client receives one or more binary frames, each containing an Arrow RecordBatch that can be directly processed.

2. Error

{
  "type": "error",
  "error": "Query execution failed: table not found"
}

3. Complete

Indicates query has finished:

{
  "type": "complete"
}

Connection Management

Ping/Pong:

Server sends WebSocket ping frames every 30 seconds
Client must respond with pong to maintain connection

Query Limit:

Only one active query per connection
Must wait for complete or send cancel before starting new query

Reconnection:

Connection drops are not automatically recovered
Client must implement reconnection logic

WebSocket Example (Python)

import asyncio
import websockets
import pyarrow as pa

async def stream_data():
    uri = "ws://localhost:15001/stream"

    async with websockets.connect(uri, subprotocols=['hugr-ipc-ws']) as ws:
        # Send query
        await ws.send(json.dumps({
            "type": "query",
            "query": "query { devices { id name status } }"
        }))

        # Receive data
        while True:
            message = await ws.recv()

            if isinstance(message, bytes):
                # Binary Arrow data
                reader = pa.ipc.open_stream(message)
                for batch in reader:
                    df = batch.to_pandas()
                    print(f"Received {len(df)} rows")
            else:
                # JSON message
                msg = json.loads(message)
                if msg.get('type') == 'complete':
                    break
                elif msg.get('type') == 'error':
                    print(f"Error: {msg['error']}")
                    break

asyncio.run(stream_data())

WebSocket Example (JavaScript)

const ws = new WebSocket('ws://localhost:15001/stream', 'hugr-ipc-ws');

ws.onopen = () => {
  // Send query
  ws.send(JSON.stringify({
    type: 'query',
    query: '{ devices { id name status } }'
  }));
};

ws.onmessage = (event) => {
  if (event.data instanceof Blob) {
    // Binary Arrow data
    event.data.arrayBuffer().then(buffer => {
      // Use Apache Arrow JS library to parse
      const table = arrow.Table.from(new Uint8Array(buffer));
      console.log(`Received ${table.length} rows`);
    });
  } else {
    // JSON message
    const msg = JSON.parse(event.data);
    if (msg.type === 'complete') {
      console.log('Query complete');
      ws.close();
    } else if (msg.type === 'error') {
      console.error('Error:', msg.error);
    }
  }
};

Using with Clients

Python Client (hugr-client)

The recommended way to use Hugr IPC in Python is via the hugr-client library:

import hugr

# Synchronous client
client = hugr.Client("http://localhost:15001/ipc")
data = client.query("{ devices { id name geom } }")
df = data.df('data.devices')

# Streaming client
from hugr.stream import connect_stream

async def stream():
    client = connect_stream("http://localhost:15001/ipc")
    async with await client.stream("{ devices { id name } }") as stream:
        async for batch in stream.chunks():
            df = batch.to_pandas()
            print(f"Batch: {len(df)} rows")

See the Python Client documentation for complete usage details.

Other Arrow-Compatible Clients

Any client that can:

Parse HTTP Multipart responses
Decode Apache Arrow IPC streams

Can use the Hugr IPC protocol.

Languages with Arrow support:

Python (PyArrow)
R (arrow package)
JavaScript/TypeScript (apache-arrow)
Java (Apache Arrow Java)
C++ (Apache Arrow C++)
Rust (arrow-rs)
Go (arrow/go)

Direct HTTP Requests

You can also use the protocol directly with standard HTTP libraries:

import requests
import pyarrow as pa
import io

# Send request
response = requests.post(
    'http://localhost:15001/ipc',
    headers={'Content-Type': 'application/json'},
    json={'query': '{ devices { id name } }'},
    stream=True
)

# Parse multipart response
from requests_toolbelt.multipart import decoder

multipart_data = decoder.MultipartDecoder.from_response(response)

for part in multipart_data.parts:
    part_type = part.headers.get(b'X-Hugr-Part-Type').decode()
    format_type = part.headers.get(b'X-Hugr-Format').decode()
    path = part.headers.get(b'X-Hugr-Path').decode()

    if format_type == 'table':
        # Decode Arrow IPC
        reader = pa.ipc.open_stream(io.BytesIO(part.content))
        table = reader.read_all()
        df = table.to_pandas()
        print(f"Table at {path}: {len(df)} rows")
    elif format_type == 'object':
        # Decode JSON
        import json
        obj = json.loads(part.content)
        print(f"Object at {path}: {obj}")

Examples

Example 1: Simple Query

Request:

curl -X POST http://localhost:15001/ipc \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ users(limit: 10) { id name email } }"
  }'

Response:

HTTP/1.1 200 OK
Content-Type: multipart/mixed; boundary=HUGR

--HUGR
Content-Type: application/vnd.apache.arrow.stream
X-Hugr-Part-Type: data
X-Hugr-Path: data.users
X-Hugr-Format: table

<Arrow IPC stream with 10 rows>
--HUGR
Content-Type: application/json
X-Hugr-Part-Type: extensions
X-Hugr-Path: extensions
X-Hugr-Format: object

{"timing": {"query_ms": 12}}
--HUGR--

Example 2: Query with Variables

Request:

import requests

response = requests.post(
    'http://localhost:15001/ipc',
    headers={'Content-Type': 'application/json'},
    json={
        'query': '''
            query GetActiveDevices($status: String!) {
                devices(where: {status: {_eq: $status}}) {
                    id
                    name
                    last_seen
                }
            }
        ''',
        'variables': {'status': 'active'}
    }
)

Example 3: Geospatial Query

Request:

{
  "query": "{ devices { id name geom } }"
}

Response (conceptual):

--HUGR
Content-Type: application/vnd.apache.arrow.stream
X-Hugr-Part-Type: data
X-Hugr-Path: data.devices
X-Hugr-Format: table

Arrow Table:
- id: Int64 [1, 2, 3]
- name: Utf8 ["Device A", "Device B", "Device C"]
- geom: Binary [<WKB bytes>, <WKB bytes>, <WKB bytes>]
--HUGR--

The geom column contains WKB-encoded geometry that can be directly loaded into geopandas:

import hugr
client = hugr.Client("http://localhost:15001/ipc")
data = client.query("{ devices { id name geom } }")
gdf = data.gdf('data.devices', 'geom')  # Automatically decodes WKB

Example 4: Nested Data

Request:

{
  "query": "{ companies { id name departments { id name } } }"
}

Response: Multiple multipart sections for different paths:

--HUGR
Content-Type: application/vnd.apache.arrow.stream
X-Hugr-Part-Type: data
X-Hugr-Path: data.companies
X-Hugr-Format: table

<Arrow table with companies>
--HUGR
Content-Type: application/vnd.apache.arrow.stream
X-Hugr-Part-Type: data
X-Hugr-Path: data.companies.departments
X-Hugr-Format: table

<Arrow table with departments (flattened)>
--HUGR--

Example 5: Streaming Large Dataset

Python:

from hugr.stream import connect_stream
import asyncio

async def process_large_dataset():
    client = connect_stream("http://localhost:15001/ipc")

    query = """
    query {
        large_table {
            id
            data
            timestamp
        }
    }
    """

    async with await client.stream(query) as stream:
        row_count = 0
        async for batch in stream.chunks():
            df = batch.to_pandas()
            row_count += len(df)

            # Process batch
            print(f"Processing batch: {len(df)} rows")
            # ... your processing logic ...

        print(f"Total rows processed: {row_count}")

asyncio.run(process_large_dataset())

Example 6: Error Handling

Request:

{
  "query": "{ invalid_table { id name } }"
}

Response:

HTTP/1.1 400 Bad Request
Content-Type: application/json

{"error": "Table 'invalid_table' not found in schema"}

Performance

Arrow IPC vs JSON Comparison

Metric	JSON	Arrow IPC	Improvement
Serialization time	1000ms	150ms	6.7x faster
Deserialization time	800ms	100ms	8x faster
Transfer size	10MB	4MB	2.5x smaller
Memory usage	25MB	12MB	2x less
Type precision	Lost	Preserved	100%

Benchmark: 100,000 rows with mixed types (strings, numbers, timestamps)

Performance Characteristics

Strengths:

Extremely fast for tabular data (columnar format)
Zero-copy reads reduce CPU and memory overhead
Efficient compression (can enable dictionary encoding)
Scales well with large datasets

Considerations:

Small overhead for very small datasets (less than 100 rows)
Requires client-side Arrow library
Binary format (not human-readable)

Benchmark Results

Query: 1 million rows with 10 columns (mix of types)

Format	Response Time	Response Size	Client Parse Time	Total Time
JSON	8.5s	180MB	4.2s	12.7s
Arrow IPC (HTTP)	1.2s	65MB	0.4s	1.6s
Arrow IPC (WebSocket)	1.0s	65MB	0.4s	1.4s

Improvement: Arrow IPC is approximately 8x faster end-to-end.

Best Practices for Performance

Use Streaming for Large Datasets:

# Good: Process data in chunks
async with await client.stream(query) as stream:
    async for batch in stream.chunks():
        process(batch)

# Avoid: Load everything into memory
data = client.query(huge_query)  # May cause OOM

Select Only Needed Fields:

# Good: Specific fields
query { devices { id name status } }

# Avoid: Selecting all fields
query { devices { id name status geom metadata created_at ... } }

Use Server-Side Filtering:

# Good: Filter on server
query {
  devices(where: {status: {_eq: "active"}}, limit: 1000) {
    id name
  }
}

# Avoid: Fetching all data then filtering client-side

Enable Compression (if supported by client):
```
Accept-Encoding: gzip, deflate
```

Use Connection Pooling:

# Reuse client instance
client = hugr.Client("http://localhost:15001/ipc")

for query in queries:
    data = client.query(query)  # Reuses connection

Limitations and Known Issues

Current Limitations

Single Query per WebSocket: Only one active query allowed per WebSocket connection
No Query Resumption: If connection drops, query must be restarted from beginning
Binary Format: Responses are not human-readable (use GraphQL JSON API for debugging)
Client Libraries: Requires Arrow-compatible client library

Known Issues

Large Nested Structures: Very deeply nested GraphQL queries may have performance overhead
Mixed Geometry Types: Mixing different geometry types in same field requires client handling
Compression: Not all clients support Arrow compression codecs

Future Enhancements

Planned improvements include:

Query resumption after connection drop
Multiple concurrent queries per WebSocket
Built-in compression negotiation
Enhanced error reporting with structured error codes

Overview​

What is Hugr IPC?​

Why Use a Separate Protocol?​

When to Use Hugr IPC​

Protocol Architecture​

HTTP Multipart Format​

Apache Arrow Data Format​

Request Format​

HTTP Method and Endpoint​

Request Headers​

Required Headers​

Optional Authentication Headers​

Request Body​

Complete Request Example​

Response Format​

Response Headers​

Multipart Structure​

Part Headers​

X-Hugr-Part-Type​

X-Hugr-Format​

X-Hugr-Path​

X-Hugr-Chunk (Optional)​

X-Hugr-Empty (Optional)​

Complete Response Example​

Error Response​

Data Types​

GraphQL to Arrow Type Mapping​

Nullable vs Non-Nullable​

Lists and Arrays​

Nested Structures​

Geometry Types​

Top-Level Table Geometry​

Nested Table Geometry​

Single Object Geometry​

Type Preservation Example​

WebSocket Streaming Protocol​

Connection​

Message Types​

Client → Server Messages​

1. Query Object​

2. Query​

3. Cancel​

Server → Client Messages​

1. Data (Binary)​

2. Error​

3. Complete​

Connection Management​

WebSocket Example (Python)​

WebSocket Example (JavaScript)​

Using with Clients​

Python Client (hugr-client)​

Other Arrow-Compatible Clients​

Direct HTTP Requests​

Examples​

Example 1: Simple Query​

Example 2: Query with Variables​

Example 3: Geospatial Query​

Example 4: Nested Data​

Example 5: Streaming Large Dataset​

Example 6: Error Handling​

Performance​

Arrow IPC vs JSON Comparison​

Performance Characteristics​

Benchmark Results​

Best Practices for Performance​

Limitations and Known Issues​

Current Limitations​

Known Issues​

Future Enhancements​

See Also​

Overview

What is Hugr IPC?

Why Use a Separate Protocol?

When to Use Hugr IPC

Protocol Architecture

HTTP Multipart Format

Apache Arrow Data Format

Request Format

HTTP Method and Endpoint

Request Headers

Required Headers

Optional Authentication Headers

Request Body

Complete Request Example

Response Format

Response Headers

Multipart Structure

Part Headers

X-Hugr-Part-Type

X-Hugr-Format

X-Hugr-Path

X-Hugr-Chunk (Optional)

X-Hugr-Empty (Optional)

Complete Response Example

Error Response

Data Types

GraphQL to Arrow Type Mapping

Nullable vs Non-Nullable

Lists and Arrays

Nested Structures

Geometry Types

Top-Level Table Geometry

Nested Table Geometry

Single Object Geometry

Type Preservation Example

WebSocket Streaming Protocol

Connection

Message Types

Client → Server Messages

1. Query Object

2. Query

3. Cancel

Server → Client Messages

1. Data (Binary)

2. Error

3. Complete

Connection Management

WebSocket Example (Python)

WebSocket Example (JavaScript)

Using with Clients

Python Client (hugr-client)

Other Arrow-Compatible Clients

Direct HTTP Requests

Examples

Example 1: Simple Query

Example 2: Query with Variables

Example 3: Geospatial Query

Example 4: Nested Data

Example 5: Streaming Large Dataset

Example 6: Error Handling

Performance

Arrow IPC vs JSON Comparison

Performance Characteristics

Benchmark Results

Best Practices for Performance

Limitations and Known Issues

Current Limitations

Known Issues

Future Enhancements

See Also