Skip to main content

Iceberg

Apache Iceberg is an open table format for large-scale analytic datasets. Hugr supports Iceberg catalogs as data sources through DuckDB's iceberg extension, providing automatic table discovery, time-travel queries via snapshots, and standard DML operations (INSERT, UPDATE, DELETE).

Hugr works with any Iceberg REST catalog that implements the Iceberg REST API, including Apache Polaris, Lakekeeper, and others. It also supports AWS Glue and S3 Tables catalogs.

To set up an Iceberg data source, add a data source record to the data_sources table through the GraphQL API.

Connection Formats

Iceberg supports five connection path formats, specified in the path field:

1. REST Catalog (HTTPS)

Connect to an Iceberg REST catalog over HTTPS with OAuth2 authentication:

mutation addIcebergSource($data: data_sources_mut_input_data! = {}) {
core {
insert_data_sources(data: $data) {
name
type
path
prefix
self_defined
as_module
}
}
}

Variables:

{
"data": {
"name": "ice_catalog",
"type": "iceberg",
"path": "iceberg://catalog.example.com/warehouse?client_id=my_client&client_secret=my_secret&oauth2_server_uri=https://catalog.example.com/v1/oauth/tokens",
"prefix": "ice",
"self_defined": true,
"as_module": true
}
}

2. REST Catalog (HTTP)

For local or development REST catalogs without TLS, use the iceberg+http:// scheme:

{
"data": {
"name": "ice_local",
"type": "iceberg",
"path": "iceberg+http://localhost:8181/warehouse",
"prefix": "ice",
"self_defined": true,
"as_module": true
}
}

When no authentication parameters are provided, hugr connects without OAuth2 (AUTHORIZATION_TYPE 'none').

Endpoint Path Prefix

Some catalogs serve the REST API at a non-root path. For example, Apache Polaris uses /api/catalog as a prefix. In this case, include the prefix in the path — the last segment is treated as the warehouse name, and everything before it becomes the endpoint:

{
"data": {
"name": "ice_polaris",
"type": "iceberg",
"path": "iceberg+http://polaris:8181/api/catalog/iceberg_warehouse?client_id=root&client_secret=s3cr3t&oauth2_server_uri=http://polaris:8181/api/catalog/v1/oauth/tokens&oauth2_scope=PRINCIPAL_ROLE:ALL",
"prefix": "ice",
"self_defined": true,
"as_module": true
}
}

This sets the DuckDB ENDPOINT to http://polaris:8181/api/catalog and the warehouse to iceberg_warehouse.

3. AWS Glue Catalog

Connect to an AWS Glue catalog:

{
"data": {
"name": "ice_glue",
"type": "iceberg",
"path": "iceberg+glue://123456789?region=us-east-1",
"prefix": "ice",
"self_defined": true,
"as_module": true
}
}

4. AWS S3 Tables

Connect to AWS S3 Tables (Iceberg-managed tables in S3):

{
"data": {
"name": "ice_s3t",
"type": "iceberg",
"path": "iceberg+s3tables://arn:aws:s3tables:us-east-1:123456789:bucket/my-bucket?region=us-east-1",
"prefix": "ice",
"self_defined": true,
"as_module": true
}
}

5. Secret Reference

If you have already created an Iceberg secret in DuckDB, reference it by name:

{
"data": {
"name": "ice_catalog",
"type": "iceberg",
"path": "my_iceberg_secret",
"prefix": "ice",
"self_defined": true,
"as_module": true
}
}

Query Parameters

ParameterDescriptionExample
client_idOAuth2 client IDadmin
client_secretOAuth2 client secretpassword
oauth2_server_uriOAuth2 token endpoint URLhttps://catalog.example.com/v1/oauth/tokens
oauth2_scopeOAuth2 scopePRINCIPAL_ROLE:ALL
tokenBearer token (alternative to OAuth2)eyJhbG...
regionAWS region (for Glue/S3 Tables)us-east-1
access_delegation_modeCredential delegation mode (see Catalog-Specific Notes)vended_credentials
schema_filterRegexp to filter namespaces^default$
table_filterRegexp to filter tables^(users|orders)$

Data Source Options

OptionTypeDescription
self_definedBooleanWhen true, auto-generates GraphQL schema from catalog metadata
as_moduleBooleanWhen true, exposes as a top-level GraphQL module
read_onlyBooleanWhen true, blocks all DML mutations

S3 Storage Access

If the Iceberg catalog stores data files on S3-compatible storage (such as MinIO or AWS S3), you must register the storage credentials in hugr before loading the Iceberg data source:

mutation {
function {
core {
storage {
register_object_storage(
type: "S3"
name: "my_s3"
scope: "s3://warehouse"
key: "access_key"
secret: "secret_key"
region: "us-east-1"
endpoint: "s3.amazonaws.com"
use_ssl: true
url_style: "path"
) { success message }
}
}
}
}

For MinIO or other local S3-compatible services, set use_ssl: false and endpoint to the service address.

tip

The S3 scope must match the bucket prefix used by the Iceberg catalog. For example, if the catalog stores data in s3://iceberg-warehouse/, set scope: "s3://iceberg-warehouse".

Self-Describing Schema

Iceberg data sources support self-describing schema generation (self_defined: true). The engine introspects the Iceberg catalog's information_schema and automatically generates a GraphQL schema for all discovered tables, including column types and nullable fields.

Iceberg namespaces are mapped to GraphQL modules. For example, a table default.sensors in an Iceberg catalog with prefix ice produces:

query {
ice {
default {
default_sensors {
id
name
temperature
}
}
}
}

The type mapping from DuckDB to GraphQL follows the same convention as DuckLake:

DuckDB TypeGraphQL Type
BOOLEANBoolean
TINYINT, SMALLINT, INTEGERInt
BIGINT, HUGEINTBigInt
FLOAT, DOUBLE, DECIMALFloat
VARCHAR, CHAR, UUIDString
BLOBString
DATEDate
TIMETime
TIMESTAMP, TIMESTAMPTZTimestamp
JSONJSON
GEOMETRYGeometry

Time Travel with @at

Iceberg supports time-travel queries via the @at directive, allowing you to query data as it existed at a specific snapshot.

note

Iceberg snapshot IDs are large random numbers (e.g. 7733883404728353578), not sequential version numbers. You can find snapshot IDs by querying the Iceberg catalog metadata via its REST API or by using tools like DuckDB CLI.

In Queries

Apply time travel at query time using the @at directive on query fields:

query {
ice {
default {
# Query data at a specific snapshot
default_sensors @at(version: 7733883404728353578) {
id
name
temperature
}

# Query data at a specific timestamp
default_sensors @at(timestamp: "2025-01-01T12:00:00Z") {
id
name
temperature
}
}
}
}

Compare Snapshots

Use aliases to compare data across different points in time:

query {
ice {
default {
before: default_sensors @at(version: 7733883404728353578) {
id
temperature
}
after: default_sensors {
id
temperature
}
}
}
}

In Schema Definitions (SDL)

Pin a table to a specific snapshot version or timestamp at the schema level:

type historical_sensors @table(name: "default.sensors") @at(version: 7733883404728353578) {
id: BigInt! @pk
name: String
temperature: Float
}

The @at directive accepts exactly one of:

  • version: Int — snapshot ID
  • timestamp: String — RFC 3339 timestamp (e.g. 2025-01-15T10:30:00Z)

The @at directive is only valid on query fields. Using @at on mutations will result in an error.

Mutations

Iceberg tables support standard DML mutations (INSERT, UPDATE, DELETE) — the same as DuckDB tables. Each mutation creates a new Iceberg snapshot.

note

DuckDB's Iceberg extension does not yet support targeted inserts (i.e., inserting into specific columns). All columns must be provided in INSERT mutations. This is a DuckDB limitation and will be resolved in future DuckDB releases.

To make an Iceberg source read-only (blocking all DML), set read_only: true when registering the data source.

DuckLake Bridge

If you use DuckLake, you can import Iceberg catalog metadata into a DuckLake catalog using the iceberg_to_ducklake mutation. This allows you to query Iceberg data through DuckLake's management and versioning infrastructure.

mutation {
function {
core {
ducklake {
iceberg_to_ducklake(
iceberg_catalog: "ice_catalog"
ducklake_catalog: "my_lake"
) { success message }
}
}
}
}

The clear parameter resets the DuckLake catalog before import. The skip_tables parameter accepts a comma-separated list of tables to exclude.

note

The iceberg_to_ducklake function requires the target DuckLake catalog to be empty. Create a dedicated DuckLake data source for the bridge.

Catalog-Specific Notes

Apache Polaris

Apache Polaris serves its REST API at /api/catalog/v1/ (not /v1/). Use the endpoint path prefix in the connection URI:

iceberg+http://polaris:8181/api/catalog/iceberg_warehouse?...

Polaris distributes the configured S3 storage endpoint to DuckDB clients. Ensure the endpoint is reachable from the hugr process. When running hugr in Docker alongside Polaris and MinIO, use Docker-internal hostnames (e.g. minio:9000).

Polaris supports vended credentials — where the catalog provides temporary S3 credentials to clients instead of static endpoints. To enable this, add access_delegation_mode=vended_credentials to the connection URI. This requires Polaris to be configured with an STS-capable storage backend (e.g. AWS S3 with IAM roles).

Polaris requires oauth2_scope=PRINCIPAL_ROLE:ALL (or a specific principal role) for catalog access.

Lakekeeper

Lakekeeper serves the REST API at the root path (/v1/), so no endpoint prefix is needed:

iceberg+http://lakekeeper:8181/warehouse?client_id=...&client_secret=...&oauth2_server_uri=http://lakekeeper:8181/v1/oauth/tokens

AWS Glue

AWS Glue catalogs use SigV4 authentication. Hugr automatically sets AUTHORIZATION_TYPE 'sigv4' when the iceberg+glue:// scheme is used. Ensure the AWS credentials are available via the standard AWS credential chain (environment variables, instance profile, etc.).

AWS S3 Tables

S3 Tables use the same authentication as AWS Glue. The ARN format is:

iceberg+s3tables://arn:aws:s3tables:REGION:ACCOUNT:bucket/BUCKET_NAME?region=REGION

Limitations

DDL Not Supported

Iceberg data sources do not support DDL operations (CREATE TABLE, ALTER TABLE, DROP TABLE) through hugr. Tables must be created and managed externally via the Iceberg catalog or tools like Spark, DuckDB CLI, or Trino.

No MERGE Support

DuckDB's Iceberg extension does not support MERGE operations. Use separate INSERT, UPDATE, and DELETE statements instead.

Targeted Inserts Not Supported

DuckDB's Iceberg extension requires all columns to be specified in INSERT operations. Partial column inserts (targeted inserts) are not yet supported.

No Incremental Schema Compilation

Unlike DuckLake, Iceberg data sources do not support incremental schema compilation. Schema changes in the Iceberg catalog trigger a full re-introspection. The schema version is based on a content hash of the discovered tables and columns.

CREATE OR REPLACE Not Supported

DuckDB's Iceberg extension does not support CREATE OR REPLACE TABLE. Use DROP TABLE IF EXISTS followed by CREATE TABLE when recreating tables externally.

Catalog Authentication

When connecting to a secured REST catalog, ensure the OAuth2 or bearer token credentials are valid and have sufficient permissions to list namespaces and tables. Authentication errors during attach will prevent the data source from loading.

Storage Endpoint Visibility

The Iceberg REST catalog distributes its configured S3 storage endpoint to clients. If hugr runs outside the catalog's network (e.g., on the host while the catalog is in Docker), the storage endpoint may not be reachable. Ensure the S3 endpoint configured in the catalog is accessible from the hugr process.