Skip to main content

DuckDB

The DuckDB data source allows you to connect to a DuckDB database file, which can be accessed by multiple users. It can be placed in the file system or in object storage like S3 or GCS.

To set up a DuckDB data source, you need to add a data source record to the data_sources table through the GraphQL API.

The path can point to a DuckDB file in an accessible file system location (e.g., /workplace/my-duckdb-file.duckdb) or in object storage (e.g., s3://my-bucket/my-duckdb-file.duckdb). It also supports using an in-memory DuckDB database by passing :memory:.

mutation addDuckDBDataSet($data: data_sources_mut_input_data! = {}) {
core {
insert_data_sources(data: $data) {
name
description
as_module
disabled
path
prefix
read_only
self_defined
type
catalogs {
name
description
path
type
}
}
}
}

Variables:

{
"data": {
"name": "quack",
"type": "duckdb",
"path": "/workplace/my-duckdb-file.duckdb",
"prefix": "ds1",
"description": "My DuckDB data set",
"read_only": false,
"self_defined": false,
"as_module": true,
"disabled": false,
"catalogs": [
{
"name": "quack_catalog",
"description": "My DuckDB catalog source",
"path": "s3://my-bucket/my-duckdb-catalog.graphql",
"type": "uriFile"
}
]
}
}

DuckDB definition

As a database source, the DuckDB data source can be marked as read_only or self_defined. The read_only flag means that the data source is read-only and cannot be modified. The self_defined flag means that the data source is self-defined and the GraphQL schema definition will be generated based on the table and view metadata from the DuckDB database.

You can also add multiple catalogs to the DuckDB data source. All data source catalogs will be merged into one catalog, along with the self-defined generated definitions.

DuckDB as a dataset

DuckDB databases with the same structure can be used as datasets in Hugr. You can set up multiple DuckDB databases as data sources (modules), so that all queries will have the same signatures, and you only need to change the root module name to switch between different datasets.

DuckDB file views

You can also use DuckDB to view data in files stored in object storage like S3 or GCS. DuckDB can read files directly from object storage, and you can use it as a data source in Hugr.

You can create views in the DuckDB database and use them as a dataset in hugr, or you can define them in the catalog sources using the @view directive with the sql parameter. In the latter case, you can use an in-memory DuckDB database to query the files without creating a physical DuckDB database file.

Limitations

DuckDB in cluster environment

DuckDB does not support write operations to the same database file from different processes at the same time. So, if you are using DuckDB in a cluster environment, you should set the data source read_only to true, because the DuckDB database file can be accessed by multiple users, but it does not support writes to the same file from different processes simultaneously.

DuckDB on object storage

Database files that are stored in object storage should be accessible only in read-only mode.

DuckDB views

Currently, DuckDB does not support materialized views, and querying across views can have side effects during query optimization. Therefore, it is recommended to use DuckDB views only with relations that are not included in the base query of the view.

Self-defined data source

Self-defined data sources can have side effects and support auto-generated relationships only for the DuckDB data source, because it uses DuckDB metadata queries to generate the schema. For other data sources, you need to define the relationships manually.

Performance

Files like CSV are very slow to query, so it is recommended to load the data into the DuckDB database or Parquet files to improve performance. DuckDB can read Parquet files directly from object storage, so you can use them as a data source in Hugr.

Examples

You can find examples of the DuckDB data source: