Overview

The hugr query engine supports multiple data sources, allowing you to connect to various databases and data stores. Each data source can be configured with its own connection settings and schema definitions (catalogs). This flexibility enables you to work with different types of data and integrate them into your applications seamlessly.

Supported Data Sources

The following data sources are supported by the hugr query engine:

DuckDB: A lightweight, embedded database engine (see DuckDB).
PostgreSQL: A powerful, open-source relational database system (see PostgreSQL).
MySQL: A widely used open-source relational database management system (see MySQL).
DuckLake: A data lake solution that supports various storage systems, including cloud storage and distributed file systems. DuckLake is designed to handle large volumes of data and provides efficient querying capabilities and able to manage data and schema changes through snapshots (in development).
HTTP RESTful API: Allows you to connect to any RESTful API endpoint (see HTTP RESTful API).
Extension: A special data source types that allows you to extend GraphQL data objects (tables and views) to add extra subquery (joins) and function calls fields. This is useful for creating custom logic or aggregations using data from other sources. The extension data source can also defines cross-data source views, which allow you to combine data from multiple data sources into a single view. This is useful for creating complex queries that span multiple data sources.

The file views (spatial formats, CSV, parquet, etc.) are supported through the DuckDB data source, which can read and write various file formats and locations directly.

Configuration

System tables

Data sources can be configured by unified GraphQL API. This allows for a consistent and streamlined approach to managing data sources across different environments and use cases. The following system tables are available for managing data sources:

data_sources: Contains information about all configured data sources.
catalog_sources: Contains information about the one or more schema definitions (catalogs).
catalogs: Contains information about the assigned catalogs to the data sources.

Data Sources Table

The data_sources table contains the following fields:

Field	Type	Description
`name`	String	Unique name of the data source.
`type`	String	Type of the data source (e.g., `postgres`, `mysql`, etc.).
`prefix`	String	Prefix for the data source schema types and queries to avoid naming conflicts.
`description`	String	Description of the data source.
`as_module`	Boolean	Indicates if the data source should be attached to the GraphQL schema as a module.
`disabled`	Boolean	Indicates if the data source is disabled, and shouldn't be loaded at startup.
`path`	String	Connection string or path, it depends on data source type.
`read_only`	Boolean	Indicates if the data source is read-only, it would not allow to generate GraphQL mutations.
`self_defined`	Boolean	Indicates if the data source is self-defined, some of the data source types may return their schema definition with out needs to make catalogs.

The path can contain environment variables, which will be resolved at runtime. This allows for flexible configuration without hardcoding sensitive information like passwords. The name of the environment variable should be passed in the format [$ENV_VAR_NAME].

Catalog sources Table

The catalog_sources table contains the following fields:

Field	Type	Description
`name`	String	Unique name of the catalog.
`type`	String	Type of the catalog (e.g., `uri`, `localFS`, etc.).
`description`	String	Description of the catalog.
`path`	String	Path to the catalog file or URI.

System functions

To manually load/reload or unload the data sources you can use the load_data_sources and unload_data_sources mutation function. These mutations allow you to dynamically manage the data sources without needing to restart the engine.

mutation loadDataSources {
  core {
    load_data_sources(name: "northwind") {
      success
      message
    }
  }
}

mutation unloadDataSources {
  core {
    unload_data_sources(name: "northwind") {
      success
      message
    }
  }
}

All GraphQL queries and mutations to manage data sources are available in the module core in the root query and mutation types.

In clustered environments, to load/reload or unload data sources, you can use the load_data_sources and unload_data_sources mutations in core.cluster module. These mutations ensure that the data sources are synchronized across all nodes in the cluster.

mutation loadDataSources {
  core {
    cluster{
      load_data_sources(name: "northwind") {
        success
        message
      }
    }
  }
}

mutation unloadDataSources {
  core {
    cluster {
      unload_data_sources(name: "northwind") {
        success
        message
      }
    }
  }
}

You can get the list of all data sources by querying the data_sources table:

query getDataSources {
  core {
    data_sources {
      name
      description
      as_module
      disabled
      path
      prefix
      read_only
      self_defined
      type
      catalogs {
        name
        description
        path
        type
      }
    }
  }
}

Example Configuration

Here is an example of how to configure a PostgreSQL data source:

mutation addNorthwindDataSet($data: data_sources_mut_input_data! = {}) {
  core {
    insert_data_sources(data: $data) {
      name
      description
      as_module
      disabled
      path
      prefix
      read_only
      self_defined
      type
      catalogs {
        name
        description
        path
        type
      }
    }
  }
}

Variables:

{
  "data": {
    "name": "northwind",
    "type": "postgres",
    "prefix": "nw",
    "description": "The Northwind database example",
    "read_only": false,
    "as_module": true,
    "path": "postgres://hugr:hugr_password@postgres:5432/northwind",
    "catalogs": [
      {
        "name": "northwind",
        "type": "uri",
        "description": "Northwind database schema",
        "path": "/workspace/get-started/schema"
      }
    ]
  }
}

To learn how to mutation works go to the GraphQL Operations - Mutations section.

Self-Describing Data Sources

The data sources can be marked as self-describing (self_defined: true), which means that the data source will provide its own schema definition. This is useful for data sources that can generate their schema dynamically or have a predefined schema that does not require external catalogs.

When a data source is self-defined, it will automatically generate the GraphQL schema. All rational data sources (DuckDB, PostgreSQL, MySQL) support self-defined schema generation based on the metadata of the database objects (tables and views), you can use the following query to retrieve the schema:

query schema{
  function{
    core{
      describe_data_source_schema(name: "pg", self: true, log: true)
    }
  }
}

The argument self: true indicates that the schema should be generated based on the data source metadata, if it is not set, the schema will generate based on linked catalogs and self-defined flag. The log: true argument will log the generated schema to the stdout, it can be useful for debugging purposes.

Rational databases

The hugr uses the rational database data sources (DuckDB, PostgreSQL, MySQL, DuckLake) as attached databases to the DuckDB query engine. You can explore the internal data sources structure (tables, views, constraints) by using metadata queries by views in core.meta module.

For example, this query is used to define the schema for the self-defined data sources:

query dbMeta($name: String!) {
  core {
    meta {
      databases_by_name(name: $name) {
        name
        description: comment
        type
        schemas(filter:{
          _or:[
            {internal: {eq: false}}
            {name: {eq: "public"}}
            {name: {eq: "main"}}
          ]
          _not: {
                  _or:[
              {name: {like: "_timescaledb%"}}
              {name: {eq: "timescaledb_experimental"}}
              {name: {eq: "timescaledb_information"}}
            ]
          }
        }){
          name
          description: comment
          tables(filter:{internal: {eq: false}}){
            name
            description: comment
            schema_name: schema_name
            columns(filter: {internal: {eq: false}}){
              name
              description: comment
              data_type
              default
              is_nullable
            }
            constraints(filter:{
              database_name: {eq: $name}
            }){
              name
              type
              references_schema_name: schema_name
              references_table_name
              columns
              references_columns
            }
          }
          views(filter:{internal: {eq: false}}){
            name
            description: comment
            schema_name: schema_name
            columns(filter: {internal: {eq: false}}){
              name
              description: comment
              data_type
              default
              is_nullable
            }
          }
        }
      }
    }
  }
}

Service data sources

The hugr query engine also supports service data sources, which are used to connect to external services or APIs (e.g., REST). These data sources can be used to fetch data from RESTful APIs or other external services.

Supported Data Sources​

Configuration​

System tables​

Data Sources Table​

Catalog sources Table​

System functions​

Example Configuration​

Self-Describing Data Sources​

Rational databases​

Service data sources​