Key concepts

The hugr is a powerful tool designed to simplify provide access to all your data.

The key concept of hugr is data sources. A data source in hugr represents a physical or logical source of data, such as a database, an API, or a data lake. Each data source can contain tables, views or functions, which are the fundamental building blocks of data organization in hugr.

The tables, views, and functions within a data source can be accessed and manipulated through a unified GraphQL API and described using a GraphQL schema definition language (SDL). The files, that contain the schema definitions, can be stored in a file systems or in object storage, such as S3.

Data Sources

The following types of data sources are currently supported:

DuckDB: A high-performance SQL database designed for analytics, which supports SQL queries and provides a unified interface for data access. DuckDB files can be stored in a file system or in object storage, such as S3.
PostgreSQL: A relational database management system that uses SQL for querying.
HTTP REST API: A data source that allows you to access data from any RESTful API, enabling integration with various web services and applications. It supports various authentication methods, including API keys, OAuth2, and basic authentication.
MySQL: Another relational database management system that uses SQL for querying.
DuckLake: A data lake solution that supports various storage systems, including cloud storage and distributed file systems. DuckLake is designed to handle large volumes of data and provides efficient querying capabilities and able to manage data and schema changes through snapshots (in development).
Extension: A special data source that allows you to extend schema data objects (tables and views) to add additional subquery fields and function calls. This is useful for creating custom logic or aggregations using data from other sources. The extension data source can also defines cross-data source views, which allow you to combine data from multiple data sources into a single view. This is useful for creating complex queries that span multiple data sources.

The PostgreSQL data source supports aggregations and joins pushing down to the database level, allowing for efficient data retrieval and manipulation. The DuckDB data source supports SQL queries and provides a unified interface for data access, allowing for efficient data retrieval and manipulation.

Other types of data sources, such as files (CSV, Parquet, JSON or Spatial data formats), support by DuckDB data source - you can define views over them.

The hugr contains a built-in data sources, called runtime data sources. These data sources are used to access the metadata and configuration of the hugr itself, such as the list of available data sources, tables, views, and functions. The runtime data sources are not intended for direct user interaction but provide essential information about the hugr environment.

CoreDB

The data source definitions are stored in the CoreDB runtime datasource. The CoreDB can be DuckDB file or PostgreSQL database, depending on the configuration of the hugr. The CoreDB also contains roles and permissions tables, which are used to manage access to the data sources, tables, views, and functions.

Data sources are loaded at the startup of the hugr.

Catalog Sources

The catalog source can be used to define the schema of tables, views, and functions in the data source. Each data source has a unique name and number of catalogs. The catalog source defines where the schema definition files are stored. Currently, the following catalog sources types are supported:

File System: A local or network file system where the schema definition files are stored.
uri: A URI that points to the schema definition files, such as file system, http or s3 path.
uriFile: A URI that points to a single schema definition file, such as file system, http or s3 path.

One catalog source can be used to describe data schema of multiple data sources.

Schema Definition Language (SDL)

The hugr uses the GraphQL schema definition language (SDL) to define the schema of tables, views and functions in the data source. To define schema objects, relations, functions, etc, it is used hugr-specific directives, such as @table, @view, @function, @field_references, @pk, @join, etc. All directives are described in the Query Engine Configuration section.

Tables and views are defined as GraphQL types, and functions are defined as GraphQL queries or mutations in special Function and MutationFunction types.

The general GraphQL schema is generated by defined catalogs for the attached data sources. It performs the following steps:

Data source schema compilation: The hugr compiles the schema definition files from the catalog sources into a unified GraphQL schema. This includes validation, prefixing types and fields, and generating queries, mutations, and functions.
Merging data source schemas: The hugr merges the schemas of all attached data sources into a single GraphQL schema. This allows for querying data from multiple data sources in a single query.
Apply extensions: The hugr applies the extensions to the merged schema, which allows to extend the schema with additional subquery fields and function calls. This is useful for creating custom logic or aggregations using data from other sources.

Data source schema compilation

At the data source loading time, the hugr combine all catalog sources in the single source, validate and compile it to the GraphQL data source schema.

Validation

The schema definition files are validated at the loading time to ensure that they conform to the GraphQL SDL syntax and the hugr specific directives. The validation process checks for syntax errors, missing required fields, and other inconsistencies in the schema definition files. If any errors are found, the hugr will not start until the issues are resolved.

At this stage, the hugr also add prefix to the types and fields to avoid name collisions between different data sources. If the prefix is not specified in the schema definition file, the hugr will not add any prefix to the types and fields.

Compilation

At the compilation stage, the hugr processes the schema definition files and generates a GraphQL schema that represents the data source.

For the data objects (tables and views), the hugr generates GraphQL types with fields that correspond to the columns in the data source. The functions are defined as fields in the Function and MutationFunction types - already queries.

All generated queries, mutations, and functions are split into module hierarchy, which allows to organize the schema in a modular way. The modules are defined using the @module directive, which allows to group related queries, mutations, and functions together.

Compilation flow

1. References subquery fields

The hugr generates references subquery fields for the data objects (tables and views) based on the defined relations in the schema. The subquery fields are added to the both objects that are involved in the relation, allowing for easy access to related data.

The relations are defined using the @field_references or @relation directives on the side of one object.
M2M relations are defined using the @field_references or @relation directives on both sides of the m2m tables, that is marked as m2m (@table(name: "some_m2m_table", is_m2m: true)). By m2m relations in the both sides of objects, the hugr generates subquery fields with the same name and type of the related object.

2. Include query time join fields

The hugr generates query time join fields for the data objects (tables and views), this allows to join selected data objects in query time, without the need to define the join in the schema. The query time join fields are added to the data objects that are involved in the relation, allowing for easy access to related data.

3. Spatial joins fields

If a data object contains field with the type Geometry, the hugr generates spatial join field, which allows to join the data object with other data objects that contain spatial fields. The spatial join field is added to the data object and can be used in queries to filter or aggregate data based on spatial relationships.

4. Filter input types

The hugr generates filter input types for the data objects (tables and views) based on the defined fields in the schema. The filter input types are used to filter the data when querying the data objects.

A filter input type contains all table or view fields, except the join subquery fields (the references subquery fields are added)

For the scalar fields input type will contain predefined operators, such as eq, in, gt, lt, etc., operators for the scalar fields are depending on the field type (e.g., String, Int, Float, etc.).
For a reference subquery field to the filter input type will be added the input field with the same filter input type as for the related object, allowing to filter by the related object fields. If original subquery field is a list, the filter input type provide following operators: any_of, all_of, none_of, which allows to filter by the related object fields in the list.
For the m2m subquery fields, the filter input type will contain the same filter input type as for the related object, allowing to filter by the related object fields. If original subquery field is a list, the filter input type provide following operators: any_of, all_of, none_of, which allows to filter by the related object fields in the list.

5. Data queries

If a data object contains primary key field, the query to get a single row from the table or view will be generated. The query will be named as object_name_by_pk, where object_name is the name of the data object. The query will accept an argument that corresponds to the primary key field and return a single row from the table or view.

If for the data object are defined unique fields constraints, the query to get a single row from the table or view by unique fields will be generated. The query will be named as object_name_by_unique_suffix, where object_name is the name of the data object and unique_suffix is a predefined suffix or list of unique fields joined by _. The query will accept arguments that correspond to the unique fields and return a single row from the table or view.

Data query will be generated with name object_name, where object_name is the name of the data object. The query will accept an arguments:

filter: an input type that corresponds to the filter input type for the data object, allowing to filter the data when querying the data object.
order_by: a list of objects with fields: field - name of the fields to sort by and direction - direction of the sort (ASC or DESC), fields should be requested.
distinct_on: a list of field names to distinct on, fields should be requested.
limit: an integer that specifies the maximum number of rows to return.
offset: an integer that specifies the number of rows to skip before returning the results.

Views can be parameterized, which means that they can accept arguments to perform query. For that type of views, arguments input object should be defined in the schema, and the hugr will add argument called args to the query, which will be of the type of the arguments input object.

6. Aggregation types

The hugr generates aggregation types for the data objects (tables and views) based on the defined fields in the schema. The aggregation types are used to perform aggregations on the data when querying the data objects. An aggregation type contains all table or view fields, including query time join and spatial join fields, relation subquery fields, joins and function calls. For the scalar fields, the aggregation type will contain predefined aggregations, such as count, sum, avg, min, max, etc.

As well as base object type the aggregation type will subquery fields for the relations, joins and function calls, which allows to perform aggregations on the related data.

To perform bucket aggregations, the hugr generates bucket aggregation types for the data objects with the two fields:

key: a type of the data object, to select fields to group by.
aggregations: an aggregation type that contains the aggregation results for the data object, which is the same as the aggregation type for the data object. It can accept arguments - filter, order_by, which are the same as for the data query.

7. Aggregation queries

The hugr generates aggregation queries for the data objects (tables and views):

aggregation: with the single row aggregation query with name object_name_aggregation, that has the same arguments as the data query, but returns a single row with the aggregation results. The aggregation query will return an object of the type of the aggregation type for the data object.
bucket aggregation: with the bucket aggregation query with name object_name_bucket_aggregation, that has the same arguments as the data query, but returns a list of rows with bucket aggregation results.

The queries will be added to the Schema and as subquery fields to the data object types, that contains subqueries for the relations and joins, as well as in the query time join and spatial join.

8. Mutation input types

The hugr generates mutation input types for the tables mutation queries - insert and update.

The insert mutation type will contain all fields of the table, except join subquery fields and function calls. The relations subquery fields will be added to the insert mutation type, allowing to insert related data. The m2m relations subquery fields will be added to the insert mutation type, allowing to insert related data.

9. Mutation queries

The hugr generates mutation queries for the tables for the insert, update and delete operations:

insert: with the name insert_object_name, that accepts an argument of the type of the insert mutation input type for the data object and returns a single row with the inserted data, if the table has a primary key field, or OperationResult type, if the table does not have a primary key field.
update: with the name update_object_name, that accepts an required argument dataof the type of the update mutation input type for the data object and optional argument filter. Returns OperationResult type.
delete: with the name delete_object_name, that accepts an optional filter argument and returns OperationResult type.

The OperationResult type contains following fields:

affected_rows: an integer that specifies the number of rows affected by the operation.
success: a boolean that specifies whether the operation was successful or not.
message: a string that contains the message about the operation result.
last_id: an integer that specifies the last inserted row id, if the operation was successful and the table has a primary key field.

Data Sources​

CoreDB​

Catalog Sources​

Schema Definition Language (SDL)​

Data source schema compilation​

Validation​

Compilation​

Compilation flow​

1. References subquery fields​

2. Include query time join fields​

3. Spatial joins fields​

4. Filter input types​

5. Data queries​

6. Aggregation types​

7. Aggregation queries​

8. Mutation input types​

9. Mutation queries​