Materialisations

(macros/materialisations)¶

Materialisations dictate how a model is created in the database.

dbt comes with 4 standard materialisations:

Table
View
Incremental
Ephemeral

vault_insert_by_period (Insert by Period)¶

(view source)

This materialisation is based on the insert_by_period materialisation developed by dbt Labs for the dbt-utils package.

We have re-purposed it and provided support for Snowflake, as well as added a number of convenience features.

Features include:

Idempotent loading
Inference of date range to load with
Manually configurable date range to load with
Configurable granularity (day, month, year etc.)

The purpose of the materialisation is to insert data into a table iteratively and periodically, using a configured date range. More detail on how this works is below.

Usage¶

Manual Load range #1Manual Load range #2Manual Load range #3Inferred Load range

{{ config(materialized='vault_insert_by_period', timestamp_field='LOAD_DATE', period='day',
start_date='2020-01-30') }}

{{ automate_dv.eff_sat(src_pk=src_pk, src_dfk=src_dfk, src_sfk=src_sfk,
                       src_start_date=src_start_date, src_end_date=src_end_date,
                       src_eff=src_eff, src_ldts=src_ldts, src_source=src_source,
                       source_model=source_model) }}

{{ config(materialized='vault_insert_by_period', timestamp_field='LOAD_DATE', period='day',
start_date='2020-01-30', stop_date='2020-04-30') }}

{{ automate_dv.eff_sat(src_pk=src_pk, src_dfk=src_dfk, src_sfk=src_sfk,
                       src_start_date=src_start_date, src_end_date=src_end_date,
                       src_eff=src_eff, src_ldts=src_ldts, src_source=src_source,
                       source_model=source_model) }}

{{ config(materialized='vault_insert_by_period', timestamp_field='LOAD_DATE', period='day',
start_date='2020-01-30', stop_date='2020-04-30', date_source_models=var('source_model')) }}

{{ automate_dv.eff_sat(src_pk=src_pk, src_dfk=src_dfk, src_sfk=src_sfk,
                       src_start_date=src_start_date, src_end_date=src_end_date,
                       src_eff=src_eff, src_ldts=src_ldts, src_source=src_source,
                       source_model=source_model) }}

{{ config(materialized='vault_insert_by_period', timestamp_field='LOAD_DATE', period='day',
date_source_models=var('source_model')) }}

{{ automate_dv.eff_sat(src_pk=src_pk, src_dfk=src_dfk, src_sfk=src_sfk,
                       src_start_date=src_start_date, src_end_date=src_end_date,
                       src_eff=src_eff, src_ldts=src_ldts, src_source=src_source,
                       source_model=source_model) }}

Initial/Base Load vs. Incremental Load¶

Due to the way materialisations currently work in dbt, the model which the vault_insert_by_period materialisation is applied to, must be run twice to complete a full load.

The first time a model with the materialisation applied is run, a BASE LOAD is executed. This loads all data for the first period in the load date range (e.g. The first day's data). All subsequent runs of the same model will execute incremental loads for each consecutive period.

The first period load will be repeated but no duplicates should be inserted when using AutomateDV macros.

Run Output¶

Examples of output for dbt runs using the eff_sat macro and this materialisation.

Initial/Base loadIncremental load

24:08 | Concurrency: 4 threads (target='snowflake')
24:08 | 15:24:08 | 1 of 1 START vault_insert_by_period model TEST.EFF_SAT..... [RUN]
24:10 | 1 of 1 OK created vault_insert_by_period model TEST.EFF_SAT [BASE LOAD 1 in 1.78s]
24:10 | 15:24:10 | Finished running 1 vault_insert_by_period model in 3.99s.

24:16 | Concurrency: 4 threads (target='snowflake')
24:16 | 15:24:16 | 1 of 1 START vault_insert_by_period model TEST.EFF_SAT..... [RUN]
24:17 + Running for day 1 of 4 (2020-01-10) [model.automate_dv_test.EFF_SAT]
24:18 + Ran for day 1 of 4 (2020-01-10); 0 records inserted [model.automate_dv_test.EFF_SAT]
24:18 + Running for day 2 of 4 (2020-01-11) [model.automate_dv_test.EFF_SAT]
24:20 + Ran for day 2 of 4 (2020-01-11); 0 records inserted [model.automate_dv_test.EFF_SAT]
24:20 + Running for day 3 of 4 (2020-01-12) [model.automate_dv_test.EFF_SAT]
24:21 + Ran for day 3 of 4 (2020-01-12); 2 records inserted [model.automate_dv_test.EFF_SAT]
24:22 + Running for day 4 of 4 (2020-01-13) [model.automate_dv_test.EFF_SAT]
24:24 + Ran for day 4 of 4 (2020-01-13); 2 records inserted [model.automate_dv_test.EFF_SAT]
24:24 | 1 of 1 OK created vault_insert_by_period model TEST.EFF_SAT [INSERT 4 in 8.13s]
24:25 | 15:24:25 | Finished running 1 vault_insert_by_period model in 10.24s.

Configuring the load date range¶

The start and finish date of the load can be configured in a number of different ways. Depending on how the materialisation is configured, the start and end of the load will get defined differently, as shown in the table below.

Configuration	Outcome	Usage
`start_date`	The load will start at `start_date`, and the `stop_date` will be set to the current date.	Manual Load range #1
`start_date` and `stop_date`	The load will start at `start_date`, and stop at `stop_date`	Manual Load range #2
`date_source_models`	The models will be unioned together, and the minimum and maximum dates extracted from the data in the `timestamp_field`	Inferred Load range
All three config options	Manually provided configuration acts as an override. The load will start at `start_date`, and stop at `stop_date`	Manual Load range #3

Please refer to the Usage section above to see examples.

Configuration Options¶

Configuration	Description	Type	Default	Required?
timestamp_field	A list of column names	List[String]	None
period	Time period to load over	String	day
start_date	The date to start the load from	String (YYYY-MM-DD)	None	See: Configuring the load date range (above)
stop_date	The date to stop the load on	String (YYYY-MM-DD)	None	See: Configuring the load date range (above)
date_source_models	A list of models containing the timestamp_field	List[String]/String	None	See: Configuring the load date range (above)

Period¶

The period configuration option allows us to configure the granularity of the load.

The naming varies per platform, though some common examples are:

hour
day
month
year

See below for the platform-specific documentation.

Snowflake

Automatic load range inference¶

Providing a list of models with the date_source_models configuration option, will automatically load all data from the source with date or date-times between the minimum and maximum values contained in the timestamp_field column.

When using the AutomateDV table template macros, date_source_models should be the same as the source_model attribute in the macro.

This does not necessarily have to be the case however, and it is possible to create a waterlevel-like table as follows:

waterlevel.sql

TYPE	LOAD_DATE
Start	2020-01-30
Stop	2020-04-30

Where LOAD_DATE is provided to the materialisation as the timestamp_field, and date_source_models is provided as waterlevel (the model name).

Using the materialisation with non-AutomateDV SQL¶

Every table template macro includes a __PERIOD_FILTER__ string in its SQL when used in conjunction with this materialisation.

At runtime, this string is replaced with SQL which applies conditions to filter the dates contained in the timestamp_field to those specified in the load date range. If you are only using AutomateDV table template macros with this materialisation, then there is no need for any additional work.

However, If you are writing your own models and wish to use this materialisation, then you must include a WHERE __PERIOD_FILTER__ somewhere appropriate in your model. A CTE which selects from your source model and then includes the placeholder, should provide best results.

See the hub source code for a demonstration of this.

Idempotent loads¶

This materialisation supports idempotent loads when used with AutomateDV macros. When calculating the start and stop dates of the load, a COALESCE function is applied. This COALESCE call compares the maximum timestamp contained in the timestamp_field, and the provided or inferred start_date and sets the start_date to whatever is larger (more recent). This means that any aborted loads will continue where they left off, and any duplicate loads will not have any effect (if using AutomateDV macros).

If you wish support idempotent loads in your own models which do not use AutomateDV macros and use this materialisation, the best approach is to use LEFT OUTER JOINS in your incremental logic to ensure duplicate records do not get loaded.

Example incremental logic containing a LEFT OUTER JOIN (taken from AutomateDV's hub macro):

hub Macro LEFT OUTER JOIN

{%- if is_incremental() %}
LEFT JOIN {{ this }} AS d
ON a.CUSTOMER_HK = d.CUSTOMER_HK
WHERE d.CUSTOMER_HK IS NULL
{%- endif %}

vault_insert_by_rank (Insert by Rank)¶

(view source)

The vault_insert_by_rank custom materialisation provides the means to iteratively load raw vault structures from an arbitrary rank column, created in the staging layer.

The RANK() window function is used to rank (using an ORDER BY clause) a row within the current 'window' of the function, which is defined by the PARTITION BY clause.

The custom materialisation uses this value as the value to iterate over when loading; a row with rank 1 will be loaded prior to a row with rank 2, and so on.

This materialisation can be used to correctly load temporal structures (such as satellites) where records may have millisecond timings between them, by partitioning by the primary/hash key of a table, and ordering by the timestamp column.

Usage¶

{{ config(materialized='vault_insert_by_rank', rank_column='AUTOMATE_DV_RANK', rank_source_models='MY_STAGE') }}

{{ automate_dv.eff_sat(src_pk=src_pk, src_dfk=src_dfk, src_sfk=src_sfk,
                       src_start_date=src_start_date, src_end_date=src_end_date,
                       src_eff=src_eff, src_ldts=src_ldts, src_source=src_source,
                       source_model=source_model) }}

Configuration Options¶

Configuration	Description	Type	Default	Required?
rank_column	The column name containing the rank values	String	None
rank_source_models	A list of model names containing the `rank_column`	List[String]	None

Creating a rank column¶

A rank column can be created one of three ways:

Manually creating it in a model prior to the staging layer, and using this model as the stage's source_model.

Using the ranked_columns configuration of the stage macro

source_model: "MY_STAGE"
ranked_columns:
  AUTOMATE_DV_RANK:
    partition_by: "CUSTOMER_HK"
    order_by: "LOAD_DATETIME"

Using the derived_columns configuration of the stage macro

source_model: "MY_STAGE"
derived_columns:
  AUTOMATE_DV_RANK: "RANK() OVER(PARTITION BY CUSTOMER_HK ORDER BY LOAD_DATETIME)"

Note

Read more about defining ranked columns.

Which option?¶

Method #2 is recommended, as makes it easier for rank columns to use user-defined derived or hashed columns created in the same staging layer.
Method #3 is similar, except it will not have hashed or derived column definitions available to it.

Check your rank

It is important that once a rank column is created, it should be sense checked for correct and expected ordering. If your ranking is incorrect according to the business, then loading will not be executed correctly.

pit_incremental¶

(view source)

The pit_incremental custom materialisation is the required materialisation for the PIT table as it allows for a continuous reconstruction of the PIT table.

Since PITs are not historized, but query helper tables, they have to be reconstructed (at least) once every reporting cycle.

This materialisation simply ensures that the old contents of the PIT table are removed before the new version populates the target table, for each run of the PIT model.

Usage¶

{{ config(materialized='pit_incremental') }}

{{ automate_dv.pit(source_model=source_model, src_pk=src_pk,
                   as_of_dates_table=as_of_dates_table,
                   satellites=satellites,
                   stage_tables_ldts=stage_tables_ldts,
                   src_ldts=src_ldts) }}                    

bridge_incremental¶

(view source)

The bridge_incremental custom materialisation is the required materialisation for the Bridge table as it allows for a continuous reconstruction of the Bridge table.

Since Bridges are not historized, but query helper tables, they have to be reconstructed (at least) once every reporting cycle.

This materialisation simply ensures that the old contents of the Bridge table are removed before the new version populates the target table, for each run of the Bridge model.

Usage¶

{{ config(materialized='bridge_incremental') }}

{{ automate_dv.bridge(source_model=source_model, src_pk=src_pk,
                      bridge_walk=bridge_walk,
                      as_of_dates_table=as_of_dates_table,
                      stage_tables_ldts=stage_tables_ldts,
                      src_ldts=src_ldts) }}