Materialisations
(macros/materialisations)¶
Materialisations dictate how a model is created in the database.
dbt comes with 4 standard materialisations:
- Table
- View
- Incremental
- Ephemeral
Read more about materialisations here
For AutomateDV, we have created some custom materialisations which support Data Vault 2.0 specific patterns which are documented below.
For normal use, you should use the standard incremental materialisation in most cases. Please refer to our best practises for recommended materialisations.
Warning
These are not to be used for daily loads or even first-time (base) loads as they have ill performance. Their main use case is for low-volume tables for testing or development purposes and users are not advised to use these in production
vault_insert_by_period (Insert by Period)¶
This materialisation is based on the insert_by_period materialisation developed by dbt Labs for the dbt-utils package.
We have re-purposed it and provided support for Snowflake, as well as added a number of convenience features.
Features include:
- Idempotent loading
- Inference of date range to load with
- Manually configurable date range to load with
- Configurable granularity (day, month, year etc.)
The purpose of the materialisation is to insert data into a table iteratively and periodically, using a configured date range. More detail on how this works is below.
Usage¶
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
Initial/Base Load vs. Incremental Load¶
Due to the way materialisations currently work in dbt, the model which the vault_insert_by_period
materialisation is
applied to, must be run twice to complete a full load.
The first time a model with the materialisation applied is run, a BASE LOAD
is executed. This loads all data for the
first period in the load date range (e.g. The first day's data). All subsequent runs of the same model will execute
incremental loads for each consecutive period.
The first period load will be repeated but no duplicates should be inserted when using AutomateDV macros.
Run Output¶
Examples of output for dbt runs using the eff_sat macro and this materialisation.
1 2 3 4 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Configuring the load date range¶
The start and finish date of the load can be configured in a number of different ways. Depending on how the materialisation is configured, the start and end of the load will get defined differently, as shown in the table below.
Configuration | Outcome | Usage |
---|---|---|
start_date |
The load will start at start_date , and the stop_date will be set to the current date. |
Manual Load range #1 |
start_date and stop_date |
The load will start at start_date , and stop at stop_date |
Manual Load range #2 |
date_source_models |
The models will be unioned together, and the minimum and maximum dates extracted from the data in the timestamp_field |
Inferred Load range |
All three config options | Manually provided configuration acts as an override. The load will start at start_date , and stop at stop_date |
Manual Load range #3 |
Please refer to the Usage section above to see examples.
Configuration Options¶
Configuration | Description | Type | Default | Required? |
---|---|---|---|---|
timestamp_field | A list of column names | List[String] | None | |
period | Time period to load over | String | day | |
start_date | The date to start the load from | String (YYYY-MM-DD) | None | See: Configuring the load date range (above) |
stop_date | The date to stop the load on | String (YYYY-MM-DD) | None | See: Configuring the load date range (above) |
date_source_models | A list of models containing the timestamp_field | List[String]/String | None | See: Configuring the load date range (above) |
Period¶
The period configuration option allows us to configure the granularity of the load.
The naming varies per platform, though some common examples are:
- hour
- day
- month
- year
See below for the platform-specific documentation.
Automatic load range inference¶
Providing a list of models with the date_source_models
configuration option, will automatically load all data from the
source with date or date-times between the minimum and maximum values contained in the timestamp_field
column.
When using the AutomateDV table template macros, date_source_models
should be the same as the source_model
attribute
in the macro.
This does not necessarily have to be the case however, and it is possible to create a waterlevel-like table as follows:
TYPE | LOAD_DATE |
---|---|
Start | 2020-01-30 |
Stop | 2020-04-30 |
Where LOAD_DATE
is provided to the materialisation as the timestamp_field
, and date_source_models
is provided
as waterlevel
(the model name).
Using the materialisation with non-AutomateDV SQL¶
Every table template macro includes a __PERIOD_FILTER__
string in its SQL when used in
conjunction with this materialisation.
At runtime, this string is replaced with SQL which applies conditions to filter the dates contained in
the timestamp_field
to those specified in the load date range. If you are only using AutomateDV table template macros
with this materialisation, then there is no need for any additional work.
However, If you are writing your own models and wish to use this materialisation, then you must include
a WHERE __PERIOD_FILTER__
somewhere appropriate in your model. A CTE which selects from your source model and then includes the placeholder,
should provide best results.
See the hub source code for a demonstration of this.
Idempotent loads¶
This materialisation supports idempotent loads when used with AutomateDV macros. When calculating the start
and stop
dates of the load, a COALESCE
function is applied. This COALESCE
call compares the maximum timestamp contained in
the timestamp_field
, and the provided or inferred start_date
and sets the start_date
to whatever is larger (more recent). This means that any aborted loads will continue where they left off, and any
duplicate loads will not have any effect (if using AutomateDV macros).
If you wish support idempotent loads in your own models which do not use AutomateDV macros and use this materialisation,
the best approach is to use LEFT OUTER JOINS
in your incremental logic to ensure duplicate records do not get loaded.
Example incremental logic containing a LEFT OUTER JOIN
(taken from AutomateDV's hub macro):
1 2 3 4 5 |
|
vault_insert_by_rank (Insert by Rank)¶
The vault_insert_by_rank
custom materialisation provides the means to iteratively load raw vault structures from an
arbitrary rank column, created in the staging layer.
The RANK()
window function is used to rank (using an ORDER BY
clause) a row within the current 'window' of the
function, which is defined by the
PARTITION BY
clause.
The custom materialisation uses this value as the value to iterate over when loading; a row with rank 1 will be loaded prior to a row with rank 2, and so on.
This materialisation can be used to correctly load temporal structures (such as satellites) where records may have millisecond timings between them, by partitioning by the primary/hash key of a table, and ordering by the timestamp column.
Usage¶
1 2 3 4 5 6 |
|
Configuration Options¶
Configuration | Description | Type | Default | Required? |
---|---|---|---|---|
rank_column | The column name containing the rank values | String | None | |
rank_source_models | A list of model names containing the rank_column |
List[String] | None |
Creating a rank column¶
A rank column can be created one of three ways:
-
Manually creating it in a model prior to the staging layer, and using this model as the stage's
source_model
. -
Using the
ranked_columns
configuration of the stage macro1 2 3 4 5
source_model: "MY_STAGE" ranked_columns: AUTOMATE_DV_RANK: partition_by: "CUSTOMER_HK" order_by: "LOAD_DATETIME"
-
Using the
derived_columns
configuration of the stage macro1 2 3
source_model: "MY_STAGE" derived_columns: AUTOMATE_DV_RANK: "RANK() OVER(PARTITION BY CUSTOMER_HK ORDER BY LOAD_DATETIME)"
Note
Read more about defining ranked columns.
Which option?¶
- Method #2 is recommended, as makes it easier for rank columns to use user-defined derived or hashed columns created in the same staging layer.
- Method #3 is similar, except it will not have hashed or derived column definitions available to it.
Check your rank
It is important that once a rank column is created, it should be sense checked for correct and expected ordering. If your ranking is incorrect according to the business, then loading will not be executed correctly.
pit_incremental¶
The pit_incremental
custom materialisation is the required materialisation for the PIT table as it
allows for a continuous reconstruction of the PIT table.
Since PITs are not historized, but query helper tables, they have to be reconstructed (at least) once every reporting cycle.
This materialisation simply ensures that the old contents of the PIT table are removed before the new version populates the target table, for each run of the PIT model.
Usage¶
1 2 3 4 5 6 7 |
|
bridge_incremental¶
The bridge_incremental
custom materialisation is the required materialisation for the Bridge table
as it allows for a continuous reconstruction of the Bridge table.
Since Bridges are not historized, but query helper tables, they have to be reconstructed (at least) once every reporting cycle.
This materialisation simply ensures that the old contents of the Bridge table are removed before the new version populates the target table, for each run of the Bridge model.
Usage¶
1 2 3 4 5 6 7 |
|