Hashing
See Also
The drawbacks of using MD5¶
By default, AutomateDV uses MD5 hashing to calculate hashes using hash and hash_columns. If your table contains more than a few billion rows, then there is a chance of a clash: where two different values generate the same hash value (see Collision vulnerabilities).
For this reason, it should not be used for cryptographic purposes either.
You can however, choose between MD5 (md5
), SHA-1 (sha1
) and SHA-256 (sha
) in
AutomateDV, read below, which will help with reducing the
possibility of collision in larger data sets.
Personally Identifiable Information (PII)¶
Although we do not use hashing for the purposes of security (but rather optimisation and uniqueness) using unsalted MD5 and SHA-256 could still pose a security risk for your organisation. If any of your presentation layer (marts) tables or views containing any hashed PII data, an attacker may be able to brute-force the hashing to gain access to the PII. For this reason, we highly recommend concatenating a salt to your hashed columns in the staging layer using the stage macro.
It's generally ill-advised to store this salt in the database alongside your hashed values, so we recommend injecting it as an environment variable for dbt to access via the env_var jinja context macro.
This salt must be a constant, as we still need to ensure that the same value produces the same hash each and every time so that we may reliably look-up and reference hashes. The salt could be an (initially) randomly generated 128-bit string, for example, which is then never changed and stored securely in a secrets manager.
In the future, we plan to develop a helper macro for achieving these salted hashes, to cater to this use case.
Why do we hash?¶
Data Vault uses hashing for two different purposes.
Primary Key Hashing¶
A hash of the primary key. This creates a surrogate key, but it is calculated consistently across the database: as it is a single column, same data type, it supports pattern-based loading.
Hashdiffs¶
Used to finger-print the payload of a Satellite (similar to a checksum), so that it is easier to detect if there has been a change in the payload. This triggers the load of a new Satellite record. This simplifies the SQL as otherwise we'd have to compare each column in turn and handle nulls to see if a change had occurred.
Hashing is sensitive to column ordering. If you provide the is_hashdiff: true
flag to your column specification in
the stage macro, AutomateDV will automatically sort the provided columns alphabetically. Columns will
be sorted by their alias.
How do we hash?¶
Our hashing approach has been designed to standardise the hashing process, and ensure hashing has been kept consistent across a data warehouse.
Single-column hashing¶
When we hash single columns, we take the following approach:
1 |
|
Single-column hashing step by step:
-
CAST
toVARCHAR
First we ensure that all data gets treated the same way in the next steps by casting everything to strings (VARCHAR
). For example, this means that the number 1001, and the string '1001' will always hash to the same value. -
TRIM
We trim whitespace from string to ensure that values with arbitrary leading or trailing whitespace will always hash to the same value. For example1001
and1001
. -
UPPER
Next we eliminate problems where the casing in a string will cause a different hash value to be generated for the same word, for exampleAUTOMATE_DV
andautomate_dv
. -
NULLIF ''
At this point we ensure that if an empty string has been provided, it will be consideredNULL
. This kind of problem can arise if data gets ingested into your warehouse from semi-structured data such as JSON or CSV, whereNULL
values can sometimes be encoded as empty strings. -
MD5_BINARY
At this point, we are ready to perform a hashing process on the string, having cleaned and normalised it. This will not necessarily useMD5_BINARY
if you have chosen to useSHA
, in which case theSHA2_BINARY
function will be used. -
CAST AS BINARY
We then store it as aBINARY
datatype
Multi-column hashing¶
When we hash multiple columns, we take the following approach:
1 2 3 4 5 |
|
1 2 3 4 5 |
|
This is similar to single-column hashing aside from the use of IFNULL
and CONCAT
. The step-by-step process has been
described below.
1. Steps 1-4 are described in single-column hashing above and are performed on each column which comprises the multi-column hash.
5. IFNULL
If Steps 1-4 resolve in a NULL value (in the case of the empty string, or a true NULL
),
then we output a double-hat string ^^
by default. This ensures that we can detect changes in columns between NULL
and
non-NULL
values. This is particularly important for HASHDIFFS
.
5.5. NULLIF
When is_hashdiff = false
and multiple columns get hashed, an extra NULLIF
check gets executed. This
is to ensure that if ALL components of a composite hash key are NULL
, then the whole key evaluates as NULL
. When
loading Hubs, for example we do not want to load NULL records and if we evaluate the whole key as NULL
, then we
resolve this issue.
6. CONCAT_WS
Next, we concatenate the column values using a double-pipe string, ||
, by default. This ensures we have consistent
concatenation, using a string which is unlikely to be contained in the columns we are concatenating. Concatenating in
this way means that we can be more confident that a combination of columns will always generate the same hash value,
particularly where NULLS
are concerned.
7. Steps 7 and 8 are identical to steps 5 and 6 described in single-column hashing.
Hashdiff components¶
As per Data Vault 2.0 Standards, HASHDIFF
columns should contain the natural key (the column(s) a PK/HK is calculated
from)
of the record, and the payload of the record.
Note
Prior to AutomateDV v0.7.4 hashdiffs are REQUIRED to contain the natural keys of the record. In AutomateDV v0.7.4, macros have been updated to include logic to ensure the primary key is checked in addition to the hashdiff when detecting new records. It is still best practice to include the natural keys, however.
Hashing best practices¶
Best practices for hashing include:
-
Alpha sorting Hashdiff columns. As mentioned, AutomateDV can do this for us, so no worries! Refer to the stage docs for details on how to do this.
-
Ensure all Hub columns used to calculate a primary key hash get presented in the same order across all staging tables
Note
Some tables may use different column names for primary key components, so you generally should not use the sorting functionality for primary keys.
- For Links, columns must be sorted by the primary key of the Hub and arranged alphabetically by the Hub name. The order must also be the same as each Hub.
Hashdiff Aliasing¶
HASHDIFF
columns should be called HASHDIFF
, as per Data Vault 2.0 standards. Due to the fact we have a shared
staging layer for the raw vault, we cannot have multiple columns sharing the same name. This means we have to name each
of our HASHDIFF
columns differently.
Below is an example satellite YAML config from a Satellite model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The highlighted lines show the syntax required to alias a column named CUSTOMER_HASHDIFF
(present in the
stg_customer_details_hashed
staging layer) as HASHDIFF
.
Choosing a hashing algorithm¶
You may choose between different types of hashing algorithm. Using a SHA-type algorithm is an option for users who wish to reduce the hashing collision rates in larger data sets.
Note
If a hashing algorithm configuration is missing or invalid, AutomateDV will use MD5
by default.
Tip
SHA-1 Hashing is recommended by the Data Vault Alliance
Configuring the hashing algorithm which will be used by AutomateDV is simple: add a global variable to your
dbt_project.yml
as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Configuring¶
Hashing by Platform¶
Support & Output Type | |||
---|---|---|---|
Platform | MD5 | SHA-1 | SHA-256 |
Snowflake | | BINARY(16) | | BINARY(20) | | BINARY(32) |
Databricks | | BINARY** | | BINARY** | | BINARY** |
BigQuery | | BYTES** | | BYTES** | | BYTES** |
SQLServer | | BINARY(16) | | BINARY(20) | | BINARY(32) |
Postgres | | BYTEA | * | | BYTEA |
Info
*SHA-1 Hashing is not supported on Postgres due to needing the pgcrypto extension to implement it fully.
**Since v0.11.0, with enable_native_hashing
enabled. STRING/VARCHAR hashes are used without this config enabled,
for backwards compatibility purposes. See Global variables for more information
Options¶
MD5 | SHA-1 | SHA-256 |
---|---|---|
md5 |
sha1 |
sha |
Info
These values are case-insensitive, e.g. both sha1
and SHA1
will work.
Configuring specific models¶
It is possible to configure a hashing algorithm at a more targeted scope rather than globally,
using the hierarchical structure of the dbt yaml
configurations.
This is not recommended, and instead we recommend keeping the hash algorithm consistent across all models, as per best practice.
Read the dbt documentation for further information on variable scoping.
Warning
Stick with your chosen algorithm unless you can afford to full-refresh, and you still have access to source data. Changing between hashing configurations when data has already been loaded will require a full-refresh of your models in order to re-calculate all hashes. In production, migration is possible but more complex and must be handled with care.
Configuring hash strings¶
As previously described, the default hashing strings are as follows:
concat_string
is ||
null_placeholder_string
is ^^
The strings can be changed by the user, and this is achieved in the same way as configuring the hashing algorithm:
1 2 3 4 |
|
1 2 3 4 5 |
|
The future of hashing in AutomateDV¶
We plan to provide users with the ability to disable hashing entirely.
The intent behind our hashing approach is to provide a robust method of ensuring consistent hashing (same input gives same output). Until we provide more configuration options, feel free to modify our macros for your needs, as long as you stick to a standard that makes sense to you or your organisation. If you need advice, feel free to join our slack and ask our developers!