Surrogate keys are pretty common between Data Warehouse designers, and they are normally used to link facts to dimensional tables.
Surrogates keys define the dimensional tables primary key, and are always implemented as meaningless sequence integers.
The usage of surrogate keys provides the following benefits:
Time dimensional tables also make use of surrogate keys. Those surrogate keys are usually not completely "meaningless" (for example, the date_id often results of the concatenation of the year, month and day of month fields).
The logical structure of a datamart describes every fact table with a primary key as univoc composition of the dimension table surrogate foreign keys.
However, sometimes a fact table surrogate key may be helpful also at the physical level.
The advantages of implementing surrogate keys in the fact tables are the following:
However, using a surrogate key also in fact tables forbids the ETL process from the usage of any UPSERT logic ("Insert or Update" or "Update or Insert") at SQL level, since all upsert operations always require the presence of a primary key, defined on the natural columns that univocally define each table row.
ETL processes usually deal with huge set of data, UPSERT-like operations are rarely used. However, in many circumstances they can be required (ex. - in presence of late arriving facts).
Every ETL tool know this issue and offers a solution. In this post we explore the Talend capabilities of the tPostgresOutput component (the same logic can be used in all the components available for all the RDBMSs supported by Talend, as well).
Choosing the "Advanced Options" of the "Component" panel:
In this example, the fact table represents a series of measures caught from a environmental monitoring station. Each measure is univocally defined by the time of the measure itself (timestamp "ts") and an "id", which refers to the measuring station (SCD 2 dimension table, since the station could be moved, and in case the GPS would be modified as well).
We check the "ts" and "station_id" columns, the natural keys of our fact table, as "Update Key" and "Deletion Key", as we would manually writing the SQL statement.
All the other columns, including the dimensions whose combination don't univocally identify the fact row (and not only the measure columns), have to be checked as "Updatable" and "Insertable".
Talend Open Studio will now implement our UPSERT logic at application level. Different and more performant logics which avoid updates or sequential operations, can (and should) be used in parallel to this technique.
The usage of surrogate keys provides the following benefits:
- Support for type 2 slowly changing dimensions attributes: this allows unlimited history preservation of records;
- Improved partitioning, indexing and query performance (most of the RDBMS engines tuning options are much more efficient, if in presence of pure-integer primary keys);
- Indipendence from the natural keys of the source systems, and therefore "protection" for istance in case of unpredictable changes of the data sources themselves.
Time dimensional tables also make use of surrogate keys. Those surrogate keys are usually not completely "meaningless" (for example, the date_id often results of the concatenation of the year, month and day of month fields).
![]() |
A traditional star schema. The primary key is composed by the combination of a subset of the foreign surrogate keys, pointing to the dimensions. |
The advantages of implementing surrogate keys in the fact tables are the following:
- You cannot predict how your system will evolve in the future. Choosing a surrogate meaningless integer as primary key, will let you -as designer- to change the structure of the fact table simply adding or removing columns, and with a few updates at the ETL system logic level;
- Improved load performance, since composite primary keys are heavy to manage for the RDDMS engine;
- Improved recovery capability: in case of faiulure of the ETL loading process is easier to determine exactly where a load job was suspended, and to resume loading or back put the job entirely;
- Improved audit capabilities: since the audit foreign key is not part of any combined primary key, updates on the audit key can be easly performed without any consequence.
However, using a surrogate key also in fact tables forbids the ETL process from the usage of any UPSERT logic ("Insert or Update" or "Update or Insert") at SQL level, since all upsert operations always require the presence of a primary key, defined on the natural columns that univocally define each table row.
ETL processes usually deal with huge set of data, UPSERT-like operations are rarely used. However, in many circumstances they can be required (ex. - in presence of late arriving facts).
Every ETL tool know this issue and offers a solution. In this post we explore the Talend capabilities of the tPostgresOutput component (the same logic can be used in all the components available for all the RDBMSs supported by Talend, as well).
The Talend tPostgresqlOutput component. |
Choosing the "Advanced Options" of the "Component" panel:
In this example, the fact table represents a series of measures caught from a environmental monitoring station. Each measure is univocally defined by the time of the measure itself (timestamp "ts") and an "id", which refers to the measuring station (SCD 2 dimension table, since the station could be moved, and in case the GPS would be modified as well).
We check the "ts" and "station_id" columns, the natural keys of our fact table, as "Update Key" and "Deletion Key", as we would manually writing the SQL statement.
All the other columns, including the dimensions whose combination don't univocally identify the fact row (and not only the measure columns), have to be checked as "Updatable" and "Insertable".
Talend Open Studio will now implement our UPSERT logic at application level. Different and more performant logics which avoid updates or sequential operations, can (and should) be used in parallel to this technique.
No comments:
Post a Comment