Best Practice ETL Architecture

ETL stands for Extract, Transform & Load. In today's Data Warehousing world, this term should be extended to E-MPAC-TL or Extract, Monitor, Profile, Analyze, Cleanse, Transform & Load. In other words: ETL with the necessary focus on data quality & metadata.

The figure underneath depicts each component's place in the overall architecture.

Key Components of a Successful ETL Architecture

The main goal of Extracting is to off-load the data from the source systems as fast as possible and as little cumbersome for these source systems, their development team, and their end-users as possible. This implies that the type of source system and its characteristics OLTP system, OLTP legacy data, multiple instances, old Data Warehouse, archives, fixed & variable external data, and spreadsheets - should be taken into account as much as possible. Furthermore, it also implies that the most applicable extraction method should be chosen from source date/time stamps, database triggers, database log tables, various delta mechanisms, full versus incremental refresh, delta parameterization, controlled overwrite, and hybrid depending on the situation.

Transform & Loading the data is about integrating and finally moving the integrated data to the presentation area which can be accessed via front-end tools by the end-user community. Here the emphasis should be on really using the offered functionality by the chosen ETL tool and using it most effectively. It is not enough to simply use an ETL tool, but still use various backdoors which do not maximize the usage of the tool. Secondly, in a medium to large-scale data warehouse environment, it is essential to standardize (dimension and fact mappings) as much as possible instead of going for customization. This will reduce the throughput time of the different source-to-target development activities which form the bulk of the traditional ETL effort. A side-effect in a later stage is the generation of the so-called ETL scripts based on this standardization and pre-defined metadata. Finally, the different individual mappings or jobs should aim for re-usability and understandability resulting in small enough pieces which can be easily debugged or tested for that matter.

Data Monitoring in the ETL Process

Monitoring of the data enables a verification of the data which is moved throughout the entire ETL process and has two main objectives. Firstly the data should be screened. Here a proper balance should be reached between screening the incoming data as much as possible and not slowing down the overall ETL process too much when too much checking is done. Here an inside-out approach, as defined in Ralph Kimball's screening technique, could be used. This technique can capture all errors consistently, based on a pre-defined set of metadata business rules, and enables reporting on them through a simple star schema, enabling a view of data quality evolution over time. Secondly, a close eye should be kept on the performance of the ETL process, meaning capturing the right operational metadata information. This metadata information embraces, start and end timings for ETL processes on different layers (overall, by stage/sub-level & by individual ETL-mapping / job). It should also capture information on the treated records (records presented, inserted, updated, discarded, failed ). This metadata will answer questions on data completeness and ETL performance. This metadata information can be plugged into all dimension & fact tables as a so-called audit dimension and as such it could be queried as other dimensions.

Quality Assurance & Data Lineage

One step further down quality assurance processes between the different stages could be defined. Depending on the need, these processes can check the completeness of values; do we still have the same number of records or totals of certain measures between different ETL stages? Of course, this information should also be captured as metadata. Finally data lineage should be foreseen throughout the entire ETL process, including the error records produced. This embraces transparent back-pointers to the source system or source interface files for that matter. File names, line numbers, business keys, source system identifiers, etc. should be dragged along properly and even be made available in the front-end tools to re-assure the end-users of the correctness and completeness of the information.

Next to the operational metadata, it is essential that the data model metadata is properly integrated with the source-to-target detail information, next to all other associated characteristics such as the slowly changing dimensions type of a dimension, calculation of derived measures, etc.

Data Profiling & Analysis

Data Profiling is used to generate statistics about the sources and as such the objective here is to understand the sources. It will use analytical techniques to discover the true content, structure, and quality of the data by deciphering and validating data patterns & formats and by identifying and validating redundant data across data sources. It is important that the correct tooling is put forward to automate this process, given the huge amounts and variety of data.

Data Analysis will analyze the results of the profiled data. In this way data analysis makes it easier to identify data quality problems such as missing data, invalid data, inconsistent data, redundant data, constraints problems, and parent/child issues such as orphans, and duplicates ... It is important to correctly capture the results of this assessment, since it will become the communication medium between the source and the data warehouse team for tackling all outstanding issues. Furthermore, the earlier-mentioned transformation activities will also benefit from this analysis in terms of proactively coping with the perceived data quality. It is obvious that the source-to-target mapping activities highly depend on the quality of the source analysis.

Within the source analysis, the focus should not only be on the sources "As Is" but also on its surroundings; obtaining proper source documentation, the future roadmap of the source applications, getting an insight on the current (data) issues of source, the corresponding data models (E/R diagrams) / metadata repositories and receiving a walk-through of source data model & business rules by the source owners. Finally, it is crucial to set up frequent meetings with source owners to detect early changes that might impact the data warehouse and the associated ETL processes.

Data Cleansing

In the data Cleansing section, the errors found can be fixed based on a pre-defined set of metadata rules. Here a distinction needs to be made between completely or partly rejecting the record and enabling a manual correction of the issue or by fixing the data through for example completing the record, correcting the inaccurate data fields, adjusting the data formatting, etc.

Conclusion

E-MPAC-TL is an extended ETL concept that tries to properly balance the requirements with the realities of the systems, tools, metadata, technical issues & constraints, and above all the data (quality) itself.

Feel free to contact us for more information on Best practice ETL Architectures!