ETL stands for Extract, Transform & Load. In todays Data Warehousing world, this term should be extended to E-MPAC-TL or Extract, Monitor, Profile, Analyze, Cleanse, Transform & Load. In other words : ETL with the necessary focus on data quality & metadata.
The figure underneath depict each components place in the overall architecture.
The main goal of Extracting is to off-load the data from the source systems as fast as possible and as less cumbersome for these source systems, its development team and its end-users as possible. This implies that the type of source system and its characteristics OLTP system, OLTP legacy data, multiple instances, old Data Warehouse, archives, fixed & variable external data, spreadsheets - should be taken into account as much as possible. Furthermore it also implies that the most applicable extraction method should be chosen source date/time stamps, database triggers, database log tables, various delta mechanisms, full versus incremental refresh, delta parameterization, controlled overwrite, hybrid depending on the situation.
Transform & Loading the data is about integrating and finally moving the integrated data to the presentation area which can be accessed via front-end tools by the end-user community. Here the emphasis should be on really using the offered functionality by the chosen ETL-tool and using it in the most effective way. It is not enough to simply use an ETL-tool, but still use various backdoors which do not maximize the usage of the tool. Secondly, in a medium to large scale data warehouse environment it is essential to standardize (dimension and fact mappings) as much as possible instead of going for customization. This will reduce the throughput time of the different source-to-target development activities which form the bulk of the traditional ETL effort. A side-effect in a later stage is the generation of the so-called ETL-scripts based on this standardization and pre-defined metadata. Finally, the different individual mappings or jobs should aim for re-usability and understandability resulting in small enough pieces which can be easily debugged or tested for that matter.
Monitoring of the data enables a verification of the data which is moved throughout the entire ETL process and has two main objectives. Firstly the data should be screened. Here a proper balance should be reached between screening the incoming data as much as possible and not slowing down the overall ETL-process too much, when too much checking is done. Here an inside-out approach, as defined in Ralph Kimball screening technique, could be used. This technique can capture all errors consistently, based on a pre-defined set of metadata business rules and enables reporting on them through a simple star schema, enabling a view on data quality evolution over time. Secondly a close eye should be kept on the performance of the ETL process, meaning capturing the right operational metadata information. This metadata information embraces, start and end timings for ETL-processes on different layers (overall, by stage/sub-level & by individual ETL-mapping / job). It should also capture information on the treated records (records presented, inserted, updated, discarded, failed ). This metadata will answer questions on data completeness and ETL performance. This metadata information can be plugged into all dimension & fact tables as a so-called audit dimension and as such it could be queried as other dimensions.
One step further down quality assurance processes between the different stages could be defined. Depending on the need, these processes can check the completeness of values; do we still have the same number of records or totals of certain measures between different ETL stages? Of course this information should also be captured as metadata. Finally data lineage should be foreseen throughout the entire ETL process, including the error records produced. This embraces transparent back-pointers to the source system or original source interface files for that matter. File names, line number, business keys, source system identifiers etc. should be dragged along properly and even be made available in the front-end tools in order to re-assure the end-users of the correctness and completeness of the information.
Next to the operational metadata it is essential that the data model metadata is properly integrated with the source-to-target detail information, next to all other associated characteristics such as the slowly changing dimensions type of a dimension, calculation of derived measures, etc.
Data Profiling is used to generate statistics about the sources and as such the objective here is to understand the sources. It will use analytical techniques to discover the true content, structure and quality of the data by deciphering and validating data patterns & formats and by identifying and validating redundant data across data source. It is important that the correct tooling is put forward to automate this process, given the huge amounts and variety of data.
Data Analysis will analyze the results of the profiled data. In this way data analyzing makes it easier to identify data quality problems such as missing data, invalid data, inconsistent data, redundant data, constraints problems, parent / child issues such as orphans, duplicates ... It is important to correctly capture the results of this assessment, since it will become the communication medium between the source and the data warehouse team for tackling all outstanding issues. Furthermore the earlier mentioned transformation activities will also benefit from this analysis in terms of pro-actively coping with the perceived data quality. It is obvious that the source-to-target mapping activities highly depend on the quality of the source analysis.
Within the source analysis the focus should not only be on the sources "As Is, but also on its surroundings; obtaining proper source documentation, the future roadmap of the source applications, getting an insight on the current (data) issues of source, the corresponding data models (E/R diagrams) / meta data repositories and receiving a walk-through of source data model & business rules by the source owners. Finally it is crucial to set up frequent meetings with source owners to detect early changes which might impact the data warehouse and the associated ETL processes.
In the data Cleansing section, the errors found can be fixed based on a pre-defined set of metadata rules. Here a distinction needs to be made between completely or partly rejecting the record and enabling a manual correction of the issue or by fixing the data through for example completing the record, correcting the inaccurate data fields, adjusting the data formatting etc.
E-MPAC-TL is an extended ETL concept which tries to properly balance the requirements with the realities of the systems, tools, metadata, technical issues & constraints and above all the data (quality) itself.
Feel free to contact us for more information on Best Practise ETL Architectures !