Kettle cuts 80% off data extraction, transformation and loading

After working with Pentaho Data Integration (PDI), also known as Kettle, for more than 9 months, I am very impressed with what it can do. Despite my mandate on its adoption at Allied Express in earlier 2007 for a Business Intelligence (BI) project, I have not actually used it myself until 9 months ago. I have worked with many different types of script to export, translate, transform and import data from different systems for more than 12 years. I’ve written Perl scripts, Windows Batch scripts, C programs, Bash scripts, etc over the 12 years but none as easy as developing the same (or more) using PDI.

In early 2007, after some investigation with BI and evaluating many different BI tools, I found Pentaho has the most comprehensive toolset and it is one of a few matured BI systems. However, due to the late reply from a business partner of Pentaho in Sydney to reply to a tender, TM1 was chosen along with a TM1 consulting company. However, we soon found TM1’s ETL tool called Turbo Integrator lacks many of the capabilities in PDI including graphical design tool. We decided to use PDI to migrate the corporate data onto a staging area from a complex web of data source including in-house systems, text files, excel spreadsheets, PostgreSQL databases, MS SQL databases, MySQL databases, Ingres databases, etc. Even the developers from the consulting company impressed with the PDI’s ease of use. Without PDI, we would have caused a major delay the data extraction and understanding of the complex corporate data.
Personally, I have been using PDI for about 9 months now. I have written many complex transformations to extract, to transform and to load the data on different databases. The time required to write these transformations would be only no more than one fifth of the time required to write the same process in Perl or C programs.
The way PDI works is the same as the old Pipe-and-Filter architectural style that many UNIX shell script developers are familiar with. Data can be read from multiple sources to produce data streams that can be filtered, manipulated, transformed, translated, migrated, looked up, mapped, etc and eventually load onto a data output. The data output could be simple text file to complex database warehouse cube and dimensions.
Though, it is important to change the mindset a little before tempting to use PDI. Every step within a PDI transformation are run in parallel (as expected in a pipe-and-filter architecture style) so there will be pitfall that new users may fall into. Such as, if a stream of data is split into two parallel running paths resulting into two data streams and eventually the two streams are merging back using keys then there is no guarantee the data input into the merge step arrives at the same time and in the same order. Therefore, if the merging step requires the merging data to be in order then a data sort step must be performed first.

Leave a Reply Cancel reply