We are sorry, This PDF is available in download format only

Evaluating Apache Hadoop* Software for Big Data ETL Functions

Intel IT recently evaluated Apache Hadoop* software for ETL (extract, transform, and load) functions.

The traditional ETL process extracts data from multiple sources, joins it with other relevant data, transforms it for analytical needs, and loads it into a data warehouse for subsequent analysis. Many organizations, including Intel, use a third-party ETL tool to perform this process. Rising costs associated with moving data and the increased size of the datasets that need to be moved prompted us to evaluate whether we could increase performance and cost benefits by replacing our third-party ETL tool with our implementation of Hadoop, the Intel® Data Platform.

We first studied industry sources to learn the advantages and disadvantages of using Hadoop for big data ETL functions. We then tested what we learned with a real business use case that involved analyzing system logs as well as a cost comparison of Hadoop and our third-party ETL tool.

We determined that using Hadoop for ETL functions works well for datasets that are coming from, passing through, or resting in Hadoop. Specifically, Hadoop makes sense for simple extract and load operations performed on those datasets. For non-Hadoop data, we do not recommend using Hadoop for ETL functions for these main reasons:

• Development, troubleshooting, and operational support for Hadoop-based features are still evolving and not as mature as our third-party ETL tools.

• Enterprise-grade features of Hadoop, specifically in the areas of performance, security, and quality of service (QoS), are not yet available.

Intel IT recently evaluated Apache Hadoop* software for ETL (extract, transform, and load) functions.

The traditional ETL process extracts data from multiple sources, joins it with other relevant data, transforms it for analytical needs, and loads it into a data warehouse for subsequent analysis. Many organizations, including Intel, use a third-party ETL tool to perform this process. Rising costs associated with moving data and the increased size of the datasets that need to be moved prompted us to evaluate whether we could increase performance and cost benefits by replacing our third-party ETL tool with our implementation of Hadoop, the Intel® Data Platform.

We first studied industry sources to learn the advantages and disadvantages of using Hadoop for big data ETL functions. We then tested what we learned with a real business use case that involved analyzing system logs as well as a cost comparison of Hadoop and our third-party ETL tool.

We determined that using Hadoop for ETL functions works well for datasets that are coming from, passing through, or resting in Hadoop. Specifically, Hadoop makes sense for simple extract and load operations performed on those datasets. For non-Hadoop data, we do not recommend using Hadoop for ETL functions for these main reasons:

• Development, troubleshooting, and operational support for Hadoop-based features are still evolving and not as mature as our third-party ETL tools.

• Enterprise-grade features of Hadoop, specifically in the areas of performance, security, and quality of service (QoS), are not yet available.

Related Videos