GlueData Success and Learnings with Hadoop Enterprise Data Warehouse Leveraging SAP Data Services

GlueData recently achieved success with integrating SAP with Hadoop Enterprise Data Warehouse. Some valuable learnings were achieved, which may benefit other customers embarking on similar projects. The tight deadlines of the project and the relatively new technology posed some interesting challenges. Its key findings were:

* Additional planning time is required in the design and architecture phases due to the complex integration and new technology in the landscape.

* The technology was so new that there was very little reference material or best practice documentation available. As a result, the GlueData team needed to develop and test rapidly.

* A thorough analysis is required to assess the current situation and pain points.

* After identifying requirements, multiple CDC solutions need to measured and compared against those requirements.

* The most critical considerations to make are the data latency requirements and the size of the delta and complete data sets:

* If the changed data doesn’t need to be available in a target system more frequently than daily, and budget constraints are an issue, then a target-based or timestamp-based CDC solution may be appropriate.

* If the changed data needs to be sent to a target system much more frequently, then a source-based CDC solution is probably appropriate.

* Additionally, if there are many downstream subscriber systems, and the data is mission critical, then implementing CDC using a replication server may be appropriate.

* The type of database must be investigated as each database server may have its own native CDC capabilities or might not be compatible with some of the CDC methods described. Third-party CDC applications may have limitations when using certain databases.

The project journey

GlueData delivered this project for a US client in the manufacturing space that had invested in Cloudera Data Hub (CDH) with the purpose of creating a central repository for business data. Their existing reports aggregated data from a number of different sources and the aim of the project was to consolidate the data within the Cloudera Data Hub and to refine the reports to leverage the strengths inherent to Hadoop.

“Running Hadoop concurrently with SAP ERP provides a platform that will adapt as the clients’ needs change. The future ingestion of unstructured and semi-structured data (like customer service notes, manufacturing floor data, and external macroeconomic data) will open new possibilities for providing critical information to management. In particular, the client anticipated analytic models leveraging machine learning techniques to assist in business needs such as forecasting, equipment maintenance, and optimization of processes,” stated Paul McCormick – Solution Architect at GlueData Master Data Solutions.

GlueData made use of SAP Data Services as the ETL tool. In the first phase of the project all relevant data was moved from SAP ERP to an ODS structure within the CDH, as agreed with the client’s brief. GlueData developed a SAP Data Services component to transfer flat files to file directories on HDFS. “This component uses the WebHDFS protocol and this allowed the solution to remove the dependencies between specific Hadoop, SAP Data Services and operating system compatibility issues, whilst providing a simplified interface for developers to use.” states McCormick.

The second phase of the project was to transform the data in the ODS, build dimension and fact tables and write these tables into Apache Kudu. (Kudu is a new addition to the open source Apache Hadoop ecosystem. It completes Hadoop’s storage layer to enable fast analytics on fast data).

The last phase of this project was to implement Change Data Capture (CDC) for all SAP Data Services Jobs which transferred data from source (SAP ERP) to end target (Apache Kudu). Change Data Capture (CDC) is a process that involves identifying and handling only records that have changed from a source table to a target table. The applications are wide ranging, but the benefits of CDC include significant savings in the time and cost of processing large amounts of data from a source. “GlueData developed various CDC methodologies to deal with the different SAP tables. For the tables holding a small amount of data in SAP we used the Data Services table compare transform. For the larger tables, we used a combination of source based CDC and Timestamp CDC methods.” says McCormick.

The project was delivered over six months, with all of the work taking place remotely from South Africa by a team of experienced SAP data consultants and architects. “The success of this project places GlueData in an excellent position to offer future clients guidance and expertise in their projects of a similar nature,” states Brett Schreuder – Managing Director at GlueData Master Data Solutions.