Authors: Sumeet-INFA Skip navigation

Big Data Management

1 Post authored by: Sumeet-INFA

Today, I’m really excited to announce the latest innovation from Cloudera and Informatica’s partnership.  With both companies focusing on helping customers adopt data lakes in the cloud, we are working together to dramatically simplify the delivery of data lakes in the cloud. 

A few months ago, Cloudera announce its new platform of a service offering for data lakes in the cloud known as Altus.  And today, I’m pleased to announce a unique integration between Informatica Big Data Management and Cloudera Altus.  This unique integrated solution will enable customers to easily deploy large-scale data workloads in the Cloud by reducing operational overhead of managing a hadoop cluster.

Cloudera Altus is a platform-as-a-service with services that enable you to analyze and process large-scale data sets in the cloud. Altus provisions clusters quickly and manages Hadoop clusters cost-effectively.

Informatica Big Data Management (BDM) provides the most advanced data integration platform for Hadoop

With Big Data Management on Altus, users can focus on building the data pipeline logic without worrying about cluster management. For example, organizations that wish to gain better visibility into data while eliminating data silos can use this approach to deliver data swiftly for analytics. Creating a data lake solution in the cloud using Big Data Management and Altus has been significantly simplified.

Creating a Data Lake Solution using BDM and Altus

Use Informatica Big Data Management and Cloudera Altus to build and quickly deploy data lakes in the cloud while eliminating data silos and increasing productivity to quickly process and analyze data.


The following illustration shows a typical data lake solution implementation using BDM on Altus:

 

Step 1. Offload infrequently used data from the Enterprise Data Warehouse and load raw data in batches to a defined landing zone in Amazon S3. This frees up space in the Enterprise Data Warehouse.

Step 2. Collect and stream data generated by machines and sensors, including application and weblog files, directly to Amazon S3. Note that staging the data in a temporary file system or the data warehouse is not longer required.

Step 3. Discover and profile data stored on Amazon S3. Profile the data to better understand its structure and context. Easily add requirements for enterprise accountability, control, and governance for compliance with corporate and governmental regulations and business service level agreements.

Step 4. Parse and prepare data from weblogs, application server logs, or sensor data. Typically, these data types are in multi-structured or unstructured format, which can be parsed to extract features and entities and to apply data quality techniques. This allows one to easily execute pre-built transformations as well as data quality and matching rules in Cloudera Altus to prepare data for analysis.

Step 5. After cleansing and transforming data onto Cloudera Altus, move high-value curated data to Amazon S3 or to Redshift. From S3 or Redshift. This will directly access data with Business Intelligence reports and applications.

 

 

Prototyping a Data Lake Solution

During this next step, a prototype will illustrate how to deploy a data lake solution using Cloudera Altus and Informatica Big Data Management.  The example below demonstrates how to run Cloudera Altus on an Amazon ecosystem while starting an on-demand Spark job with Altus.

To reduce the cluster management cost and operational overhead, use Big Data Management to create and terminate the Altus cluster on demand. To create the cluster, specify the cluster configuration details, including the instance type and the number of worker nodes.

Creating a Workflow

Create a workflow in Informatica Big Data Management to implement the data lake solution.

The following image shows a typical workflow for the data lake solution:

 

The workflow contains command and mapping tasks as described in the following steps:

Step 1. This creates an Altus cluster by retrieving cluster configuration parameters from the user to create the Altus cluster on demand.


The following illustration demonstrates an Altus cluster:

Step 2. Ingests data to Amazon S3. This mapping task runs on the newly created Altus cluster.

Step 3. Prepares the data by cleansing and integrating with other datasets. This mapping task runs on the newly created Altus cluster.

The mapping tasks are fully integrated with Cloudera Altus. The mapping tasks will run natively on the Spark engine.


The following image shows the mapping for data preparation:

Step 4. Terminate the Altus cluster after the mapping processing.

Monitoring Spark Jobs

The Informatica monitoring console can be used to monitor Spark jobs that run on the Altus cluster.


The following image demonstrates the Informatica monitoring console running Spark jobs on Altus:

The job below demonstrates how an Altus “Jobs” page runs an Informatica Spark job(s):


 

The following illustration demonstrates a completed Informatica Spark job on Altus:

Video: Data Lake Solution using BDM and Altus

To learn more, watch Informatica Big Data Management in action solving the “Data Lake on Altus” use case: https://network.informatica.com/videos/1113

Looking Ahead

As a strategic partner to Cloudera, Informatica is delighted to announce this new solution that showcases and Informatica Big Data Management with Cloudera Altus technologies working together. Integrating Big Data Management with Altus will reduce the cost and operational overhead of managing Hadoop clusters for data engineers and IT Administrators alike. 

To learn more, visit informatica.com/big-data-ready/