Skip navigation

Introduction

Informatica® Big Data Management allows users to build big data pipelines that can be seamlessly ported on to any big data ecosystem such as Amazon AWS, Azure HDInsight and so on. A pipeline built in the Big Data Management (BDM) is known as a mapping and typically defines a data flow from one or more sources to one or more targets with optional transformations in between. The mappings and other associated data objects are stored in a Model Repository via a Model Repository Service (MRS). In design-time environment, mappings are often organized into folders within projects. A mapping can refer to objects across projects and folders. Mappings can be grouped together into a workflow for orchestration. Workflow defines the sequence of execution of various objects including mappings.

 

Deployment process overview

For mappings and workflows to be deployed and executed in the run-time, they are grouped into applications. Application is a container that holds executable objects such as mappings and workflows. Applications are defined in the Developer and deployed to a Data Integration Service for execution. Once deployed, Data Integration Service persists a copy of the Application. Application can also be deployed to a file known as Informatica Application Archive (.iar) file, which can subsequently be deployed to a Data Integration Service in same or different domain. The overall process flow for deployment in BDM is as shown here:

BDM Deployment Process

Automation

The process of deploying a design-time application to an Informatica application archive (.iar) file can be executed via a infacmd CLI with Object Import Export (oie) plugin. A sample of the deploy application command is as follows:

infacmd.sh oie deployApplication -dn $infaDomainName -un $infaUserName -pd $infaPassword -sdn $infaSecurityDomain -rs $designTimeMRSName -ap $applicationPath -od $Output_Directory

 

The above example uses several user-defined environment variables. They can be named as per the individual organization standards. The password provided is case sensitive. Alternatively, an encrypted password string can be stored in the predefined environment variable INFA_DEFAULT_DOMAIN_PASSWORD. When an encrypted password is used, -pd option is not required. This command is documented in detail in Informatica documentation at Command Reference Guide → infacmd OIE Command Reference → Deploy Application

 

Once the application archive file is created, it can be optionally checked into GIT or other version control system for audit and tracking purposes.

 

Subsequently, the application archive file can be deployed to Data Integration Service of the same or different domain. Typically the application archive file is created out of a development domain and is eventually deployed into QA, UAT and Production domains. This can be achieved via infacmd CLI with Data Integration Service (dis)  plugin. A sample of such deployment command is as follows:

infacmd.sh dis deployApplication -dn $infaDomainName -un $infaUserName -pd $infaPassword -sdn $infaSecurityDomain -sn $dataIntegrationServiceName -a $applicationName -f $applicationArchiveFileName

 

This command is documented in detail in Informatica documentation at Command Reference Guide → infacmd DIS Command Reference → Deploy Application. Once deployment is successful the listApplications and listApplicationObjects in the DIS plugin can be used to get a list of the deployed applications and their contents respectively. This information can be used for post-deployment verification / sanity checks.

 

Integration with Jenkins

The CLI described above can be used to initiate the deployment process from within a Jenkins task. A "Build Step" of type "Execute Shell" can be added to the Jenkins. The step can be configured to execute one of the infacmd commands as shown in the example below

 

BDM deployment in Jenkins

 

A sample template file for Jenkins is attached (Jenkins-Template-App-Deployment) . The template contains the commands to:

  1. Create an Informatica Application Archive (.iar) file
  2. Commit the application archive file to GIT
  3. Deploy the application into DIS

 

Summary

Informatica BDM jobs can be deployed using Jenkins without any need for 3ʳᵈ party plugins. infacmd CLI commands can be directly used in Jenkins just as they can be used in an enterprise scheduling tool.

 

Contributors

  • Keshav Vadrevu, Principal Product Manager
  • Paul Siddal, Big Data Presales Specialist

 

 

 

Dear Customer,

 

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data etc. In these sessions, we will strive to provide you with as much technical details including new features and functionalities as possible, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

Topic: Meet the Experts Webinar - Sizing and Tuning for Spark in Informatica Big Data 10.2.1

Date: 22 August 2018

Time: 8:00 AM PST

Duration: 1 Hour

 

Informatica Big Data Management is the industry’s best solution for faster, more flexible, and more repeatable data ingestion and integration on Hadoop. Hundreds of organizations have adopted Informatica Big Data Management to take advantage of the power of Hadoop without the risks and delays of manual and specialized approaches.  To help you get the most out of Big Data Management, join this webinar to learn best practices for high-performance tuning, sizing, and security to get the most out of Informatica Big Data Management.

 

Learn about:

 

  • Sizing & Capacity Planning for Informatica’s platform and the underlying Hadoop cluster
  • Special Sizing Guidelines for Cloud environments like AWS and Azure
  • Optimal Deployment Architectures
  • Performance Tuning Tips for getting the most of out of engines like Apache Spark

 

Speaker: Vishal Kamath, Senior Manager, Performance

 

-------------------------------------------------------

To register for this meeting

-------------------------------------------------------

1. Go to https://informatica-events.webex.com/informatica-events/j.php?RGID=r3184b4bb4fb135c8bc85c3f88874273c

2. Register for the meeting.

3. Check for the confirmation email with instructions on how to join

 

To view in other time zones or languages, please click the link:

https://informatica-events.webex.com/informatica-events/j.php?RGID=rd12dbd328d01a6f900a9533009559810

 

 

-------------------------------------------------------

For assistance

-------------------------------------------------------

1. Go to https://informatica-events.webex.com/informatica-events/mc

2. On the left navigation bar, click "Support".

 

You can also contact us at:

network@informatica.com

 

Regards,

MeetTheExperts Team

This blog post shows how to call webservice in BDM using Spark.

We will be using the python transformation that’s introduced in BDM 10.2.1 to call the web-service.

 

Java tx is another option to call the webservice.

 

Pre-requisites

 

Python and jep package need to be installed on BDM DIS server, refer to the install documentation to configure python transformation with BDM.

 

Post python Installation, edit the Hadoop connection by going to window --> preferences --> connections --> click on your hadoop connection

 

 

 

Edit the hadoop connection and go to spark tab

 

 

 

Under the spark tab , advanced properties --> click the Edit button.  The first 3 properties in the screenshot are the python properties which come by default and we will put in the values for those 3 properties. Change values as per your python installation.

 

 

Web-service Details

 

 

We will be using the following webservice to get the states for any given country

 

http://services.groupkt.com/state/get

 

For example if we pass the country “USA” to the above url, the webservice will return all the state information within USA along with other details like area,capital, largest city etc.

 

To test the webservice for USA open the following URL in your browser and it will return json output.

 

http://services.groupkt.com/state/get/USA/all

 

 

 

Calling the web-service in BDM Mapping

 

 

We will create 2 mappings in BDM

 

In the first mapping we we will pass the country names from an input file,  then use the python tx to call the web-service and finally write the output to a HDFS file. The output will be a json file.

 

In the second mapping we will parse the json  output from the above mapping and write to Hive.

 

 

Mapping 1:

 

We have an input file is on hdfs with the following contents. We will pass the country names from this input file and get the states.

 

Create a flat file object file in developer client, in the advanced properties go to the read section and point to the HDFS connection and directory.

 

 

 

Create a new mapping and drag the flatfile object in the mapping and  choose the read operation.

 

 

 

Add a python transformation to the mapping and drag the country_name to input of python tx.

Create an output port for python tx and add call it states_data_json. The python tx ports should look like below.

 

 

Go to the python tab of the python tx and add the following code

 

 

import requests
import json

input_string = country_name
input_url =
"http://services.groupkt.com/state/get/"

states = requests.get(input_url + input_string + "/all")
states_data = states.json()
states_data_json = json.dumps(states_data)

 

 

Connect the output port of python transformation to a flatfile data  object writing to hdfs.

 

 

 

The target data object properties look like below

 

 

Change the execution mode of the mapping to run in spark

 

 

Execute the mapping and verify the status of the mapping in the admin console.

 

 

 

Verify the output of the mapping on hdfs and you will see the output in json format.

 

 

 

Mapping 2:

In this mapping we will parse the output file in the previous which is json making it structured.

 

Create a complex data object by right clicking on physical data objects -> New -> Physcial Data Object

 

 

Choose complex file data object and click Next

 

 

Name the complex file data objects as “cfr_states” and click on the browse button under connection and choose your hdfs connection and Under “selected resources” click on the Add button

 

 

In the Add resource, navigate to the hdfs file location (this is the output file location we gave in the previous mapping) and click on the json file and click OK

 

 

 

 

Click finish on the next step

 

 

Now create a dataprocessor transformation by right clicking on transformations -> New -> Transformation

 

 

Choose data processor transformation from the list of transformations

 

 

Name the data processor transformation as “dp_ws_state” and choose the “create a data processor using a wizard”

 

 

 

Since the input to the data processor transformation is coming as JSON , choose json In the next step and click next

 

 

 

Make sure you have sample output from the first mapping on the developer machine and choose the “sample json file” option and browse the sample json file and click next

 

 

 

Choose relational output and click finish

 

 

 

After you click on the finish button the data processor transformation will look like below

 

Create a hive table using the following DDL in your target database and import the hive table as a relational data object into the developer client

 

CREATE TABLE infa_pushdown.ws_states (

            FKey_states BIGINT,

            id DOUBLE,

            country STRING,

            name STRING,

            abbr STRING,

            area STRING,

            largest_city STRING,

            capital STRING

) ;

 

 

Now drag the compex file reader, the data processor transformation and the Hive target into the mapping. The connect the data port from CFR to the input of data processor and the output of data processor to Hive target.

 

The final mapping should look like below.

 

 

The mapping is tested in BDM 10.2.1 and in this version data processor is not supported in spark mode so we will run the second mapping using Blaze. Once data processor support is added in spark the second mapping can be eliminated by adding data processor transformation in the first mapping. Screen shot showing blaze as the execution engine.

 

 

Execute the mapping and verify the output of the target table by running the data viewer on target data object

 

The Past

Introduced in the early days of big data, MapReduce is a framework that can be used to develop applications that process large amounts of data in a distributed computing environment.

A typical MapReduce job splits the input data set into chunks that are processed in parallel by map tasks. The framework sorts the output of the map tasks and passes the output to reduce tasks. The result is stored in a file system such as Hadoop Distributed File System (HDFS).

In 2012, Informatica Big Data Management (BDM) product introduced the ability to push down mapping logic to Hadoop clusters by leveraging the MapReduce framework. Big Data Management translated mappings into HiveQL, and then into MapReduce programs, which were executed on a Hadoop cluster.

By using this technique to convert mapping logic to HiveQL and pushing its processing to the Hadoop cluster, Informatica was the first (and still leading) vendor to offer the ability to push down processing logic to Hadoop without having to learn MapReduce. Developers simply had to select the "Hive" checkbox in the runtime properties of a mapping to run the mapping in MapReduce mode. This enabled hundreds of BDM customers to reuse their traditional Data Integration jobs and onboard them to the Hadoop ecosystem.

MapReduce as execution engine for BDM Mappings

The Present

With time, the Hadoop ecosystem evolved. For starters, MapReduce is no longer the only job processing framework. Tez, Spark and other processing frameworks are used throughout the industry as viable alternatives for MapReduce.

Recently, Spark has been widely adopted by vendors and customers alike. Several ecosystems such as Microsoft Azure use Spark as their default processing framework.

UPDATE: Hadoop distribution vendors have started to move away from MapReduce. As of HDP 3.0, MapReduce is no longer supported as an execution engine. Please refer to this link for more details on Hortonworks recommendations of execution engines: Apache Hive 3 architectural overview. Hive execution engine (including MapReduce) was deprecated in Big Data Management 2018 Spring release and has reached End of Life (EOL) Big Data Management 2019 Spring Release (10.2.2). Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.

Big Data Management adopted Spark several years ago and currently supports the latest versions of Spark. Please refer to Product Availability Matrix for the Spark version support. Big Data Management supports running Data Integration, Data Quality and Data Masking transformations using Spark.

For the most part, developers do not have to make any changes to mappings to leverage Spark. To use Spark to run mappings, they simply change the execution engine from Hive to Spark as shown in the following screenshot from Big Data Management 2018 Spring Release (BDM 10.2.1)

Migration from MapReduce to Spark

As a result of this simple change, the Data Integration Service generates Spark Scala code instead of MapReduce and executes it on the cluster.

The deprecation and End of Life of MapReduce

To accommodate the evolution of the Hadoop ecosystem, Informatica has announced the deprecation of MapReduce in Big Data Management in 2018 Spring release and has announced End of Life in 2019 Spring release (BDM 10.2.2). Customers previously leveraging Map Reduce are recommended to migrate to Spark. Customers currently on older versions of Big Data Management (including Spring release 2018 / BDM 10.2.1) are strongly recommended to migrate to Spark execution engine by selecting the Spark checkbox in Run-time properties of the mapping

Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.

Informatica's EOL of MapReduce applies only to Big Data Management mappings that use MapReduce as the run-time engine. It will not affect any Hadoop components (such as SQOOP) that internally rely on MapReduce or other third-party components. For example, when a customer uses SQOOP as a source in a Big Data Management mapping, Big Data Management will invoke SQOOP, which internally invokes MapReduce for processing. This will continue to be the case even after End of Life for the MapReduce execution mode.

Migration from MapReduce

To mitigate the continual evolution of big data ecosystems, Informatica recommends that developers practice an inclusive strategy for running mappings. In the Run-time mappings properties, select all Hadoop run-time engines, as shown in the following screenshot:

Polyglot computing in Big Data Management (BDM) including Spark

When all Hadoop run-time engines are selected, Informatica chooses the right execution engine at runtime for processing. Beginning with Big Data Management version 2018 Spring release (BDM 10.2.1), mappings default to Spark when Spark is selected with other execution engines. When customers use this inclusive strategy, the mappings with all Hadoop engines selected will automatically run in Spark mode.

Mappings that have only Hive (MapReduce) selected can be changed in bulk to leverage Spark. Several infacmd commands allow you to change the execution engine for mappings. Mappings that already exist as objects in the Model repository can be migrated to Spark by using one of the following commands:

  • infacmd mrs enableMappingValidationEnvironment
  • infacmd mrs setMappingExecutionEnvironment

This command receives the Model repository and project names as input and changes the execution engine for all mappings in the given project. The MappingNamesFilter property can be used to provide a comma-separated list of mappings to change. You can use wildcard characters to define mapping names. For more information about using these commands, see the Informatica Command Reference Guide.Similarly, for the mappings that have been deployed to the Data Integration Service as part of an application, you can use the following commands to change the execution engine for multiple mappings:

  • infacmd dis enableMappingValidationEnvironment
  • infacmd dis setMappingExecutionEnvironment

 

Summary

Starting Big Data Management 2019 Spring release (BDM 10.2.2), Hive execution mode (including Map Reduce) is no longer supported. Customers are recommended to migrate to Spark.

This blog is a short overview about Apache Airflow and shows how to integrate BDM with Apache Airflow. I also have a sample template to orchestrate BDM mappings. Same concept can be extended with Powercenter and non-BDM mappings.

 

Apache Airflow overview

 

Airflow is a platform to programmatically author, schedule and monitor workflows.

 

Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban.

 

Generally, Airflow works in a distributed environment, as you can see in the diagram below. The airflow scheduler schedules jobs according to the dependencies defined in directed acyclic graphs (DAGs), and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is updated in a timely manner. The users can monitor their jobs via a shiny Airflow web UI and/or the logs.

 

 

Installing Apache Airflow

 

The following installation method is for non-production type of uses. Refer to airflow documentation for production type of deployments.

 

Apache Airflow has various operators listed below. An operator describes a single task in a workflow.

 

https://github.com/apache/incubator-airflow/tree/master/airflow/operators

 

To trigger Informatica BDM mappings we will be using the bashoperator i.e triggering the mappings through commandline.

 

  1. If apache airflow is running on a machine different than infa node, install Informatica command line utilities on the airflow worker nodes
  2. Python

 

 

Create a directory /opt/infa/airflow

 

 

Easy way to install to run the following command. Pip is a python utility to install various python packages.

 

pip install apache-airflow

set AIRFLOW_HOME environment variable

 

Create a folder called “dags” inside AIRFLOW_HOME folder.

 

Initialize the airflow DB by typing the command “airflow initdb”. This is where the metdata will be stored, we will be using the default aclchemy database that comes with airflow, if needed the configuration can be modified to make mysql or postgres as the backend for airflow.

 

 

If the initdb shows any errors its most likely because of some missing airflow packages and a complete list of packages and the commands to install them are in the below link.

 

https://airflow.apache.org/installation.html

 

start the airflow web UI using the following command

 

Start the airflow scheduler

 

 

Login into the Airflow UI using the URL http://hostname:8080, if you have installed examples you should see the example DAG’s listed in the UI.

 

 

Creating a DAG for BDM Mappings

 

 

For the Demo we deployed the following 3 BDM mappings in to DIS.

 

Application_m_01_Get_States_Rest_Webservice

Application_m_02_Parse_Webservice_Output

Application_m_Read_Oracle_Customers_Write_Hive_Python

 

 

The 3 applications need to be orchestrated in the following way.

 

  1. Application_m_01_Get_States_Rest_Webservice and Application_m_01_Get_States_Rest_Webservice can run in parallel
  2. Application_m_02_Parse_Webservice_Output will run only if Application_m_01_Get_States_Rest_Webservice is successful

 

 

 

Save the following code as inside as airflow_bdm_sample.py under /opt/infa/airflow/dags folder.

There are different ways to call infacmd runmapping command, for example the command can be put in a shell script and the script can be called from the DAG.

 

 

#Start Code

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


# these args will get passed on to each operator
# you can override them on a per-task basis during operator initialization
default_args = {
'owner': 'infa',
'depends_on_past': False,
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'start_date': datetime.now() - timedelta(seconds=10),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'dag': dag,
# 'adhoc':False,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'trigger_rule': u'all_success'
}

dag = DAG(
'Informatica_Bigdata_Demo',
default_args=default_args,
description='A simple Informatica BDM DAG')

# Printing start date and time of the DAG

t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)


t2 = BashOperator(
task_id='mapping_calling_webservice',
depends_on_past=False,
bash_command='infacmd.sh ms RunMapping -dn infa_dom_1021 -sn dis_bdm_cdh -un Administrator -m m_01_Get_States_Rest_Webservice -a Application_m_01_Get_States_Rest_Webservice -pd admin',
dag=dag)


t3 = BashOperator(
task_id='mapping_parsing_json',
depends_on_past=True,
bash_command='infacmd.sh ms RunMapping -dn infa_dom_1021 -sn dis_bdm_cdh -un Administrator -m m_02_Parse_Webservice_Output -a Application_m_02_Parse_Webservice_Output',
dag=dag)


t4 = BashOperator(
task_id='Read_Oracle_load_Hive',
depends_on_past=False,
bash_command='infacmd.sh ms RunMapping -dn infa_dom_1021 -sn dis_bdm_cdh -un Administrator -m m_Read_Oracle_Customers_Write_Hive_Python -a Application_m_Read_Oracle_Customers_Write_Hive_Python',
dag=dag)



t1.set_downstream(t2)
t2.set_downstream(t3)
t4.set_upstream(t1)

 

 

# End code

 

 

Restart the airflow webserver and the Informatica_Bigdata_Demo DAG will appear in the list of DAG’s

 

 

 

Click on the DAG and go to Graph View, it gives a better view of orchestration.

 

 

 

Run the DAG and you will see the status of the DAG’s running in the Airflow UI as well as the Informatica monitor

 

 

 

The above DAG code can be extended to get the mapping logs, status of the runs.

What are we announcing?

Informatica Big Data Release 10.2.1

 

Who would benefit from this release?

This release is for all customers and prospects who want to take advantage of the latest Big Data Management, Big Data Quality, Big Data Streaming, Enterprise Data Catalog, and Enterprise Data Lake capabilities.

 

What’s in this release?

This update provides the latest ecosystem support, security, connectivity, cloud, and performance while improving the user experience.

 

Big Data Management (BDM)

 

Enterprise Class

 

  • Zero client configuration: Developers can now import the metadata from Hadoop clusters without configuring Kerberos Keytabs and configuration files on individual workstations by leveraging the Metadata Access Service
  • Mass ingestion: Data analysts can now ingest relational data into HDFS and Hive with a simple point and click interface and without having to develop individual mappings. Mass Ingestion simplifies ingestion of thousands of objects and operationalizes them via a non-technical interface
  • CLAIRE integration: Big Data Management now integrates with Intelligent Structure Discovery (that is part of Informatica Intelligent Cloud Services) to provide machine learning capabilities in parsing the complex file formats such as Weblogs
  • SQOOP enhancements: SQOOP connector has been re-architected to support high concurrency and performance
  • Simplified server configuration: Cluster configuration object and Hadoop connections are enhanced to improve the usability and ability to perform advanced configurations from the UI
  • Increased developer productivity: Developers can now use the "Run mapping using advanced options" menu to execute undeployed mappings by providing parameter file/sets, tracing level and optimizer levels in the Developer tool. Developers can also view optimized mappings after the parameter binding is resolved using the new "Show mapping with resolved parameters" option.
  • PowerCenter Reuse enhancements: Import from PowerCenter functionality has been enhanced to support import of PowerCenter workflows into Big Data Management
  • GIT Support: Big Data Management administrators can now configure GIT (in addition to Perforce and SVN) as the external versioning repository

 

Advanced Spark Support

 

  • End to end functionality: End to end Data Integration and Data Quality use-cases can now be executed on the Spark Engine. New and improved functionality includes Sequence Generator transformation, Pre/Post SQL support for Hive, support for Hive ACID Merge statement on supported distributions, Address Validation and Data Masking.
  • Data science integration: Big Data customers can now integrate pre-trained data science models with Big Data Management mappings using our new Python transformation.
  • Enhanced hierarchical data processing support: With support for Map data types and support of Arrays, Structs and Maps in Java transformations, customers can now build complex hierarchical processing mappings to run on the Spark engine. Enhancements in gestures and UI enable customers to leverage this functionality in a simple yet effective manner
  • Spark 2.2 support: Big Data Management now uses Spark 2.2 on supported Hadoop distributions

 

Cloud

 

  • Ephemeral cluster support: With out-of-the-box ephemeral cluster support for AWS and Azure ecosystems, customers can now auto deploy and auto scale compute clusters from a BDM workflow and push the mapping for processing to the automatically deployed clusters
  • Cloudera Altus support: Cloudera customers can now push the processing to Cloudera Altus compute clusters.
  • Improved AWS connectivity: Amazon S3 and Redshift connectors have received several functional, usability and performance updates
  • Enhanced Azure connectivity: Azure WASB/Blob and SQL DW connectors have received several functional, usability and performance updates

 

Platform PAM Update

 

Oracle 12cR2

Added

SQL Server 2017

Added

Azure SQL DB

(PaaS / DBaaS , Single database model)

Added

SQL Server 2008 R2 & 2012 (EOL)

Dropped

IBM DB2 9.7 & 10.1 (EOL)

Dropped

Suse 12 Sp2

Added

Suse 12 Sp0

Dropped

AIX

Not Supported

Solaris

Not Supported

Windows Server

Not Supported

Model Repository - Versioned Controlled

Status

Git

Added

 

Oracle Java 1.8.0_162

Updated

IBM JDK

NA

Tomcat 7.0.84

Updated

 

Big Data Quality (BDQ)

 

Capabilities

  • Enable data quality processing on Spark
  • Updated Address Verification Engine (AddressDoctor 5.12)
  • Support for custom schemas for reference tables
  • Updated workflow engine

Benefits

  • Support Spark scale and execution with Big Data Management
  • Enhanced Address Verification engine with world-wide certifications
  • Flexible use of reference data with enterprise DB procedures
  • Faster start times for workflow engine

 

Big Data Streaming (BDS)

 

Change in Product Name: The product name has changed from "Informatica Intelligent Streaming" to "Big Data Streaming"

 

Azure Cloud Ecosystem Support

 

  • Endpoint Support: Azure EventHub as source & target and ADLS as target
  • Cloud deployment: Run streaming jobs in Azure cluster on HDInsight

 

Enhanced Streaming Processing and Analytics

 

  • Stateful computing support on streaming data
  • Support for masking streaming data
  • Support for normalizer transformation
  • Support for un-cached lookup on HBase tables in streaming
  • Kafka Enhancements - Kafka 1.0 support & support for multiple Kafka versions

 

New Connectivity and PAM support

 

  • Spark Engine Enhancements - Spark 2.2.1 support in streaming, Truncate table, Spark concurrency
  • Relational DB as target - SQL Server and Oracle
  • New PAM - HDInsight
  • Latest version support on Cloudera, Hortonworks, EMR

 

Enterprise Data Lake (EDL)

 

Change in Product Name: The product name has changed from "Intelligent Data Lake" to "Enterprise Data Lake"

 

Core Data Preparation

 

  • Data Preparation for JSON Lines (JSONL) Files: Users can add JSONL files to a project and structure the hierarchical data in row-column form. They can extract specific attributes from the hierarchy and can expand (or explode) arrays into rows in the worksheet.
    Pivot and UnPivot: Users can pivot or unpivot columns in a worksheet to transpose/reshape the row and column data in a worksheet for advanced aggregation and analysis.
  • Categorize and One-hot-encoding functions: Users can easily categorize similar values into fewer values to make analysis easier. With one-hot-encoding, the user can convert categorical values in a worksheet to numeric values suitable for machine learning algorithms.
  • Column Browser with Quality Bar: A new panel for browsing columns is added to the left panel in the worksheet. This easy to use column browser interface allows users to show/hide columns, search for columns, highlight columns in the worksheet, etc. The panel also has a Quality bar that shows unique, duplicate and blank value count percentages within the column. The panel can also show any associated glossary terms.
  • Project Level Graphical View: For a project with a large number of assets, the graphical view helps users understand the relationships between input data sources, sheets created, assets published, and Apache Zeppelin notebooks created. Users can navigate to the asset, notebook or the worksheet directly.
  • Insert recipe step, add a filter to an existing step: Users can insert a new step at any location in the recipe. They can also add/modify existing filters for any recipe step.
  • Data Type Inferencing optimization: Users can revert undesired inferencing done by data preparation engine and apply appropriate functions. They can revert or re-infer types as needed.
  • Show where the data in a column comes from: The column overview in the bottom panel now has a Source property that shows if the column corresponds to a physical input source column, another worksheet or a step in the recipe. If the user hovers over a data source name, the application shows details of the formula when available and highlights the appropriate recipe step.
  • UX Improvements in Filter-in-effect, Sampling, Join and Apply Rule panels: The user interface has been improved for clarity of icons and language used, visibility of information and button and better user flow for these panels. Users can also input constant values as inputs in the Apply Rule panel for text based user inputs.

 

Self-service and Collaboration

 

  • Self-service scheduling: Data Analysts now have the ability to schedule import, publish and export activities. The Import/Publish/Export wizard offers the choice to perform the activity now, or to schedule it. For publish, a “snapshot” of recipes is saved for execution at the scheduled time. Users can continue to work on the project and modify recipes without affecting scheduled activity.
    The “My Scheduled Activities” tab provides details of upcoming activities. The “Manage My Schedules” tab provides details of schedules and enables users to modify schedules.
    Scheduled activities can be monitored on the My Activities page. Functionally it has the same effect as running the activity manually. All the schedules created in Enterprise Data Lake and activities scheduled in Enterprise Data Lake are also visible in the Administrator Console tool.
  • Project History: Users (and IT/Governance staff) can view the important events that happened within a given project. These include events related to Project, Collaborators, Assets, Worksheets, Publications, Scheduled Publications, Notebook etc.
  • Copy-Paste Recipe Steps: Users can copy specific steps or the whole recipe and paste into another sheet in the same project or another project. There is also a way to map the input columns used in the source sheet to the columns present in the target sheet. This enables reuse of each other’s or their own work in the creation of repetitive steps.
  • Quick Filters for asset search in the data lake: In the search results, users have a single-click filter to get all the assets in the data lake that match the search criteria.
  • Recommendation Card UX Improvements: The Recommendation cards in the Project view now show the reason an asset was recommended for inclusion in the project, and what action user should take.
  • Details of Source Filters during Publish: During Publication, the Publish Wizard shows the details of "Source Filters" so the user understands the impact of including or not including the filters.

 

Enterprise Focus

 

  • Single Installer for Big Data Management, Enterprise Data Catalog and Enterprise Data Lake: The installation and upgrade flows have been improved and simplified with a single installer. Enterprise Data Lake customers can now install all three products in a single install. The total size of the single installer is just ~7GB due to better compression, as compared to the previous combined size of ~13GB. The process requires fewer domain restarts, and additional configurations can also be enabled in the same single flow.
  • Blaze as Default Execution Engines for Enterprise Data Lake: All Enterprise Data Lake processes using Big Data Management mapping execution now use Blaze as the default engine. This has improved performance and consistency.
  • SAML based SSO: Enterprise Data Lake now supports SAML based Single-Sign-On.
  • Lake Resource Management UI: Administrators can manage the Enterprise Data Catalog resources that represent the external data sources and metadata repositories from which scanners extract metadata for use in the data lake. The Lake Resource Management page also verifies the validity of resources, the presence of at least one Hive resource, etc. so that Enterprise Data Lake functionality is usable. Changes done through the Lake Resource Management page do not require a service restart.
  • Data Encryption for Data Preparation Service node: The temporary data created on Data Preparation Service nodes is encrypted for better security.
  • Demo Version of IT Monitoring Dashboard: A dashboard created in Apache Zeppelin allows administrators to monitor Enterprise Data Lake user activities. The dashboard is not a product feature, but an example to show what is possible with the audit information. The dashboard is an Apache Zeppelin Notebook built on top of the Enterprise Data Lake user event auditing database. The Zeppelin Notebook and associated content are available on request, but it is unsupported. The Audit mechanism has been changed and improved now to support direct queries using JDBC. 
  • Performance Improvement in Import process using CLAIRE: Using the profiling metadata information available in CLAIRE, the import process optimizes the number of sub-processes created thereby improving the overall performance of Import

 

Enterprise Data Catalog (EDC)

 

  • Intelligence
    • Enhanced Smart Discovery: By clustering similar columns from across data sources, EDC enables users to quickly associate business terms as well as classify data elements. Unsupervised clustering of similar columns is now based on names, unique values and patterns in addition to the existing data overlap similarity.
    • Enhanced Unstructured Data Discovery (Tech Preview): Enhanced unstructured data support for accurate domain discovery using NLP and new file system connectivity.
    • New Data Domain Rules: Override rules and new scan options for more granular control on rule based data domain inference.
  • Connectivity
    • New Filesystems: Added support for cataloging of Sharepoint, Onedrive, Azure Data Lake Store(ADLS), Azure Blob and MapRFS
    • New File Formats: Avro and Parquet support added in 10.2.1.
    • Remote File Access Scanner: Mounting folders on Hadoop nodes not required for Linux and Windows filesystem, instead the new remote file access scanner uses SMB for Windows and SFTP for Linux for cataloging.
    • Deep Dive Lineage support for BDM: End to End data lineage from Big Data Management with transformation logic and support for dynamic mappings
    • Data Integration Hub: Users can now scan DIH to access metadata for all objects and its subscriptions and publications.
    • Data Lineage from SQL Scripts(Tech Preview): End to End data lineage from hand coded SQL scripts to understand column level data flows and data transformations- includes support for Oracle PLSQL, DB2 PLSQL, Teradata BTEQ, HiveQL. Stored Procedures are not supported in this release.
    • Qlikview: Scan reports and report lineage from Qlikview.
  • User Experience Improvements
    • Manage business context with in-place editing of wikipages of data assets. Businessuser friendly data asset overview page that provides all the business context about the data asset. Inherit descriptions from Axon associations or type your own.
    • SAML Support: For Single Sign-On.
    • Multiple Business Term Linking: Allows custom attribute creation with Axon or BG term type to allow users to link multiple business terms with a single asset.
    • Search Facet Reordering: Catalog Administrators can now reorder the default facet orders making business facets show up higher than the technical facets.
    • New Missing Asset Link Report: To help users identify linked and unlinked data assets for a lineage-type source.
  • Open and Extensible Platform
    • New REST APIs for starting and monitoring scan jobs
    • S@S Interop: Shared Infrastructure, Metadata Repository, Data Domain Definitions and Curation Results shared across EDC and S@S. Users can now scan a resource once to see it in both EDC and S@S.
    • Reduced Sizing: Upto 3X reduction in computation cores required on the Hadoop cluster across all sizing categories
    • Ease of Deployment – Improved validation utilities, updated distro(HDP v2.6) for embedded cluster.

 

Release Notes & Product Availability Matrix (PAM)

 

PAM for Informatica 10.2.1

 

Informatica 10.2.1 Release Notes

 

Summary

 

Performance issues are seen when processing huge EBCDIC files in hive pushdown mode. The mapping has a Complex Data Object as source to read the EBCDIC file in binary mode followed by a Data Processor streamer to chunk the input data and convert the Data to relational format and finally write data to flat file in HDFS.

 

We are not able to leverage Hadoop parallel distributed computing since only one map job is spawned reading the entire EBCDIC binary file.
This document discusses some performance tuning steps when processing such huge EBCDIC files in Hadoop pushdown mode. The EBCDIC files assumed in this article are fixed length records based on Cobol Copybooks.

 

Suggestions to Improve performance

 

The mapping assumed here has a Complex Data Object as source to read the EBCDIC file in binary mode followed by a Data Processor streamer to chunk the input data and convert the Data to relational format and finally write data to flat file in HDFS.

 

 

So, some options to improve performance -

  1. In the streamer data processor, look for the “count” property when you segment the binary input under repeating_segment. Set the count property to define the number of records that the Data Integration Service must treat as a batch. When you set the count property, the Data Processor Engine will be called once for each batch of records instead of calling the Data Processor Engine for every record. So, batch processing to improve performance.
  2. You use “org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat” the Custom Input format to split the binary records into equal length. This can be configured as custom Input format under the Complex File Reader, so the EBCDIC file is split-able based on multiples of single record length. That would help create multiple map jobs for each split. This would help only if your data has a fixed length records in EBCDIC format. If it is variable length, this approach would not help. 
  3. Configure the Input Split size maximum and minimum in such a way that it creates multiple maps for each input split.
  4. There is also com.informatica.hadoop.reader.RegexInputFormat available for custom Input Format value to help with the split, but I am not sure if you can construct a regex given the data is in EBCDIC format.

 

Steps to Improve the performance by spawning multiple map jobs.

 

We will be using the custom input format org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat” class to split the input. Note that the class file for the Input format org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat” is already part of various Hadoop distribution vendor jars. So you need not worry about copying them to services/shared/hadoop/<distro>/infaLib directory.

 

Here is the proof from class finder utility

 

Now, the detailed Steps …

 

1. Add the below snippet in the core-site.xml file under services/shared/hadoop/<your distro>/conf directory. As you can see, this is where the fixed length record size 1026 (in my case) is specified.

  
<property>

<name>fixedlengthinputformat.record.length</name>

         <value>1026</value>

          <final>true</final>

</property>

2. Open hadoopEnv.properties under services/shared/hadoop/<your distro>/InfaConf directory and add the core-site.xml file to the infapdo.aux.jars.path as shown below

infapdo.aux.jars.path=file://$DIS_HADOOP_DIST/infaLib/hive1.0.0-infa-boot.jar,file://$DIS_HADOOP_DIST/infaLib/profiling-hive0.14.0-udf.jar,file://$DIS_HADOOP_DIST/infaLib/hive-infa-plugins-interface.jar,file://$DIS_HADOOP_DIST/lib/sqoop-1.4.6-hadoop200.jar,file://$DIS_HADOOP_DIST/infaLib/sqoop-1.4.6-serde.jar,file://$DIS_HADOOP_DIST/infaLib/sqoop-1.4.6-client.jar,file://$DIS_HADOOP_DIST/infaLib/azuredw-2.6.0-distcp.jar,file://$DIS_HADOOP_DIST/infaLib/redshift-2.6.0-distcp.jar,file://$DIS_HADOOP_DIST/infaLib/spark-boot.jar,file://$DIS_HADOOP_DIST/conf/hbase-site.xml,file://$DIS_HADOOP_DIST/conf/core-site.xml

 

3. In the mapping Runtime properties, override the Input Split size so you can create multiple map jobs. In my case, the dfs block size is 128 MB. So in order to set the input split size as 64 MB, I set the below values in the mapping runtime properties

 

The split size is calculated by the formula:-
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

            mapred.min.split.size : value 33554432

            mapred.max.split.size : value 67108864

 

I have also set the number of mappers and reducers as shown below.

 

 

4. Complex File reader with input format, as of 10.1.1 version, prepends the size of the input length to the buffer that it sends out. So we need to skip it in the parser. You can see the highlighted section below where I have skipped the record size in bytes (4 bytes) under the repeating group in the Data Processor generated script using Cobol to relational wizard.

 

 

5. Open the Streamer Data Processor and set the offset to split as (fixed record length + bytes need to store the size of the record) = (1026+4) = 1030 in my case. Set as shown below

 

 

6. Set the custom input format under the complex file reader to “org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat”

 

 

7. Adjust port precision depending on your record size. I am attaching my sample mapping here.

 

8. You can also set the count to greater than 1 to enable batch processing by Streamer Data Processor. 

 

9. Run the mapping in Hadoop pushdown mode using Hive engine and check if multiple maps spawned.

 

10. Tune the performance by adjusting the input split size and also the batch processing count in the streamer.

 

Limitation

 

In 10.1.1, the record length needs to be set as part of the core-site xml file. So, in case you need to process multiple EBCDIC files of different sizes, there is only crude workarounds to accomplish this currently. You can either have multiple Data Integration Service created [Or] use Fixed length binary record format code available in Internet, compile and place them under server/shared/hadoop/<distro>/infaLib directory with different package names for different hard-coded record lengths. Sample code: https://gist.github.com/freeman-lab/98d9096695e794391ab9.  This custom Input format code is derived from github and this is not owned by Informatica. GCS will not be responsible for any issues or bug fixes with this format.

 

 

Tested in Product & Version: BDM 10.1.1 Update2

 

Author Name : Sugi

Overview

PowerCenter customers can now build mappings that can read from and write to Hive. Hive connectivity is now supported through ODBC.

 

Features

Some salient features include:

  • Reading from both internally and externally managed Hive tables
  • Extensive data type support
  • Easily ingest data from relational tables, flat files and others sources into Hive or vice-versa
  • Query override support for complex and custom HiveQLs
  • Read support for partitioned and bucketed Hive tables
  • ANSI SQL-92 support

 

Known limitations

Following are the known limitations of the driver*:

  • Write to partitioned Hive tables
  • Write to bucketed Hive tables

 

* Support may vary between Hadoop distributions

 

References

Here are the KB articles that describe the simple steps that allow PowerCenter to read from and write to Hive.

Under very specific conditions and datatype usage, Informatica Data Service mapping with Joiner transformation that has a single join condition, data type as string and Non-Unicode data, could cause potential data loss. This issue is being tracked as Change Request PLAT-19257 and has been isolated to mappings that have the following characteristics:

 

Joiner transformations that meet ALL of the following criteria:

 

  • Join type: Equi-Join,
  • Datatype: String
  • Number of conditions: Single
  • Data Transferred: ASCII (dynamically determined based on connection codepages)
  • Engine Type: DataIntegrationService (DIS)
  • Mode: Native

 

Affected Software

 

Informatica Data Quality 10.0/Informatica Big Data Management Edition 10.0

Informatica Data Quality 10.1.0/Informatica Big Data Management Edition 10.1.0

Informatica Data Quality 10.1.1/Informatica Big Data Management Edition 10.1.1

Informatica Data Quality 10.1.1 HotFix 1/Informatica Big Data Management 10.1.1 Edition HotFix 1

 

Suggested Actions

 

Refer to Knowledge base (KB) article, 520801 that has more details on how to check if mappings meet the criteria.

 

If impacted, apply patch or Emergency Bug Fix (EBF) that is available for download from the FTP location provided below for 10.1.1 HotFix 1.

For all other versions, a workaround is available as mentioned in the KB article.

 

Informatica strongly recommends that customers apply the patch or workaround suggested in the KB article if they have mappings that fall into the scope defined above.

 

Server: https://tsftp.informatica.com

Location: updates/Informatica10/10.1.1 HotFix1/EBF-10298

 

Please refer to the attached document for more information.

Today, I’m really excited to announce the latest innovation from Cloudera and Informatica’s partnership.  With both companies focusing on helping customers adopt data lakes in the cloud, we are working together to dramatically simplify the delivery of data lakes in the cloud. 

A few months ago, Cloudera announce its new platform of a service offering for data lakes in the cloud known as Altus.  And today, I’m pleased to announce a unique integration between Informatica Big Data Management and Cloudera Altus.  This unique integrated solution will enable customers to easily deploy large-scale data workloads in the Cloud by reducing operational overhead of managing a hadoop cluster.

Cloudera Altus is a platform-as-a-service with services that enable you to analyze and process large-scale data sets in the cloud. Altus provisions clusters quickly and manages Hadoop clusters cost-effectively.

Informatica Big Data Management (BDM) provides the most advanced data integration platform for Hadoop

With Big Data Management on Altus, users can focus on building the data pipeline logic without worrying about cluster management. For example, organizations that wish to gain better visibility into data while eliminating data silos can use this approach to deliver data swiftly for analytics. Creating a data lake solution in the cloud using Big Data Management and Altus has been significantly simplified.

Creating a Data Lake Solution using BDM and Altus

Use Informatica Big Data Management and Cloudera Altus to build and quickly deploy data lakes in the cloud while eliminating data silos and increasing productivity to quickly process and analyze data.


The following illustration shows a typical data lake solution implementation using BDM on Altus:

 

Step 1. Offload infrequently used data from the Enterprise Data Warehouse and load raw data in batches to a defined landing zone in Amazon S3. This frees up space in the Enterprise Data Warehouse.

Step 2. Collect and stream data generated by machines and sensors, including application and weblog files, directly to Amazon S3. Note that staging the data in a temporary file system or the data warehouse is not longer required.

Step 3. Discover and profile data stored on Amazon S3. Profile the data to better understand its structure and context. Easily add requirements for enterprise accountability, control, and governance for compliance with corporate and governmental regulations and business service level agreements.

Step 4. Parse and prepare data from weblogs, application server logs, or sensor data. Typically, these data types are in multi-structured or unstructured format, which can be parsed to extract features and entities and to apply data quality techniques. This allows one to easily execute pre-built transformations as well as data quality and matching rules in Cloudera Altus to prepare data for analysis.

Step 5. After cleansing and transforming data onto Cloudera Altus, move high-value curated data to Amazon S3 or to Redshift. From S3 or Redshift. This will directly access data with Business Intelligence reports and applications.

 

 

Prototyping a Data Lake Solution

During this next step, a prototype will illustrate how to deploy a data lake solution using Cloudera Altus and Informatica Big Data Management.  The example below demonstrates how to run Cloudera Altus on an Amazon ecosystem while starting an on-demand Spark job with Altus.

To reduce the cluster management cost and operational overhead, use Big Data Management to create and terminate the Altus cluster on demand. To create the cluster, specify the cluster configuration details, including the instance type and the number of worker nodes.

Creating a Workflow

Create a workflow in Informatica Big Data Management to implement the data lake solution.

The following image shows a typical workflow for the data lake solution:

 

The workflow contains command and mapping tasks as described in the following steps:

Step 1. This creates an Altus cluster by retrieving cluster configuration parameters from the user to create the Altus cluster on demand.


The following illustration demonstrates an Altus cluster:

Step 2. Ingests data to Amazon S3. This mapping task runs on the newly created Altus cluster.

Step 3. Prepares the data by cleansing and integrating with other datasets. This mapping task runs on the newly created Altus cluster.

The mapping tasks are fully integrated with Cloudera Altus. The mapping tasks will run natively on the Spark engine.


The following image shows the mapping for data preparation:

Step 4. Terminate the Altus cluster after the mapping processing.

Monitoring Spark Jobs

The Informatica monitoring console can be used to monitor Spark jobs that run on the Altus cluster.


The following image demonstrates the Informatica monitoring console running Spark jobs on Altus:

The job below demonstrates how an Altus “Jobs” page runs an Informatica Spark job(s):


 

The following illustration demonstrates a completed Informatica Spark job on Altus:

Video: Data Lake Solution using BDM and Altus

To learn more, watch Informatica Big Data Management in action solving the “Data Lake on Altus” use case: https://network.informatica.com/videos/1113

Looking Ahead

As a strategic partner to Cloudera, Informatica is delighted to announce this new solution that showcases and Informatica Big Data Management with Cloudera Altus technologies working together. Integrating Big Data Management with Altus will reduce the cost and operational overhead of managing Hadoop clusters for data engineers and IT Administrators alike. 

To learn more, visit informatica.com/big-data-ready/

The attached Informatica Big Data Edition Tools package is distributed as a compressed zip folder. It includes five tests that help validate Hadoop cluster services connectivity from Informatica BDE installation for supported Hadoop distributions.

 

The five tools in summary are:

TestDescription
HiveJDBCTest Validates the JDBC connection to HiveServer2 and runs basic DML queries
HiveCLITest Validates if client can submit a Hive Job to Hadoop cluster using Hive CLI driver and runs basic HIVE queries
DisplayClientServerVersion Displays the versions of Informatica BDE Hadoop client libraries and Hadoop cluster/server libraries.
HDFSConnectionTestValidates the ability to connect/read/write to the Hadoop File System
HBaseConnectionTestValidates the ability to connect/read/write to HBase

 

Refer to the Big Data Edition Tools User Guide (within the zip folder) for more information.This Guide is written for the Informatica administrator who is responsible for installing and configuring Informatica and related tools. This guide assumes you are familiar with the Hadoop ecosystem including MapReduce, Yarn, HDFS, Hive, and HBase.

This release includes:

 

BDE Update 3 features

  • Hadoop PAM : IBM 4.1, CDH 5.5
  • Enhancement in Big Data Config utility
  • HiveServer2 Integration with Big Data Edition

BDE Update3 PAM additions

  • Cloudera CDH 5.5
  • IBM BigInsights 4.1

 

Release Notes:

https://kb.informatica.com/proddocs/Product%20Documentation/4/IN_BDE961HF3Update3_ReleaseNotes_en.pdf

 

Note : This EBF is with above Update3 features and PAM support can be directly applied on top 9.6.1HotFix 3  or on top of the  9.6.1 HotFix 3 Update2.

BDE Update2 Features

  • Hive 14  Feature support: Support for update strategy transformation on hadoop mode of execution.
  • Active Directory KDC support : Support of Active directory based Kerberos domain controller for Hortonworks and Cloudera.
  • Merge of the 9.6.1 HF2 Update1 features
  • EBF Merges mentioned in release notes section .

 

EBF16193 specific feature along with above .

  • Hortonworks 2.3 Support for the Informatica  BDE Edition

Informatica Big Data Edition - 9.6.1 HotFix 3 Update 2 - Release Notes - (English)

PAM for Informatica 9.6.1 Hotfix 3 (Update 2) - Big Data Edition (Hadoop)

The release includes the following

 

  • Hive 14  Feature support: Support for update strategy transformation on hadoop mode of execution.
  • Active Directory KDC support : Support of Active directory based Kerberos domain controller for Hortonworks and Cloudera.
    • Release on top of 9.6.1 HF3(All the HF3 fixes are now available for BDE such as MM performance enhancements )
    • Merge of the 9.6.1 HF2 Update1 features
    • EBF Merges mentioned in release notes

 

Informatica Big Data Edition - 9.6.1 HotFix 3 Update 2 - Release Notes - (English)

PAM for Informatica 9.6.1 Hotfix 3 (Update 2) - Big Data Edition (Hadoop)

This release includes:

  • Hadoop Distributions on prem
    • Support for new versions: CDH 5.4, MapR 4.0.2
  • Hadoop Distributions on Cloud
    • Support for Cloudera and Hortonworks on Microsoft Azure and Amazon EC2.
  • Performance
    • HBase writer Performance Enhancement.
    • Big Data Edition on Tez.
  • Ease of Big Data Edition Configuration (Phase1)

Informatica Big Data Edition - 9.6.1 HotFix 2 Update 1 - Release Notes - (English)