Authors: KVadrevu Skip navigation

Big Data Management

5 Posts authored by: Keshav Vadrevu

Introduction

Team Based Development refer to the capabilities in Informatica Big Data Management (BDM) that allow various developers to access, share, collaborate and reuse objects developed by others within the team. BDM has several capabilities that allow developers to work collectively without stepping on each other's work and without accidental overwrites.

Integration with version control system

Model Repository Service (MRS) can be integrated with any supported version control system such as GIT, SVN and Perforce. Model Repository completely abstracts all the complex version control operations to the Informatica developers. As the developers check-in / check-out objects, MRS seamlessly translates these into necessary operations for the underlying version control system. Integrating Model Repository Service with an external version control system is a single step process as demonstrated in the screenshot here:

BDM integration with external version control system such as GIT

 

When integrated with a version control system, MRS will preserve the latest version (current version) of an object in the model repository and all other versions in the external version control system.

Versioned objects

Developers can use the check-in objects into or check out objects out of the version control system. Developers can perform such operations on multiple objects at a time.

Versioned objects

Version History

All the historical versions stored in the external version control system can be directly accessed from the Developer tool itself. The View Version History menu opens the Version History pane.

Collaboration

Multiple Informatica developers can work and collaborate with each other. Developers can in parallel edit and operate on multiple related objects. For example, consider a mapping with a some reusable and non-reusable transformations. While a developer (developer-1) is editing the mapping, another user (developer-2) can edit the data object and a 3rd user (developer-3) can change / update the reusable transformations used within the mapping. Depending on the complexity of the mapping, multiple users can edit several components of it at the same time. Users can also edit Mapplet, Workflows and other related objects at the same time.

Collaboration in BDM

 

Administrators and other super users will be able to view edits in progress from the Administrator console - as described in the next section.

Collaboration Locks in Admin Console

 

Intent based object locking

BDM has in-built capability to acquire write locks for objects that the developers edit. A classic lock acquisition mechanism would acquire the write lock on the object as soon as user opens it in the workspace. While this eliminates the accidental overwrites, it is often an administration overhead when large teams are involved. A developer may just wish to have a read-only copy of a certain object open in their workspace as a reference to something they are working on. With globalization and developer teams spread across the oceans, acquiring a write lock for a user who opens an object makes collaboration a nightmare for developers. BDM uses intent based object locking to provide a more seamless collaboration experience. BDM acquires a write lock on the objects that on the first attempt to edit an object. This way many users can have the object open in the workspace and not interfere with the active users. Intent based locking is available for all top level objects including mappings, profiles and workflows. Locks acquired by the developers are automatically released when the objects are closed.

 

Developers have complete visibility on the objects that are locked by other developers, the time since this the user acquired the lock and other details. Users with elevated privileges

Locked objects in Developer

Administrators and elevated users can leverage the Administrator console to similarly manage the object locks and release the locks that are no longer valid and active

Intent based locks in Administrator

Summary

Informatica's Big Data Management has capabilities that allows various developers to work in parallel in a model repository that is version control enabled. Developers can check-in and check-out objects from the model repository that are seamlessly sent to an external version control system such as GIT. Big Data Management automatically maintains locks on the objects while allowing users to contribute and collaborate.

Introduction

Informatica® Big Data Management allows users to build big data pipelines that can be seamlessly ported on to any big data ecosystem such as Amazon AWS, Azure HDInsight and so on. A pipeline built in the Big Data Management (BDM) is known as a mapping and typically defines a data flow from one or more sources to one or more targets with optional transformations in between. The mappings and other associated data objects are stored in a Model Repository via a Model Repository Service (MRS). In design-time environment, mappings are often organized into folders within projects. A mapping can refer to objects across projects and folders. Mappings can be grouped together into a workflow for orchestration. Workflow defines the sequence of execution of various objects including mappings.

Deployment overview

Mappings, workflows and other objects developed by Informatica developers are stored in the model repository that the MRS is integrated with. These design-time objects are deployed to the run-time DIS for execution. In a typical enterprise, there is more than one Informatica environment and the code developed in the Development domain is deployed to several non-production environment such as QA and UAT before deployed into Production. While the Development environments contain both design-time and run-time services, it is not necessary for the subsequent environments to be configured with both design-time and run-time services. For deploying objects from one environment to another, the objects must be added into containers called Applications. Applications can be deployed to a runtime Data Integration Service (DIS) or to an Application Archive (.iar) file. The application archive file can subsequently be deployed to data integration services in the same or different domain as depicted below.

BDM Deployment Process

 

There are two recommended ways of deployment: Classic deployment model and the CI/CD deployment model. In the example below, the migration and deployment of objects 

Classic deployment

In classic deployment model, the following process is followed:

  1. Metadata / objects that need to be deployed are deployed into the run-time DIS of the development environment
  2. Once unit testing is complete, the objects can be migrated to subsequent environment's MRS (such as QA) via XML export/import or via application export
  3. From the MRS of QA environment, application is rebuilt and deployed to the QA DIS
  4. Once functional testing is complete, the objects are migrated from QA MRS to Production MRS via XML export/import or via application export
  5. From the Production MRS, application is rebuilt and deployed to the Production DIS

 

Classic deployment model in BDM

 

In this approach, a design-time copy of the mappings and workflows are maintained in the MRS of every single environment. Application is rebuilt in each environment and deployed to the corresponding DIS. During migration of objects from one MRS to another, one of the available replacement strategies can be selected. Replacement strategies include replacing objects from the source upon conflict, reusing the objects in the target repository, etc. Upon conflicts, if the objects in the target repository are not replaced from the source, the application built in each environment may not match with that of the other as the dependency resolution can happen with different versions of the objects or different objects altogether

Agile deployment

In Agile deployment model, the following process is followed:

  1. An application archive is built in the Development repository
  2. This application archive (.iar) file is uploaded into a version control system such as GIT
  3. The application archive (.iar) file from version control system is then downloaded and deployed to the Development DIS using infacmd CLI
  4. Once unit testing is complete, the same step is repeated to deploy the application in to QA DIS
  5. Once functional testing is complete, the same step is repeated to deploy the application in to Production DIS

 

Agile deployment in BDM

In this approach, a single application archive file is used across several environments and hence consistency is assured. Though not common, the application archive can optionally be imported into MRS to maintain a design-time copy of the objects.

Automation

infacmd CLI can be used perform deployment in an automated manner. Both of the deployment models described above can be automated using the CLI. Automation server tools such as Jenkins can be used to automate the overall process of deployment as described in the blog: Continuous delivery with Informatica  BDM.

Summary

In Big Data Management, there are many ways to migrate and deploy objects  from one environment to another. Customers can choose the approach that best suits their needs. All approaches can be automated using infacmd CLI and automation tools such as Jenkins.

Introduction

Informatica® Big Data Management allows users to build big data pipelines that can be seamlessly ported on to any big data ecosystem such as Amazon AWS, Azure HDInsight and so on. A pipeline built in the Big Data Management (BDM) is known as a mapping and typically defines a data flow from one or more sources to one or more targets with optional transformations in between. The mappings and other associated data objects are stored in a Model Repository via a Model Repository Service (MRS). In design-time environment, mappings are often organized into folders within projects. A mapping can refer to objects across projects and folders. Mappings can be grouped together into a workflow for orchestration. Workflow defines the sequence of execution of various objects including mappings.

 

Deployment process overview

For mappings and workflows to be deployed and executed in the run-time, they are grouped into applications. Application is a container that holds executable objects such as mappings and workflows. Applications are defined in the Developer and deployed to a Data Integration Service for execution. Once deployed, Data Integration Service persists a copy of the Application. Application can also be deployed to a file known as Informatica Application Archive (.iar) file, which can subsequently be deployed to a Data Integration Service in same or different domain. The overall process flow for deployment in BDM is as shown here:

BDM Deployment Process

Automation

The process of deploying a design-time application to an Informatica application archive (.iar) file can be executed via a infacmd CLI with Object Import Export (oie) plugin. A sample of the deploy application command is as follows:

infacmd.sh oie deployApplication -dn $infaDomainName -un $infaUserName -pd $infaPassword -sdn $infaSecurityDomain -rs $designTimeMRSName -ap $applicationPath -od $Output_Directory

 

The above example uses several user-defined environment variables. They can be named as per the individual organization standards. The password provided is case sensitive. Alternatively, an encrypted password string can be stored in the predefined environment variable INFA_DEFAULT_DOMAIN_PASSWORD. When an encrypted password is used, -pd option is not required. This command is documented in detail in Informatica documentation at Command Reference Guide → infacmd OIE Command Reference → Deploy Application

 

Once the application archive file is created, it can be optionally checked into GIT or other version control system for audit and tracking purposes.

 

Subsequently, the application archive file can be deployed to Data Integration Service of the same or different domain. Typically the application archive file is created out of a development domain and is eventually deployed into QA, UAT and Production domains. This can be achieved via infacmd CLI with Data Integration Service (dis)  plugin. A sample of such deployment command is as follows:

infacmd.sh dis deployApplication -dn $infaDomainName -un $infaUserName -pd $infaPassword -sdn $infaSecurityDomain -sn $dataIntegrationServiceName -a $applicationName -f $applicationArchiveFileName

 

This command is documented in detail in Informatica documentation at Command Reference Guide → infacmd DIS Command Reference → Deploy Application. Once deployment is successful the listApplications and listApplicationObjects in the DIS plugin can be used to get a list of the deployed applications and their contents respectively. This information can be used for post-deployment verification / sanity checks.

 

Integration with Jenkins

The CLI described above can be used to initiate the deployment process from within a Jenkins task. A "Build Step" of type "Execute Shell" can be added to the Jenkins. The step can be configured to execute one of the infacmd commands as shown in the example below

 

BDM deployment in Jenkins

 

A sample template file for Jenkins is attached (Jenkins-Template-App-Deployment) . The template contains the commands to:

  1. Create an Informatica Application Archive (.iar) file
  2. Commit the application archive file to GIT
  3. Deploy the application into DIS

 

Summary

Informatica BDM jobs can be deployed using Jenkins without any need for 3ʳᵈ party plugins. infacmd CLI commands can be directly used in Jenkins just as they can be used in an enterprise scheduling tool.

 

Contributors

  • Keshav Vadrevu, Principal Product Manager
  • Paul Siddal, Big Data Presales Specialist

 

 

 

The Past

Introduced in the early days of big data, MapReduce is a framework that can be used to develop applications that process large amounts of data in a distributed computing environment.

A typical MapReduce job splits the input data set into chunks that are processed in parallel by map tasks. The framework sorts the output of the map tasks and passes the output to reduce tasks. The result is stored in a file system such as Hadoop Distributed File System (HDFS).

In 2012, Informatica Big Data Management (BDM) product introduced the ability to push down mapping logic to Hadoop clusters by leveraging the MapReduce framework. Big Data Management translated mappings into HiveQL, and then into MapReduce programs, which were executed on a Hadoop cluster.

By using this technique to convert mapping logic to HiveQL and pushing its processing to the Hadoop cluster, Informatica was the first (and still leading) vendor to offer the ability to push down processing logic to Hadoop without having to learn MapReduce. Developers simply had to select the "Hive" checkbox in the runtime properties of a mapping to run the mapping in MapReduce mode. This enabled hundreds of BDM customers to reuse their traditional Data Integration jobs and onboard them to the Hadoop ecosystem.

MapReduce as execution engine for BDM Mappings

The Present

With time, the Hadoop ecosystem evolved. For starters, MapReduce is no longer the only job processing framework. Tez, Spark and other processing frameworks are used throughout the industry as viable alternatives for MapReduce.

Recently, Spark has been widely adopted by vendors and customers alike. Several ecosystems such as Microsoft Azure use Spark as their default processing framework.

UPDATE: Hadoop distribution vendors have started to move away from MapReduce. As of HDP 3.0, MapReduce is no longer supported as an execution engine. Please refer to this link for more details on Hortonworks recommendations of execution engines: Apache Hive 3 architectural overview. Hive execution engine (including MapReduce) was deprecated in Big Data Management 2018 Spring release and has reached End of Life (EOL) Big Data Management 2019 Spring Release (10.2.2). Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.

Big Data Management adopted Spark several years ago and currently supports the latest versions of Spark. Please refer to Product Availability Matrix for the Spark version support. Big Data Management supports running Data Integration, Data Quality and Data Masking transformations using Spark.

For the most part, developers do not have to make any changes to mappings to leverage Spark. To use Spark to run mappings, they simply change the execution engine from Hive to Spark as shown in the following screenshot from Big Data Management 2018 Spring Release (BDM 10.2.1)

Migration from MapReduce to Spark

As a result of this simple change, the Data Integration Service generates Spark Scala code instead of MapReduce and executes it on the cluster.

The deprecation and End of Life of MapReduce

To accommodate the evolution of the Hadoop ecosystem, Informatica has announced the deprecation of MapReduce in Big Data Management in 2018 Spring release and has announced End of Life in 2019 Spring release (BDM 10.2.2). Customers previously leveraging Map Reduce are recommended to migrate to Spark. Customers currently on older versions of Big Data Management (including Spring release 2018 / BDM 10.2.1) are strongly recommended to migrate to Spark execution engine by selecting the Spark checkbox in Run-time properties of the mapping

Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.

Informatica's EOL of MapReduce applies only to Big Data Management mappings that use MapReduce as the run-time engine. It will not affect any Hadoop components (such as SQOOP) that internally rely on MapReduce or other third-party components. For example, when a customer uses SQOOP as a source in a Big Data Management mapping, Big Data Management will invoke SQOOP, which internally invokes MapReduce for processing. This will continue to be the case even after End of Life for the MapReduce execution mode.

Migration from MapReduce

To mitigate the continual evolution of big data ecosystems, Informatica recommends that developers practice an inclusive strategy for running mappings. In the Run-time mappings properties, select all Hadoop run-time engines, as shown in the following screenshot:

Polyglot computing in Big Data Management (BDM) including Spark

When all Hadoop run-time engines are selected, Informatica chooses the right execution engine at runtime for processing. Beginning with Big Data Management version 2018 Spring release (BDM 10.2.1), mappings default to Spark when Spark is selected with other execution engines. When customers use this inclusive strategy, the mappings with all Hadoop engines selected will automatically run in Spark mode.

Mappings that have only Hive (MapReduce) selected can be changed in bulk to leverage Spark. Several infacmd commands allow you to change the execution engine for mappings. Mappings that already exist as objects in the Model repository can be migrated to Spark by using one of the following commands:

  • infacmd mrs enableMappingValidationEnvironment
  • infacmd mrs setMappingExecutionEnvironment

This command receives the Model repository and project names as input and changes the execution engine for all mappings in the given project. The MappingNamesFilter property can be used to provide a comma-separated list of mappings to change. You can use wildcard characters to define mapping names. For more information about using these commands, see the Informatica Command Reference Guide.Similarly, for the mappings that have been deployed to the Data Integration Service as part of an application, you can use the following commands to change the execution engine for multiple mappings:

  • infacmd dis enableMappingValidationEnvironment
  • infacmd dis setMappingExecutionEnvironment

 

Summary

Starting Big Data Management 2019 Spring release (BDM 10.2.2), Hive execution mode (including Map Reduce) is no longer supported. Customers are recommended to migrate to Spark.

Overview

PowerCenter customers can now build mappings that can read from and write to Hive. Hive connectivity is now supported through ODBC.

 

Features

Some salient features include:

  • Reading from both internally and externally managed Hive tables
  • Extensive data type support
  • Easily ingest data from relational tables, flat files and others sources into Hive or vice-versa
  • Query override support for complex and custom HiveQLs
  • Read support for partitioned and bucketed Hive tables
  • ANSI SQL-92 support

 

Known limitations

Following are the known limitations of the driver*:

  • Write to partitioned Hive tables
  • Write to bucketed Hive tables

 

* Support may vary between Hadoop distributions

 

References

Here are the KB articles that describe the simple steps that allow PowerCenter to read from and write to Hive.