Introduced in the early days of big data, MapReduce is a framework that can be used to develop applications that process large amounts of data in a distributed computing environment.
A typical MapReduce job splits the input data set into chunks that are processed in parallel by map tasks. The framework sorts the output of the map tasks and passes the output to reduce tasks. The result is stored in a file system such as Hadoop Distributed File System (HDFS).
In 2012, Informatica Big Data Management (BDM) product introduced the ability to push down mapping logic to Hadoop clusters by leveraging the MapReduce framework. Big Data Management translated mappings into HiveQL, and then into MapReduce programs, which were executed on a Hadoop cluster.
By using this technique to convert mapping logic to HiveQL and pushing its processing to the Hadoop cluster, Informatica was the first (and still leading) vendor to offer the ability to push down processing logic to Hadoop without having to learn MapReduce. Developers simply had to select the "Hive" checkbox in the runtime properties of a mapping to run the mapping in MapReduce mode. This enabled hundreds of BDM customers to reuse their traditional Data Integration jobs and onboard them to the Hadoop ecosystem.
With time, the Hadoop ecosystem evolved. For starters, MapReduce is no longer the only job processing framework. Tez, Spark and other processing frameworks are used throughout the industry as viable alternatives for MapReduce.
Recently, Spark has been widely adopted by vendors and customers alike. Several ecosystems such as Microsoft Azure use Spark as their default processing framework.
UPDATE: Hadoop distribution vendors have started to move away from MapReduce. As of HDP 3.0, MapReduce is no longer supported as an execution engine. Please refer to this link for more details on Hortonworks recommendations of execution engines: Apache Hive 3 architectural overview. Hive execution engine (including MapReduce) was deprecated in Big Data Management 2018 Spring release and has reached End of Life (EOL) Big Data Management 2019 Spring Release (10.2.2). Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.
Big Data Management adopted Spark several years ago and currently supports the latest versions of Spark. Please refer to Product Availability Matrix for the Spark version support. Big Data Management supports running Data Integration, Data Quality and Data Masking transformations using Spark.
For the most part, developers do not have to make any changes to mappings to leverage Spark. To use Spark to run mappings, they simply change the execution engine from Hive to Spark as shown in the following screenshot from Big Data Management 2018 Spring Release (BDM 10.2.1)
As a result of this simple change, the Data Integration Service generates Spark Scala code instead of MapReduce and executes it on the cluster.
The deprecation and End of Life of MapReduce
To accommodate the evolution of the Hadoop ecosystem, Informatica has announced the deprecation of MapReduce in Big Data Management in 2018 Spring release and has announced End of Life in 2019 Spring release (BDM 10.2.2). Customers previously leveraging Map Reduce are recommended to migrate to Spark. Customers currently on older versions of Big Data Management (including Spring release 2018 / BDM 10.2.1) are strongly recommended to migrate to Spark execution engine by selecting the Spark checkbox in Run-time properties of the mapping
Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.
Informatica's EOL of MapReduce applies only to Big Data Management mappings that use MapReduce as the run-time engine. It will not affect any Hadoop components (such as SQOOP) that internally rely on MapReduce or other third-party components. For example, when a customer uses SQOOP as a source in a Big Data Management mapping, Big Data Management will invoke SQOOP, which internally invokes MapReduce for processing. This will continue to be the case even after End of Life for the MapReduce execution mode.
Migration from MapReduce
To mitigate the continual evolution of big data ecosystems, Informatica recommends that developers practice an inclusive strategy for running mappings. In the Run-time mappings properties, select all Hadoop run-time engines, as shown in the following screenshot:
When all Hadoop run-time engines are selected, Informatica chooses the right execution engine at runtime for processing. Beginning with Big Data Management version 2018 Spring release (BDM 10.2.1), mappings default to Spark when Spark is selected with other execution engines. When customers use this inclusive strategy, the mappings with all Hadoop engines selected will automatically run in Spark mode.
Mappings that have only Hive (MapReduce) selected can be changed in bulk to leverage Spark. Several infacmd commands allow you to change the execution engine for mappings. Mappings that already exist as objects in the Model repository can be migrated to Spark by using one of the following commands:
- infacmd mrs enableMappingValidationEnvironment
- infacmd mrs setMappingExecutionEnvironment
This command receives the Model repository and project names as input and changes the execution engine for all mappings in the given project. The MappingNamesFilter property can be used to provide a comma-separated list of mappings to change. You can use wildcard characters to define mapping names. For more information about using these commands, see the Informatica Command Reference Guide.Similarly, for the mappings that have been deployed to the Data Integration Service as part of an application, you can use the following commands to change the execution engine for multiple mappings:
- infacmd dis enableMappingValidationEnvironment
- infacmd dis setMappingExecutionEnvironment
Starting Big Data Management 2019 Spring release (BDM 10.2.2), Hive execution mode (including Map Reduce) is no longer supported. Customers are recommended to migrate to Spark.