Big Data Management : 2018 : July Skip navigation
2018

This blog post shows how to call webservice in BDM using Spark.

We will be using the python transformation that’s introduced in BDM 10.2.1 to call the web-service.

 

Java tx is another option to call the webservice.

 

Pre-requisites

 

Python and jep package need to be installed on BDM DIS server, refer to the install documentation to configure python transformation with BDM.

 

Post python Installation, edit the Hadoop connection by going to window --> preferences --> connections --> click on your hadoop connection

 

 

 

Edit the hadoop connection and go to spark tab

 

 

 

Under the spark tab , advanced properties --> click the Edit button.  The first 3 properties in the screenshot are the python properties which come by default and we will put in the values for those 3 properties. Change values as per your python installation.

 

 

Web-service Details

 

 

We will be using the following webservice to get the states for any given country

 

http://services.groupkt.com/state/get

 

For example if we pass the country “USA” to the above url, the webservice will return all the state information within USA along with other details like area,capital, largest city etc.

 

To test the webservice for USA open the following URL in your browser and it will return json output.

 

http://services.groupkt.com/state/get/USA/all

 

 

 

Calling the web-service in BDM Mapping

 

 

We will create 2 mappings in BDM

 

In the first mapping we we will pass the country names from an input file,  then use the python tx to call the web-service and finally write the output to a HDFS file. The output will be a json file.

 

In the second mapping we will parse the json  output from the above mapping and write to Hive.

 

 

Mapping 1:

 

We have an input file is on hdfs with the following contents. We will pass the country names from this input file and get the states.

 

Create a flat file object file in developer client, in the advanced properties go to the read section and point to the HDFS connection and directory.

 

 

 

Create a new mapping and drag the flatfile object in the mapping and  choose the read operation.

 

 

 

Add a python transformation to the mapping and drag the country_name to input of python tx.

Create an output port for python tx and add call it states_data_json. The python tx ports should look like below.

 

 

Go to the python tab of the python tx and add the following code

 

 

import requests
import json

input_string = country_name
input_url =
"http://services.groupkt.com/state/get/"

states = requests.get(input_url + input_string + "/all")
states_data = states.json()
states_data_json = json.dumps(states_data)

 

 

Connect the output port of python transformation to a flatfile data  object writing to hdfs.

 

 

 

The target data object properties look like below

 

 

Change the execution mode of the mapping to run in spark

 

 

Execute the mapping and verify the status of the mapping in the admin console.

 

 

 

Verify the output of the mapping on hdfs and you will see the output in json format.

 

 

 

Mapping 2:

In this mapping we will parse the output file in the previous which is json making it structured.

 

Create a complex data object by right clicking on physical data objects -> New -> Physcial Data Object

 

 

Choose complex file data object and click Next

 

 

Name the complex file data objects as “cfr_states” and click on the browse button under connection and choose your hdfs connection and Under “selected resources” click on the Add button

 

 

In the Add resource, navigate to the hdfs file location (this is the output file location we gave in the previous mapping) and click on the json file and click OK

 

 

 

 

Click finish on the next step

 

 

Now create a dataprocessor transformation by right clicking on transformations -> New -> Transformation

 

 

Choose data processor transformation from the list of transformations

 

 

Name the data processor transformation as “dp_ws_state” and choose the “create a data processor using a wizard”

 

 

 

Since the input to the data processor transformation is coming as JSON , choose json In the next step and click next

 

 

 

Make sure you have sample output from the first mapping on the developer machine and choose the “sample json file” option and browse the sample json file and click next

 

 

 

Choose relational output and click finish

 

 

 

After you click on the finish button the data processor transformation will look like below

 

Create a hive table using the following DDL in your target database and import the hive table as a relational data object into the developer client

 

CREATE TABLE infa_pushdown.ws_states (

            FKey_states BIGINT,

            id DOUBLE,

            country STRING,

            name STRING,

            abbr STRING,

            area STRING,

            largest_city STRING,

            capital STRING

) ;

 

 

Now drag the compex file reader, the data processor transformation and the Hive target into the mapping. The connect the data port from CFR to the input of data processor and the output of data processor to Hive target.

 

The final mapping should look like below.

 

 

The mapping is tested in BDM 10.2.1 and in this version data processor is not supported in spark mode so we will run the second mapping using Blaze. Once data processor support is added in spark the second mapping can be eliminated by adding data processor transformation in the first mapping. Screen shot showing blaze as the execution engine.

 

 

Execute the mapping and verify the output of the target table by running the data viewer on target data object

 

The Past

Introduced in the early days of big data, MapReduce is a framework that can be used to develop applications that process large amounts of data in a distributed computing environment.

A typical MapReduce job splits the input data set into chunks that are processed in parallel by map tasks. The framework sorts the output of the map tasks and passes the output to reduce tasks. The result is stored in a file system such as Hadoop Distributed File System (HDFS).

In 2012, Informatica Big Data Management (BDM) product introduced the ability to push down mapping logic to Hadoop clusters by leveraging the MapReduce framework. Big Data Management translated mappings into HiveQL, and then into MapReduce programs, which were executed on a Hadoop cluster.

By using this technique to convert mapping logic to HiveQL and pushing its processing to the Hadoop cluster, Informatica was the first (and still leading) vendor to offer the ability to push down processing logic to Hadoop without having to learn MapReduce. Developers simply had to select the "Hive" checkbox in the runtime properties of a mapping to run the mapping in MapReduce mode. This enabled hundreds of BDM customers to reuse their traditional Data Integration jobs and onboard them to the Hadoop ecosystem.

MapReduce as execution engine for BDM Mappings

The Present

With time, the Hadoop ecosystem evolved. For starters, MapReduce is no longer the only job processing framework. Tez, Spark and other processing frameworks are used throughout the industry as viable alternatives for MapReduce.

Recently, Spark has been widely adopted by vendors and customers alike. Several ecosystems such as Microsoft Azure use Spark as their default processing framework.

UPDATE: Hadoop distribution vendors have started to move away from MapReduce. As of HDP 3.0, MapReduce is no longer supported as an execution engine. Please refer to this link for more details on Hortonworks recommendations of execution engines: Apache Hive 3 architectural overview. Hive execution engine (including MapReduce) was deprecated in Big Data Management 2018 Spring release and has reached End of Life (EOL) Big Data Management 2019 Spring Release (10.2.2). Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.

Big Data Management adopted Spark several years ago and currently supports the latest versions of Spark. Please refer to Product Availability Matrix for the Spark version support. Big Data Management supports running Data Integration, Data Quality and Data Masking transformations using Spark.

For the most part, developers do not have to make any changes to mappings to leverage Spark. To use Spark to run mappings, they simply change the execution engine from Hive to Spark as shown in the following screenshot from Big Data Management 2018 Spring Release (BDM 10.2.1)

Migration from MapReduce to Spark

As a result of this simple change, the Data Integration Service generates Spark Scala code instead of MapReduce and executes it on the cluster.

The deprecation and End of Life of MapReduce

To accommodate the evolution of the Hadoop ecosystem, Informatica has announced the deprecation of MapReduce in Big Data Management in 2018 Spring release and has announced End of Life in 2019 Spring release (BDM 10.2.2). Customers previously leveraging Map Reduce are recommended to migrate to Spark. Customers currently on older versions of Big Data Management (including Spring release 2018 / BDM 10.2.1) are strongly recommended to migrate to Spark execution engine by selecting the Spark checkbox in Run-time properties of the mapping

Hive will continue to be supported as Source and Target in other execution modes such as Blaze and Spark.

Informatica's EOL of MapReduce applies only to Big Data Management mappings that use MapReduce as the run-time engine. It will not affect any Hadoop components (such as SQOOP) that internally rely on MapReduce or other third-party components. For example, when a customer uses SQOOP as a source in a Big Data Management mapping, Big Data Management will invoke SQOOP, which internally invokes MapReduce for processing. This will continue to be the case even after End of Life for the MapReduce execution mode.

Migration from MapReduce

To mitigate the continual evolution of big data ecosystems, Informatica recommends that developers practice an inclusive strategy for running mappings. In the Run-time mappings properties, select all Hadoop run-time engines, as shown in the following screenshot:

Polyglot computing in Big Data Management (BDM) including Spark

When all Hadoop run-time engines are selected, Informatica chooses the right execution engine at runtime for processing. Beginning with Big Data Management version 2018 Spring release (BDM 10.2.1), mappings default to Spark when Spark is selected with other execution engines. When customers use this inclusive strategy, the mappings with all Hadoop engines selected will automatically run in Spark mode.

Mappings that have only Hive (MapReduce) selected can be changed in bulk to leverage Spark. Several infacmd commands allow you to change the execution engine for mappings. Mappings that already exist as objects in the Model repository can be migrated to Spark by using one of the following commands:

  • infacmd mrs enableMappingValidationEnvironment
  • infacmd mrs setMappingExecutionEnvironment

This command receives the Model repository and project names as input and changes the execution engine for all mappings in the given project. The MappingNamesFilter property can be used to provide a comma-separated list of mappings to change. You can use wildcard characters to define mapping names. For more information about using these commands, see the Informatica Command Reference Guide.Similarly, for the mappings that have been deployed to the Data Integration Service as part of an application, you can use the following commands to change the execution engine for multiple mappings:

  • infacmd dis enableMappingValidationEnvironment
  • infacmd dis setMappingExecutionEnvironment

 

Summary

Starting Big Data Management 2019 Spring release (BDM 10.2.2), Hive execution mode (including Map Reduce) is no longer supported. Customers are recommended to migrate to Spark.