Big Data Management : 2018 : May Skip navigation
2018

This blog is a short overview about Apache Airflow and shows how to integrate BDM with Apache Airflow. I also have a sample template to orchestrate BDM mappings. Same concept can be extended with Powercenter and non-BDM mappings.

 

Apache Airflow overview

 

Airflow is a platform to programmatically author, schedule and monitor workflows.

 

Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban.

 

Generally, Airflow works in a distributed environment, as you can see in the diagram below. The airflow scheduler schedules jobs according to the dependencies defined in directed acyclic graphs (DAGs), and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is updated in a timely manner. The users can monitor their jobs via a shiny Airflow web UI and/or the logs.

 

 

Installing Apache Airflow

 

The following installation method is for non-production type of uses. Refer to airflow documentation for production type of deployments.

 

Apache Airflow has various operators listed below. An operator describes a single task in a workflow.

 

https://github.com/apache/incubator-airflow/tree/master/airflow/operators

 

To trigger Informatica BDM mappings we will be using the bashoperator i.e triggering the mappings through commandline.

 

  1. If apache airflow is running on a machine different than infa node, install Informatica command line utilities on the airflow worker nodes
  2. Python

 

 

Create a directory /opt/infa/airflow

 

 

Easy way to install to run the following command. Pip is a python utility to install various python packages.

 

pip install apache-airflow

set AIRFLOW_HOME environment variable

 

Create a folder called “dags” inside AIRFLOW_HOME folder.

 

Initialize the airflow DB by typing the command “airflow initdb”. This is where the metdata will be stored, we will be using the default aclchemy database that comes with airflow, if needed the configuration can be modified to make mysql or postgres as the backend for airflow.

 

 

If the initdb shows any errors its most likely because of some missing airflow packages and a complete list of packages and the commands to install them are in the below link.

 

https://airflow.apache.org/installation.html

 

start the airflow web UI using the following command

 

Start the airflow scheduler

 

 

Login into the Airflow UI using the URL http://hostname:8080, if you have installed examples you should see the example DAG’s listed in the UI.

 

 

Creating a DAG for BDM Mappings

 

 

For the Demo we deployed the following 3 BDM mappings in to DIS.

 

Application_m_01_Get_States_Rest_Webservice

Application_m_02_Parse_Webservice_Output

Application_m_Read_Oracle_Customers_Write_Hive_Python

 

 

The 3 applications need to be orchestrated in the following way.

 

  1. Application_m_01_Get_States_Rest_Webservice and Application_m_01_Get_States_Rest_Webservice can run in parallel
  2. Application_m_02_Parse_Webservice_Output will run only if Application_m_01_Get_States_Rest_Webservice is successful

 

 

 

Save the following code as inside as airflow_bdm_sample.py under /opt/infa/airflow/dags folder.

There are different ways to call infacmd runmapping command, for example the command can be put in a shell script and the script can be called from the DAG.

 

 

#Start Code

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


# these args will get passed on to each operator
# you can override them on a per-task basis during operator initialization
default_args = {
'owner': 'infa',
'depends_on_past': False,
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'start_date': datetime.now() - timedelta(seconds=10),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'dag': dag,
# 'adhoc':False,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'trigger_rule': u'all_success'
}

dag = DAG(
'Informatica_Bigdata_Demo',
default_args=default_args,
description='A simple Informatica BDM DAG')

# Printing start date and time of the DAG

t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)


t2 = BashOperator(
task_id='mapping_calling_webservice',
depends_on_past=False,
bash_command='infacmd.sh ms RunMapping -dn infa_dom_1021 -sn dis_bdm_cdh -un Administrator -m m_01_Get_States_Rest_Webservice -a Application_m_01_Get_States_Rest_Webservice -pd admin',
dag=dag)


t3 = BashOperator(
task_id='mapping_parsing_json',
depends_on_past=True,
bash_command='infacmd.sh ms RunMapping -dn infa_dom_1021 -sn dis_bdm_cdh -un Administrator -m m_02_Parse_Webservice_Output -a Application_m_02_Parse_Webservice_Output',
dag=dag)


t4 = BashOperator(
task_id='Read_Oracle_load_Hive',
depends_on_past=False,
bash_command='infacmd.sh ms RunMapping -dn infa_dom_1021 -sn dis_bdm_cdh -un Administrator -m m_Read_Oracle_Customers_Write_Hive_Python -a Application_m_Read_Oracle_Customers_Write_Hive_Python',
dag=dag)



t1.set_downstream(t2)
t2.set_downstream(t3)
t4.set_upstream(t1)

 

 

# End code

 

 

Restart the airflow webserver and the Informatica_Bigdata_Demo DAG will appear in the list of DAG’s

 

 

 

Click on the DAG and go to Graph View, it gives a better view of orchestration.

 

 

 

Run the DAG and you will see the status of the DAG’s running in the Airflow UI as well as the Informatica monitor

 

 

 

The above DAG code can be extended to get the mapping logs, status of the runs.

What are we announcing?

Informatica Big Data Release 10.2.1

 

Who would benefit from this release?

This release is for all customers and prospects who want to take advantage of the latest Big Data Management, Big Data Quality, Big Data Streaming, Enterprise Data Catalog, and Enterprise Data Lake capabilities.

 

What’s in this release?

This update provides the latest ecosystem support, security, connectivity, cloud, and performance while improving the user experience.

 

Big Data Management (BDM)

 

Enterprise Class

 

  • Zero client configuration: Developers can now import the metadata from Hadoop clusters without configuring Kerberos Keytabs and configuration files on individual workstations by leveraging the Metadata Access Service
  • Mass ingestion: Data analysts can now ingest relational data into HDFS and Hive with a simple point and click interface and without having to develop individual mappings. Mass Ingestion simplifies ingestion of thousands of objects and operationalizes them via a non-technical interface
  • CLAIRE integration: Big Data Management now integrates with Intelligent Structure Discovery (that is part of Informatica Intelligent Cloud Services) to provide machine learning capabilities in parsing the complex file formats such as Weblogs
  • SQOOP enhancements: SQOOP connector has been re-architected to support high concurrency and performance
  • Simplified server configuration: Cluster configuration object and Hadoop connections are enhanced to improve the usability and ability to perform advanced configurations from the UI
  • Increased developer productivity: Developers can now use the "Run mapping using advanced options" menu to execute undeployed mappings by providing parameter file/sets, tracing level and optimizer levels in the Developer tool. Developers can also view optimized mappings after the parameter binding is resolved using the new "Show mapping with resolved parameters" option.
  • PowerCenter Reuse enhancements: Import from PowerCenter functionality has been enhanced to support import of PowerCenter workflows into Big Data Management
  • GIT Support: Big Data Management administrators can now configure GIT (in addition to Perforce and SVN) as the external versioning repository

 

Advanced Spark Support

 

  • End to end functionality: End to end Data Integration and Data Quality use-cases can now be executed on the Spark Engine. New and improved functionality includes Sequence Generator transformation, Pre/Post SQL support for Hive, support for Hive ACID Merge statement on supported distributions, Address Validation and Data Masking.
  • Data science integration: Big Data customers can now integrate pre-trained data science models with Big Data Management mappings using our new Python transformation.
  • Enhanced hierarchical data processing support: With support for Map data types and support of Arrays, Structs and Maps in Java transformations, customers can now build complex hierarchical processing mappings to run on the Spark engine. Enhancements in gestures and UI enable customers to leverage this functionality in a simple yet effective manner
  • Spark 2.2 support: Big Data Management now uses Spark 2.2 on supported Hadoop distributions

 

Cloud

 

  • Ephemeral cluster support: With out-of-the-box ephemeral cluster support for AWS and Azure ecosystems, customers can now auto deploy and auto scale compute clusters from a BDM workflow and push the mapping for processing to the automatically deployed clusters
  • Cloudera Altus support: Cloudera customers can now push the processing to Cloudera Altus compute clusters.
  • Improved AWS connectivity: Amazon S3 and Redshift connectors have received several functional, usability and performance updates
  • Enhanced Azure connectivity: Azure WASB/Blob and SQL DW connectors have received several functional, usability and performance updates

 

Platform PAM Update

 

Oracle 12cR2

Added

SQL Server 2017

Added

Azure SQL DB

(PaaS / DBaaS , Single database model)

Added

SQL Server 2008 R2 & 2012 (EOL)

Dropped

IBM DB2 9.7 & 10.1 (EOL)

Dropped

Suse 12 Sp2

Added

Suse 12 Sp0

Dropped

AIX

Not Supported

Solaris

Not Supported

Windows Server

Not Supported

Model Repository - Versioned Controlled

Status

Git

Added

 

Oracle Java 1.8.0_162

Updated

IBM JDK

NA

Tomcat 7.0.84

Updated

 

Big Data Quality (BDQ)

 

Capabilities

  • Enable data quality processing on Spark
  • Updated Address Verification Engine (AddressDoctor 5.12)
  • Support for custom schemas for reference tables
  • Updated workflow engine

Benefits

  • Support Spark scale and execution with Big Data Management
  • Enhanced Address Verification engine with world-wide certifications
  • Flexible use of reference data with enterprise DB procedures
  • Faster start times for workflow engine

 

Big Data Streaming (BDS)

 

Change in Product Name: The product name has changed from "Informatica Intelligent Streaming" to "Big Data Streaming"

 

Azure Cloud Ecosystem Support

 

  • Endpoint Support: Azure EventHub as source & target and ADLS as target
  • Cloud deployment: Run streaming jobs in Azure cluster on HDInsight

 

Enhanced Streaming Processing and Analytics

 

  • Stateful computing support on streaming data
  • Support for masking streaming data
  • Support for normalizer transformation
  • Support for un-cached lookup on HBase tables in streaming
  • Kafka Enhancements - Kafka 1.0 support & support for multiple Kafka versions

 

New Connectivity and PAM support

 

  • Spark Engine Enhancements - Spark 2.2.1 support in streaming, Truncate table, Spark concurrency
  • Relational DB as target - SQL Server and Oracle
  • New PAM - HDInsight
  • Latest version support on Cloudera, Hortonworks, EMR

 

Enterprise Data Lake (EDL)

 

Change in Product Name: The product name has changed from "Intelligent Data Lake" to "Enterprise Data Lake"

 

Core Data Preparation

 

  • Data Preparation for JSON Lines (JSONL) Files: Users can add JSONL files to a project and structure the hierarchical data in row-column form. They can extract specific attributes from the hierarchy and can expand (or explode) arrays into rows in the worksheet.
    Pivot and UnPivot: Users can pivot or unpivot columns in a worksheet to transpose/reshape the row and column data in a worksheet for advanced aggregation and analysis.
  • Categorize and One-hot-encoding functions: Users can easily categorize similar values into fewer values to make analysis easier. With one-hot-encoding, the user can convert categorical values in a worksheet to numeric values suitable for machine learning algorithms.
  • Column Browser with Quality Bar: A new panel for browsing columns is added to the left panel in the worksheet. This easy to use column browser interface allows users to show/hide columns, search for columns, highlight columns in the worksheet, etc. The panel also has a Quality bar that shows unique, duplicate and blank value count percentages within the column. The panel can also show any associated glossary terms.
  • Project Level Graphical View: For a project with a large number of assets, the graphical view helps users understand the relationships between input data sources, sheets created, assets published, and Apache Zeppelin notebooks created. Users can navigate to the asset, notebook or the worksheet directly.
  • Insert recipe step, add a filter to an existing step: Users can insert a new step at any location in the recipe. They can also add/modify existing filters for any recipe step.
  • Data Type Inferencing optimization: Users can revert undesired inferencing done by data preparation engine and apply appropriate functions. They can revert or re-infer types as needed.
  • Show where the data in a column comes from: The column overview in the bottom panel now has a Source property that shows if the column corresponds to a physical input source column, another worksheet or a step in the recipe. If the user hovers over a data source name, the application shows details of the formula when available and highlights the appropriate recipe step.
  • UX Improvements in Filter-in-effect, Sampling, Join and Apply Rule panels: The user interface has been improved for clarity of icons and language used, visibility of information and button and better user flow for these panels. Users can also input constant values as inputs in the Apply Rule panel for text based user inputs.

 

Self-service and Collaboration

 

  • Self-service scheduling: Data Analysts now have the ability to schedule import, publish and export activities. The Import/Publish/Export wizard offers the choice to perform the activity now, or to schedule it. For publish, a “snapshot” of recipes is saved for execution at the scheduled time. Users can continue to work on the project and modify recipes without affecting scheduled activity.
    The “My Scheduled Activities” tab provides details of upcoming activities. The “Manage My Schedules” tab provides details of schedules and enables users to modify schedules.
    Scheduled activities can be monitored on the My Activities page. Functionally it has the same effect as running the activity manually. All the schedules created in Enterprise Data Lake and activities scheduled in Enterprise Data Lake are also visible in the Administrator Console tool.
  • Project History: Users (and IT/Governance staff) can view the important events that happened within a given project. These include events related to Project, Collaborators, Assets, Worksheets, Publications, Scheduled Publications, Notebook etc.
  • Copy-Paste Recipe Steps: Users can copy specific steps or the whole recipe and paste into another sheet in the same project or another project. There is also a way to map the input columns used in the source sheet to the columns present in the target sheet. This enables reuse of each other’s or their own work in the creation of repetitive steps.
  • Quick Filters for asset search in the data lake: In the search results, users have a single-click filter to get all the assets in the data lake that match the search criteria.
  • Recommendation Card UX Improvements: The Recommendation cards in the Project view now show the reason an asset was recommended for inclusion in the project, and what action user should take.
  • Details of Source Filters during Publish: During Publication, the Publish Wizard shows the details of "Source Filters" so the user understands the impact of including or not including the filters.

 

Enterprise Focus

 

  • Single Installer for Big Data Management, Enterprise Data Catalog and Enterprise Data Lake: The installation and upgrade flows have been improved and simplified with a single installer. Enterprise Data Lake customers can now install all three products in a single install. The total size of the single installer is just ~7GB due to better compression, as compared to the previous combined size of ~13GB. The process requires fewer domain restarts, and additional configurations can also be enabled in the same single flow.
  • Blaze as Default Execution Engines for Enterprise Data Lake: All Enterprise Data Lake processes using Big Data Management mapping execution now use Blaze as the default engine. This has improved performance and consistency.
  • SAML based SSO: Enterprise Data Lake now supports SAML based Single-Sign-On.
  • Lake Resource Management UI: Administrators can manage the Enterprise Data Catalog resources that represent the external data sources and metadata repositories from which scanners extract metadata for use in the data lake. The Lake Resource Management page also verifies the validity of resources, the presence of at least one Hive resource, etc. so that Enterprise Data Lake functionality is usable. Changes done through the Lake Resource Management page do not require a service restart.
  • Data Encryption for Data Preparation Service node: The temporary data created on Data Preparation Service nodes is encrypted for better security.
  • Demo Version of IT Monitoring Dashboard: A dashboard created in Apache Zeppelin allows administrators to monitor Enterprise Data Lake user activities. The dashboard is not a product feature, but an example to show what is possible with the audit information. The dashboard is an Apache Zeppelin Notebook built on top of the Enterprise Data Lake user event auditing database. The Zeppelin Notebook and associated content are available on request, but it is unsupported. The Audit mechanism has been changed and improved now to support direct queries using JDBC. 
  • Performance Improvement in Import process using CLAIRE: Using the profiling metadata information available in CLAIRE, the import process optimizes the number of sub-processes created thereby improving the overall performance of Import

 

Enterprise Data Catalog (EDC)

 

  • Intelligence
    • Enhanced Smart Discovery: By clustering similar columns from across data sources, EDC enables users to quickly associate business terms as well as classify data elements. Unsupervised clustering of similar columns is now based on names, unique values and patterns in addition to the existing data overlap similarity.
    • Enhanced Unstructured Data Discovery (Tech Preview): Enhanced unstructured data support for accurate domain discovery using NLP and new file system connectivity.
    • New Data Domain Rules: Override rules and new scan options for more granular control on rule based data domain inference.
  • Connectivity
    • New Filesystems: Added support for cataloging of Sharepoint, Onedrive, Azure Data Lake Store(ADLS), Azure Blob and MapRFS
    • New File Formats: Avro and Parquet support added in 10.2.1.
    • Remote File Access Scanner: Mounting folders on Hadoop nodes not required for Linux and Windows filesystem, instead the new remote file access scanner uses SMB for Windows and SFTP for Linux for cataloging.
    • Deep Dive Lineage support for BDM: End to End data lineage from Big Data Management with transformation logic and support for dynamic mappings
    • Data Integration Hub: Users can now scan DIH to access metadata for all objects and its subscriptions and publications.
    • Data Lineage from SQL Scripts(Tech Preview): End to End data lineage from hand coded SQL scripts to understand column level data flows and data transformations- includes support for Oracle PLSQL, DB2 PLSQL, Teradata BTEQ, HiveQL. Stored Procedures are not supported in this release.
    • Qlikview: Scan reports and report lineage from Qlikview.
  • User Experience Improvements
    • Manage business context with in-place editing of wikipages of data assets. Businessuser friendly data asset overview page that provides all the business context about the data asset. Inherit descriptions from Axon associations or type your own.
    • SAML Support: For Single Sign-On.
    • Multiple Business Term Linking: Allows custom attribute creation with Axon or BG term type to allow users to link multiple business terms with a single asset.
    • Search Facet Reordering: Catalog Administrators can now reorder the default facet orders making business facets show up higher than the technical facets.
    • New Missing Asset Link Report: To help users identify linked and unlinked data assets for a lineage-type source.
  • Open and Extensible Platform
    • New REST APIs for starting and monitoring scan jobs
    • S@S Interop: Shared Infrastructure, Metadata Repository, Data Domain Definitions and Curation Results shared across EDC and S@S. Users can now scan a resource once to see it in both EDC and S@S.
    • Reduced Sizing: Upto 3X reduction in computation cores required on the Hadoop cluster across all sizing categories
    • Ease of Deployment – Improved validation utilities, updated distro(HDP v2.6) for embedded cluster.

 

Release Notes & Product Availability Matrix (PAM)

 

PAM for Informatica 10.2.1

 

Informatica 10.2.1 Release Notes