Skip navigation
1 2 3 Previous Next

Data Engineering Integration

36 posts

Informatica 10.4.1.2 has been released to shipping and is available for download.

What are we announcing?

Informatica 10.4.1.2

Who would benefit from this release?

This release is for all Data Engineering, Data Catalog, Data Preparation & Data Privacy Management customers and prospects who want to take advantage of updated Hadoop distribution support as well as fixes to the core platform and other functionality. You can apply this service pack after you install or upgrade to Informatica 10.4.1. If you are already on 10.4.1 or 10.4.1.1, you can directly install 10.4.1.2.

What’s in this release?

Data Engineering

PAM

  • Cloudera CDP Public Cloud 7.2 - AWS & Azure
  • Cloudera CDP Private Cloud Base 7.1.3
  • Amazon EMR 6.1 - Tech Preview

Developer Client

  • Performance improvements when browsing large numbers of physical data objects.

Enterprise Data Catalog

Scanners

  • Amazon S3 scanner enhancements:
    • For enhanced security, administrators can configure the S3 scanner to use a temporary session token.
    • Extract metadata from an Amazon S3-compatible storage system such as Scality RING.
  • Axon scanner enhancements:
    • Extract metadata from additional assets such as attributes, policies, systems, and datasets and view the relationship between these assets.
    • Reconcile domain users between Informatica Axon and Enterprise Data Catalog.
  • Data domain discovery enhancements:
    • Configure data domains with the match data or column name rule.
    • Automatically reject all other inferred data domains after you approve a data domain associated with an asset.
  • Data Asset Analytics enhancements:
    • Asset Usage report type to view the usage of assets in Enterprise Data Catalog.
    • New Top Assets Viewed chart in User Adoption dashboard to view the list of the most-viewed assets in the catalog.
    • New Feature Usage Value chart in the Data Value dashboard to view the value of Enterprise Data Catalog features based on their usage.
    • Enhanced Resource Value calculation in the Data Value dashboard.
    • Support for different currency types
    • Performance improvements

Application

  • Perform a bulk export or import of business terms and Axon Glossaries for curation.

PAM

  • Deployment:
    • HDInsight 4.0 with WASB
    • Data Asset Analytics repository database updates:
      • Azure SQL Database as DBaaS
      • PostgreSQL 11.7
      • Oracle RAC
      • Amazon RDS for Oracle and Microsoft SQL Server
  • Scanners:
    • SAP S/4HANA versions 1610 and 1909
    • SAP BW/4HANA version 2.0
    • Apache Hive on CDP Private Cloud Base 7.1
    • HDFS on Cloudera CDP Private Cloud Base 7.1

Enterprise Data Preparation

  • Support for Cloudera CDP Private Cloud Base 7.1.3

Data Privacy Management

Unstructured Scans through the Informatica Discovery Agent

  • Modified keywords for certain data domains to improve accuracy.
  • Modified 9 data domains to improve accuracy.
  • Improved Scan performance when scanning very large unstructured files.

     Subject Registry Enhancements

  • Support for incremental scans of unstructured sources

      PAM

  • ADLS Gen 2 support through the Informatica Discovery Agent

Release Notes & Product Availability Matrix (PAM)

Informatica 10.4.1.2 Release Notes: Informatica 10.4.1.2 Release Notes

PowerExchange Adapters for Informatica Release Notes: PowerExchange Adapters for Informatica Release Notes

Informatica Release Guide:  https://docs.informatica.com/data-integration/shared-content-for-data-integration/10-4-1/release-guide--10-4-1-2-/preface.html

PAM for Informatica 10.4.1.2: https://network.informatica.com/docs/DOC-18676#comment-77540

 

You can download the service pack from here.

What are we announcing?

Informatica 10.4.1.1

Who would benefit from this release?

This release is for all Data Engineering, Data Catalog, Data Preparation & Data Privacy Management customers and prospects who want to take advantage of updated Hadoop distribution support as well as fixes to the core platform and other functionality. You can apply this service pack after you install or upgrade to Informatica 10.4.1. If you are already on 10.4.1, you can directly install 10.4.1.1.

What’s in this release?

Data Engineering Integration

 

PAM

  • CDP Datacenter 7.1.1 GA with Spark pushdown
  • EMR 5.29 (Also includes support for Glue with Non-Kerberos clusters)
  • WASBS support for HDI 4.0
  • DLM enabled clusters for HDP 3.1.5

Data Engineering Streaming

PAM

Added support for:

  • Cloudera Data Platform (CDP) Data Center 7.1.1
  • HD Insight (HDI) 4.0 on Clearlake/Microsoft Distribution of Hadoop (MDH) with ADLS Gen2
  • Amazon Elastic MapReduce (EMR) 5.29 with EMRFS (default Hive metastore) and Kerberos
  • Amazon Elastic MapReduce (EMR) 5.29 with AWS Glue (Hive metastore), non-Kerberos
  • WANdisco with Cloudera Distribution Including Apache Hadoop (CDH) 6.3
  • WANdisco with Hortonworks Data Platform (HDP) 3.1.4

Enterprise Data Catalog

  • Data similarity enhancements
  • Scanner: Azure Active Directory authentication support and proxy connection support enhancements for ADLS Gen2 scanner
  • Scanner: Support for project level scans for Tableau
  • Scanner: Better handling of schema level connections  in addition to database level connections for Microstrategy scanner
  • Catalog UI: Duplicate asset type issue in asset filters is addressed
  • Catalog UI: Hierarchical display of Axon/BG terms

PAM

  • Deployment on RHEL 7.8
  • F5 Big IP with SAML 2.0
  • NetScalar as IDP

Enterprise Data Preparation

  • Preparation project can now be copied assisting central data teams to prepare template projects (with datasets and recipe steps) to be shared with their data consumers. 
  • EDP is now available in French and German languages
  • Audit and collaboration feature enhancements
  • Performance improvements in Data Preparation Service
  • Enhanced support for partitioned parquet files

PAM

  • F5 Big IP with SAML 2.0
  • NetScalar as IDP
  • HD Insights 4.0 Clearlake
  • EMR 5.29

Data Privacy Management

  • Improved operational efficiencies for Subject Registry by
  • Incremental Scans for Subject registry
  • Separation of Index & Search configuration
  • New implementation for "Exact Match" to improve performance
  • Handling Nulls & blanks in Indexes

PAM

  • Cassandra for Domain discovery
  • Unstructured Scan support through Remote Agent
    • HDFS
    • Azure Blob Storage
    • ADLS Gen 1
    • WASB
    • Google Drive
  • File types support added through Remote Agent
    • AVRO
    • PARQUET
  • Windows install of Remote Agent
  • Additional Metadata support for unstructured files through unstructured agent scans

Release Notes & Product Availability Matrix (PAM)

Informatica 10.4.1.1 Release Notes: https://docs.informatica.com/data-catalog/enterprise-data-catalog/10-4-1/release-notes--10-4-1-1-/preface.html

PowerExchange Adapters for Informatica Release Notes: https://docs.informatica.com/data-integration/powerexchange-adapters-for-informatica/10-4-1/powerexchange-adapters-for-informatica-release-notes--10-4-1-1-/abstract.html

 

Informatica Release Guide: https://docs.informatica.com/data-engineering/shared-content-for-data-engineering/10-4-1/release-guide--10-4-1-1-/preface.html

PAM for Informatica 10.4.1.1: https://network.informatica.com/docs/DOC-18676#comment-72889

You can download the Service Packs from here.

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

  • Topic: Meet the Experts Webinar - "AI-Powered Streaming Analytics for Real-Time Customer Experience  – Deep Dive and Demo"
  • Date: Wednesday, 4 December 2019
  • Time: 10:00 AM Pacific Daylight Time (PDT)
  • Duration: 1 Hour
  • Webinar Registration Link: https://www.informatica.com/about-us/webinars.html?commid=377797&utm_source=support
  • Speakers:
    • Preetam Kumar, Product Marketing Manager, Data Engineering Streaming

    • Vishwa Belur, Director Product Management, Data Engineering Streaming

 

What’s the best way for businesses to differentiate themselves today? By delivering a unique, real-time customer experience across all touchpoints—one that is based on a solid, connected business strategy driven by data and analytics insights. 

 

However, interacting with customers in real-time and in a relevant, meaningful way can be challenging for organizations with disparate sources of data, including at the edge, on-premises, and multi-cloud.

 

To get maximum value and insight from customer data, you need a data management framework that can do three things:

  1. Sense - Capture event data and streaming data from sources such as social media, web logs, machine logs, and IoT sensors.
  2. Reason - Automatically process, correlate, and analyze this data to add context and meaningful insights.
  3. Act - Respond and take automated action in a reliable, timely, and consistent way.

 

Join this webinar to learn how to:

  • Manage the entire end-to-end sense-reason-act process at any latency with an AI-powered streaming solution
  • Automate data management processes and guide user behavior for real-time streaming analytics with AI
  • Gain a clear understanding of how to use Spark Structured Streaming for data engineering using an intelligent data streaming solution that unifies streaming and batch data to deliver next best actions that improve customer experience

What are we announcing?

Informatica 10.2.2 HotFix 1 Service Pack 2

 

Who would benefit from this release?

This release is for all Big Data customers and prospects who want to take advantage of updated Hadoop distribution support as well as fixes to the core platform, connectivity, and other functionality. You can apply this service pack after you install or upgrade to Informatica 10.2.2 HotFix 1. If you are already on 10.2.2 HotFix1 Service Pack 1, this 10.2.2 HotFix 1 Service Pack 2 can be installed directly.

 

What’s in this release?

Big Data PAM

Distribution Support:

  • Cloudera CDH: 6.2, 6.1, 5.16, 5.15, 5.14, 5.13
  • Hortonworks HDP: 2.6.x, 3.1.x
  • MapR: 6.0.1 with MEP 5.0, 6.1 with MEP 6.0
  • Azure HDInsight: 3.6.x WASB, ADLS Gen1, ADLS Gen2
  • Amazon EMR: 5.16.x, 5.20
  • Google Cloud Dataproc 1.3
  • Databricks 5.1, 5.3
  • WANdisco enabled CDH 5.16 and HDP 2.6.5 on RH7

Big Data Streaming

  • Support for latest Cloudera, Hortonworks, MapR, HDInsight and EMR versions
  • Bug fixes and improvements

Enterprise Data Catalog

  • This update provides bug fixes for functional and performance improvements. Informatica recommends that Enterprise Data Catalog customers on 10.2.2 HotFix 1 / 10.2.2 HotFix 1 Service Pack 1 apply this service pack.

Enterprise Data Preparation

  • Functional and performance improvements
  • EDP now supports WANdisco enabled HDP 2.6.5 on RH

 

Release Notes & Product Availability Matrix (PAM)

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

  • Topic: Meet the Experts Webinar - "End-to-End Data Engineering for AI & Analytics on Microsoft Azure"
  • Date: Wednesday, 23 October 2019
  • Time: 9:00 AM Pacific Daylight Time (PDT)
  • Duration: 1 Hour
  • Webinar Registration Link: Webinars – Webcasts | Informatica Talks | Informatica
  • Speakers:
    • Sumeet Agrawal, Product Management, Informatica 
    • Vamshi Sriperumbudur, Product Marketing, Informatica

 

Successful next-generation AI and analytics projects require you to ingest, process, and govern all types of data at all latencies. It can be a challenge—but it doesn't have to be.

 

Our complimentary webinar, "End-to-End Data Engineering for AI & Analytics on Microsoft Azure," shows you how Informatica's Data Engineering portfolio can help you ingest all types of data into the Azure Data Lake Store (ADLS), batched or in real-time. You will learn about:

 

  • Building data pipelines to feed ADLS with data engineering integration in a Spark serverless mode
  • Leveraging Spark Structured Streaming for real-time analytics that extend to the edge
  • Finding, preparing, and operationalizing trusted data with Informatica Enterprise Data Preparation
  • Performing powerful search, data lineage, and impact analysis with Enterprise Data Catalog

 

Whether you've just started to consider Azure for analytics and AI or you're already using it, you won't want to miss this webinar and demo.

What are we announcing?

Informatica 10.2.2 HotFix 1 Service Pack 1

Who would benefit from this release?

The release is for all Big Data customers and prospects who want to take advantage of updated Hadoop distribution support as well as fixes to core platform, connectivity and other functionality. You can apply this service pack after you install or upgrade to Informatica 10.2.2 HotFix 1.

What’s in this release?

Big Data PAM

Distribution Support:

  • Cloudera CDH: 6.2, 6.1, 5.16, 5.15, 5.14, 5.13
  • Hortonworks HDP: 2.6.x, 3.1.x
  • MapR: 6.0.1 with MEP 5.0, 6.1 with MEP 6.0
  • Azure HDInsight: 3.6.x WASB, ADLS Gen1, ADLS Gen2
  • Amazon EMR: 5.16.x, 5.20
  • Google Cloud Dataproc 1.3
  • Databricks 5.1, 5.3

Big Data Streaming

  • Apache Kafka version 2.3.x support

Enterprise Data Catalog

  • The update provides bug fixes for functional and performance improvements. Informatica recommends that Enterprise Data Catalog customers on 10.2.2 HF1 apply this service pack.

Enterprise Data Preparation

  • Ability to change delimiters and text qualifiers during file preparation

Release Notes & Product Availability Matrix (PAM)

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

 

To get the most value from your data, you need to maintain a robust, stable production environment. Informatica Big Data Management (BDM) has been enhanced to help you do that with integrated DevOps and DataOps.

 

Join our complimentary webinar, "Operationalize Big Data Management With Integrated DevOps and DataOps," to learn more about what's new in BDM.

 

You will learn about:

  • Leveraging version control systems like Git
  • Invoking Informatica BDM processes from open source technologies like Jenkins
  • Using concurrency, stability, and other operationalization enhancements

 

Don't miss this opportunity to operationalize your big data management and extract more value from your big data.

What are we announcing?

Informatica 10.2.2 HotFix 1

 

Who would benefit from this release?

This release is for all Big Data and Enterprise Data Catalog customers and prospects who want to take advantage of the new capabilities and updated Hadoop distribution support. The release also includes fixes to core platform and connectivity. It includes support for new environments as well as fixes to support stable deployments.

 

What’s in this release?

This release includes Big Data Management, Big Data Quality, Big Data Streaming, Enterprise Data Catalog, and Enterprise Data Preparation capabilities.

  • Big Data PAM Update
    • Distribution Support:
      • Cloudera CDH: 6.2, 6.1, 5.16, 5.15, 5.14, 5.13
      • Hortonworks HDP: 2.6.x, 3.1 (Tech Preview)
      • MapR: 6.0.1 with MEP 5.0, 6.1 with MEP 6.0
      • Azure HDInsight: 3.6.x WASB, ADLS Gen1
      • Amazon EMR: 5.16.x, 5.20
      • Databricks 5.1
    • New Relational Systems:
      • Oracle 18c (Source/Target)
  • Enterprise Data Catalog Updates
    • Scanners
      • SAP HANA (Metadata Only): New scanner for SAP HANA that can extract object and lineage metadata. Lineage metadata includes calculation view to table lineage. Profiling is not supported.
      • ADLS Gen 2(Metadata Only): New scanner for Azure Data Lake Store Gen 2 to extract metadata from files and folders. All formats supported by ADLS Gen 1 scanner are supported for Gen 2. Profiling is not supported.
      • Profiling Warehouse Scanner: Extract profiling and domain discovery statistics from an IDQ or a BDQ profiling warehouse. Users who have already run profiling and enterprise discovery in IDQ/BDQ can now extract these profiling results and visualize them in EDC.
      • SAP PowerDesigner: Extract database model object from physical diagram including internal lineage. Model objects can be linked to physical object from other scanners.
      • (Tech Preview) Lineage Extraction from Stored Procedures: Ability to extract data lineage at the column level for stored procedures in Oracle and SQL Server.
      • (Tech Preview) Oracle Data Integrator: Ability to extract data lineage at the column level with transformation logic from Oracle Data Integration.
      • (Tech Preview) IBM Datastage: Ability to extract data lineage at the column level with transformation logic from IBM Datastage jobs.
      • Enhanced MS SQL Server scanner: support for windows based authentication using the EDC agent
    • Scanner Framework
      • Case insensitive linking: Ability to mark resources as case sensitive/insensitive. A new link ID is now generated for every object based on the above property. Useful for automatic lineage linking where ETL/BI tools refer to the object using a different case compared to the data source.
      • Offline scanner: support added for Sybase, IBM DB2 LUW, IBM DB2 z/OS, Netezza, Mysql, JDBC, PowerCenter, Informatica Platform, File Systems, Tableau, MicroStrategy, Hive, HDFS, Cloudera Navigator, Atlas
      • Custom Scanner Enhancements: The following new features are available for users of custom scanners:
        • Pre-Scripts: Users can now configure pre-scripts that are run before scanner execution. This allows running any custom extraction jobs or setup tasks.
        • File Path: Users can now configure a file path to pick the scanner assets csv, instead of uploading it to the catalog. The file should be either mounted or copied to the Informatica Domain machine with read permissions. This helps with automating and scheduling custom scanner runs.
      • Custom Scanner Framework Enhancements: The following new features are available for the developers of custom scanners:
        • Assign icons:  Ability to assign icons to types in the custom model
        • Detailed Lineage: Custom scanners can now include detailed lineage views which are rendered like transformation lineage from any native scanner.
        • Custom relationships: Ability to add custom relationships to be displayed in the relationship diagrams
      • Business User Experience
        • Search Operators: New search operators - AND, OR, NOT, double quotes, title: and description: for advanced search queries.
        • Search Tabs: Administrators can now create “Search Tabs” designed to personalize search experience by user groups and individual users. These search tabs are created with pre-selected facets that apply to a set of users/groups. EDC creates the following search tabs by default: “All”, “Data Objects”, “Data Elements”, “Business Terms”, “Reports” and “Resources”.
      • EDC Plug-In

        • Enterprise Data Catalog Tableau Extension: Enterprise Data Catalog Extension is a native extension for Tableau dashboard that you can use within Tableau within a Tableau Desktop, Tableau Server, and all the web browsers supported by Tableau version 2018.2.x onwards.
      • Supportability
        • Progress logs for re-index and bulk import/export
        • Utility to collect logs that is now expanded to support Cloudera in 10.2.2 HF1
      • (Tech Preview) Data Provisioning
        • (Tech Preview) Data Provisioning: After discovery, users can now move data to a target where it can be analyzed. EDC works with IICS to provision data for end users. Credentials are supplied by the users for both the source and the target.
          • Supported Sources in this release: Oracle, SQL Server
          • Supported Targets in this release: Amazon S3, Tableau Online, Oracle, Azure SQL DB
        • (Tech Preview) Live Data Preview: Users can now preview source data at the table level by providing source credentials.
      • CLAIRE
        • Intelligent Glossary Associations: The tech preview capability in 10.2.2 for linking glossaries to technical metadata is now GA. Additionally, EDC now supports auto-association of glossaries to objects at the table/file level.
      • PAM
        • Deployment Support
          • Cloudera: CDH 6.2, 6.1, 5.16, 5.15, 5.14
          • Hortonworks: HDP 2.6.x, (Tech Preview) HDP 3.1
        • Source Support
          • Hive, HDFS on CDH 6.1, 6.2
          • Hive, HDFS on HDP 3.1
          • Oracle Data Integrator 11g, 12c
          • Profile Warehouse on Oracle, SQL Server and IBM DB2 for Informatica 10.1.1 HF1, 10.2, 10.2.1, 10.2.2
          • SAP PowerDesigner 7.5.x to 16.x
          • SAP Hana DB 2.0
  • EDP Updates
    • Hadoop distribution support (aligned with Big Data Management)
    • Performance improvement in Preparation
    • Search alignment with Enterprise Data Catalog: Alignment with EDC in terms of search results and user experience (example: search tabs)
  • Connectivity Updates
    • Sqoop mapping with override query using aliases support in Spark mode
    • PAM certification for HBase for ecosystems: Cloudera, Hortonworks, MapR, AWS and Azure.
    • "--boundary-query" for specifying custom SQL query support for Sqoop import
  • Platform PAM Update
    • Oracle 18c - added
    • JVM support update: Azul OpenJDK 1.8.0_212 – updated

 

Release Notes & Product Availability Matrix (PAM)

 

Informatica 10.2.2 HotFix 1 Release Notes: https://docs.informatica.com/big-data-management/shared-content-for-big-data/10-2-2-hotfix-1/big-data-release-notes/abstract.html

 

PowerExchange Adapters 10.2.2 HotFix 1 Release Notes: https://docs.informatica.com/data-integration/powerexchange-adapters-for-informatica/10-2-2-hotfix-1/powerexchange-adapters-for-informatica-release-notes/abstract.html

 

PAM for Informatica 10.2.2 HotFix 1: https://network.informatica.com/docs/DOC-18280

 

You can download the Hotfixes from here.

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

 

If you need to integrate and ingest large amounts of data at speed and scale, Informatica has two new big data cloud services to help.

 

Join our complimentary Meet the Experts webinar on July 16 to discover the capabilities of Informatica Intelligent Cloud Services (IICS) Integration at Scale and IICS Ingestion at Scale. You will learn:

  • How to lower overall TCO with CLAIRE-based auto scaling and provisioning of serverless Spark support
  • How to manage streaming and IoT data with real-time monitoring and lifecycle management
  • How to accelerate AI/ML and advanced analytics projects with Informatica Enterprise Data Preparation and DataRobot

 

If you want to create proof of concept for a big data project in just six weeks, turn your data lake into a modern data marketplace, and more, you won't want to miss this deep dive and demo.

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

 

Once the host approves your request, you will receive a confirmation email with instructions for joining the meeting.

 

Here is the agenda for the webinar:

  • Spark Architecture
    • Spark Integration with BDM
    • Spark shuffle
    • Spark dynamic allocation
  • Journey from Hive, Blaze, to Spark
  • Spark troubleshooting and self-service
  • Spark Monitoring
  • References
  • Q & A

 

Speaker Details

 

The session will be presented by Vijay Vipin and Ramesh Jha, both Informatica BDM SMEs. They have been supporting our customers for over 5 years and have developed a niche across all aspects of BDM product portfolio.

What are we announcing?

Informatica 10.2.1 Service Pack 2

Who would benefit from this release?

This release is for all Big Data customers and prospects who want to take advantage of updated Hadoop distribution support as well as fixes to core platform, connectivity, and other functionality. You can apply this service pack after you install or upgrade to Informatica 10.2.1.

What’s in this release?

Big Data PAM Update

Applies to Big Data Management, Big Data Quality, and Big Data Streaming

  • Distribution Support
    • Cloudera CDH: 5.11.x, 5.12.x, 5.13.x, 5.14.x, 5.15.x
    • Hortonworks HDP: 2.5.x, 2.6.x
    • MapR 6.0 with MEP 5.x
    • Amazon EMR 5.14.x
    • Azure HDInsight 3.6.x

Enterprise Data Lake

  • Bug fixes for functional and performance improvements

Enterprise Data Catalog

  • Bug fixes for functional and performance improvements

Informatica recommends that all Enterprise Data Catalog customers on 10.2.1 apply this service pack.

Informatica 10.2.1 SP2 Release Notes

PAM for Informatica 10.2.1 SP2

You can download the Hotfixes from here.

What are we announcing?

Informatica 10.2.2 Service Pack 1

 

Who would benefit from this release?

This release is for all Big Data customers and prospects who want to take advantage of updated compute cluster support, updated streaming capabilities and security enhancements as well as fixes to core platform, connectivity, and other functionality. You can apply this service pack after you install or upgrade to Informatica 10.2.2.

 

What’s in this release?

 

Big Data PAM Update

Applies to Big Data Management, Big Data Quality, and Big Data Streaming

 

  • Distribution Support:
    • Cloudera CDH: 5.15, 5.16, 6.1
    • Hortonworks HDP: 2.6.5, 3.1 (Tech Preview)
    • MapR: 6.0.1 with MEP 5.0, 6.1 MEP 6.0
    • Azure HDInsight: 3.6.x WASB
    • Amazon EMR 5.16.x, EMR 5.20
    • Databricks 5.1
  • Security Enhancements:
    • Security enhancements for AWS. The following security mechanisms on AWS are now supported:
      • At rest:
        • SSE-S3
        • SSE-KMS
        • CSE-KMS
      • In transit:
        • SSE-SE
        • SSE-KMS

 

Big Data Streaming

  • Connectivity and Cloud
    • New connectivity: Native connectivity to Amazon S3 targets
    • Connectivity enhancements: Filename port support for HDFS targets
  • Stream processing and analytics
    • Message header support in streaming sources: JMS standard headers support
    • Enhanced MapR distribution support:
      • Support for Kafka in MapR distributions
      • Support for secured MapR Streams

Connectivity

  • Security Enhancements:
    • Certified SQL Server for SSL support with Sqoop

 

Enterprise Data Lake - now renamed to Enterprise Data Preparation

  • Product Rename:
    • With this release, Informatica Enterprise Data Lake is now renamed to Informatica Enterprise Data Preparation.
  • Distribution Support:
    • Cloudera CDH: 5.15, 5.16, 6.1
    • Hortonworks HDP: 2.6.5, 3.1 (Tech Preview)
    • MapR: 6.0.1 with MEP 5.0, 6.1 MEP 6.0
    • Azure HDInsight: 3.6.x WASB
    • Amazon EMR 5.16.x, EMR 5.20
  • Functional Improvements:
    • Users can preview and prepare Avro and Parquet files in the data lake.
    • Users can revert all data type inferencing within a single worksheet during data preparation.
    • Administrators can disable automatic data type inferencing for all worksheets in all projects.

 

Enterprise Data Catalog

  • Distribution Support Updates for EDC External Cluster Support:
    • Cloudera CDH: 6.1
    • Hortonworks HDP: 3.1 (Tech Preview)

 

Release Notes & Product Availability Matrix (PAM)

 

Informatica 10.2.2 SP1 Release Notes: https://docs.informatica.com/big-data-management/shared-content-for-big-data/10-2-2-service-pack-1/big-data-release-notes.html

 

PAM for Informatica 10.2.2 SP1:   https://network.informatica.com/docs/DOC-18072#comment-37896

Executive Summary:

 

Informatica Big Data Management (BDM) and Informatica Big Data Quality (BDQ) mappings having Decimal manipulations and source data size is greater than 256 MB, which when executed on Blaze engine can potentially cause data inconsistencies in decimal port values. This issue is being tracked as bug # BDM-24814 and is known to manifest under the following conditions:

 

  1. Active transformations with Decimal ports

Potential Data loss and dropped rows

    • Filters and Router with Decimal datatype ports on filter condition
    • Joiners with condition ports on Decimal datatype

     2. Passive transformations with Decimal ports

Potential data inconsistency with columns with decimal datatypes getting changed to NULL.

    • Expression with decimal manipulation

 

Affected Software

 

Informatica BDM/BDQ 10.0.x

Informatica BDM/BDQ 10.1.x

Informatica BDM/BDQ 10.2.0, 10.2.0 HF1, 10.2.0 HF2

Informatica BDM/BDQ 10.2.1, 10.2.1 SP1

Informatica BDM/BDQ 10.2.2

 

Suggested Actions

 

Step 1: Refer to Executive Summary and knowledge base article KB-575249 to identify if you are impacted

Step 2: If impacted, perform the following task to resolve the issue:

  • BDM/BDQ 10.2.1 – Apply Service Pack 2 (tentative release date is mid-May)
  • BDM/BDQ 10.2.2 – Apply Service pack 1 (tentative release date is mid-May)
  • BDM/BDQ 10.1.1 HF1 - Apply Emergency Bug Fix (EBF) that is available for download from https://tsftp.informatica.com

/updates/Informatica10/10.1.1 HotFix1/EBF-14519

  • Other BDM/BDQ versions: Please reach out to Informatica Global Customer Support

 

Informatica strongly recommends applying this patch for all Informatica environments that fall into the problem scope defined in the executive summary.

 

Frequently Asked Questions (FAQs) related to this advisory:

 

Q1: What is the scope of this advisory?

A: This advisory is applicable to Informatica Bigdata Management 10.0,10.1.0,10.1.1,10.2.0,10.2.1, 10.2.2 and running mappings in Hadoop pushdown mode using Blaze engine only. This advisory is not applicable if you are using any other versions of Informatica platforms like Informatica DataQuality and PowerCenter.

 

Q2: I am using one of the affected product versions and also have other Emergency Bug Fixes (EBFs) applied. What should I do?

A: You might need a combination EBF that includes the previous fix(es) as well as the fix for the issue covered in this advisory. Please contact Informatica Support to confirm if you would need a combination EBF.

 

Q3: Whom should I contact for additional questions?
A: For all questions related to this advisory, please contact your nearest Informatica Global Customer Support center.

https://www.informatica.com/services-and-training/support-services/contact-us.html

 

Disclaimer

INFORMATICA LLC   PROVIDES   THIS   INFORMATION ‘AS   IS’ WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES  OF  MERCHANTABILITY,  FITNESS FOR A  PARTICULAR  PURPOSE  AND ANY  WARRANTY  OR CONDITION OF NON-INFRINGEMENT

 

Revisions

V1.0 (April 22, 2019): Customer advisory published

The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.

 

Topic and Agenda

 

 

Once the host approves your request, you will receive a confirmation email with instructions for joining the meeting.

 

Here is the agenda for the webinar.

 

1. Overview of Blaze architecture and components

2. Blaze configuration (hadoopEnv.properties and beyond)

3. Logs location and collection

4. Common issues and troubleshooting

5. Tips and Tricks

 

This session is intended for BDM customers who are executing their mappings/profiles/scorecards using Blaze execution engine. At the end of this session, customers will be able to get insights into Blaze architecture, various components, and services associated with Blaze, how to troubleshoot the most common issues and how to access/provide logs to Informatica Support, which GCS requires for troubleshooting.

 

Speaker Details

 

The presenter for this session is Sujata, an Informatica GCS veteran, handling IDQ and BDM products for the past 5 years and has developed a niche in troubleshooting Blaze related issues.



Text Classification in BDM using NLP

 

This document shows how to do text classification in BDM using NLP. We will be using PredictionIO server to run our classification engine. The demo covers how to install predictionIO, build/train & deploy a text classification template and use that in BDM.

 

Apache PredictionIo Overview

 

Apache PredictionIO is an open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task.

 

It lets you:

 

  • Quickly build and deploy an engine as a web service on production with customizable templates;
  • Respond to dynamic queries in real-time once deployed as a web service;
  • Evaluate and tune multiple engine variants systematically;
  • Unify data from multiple platforms in batch or in real-time for comprehensive predictive analytics;
  • Speed up machine learning modeling with systematic processes and pre-built evaluation measures;
  • Support machine learning and data processing libraries such as Spark MLLib and OpenNLP;
  • Implement your own machine learning models and seamlessly incorporate them into your engine;
  • Simplify data infrastructure management.
  • Apache PredictionIO can be installed as a full machine learning stack, bundled with Apache Spark, MLlib, HBase, Spray and Elasticsearch, which simplifies and accelerates scalable machine learning infrastructure management.

 

 

 

PredictionIO Architecture

 

Apache PredictionIO consists of different components.

 

PredictionIO Platform: An open source machine learning stack built on the top of some state-of-the-art open source application such as Apache Spark, Apache Hadoop, Apache HBase and Elasticsearch.

 

Event Server: This continuously gathers data from your web server or mobile application server in real-time mode or batch mode. The gathered data can be used to train the engine or to provide a unified view for data analysis. The event server uses Apache HBase to store the data.

 

Engine Server: The engine server is responsible for making the actual prediction. It reads the training data from the data store and uses one or more machine learning algorithm for building the predictive models. An engine, once deployed as a web service, responds to the queries made by a web or mobile app using REST API or SDK.

 

Template Gallery: This gallery offers various types of pre-built engine templates. You can choose a template which is similar to your use case and modify it according to your requirements.

 

Prerequisites

 

PredictionIO can also be installed on an existing Hadoop cluster but for the demo we will be installing the following standalone components and configure with PredictionIO

 

  • Java 1.8
  • Apache Spark
  • Apache Hbase
  • Apache Hadoop
  • Elastic Search

 

 

Installing Apache PredictionIo

 

Make sure java is installed on the machine and set JAVA_HOME and add $JAVA_HOME/bin to your path

 

 

 

Download and Install Apache PredictionIo

 

Apache provides PredictionIo source files which can be downloaded and compiled locally.Create a temporary directory and compile the source file

 

mkdir /tmp/pio_sourcefiles

cd /tmp/pio_sourcefiles

 

 

Download the PredictionIO source file archive using any apache mirror site

 

wget http://apache.mirror.vexxhost.com/incubator/predictionio/0.12.0-incubating/apache-predictionio-0.12.0-incubating.tar.gz

 

Extract the archive and compile the source to create a distribution of PredictionIO

 

tar -xvf apache-predictionio-0.12.0-incubating.tar.gz

./make-distribution.sh

 

The above distribution will be built against the default versions of the dependencies, which are Scala 2.11.8, Spark 2.1.1, Hadoop 2.7.3 and ElasticSearch 5.5.2. The build will take approximately 10-15 mins.

 

You can also build PredictionIo using the latest supported version of spark, scala,Hadoop and hbase but you may see some warnings during the build as some functions might be deprecated. To run the build using your own version run ./make-distribution.sh -Dscala.version=2.11.11 -Dspark.version=2.1.2 -Dhadoop.version=2.7.4 -Delasticsearch.version=5.5.3, replacing the version number according to your choice.

 

Once the build successfully finishes, you will see the following message at the end.

 

PredictionIO binary distribution created at PredictionIO-0.12.0-incubating.tar.gz

 

The PredictionIO binary files will be saved in the PredictionIO-0.12.0-incubating.tar.gz archive. Extract the archive in the /opt directory and provide the ownership to the current user.

 

sudo tar xf PredictionIO-0.12.0-incubating.tar.gz -C /opt/

sudo chown -R $USER:$USER /opt/PredictionIO-0.12.0-incubating

 

 

Set the PIO_HOME environment variable.

 

echo "export PIO_HOME=/opt/PredictionIO-0.12.0-incubating" >> ~/.bash_profile

source ~/.bash_profile

 

 

Install Required Dependencies

 

Create a new directory to install PredictionIO dependencies such as HBase, Spark and Elasticsearch.

 

mkdir /opt/PredictionIO-0.12.0-incubating/vendors

 

Download Scala version 2.11.8 and extract it into the vendors directory.

 

wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz

tar xf scala-2.11.8.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors

 

Download Apache Hadoop version 2.7.3 and extract it into the vendors directory.

 

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xf hadoop-2.7.3.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors

 

Apache Spark is the default processing engine for PredictionIO. Download Spark version 2.1.1 and extract it into the vendors directory.

 

wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

tar xf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors

 

Download Elasticsearch version 5.5.2 and extract it into the vendors directory.

 

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.tar.gz

tar xf elasticsearch-5.5.2.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors

 

Finally, download HBase version 1.2.6 and extract it into the vendors directory.

 

wget https://archive.apache.org/dist/hbase/stable/hbase-1.2.6-bin.tar.gz

tar xf hbase-1.2.6-bin.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors

 

Open the hbase-site.xml configuration file to configure HBase to work in a standalone environment.

 

vi /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-site.xml

Add the following block to the hbase configuration

 

hbase-site.xml

<configuration>

<property>

<name>hbase.rootdir</name>

<value>file:///opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/data</value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/zookeepe</value>

</property>

</configuration>

 

The hbase-site.xml should look like below.

 

The data directory will be created automatically by HBase. Edit the HBase environment file to set the JAVA_HOME path.

 

vi /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-env.sh

 

Add JAVA_HOME on line 27 and and also comment line 46 and 47 as they are not needed for Java 8. The hbase-env.sh should look like below.

 

 

Configuring Apache PredictionIo

 

The default configuration in the PredictionIO environment file pio-env.sh assumes that we are using PostgreSQL or MySQL. As we have used HBase and Elasticsearch, we will need to modify nearly every configuration in the file. It's best to take a backup of the existing file and create a new PredictionIO environment file.

 

mv /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh.bak

 

Create a new file for PredictionIO environment configuration.

 

vi /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh

The file should look like below

 

pio-env.sh

 

# PredictionIO Main Configuration

#

# This section controls core behavior of PredictionIO. It is very likely that

# you need to change these to fit your site.

 

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.

SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7

 

# POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar

# MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

 

# ES_CONF_DIR: You must configure this if you have advanced configuration for

#              your Elasticsearch setup.

ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-5.5.2/config

 

# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO

#                  with Hadoop 2.

HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7/conf

 

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO

#                 with HBase on a remote cluster.

HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.2.6/conf

 

# Filesystem paths where PredictionIO uses as block storage.

PIO_FS_BASEDIR=$HOME/.pio_store

PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines

PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

 

# PredictionIO Storage Configuration

#

# This section controls programs that make use of PredictionIO's built-in

# storage facilities. Default values are shown below.

#

# For more information on storage configuration please refer to

# http://predictionio.incubator.apache.org/system/anotherdatastore/

 

# Storage Repositories

 

# Default is to use PostgreSQL

PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta

PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

 

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event

PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

 

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model

PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

 

# Storage Data Sources

 

# PostgreSQL Default Settings

# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL

# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and

# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly

# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc

# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio

# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio

# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

 

# MySQL Example

# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc

# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio

# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio

# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio

 

# Elasticsearch Example

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost

PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200

PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2

 

# Optional basic HTTP auth

# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name

# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret

# Elasticsearch 1.x Example

# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

# PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=<elasticsearch_cluster_name>

# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost

# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300

# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.7.6

 

# Local File System Example

PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs

PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models

 

# HBase Example

PIO_STORAGE_SOURCES_HBASE_TYPE=hbase

PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.6

 

# AWS S3 Example

# PIO_STORAGE_SOURCES_S3_TYPE=s3

# PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio_bucket

# PIO_STORAGE_SOURCES_S3_BASE_PATH=pio_model

 

Open the Elasticsearch configuration file:

 

cat /opt/PredictionIO-0.12.0-incubating/vendors/elasticsearch-5.5.2/config/elasticsearch.yml

 

Uncomment the line and set the cluster name to exactly the same as the one provided in the PredictionIO environment file. The cluster name is set to pio in the configuration (in bold below)

 

elasticsearch.yml

 

# ======================== Elasticsearch Configuration =========================

#

# NOTE: Elasticsearch comes with reasonable defaults for most settings.

#       Before you set out to tweak and tune the configuration, make sure you

#       understand what are you trying to accomplish and the consequences.

#

# The primary way of configuring a node is via this file. This template lists

# the most important settings you may want to configure for a production cluster.

#

# Please consult the documentation for further information on configuration options:

# https://www.elastic.co/guide/en/elasticsearch/reference/index.html

#

# ---------------------------------- Cluster -----------------------------------

#

# Use a descriptive name for your cluster:

#

cluster.name: pio

#

# ------------------------------------ Node ------------------------------------

#

# Use a descriptive name for the node:

#

#node.name: node-1

#

# Add custom attributes to the node:

#

#node.attr.rack: r1

#

# ----------------------------------- Paths ------------------------------------

#

# Path to directory where to store the data (separate multiple locations by comma):

#

#path.data: /path/to/data

#

# Path to log files:

#

#path.logs: /path/to/logs

#

# ----------------------------------- Memory -----------------------------------

#

# Lock the memory on startup:

#

#bootstrap.memory_lock: true

#

# Make sure that the heap size is set to about half the memory available

# on the system and that the owner of the process is allowed to use this

# limit.

#

# Elasticsearch performs poorly when the system is swapping the memory.

#

# ---------------------------------- Network -----------------------------------

#

# Set the bind address to a specific IP (IPv4 or IPv6):

#

#network.host: 192.168.0.1

#

# Set a custom port for HTTP:

#

#http.port: 9200

#

# For more information, consult the network module documentation.

#

# --------------------------------- Discovery ----------------------------------

#

# Pass an initial list of hosts to perform discovery when new node is started:

# The default list of hosts is ["127.0.0.1", "[::1]"]

#

#discovery.zen.ping.unicast.hosts: ["host1", "host2"]

#

# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):

#

#discovery.zen.minimum_master_nodes: 3

#

# For more information, consult the zen discovery module documentation.

#

# ---------------------------------- Gateway -----------------------------------

#

# Block initial recovery after a full cluster restart until N nodes are started:

#

#gateway.recover_after_nodes: 3

#

# For more information, consult the gateway module documentation.

#

# ---------------------------------- Various -----------------------------------

#

# Require explicit names when deleting indices:

#

#action.destructive_requires_name: true

 

Add the $PIO_HOME/bin directory into the PATH variable so that the PredictionIO executables are executed directly.

 

echo "export PATH=$PATH:$PIO_HOME/bin" >> ~/.bash_profile

source ~/.bash_profile

 

 

At this point, PredictionIO is successfully installed on your server.

 

 

Starting PredictionIo

 

 

You can start all the services in PredictionIO such as Elasticsearch, HBase and Event server using a single command.

 

 

You will see the following output.


Use the following command to check the status of the PredictionIO server.

 

You will see the following output.

pio-start-all

Starting Elasticsearch...

Starting HBase...

starting master, logging to /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/bin/../logs/hbase-user-master-vultr.guest.out

   Waiting 10 seconds for Storage Repositories to fully initialize...

   Starting PredictionIO Event Server...

 

Implementing an Engine Template

 

Several ready to use engine templates are available on the PredictionIO Template Gallery which can be easily installed on the PredictionIO server. You can browse through the list of engine templates to find the one that is close to your requirements or you can write your own engine.

 

In this tutorial, we will implement the Text Classification engine template to demonstrate the functionality of PredictionIO server using some sample data.

 

This engine template takes input like twitter or email or newsgroups data and tells us the topic inside the data like whether they are talking about a particular topic, you can send a query with the twitter or email data and the output will be the topic name.

 

Install Git, as it will be used to clone the repository.

 

sudo yum -y install git

Clone the text classification engine template on your system.

 

git clone https://github.com/amrgit/textclassification.git

cd template-classification-opennlp

 

You can choose any name for your application.

 

pio app new docclassification

You can type the following command to list the apps that are created inside PredictionIo

 

pio app list

 

Install PredictionIO python SDK using pip

 

pip install predictionio

 

 

Run the Python script to add the sample data to the event server. The git project already has some sample datasets which can be used to train the model. We will use the  20newsgroups training data set for this demo.

python3 data/import_data.py --access_key 8FhrUWaTIZJLLPcuS0bRu64O4TiZoYjgZFWjWm_Mik3QgoxoZAUO-7Ti4xo59ZcX --file datasets/20ng-train-no-stop.txt

 

 

If the import is successful you should see a message like below

 

 

The above script imports 11294 events To check if the events are imported or not, you can run the following query.

 

curl -i -X GET "http://localhost:7070/events.json?accessKey=8FhrUWaTIZJLLPcuS0bRu64O4TiZoYjgZFWjWm_Mik3QgoxoZAUO-7Ti4xo59ZcX"

 

 

The output will show you the list of all the imported events in JSON format.

 

Open the engine.json file into the editor. This file contains the configuration of the engine. Make sure the appId matches to the Id from “pio app list” command.

 

 

Build the application using the following command. If you do no want to run in verbose remove the verbose parameter.

 

pio build --verbose

 

You should see the message that the build is successful and ready for training.

 

 

Train the engine. During the training, the engine analyzes the data set and trains itself according to the provided algorithm.

pio train

If the train command fails with OOM errors use the following command to

 

pio train -- --driver-memory 2g --executor-memory 4g

You should see a message that the train is successful

 

 

Before we deploy the application, we will need to open the port 8000 so that the status of the application can be viewed on the Web GUI. Also, the websites and applications using the event server will send and receive their queries through this port. You can also use a different port in the deploy command.

 

You can deploy the PredictionIO engine using the following command

 

pio deploy

You can increase the driver memory for deploy command using the following command and you use use a different port using the –port argument.

 

pio deploy -- --driver-memory 4G &

You will see a message that the engine is deployed and running.

 

 

Calling PredictionIO engine through BDM

 

BDM will be sending data to predictionIO event server which is then passed to the prediction engine and the the engine sends the results back to BDM

 

 

 

You can use the sample datasets provided in the github link for testing through BDM mapping or get your own dataset. For the demo im using the datasets on github and copied these datasets onto my Hadoop cluster.

 

 

Create a flatfile dataobject called opennlp_dataset in developer client and in the advanced tab change the connection type to Hadoop file system and connection name to your HDFS connection. Also change the source file directory to the hdfs location.

 

 

 

 

Create a new mapping and call it m_Text_Classification_OpenNLP

 

 

 

 

Drag the flatfile object opennlp_dataset created in the above step into the mapping workspace and choose read operation. The object will look like below in the mapping workspace

 

 

 

Add a python transformation and create an input port called “data” and an output port called “class_output”

 

 

 

In the python tab of python transformation add the following code

 

import predictionio
engine_client = predictionio.EngineClient(
url="http://predictionio_host:8000")
text_class = (engine_client.send_query({
"sentence": data}))
for i in text_class:
class_output = text_class[i]

 

 

 

The mapping should like below at this point.

 

 

 

Add a pass through expression transformation and drag the class_output from python tx to expression tx.

 

You can right click on the expression transformation and click on create target, then click relational choose relational,choose hive from drop down,name the hive table as text_class_output

 

The final mapping should like this.

 

 

 

 

In the mapping  properties window choose spark as execution engine and the Hadoop connection.

 

 

Run the mapping and monitor it through admin console

 

 

 

Once the mapping is successful verify the output through beeline or any hive client. For the demo im using zeppelin to query the table and view the results as piechart.

 

 

 

You can also view the results in tabular format in zeppelin and as you can see in the screen shot below the count of messages and topic name.

 

 

PredictionIO has other engine templates which can be deployed in a similar fashion and used in BDM.

Filter Blog

By date: By tag: