Informatica Big Data 10.2.2 Release

Version 2

    INFA技术超群管委会(QQ群: 309925255) 杨晓东 原创,版权所有,违者必究。

    欢迎您加入Informatica技术超群(QQ群: 112443162, 92949669)。

     

    找工作,找项目,找老师,找朋友,找买卖,找理财,找旅游,找健康,找啥都有!

    Informatica技术超群(112443162),一个神奇的Q群!

     

     

    Overviews of the 10.2.2 GA Release

     

    Informatica announced the General Availability Big Data 10.2.2 release at Feb 28, 2019.

    10.2.2 GA delivers many new powerful innovations across Enterprise Data Catalog, Big Data Management, Enterprise Data Lake, and Big Data Streaming.

     

    What are we announcing?

    Informatica Big Data Release 10.2.2 GA

     

    Who would benefit from this release?

    This release is for all customers and prospects who want to take advantage of the latest Big Data Management, Big Data Quality, Big Data Streaming, Enterprise Data Catalog, and Enterprise Data Lake capabilities.

     

    What’s in this release?

    This update provides the latest ecosystem support, security, connectivity, cloud, and performance while improving the end-to-end user experience.

     

     

    Highlights of the 10.2.2 GA Release

     

    Enterprise Data Catalog (EDC) Overview

    • Social curation with certification of datasets by SMEs/data owners, ratings and reviews by data consumers and maximize use of this shared data knowledge.
    • Better insight and collaboration with the ability to follow datasets and get change notifications, and Q&A platform for SMEs to answer common questions.
    • Customers can eliminate tedious manual associations of business glossary terms to technical data assets through AI-powered automatic association of business glossary terms.

     

    Big Data Management (BDM) Overview

    • Big Data Management (BDM) can now scale up to 10x more for jobs that are pushed down to compute clusters. Customers can now mass ingest incremental loads post initial loads with automatic tracking of the incremental values per table. BDM can now be setup using docker containerization that can be orchestrated via Kubernetes.
    • Developers can leverage dynamic mappings to build code templates that involve complex types across native, Hadoop, AWS, Azure and Databricks ecosystems. BDM now supports data drift with enhanced CLAIRE integration for many complex file formats across various ecosystems including AWS and Azure. Schema drift is now supported for Hive targets. BDM applications can now be incrementally deployed minimizing disruption in production.
    • BDM now supports Serverless processing with Azure Databricks. With support for dynamic file name generation, wild card character support for read, developers can build advanced pipelines that can be templatized in AWS and Azure ecosystems. BDM jobs can be pushed down to Redshift for efficient database level processing.

     

    Big Data Streaming (BDS) Overview

    • Increase performance, and scaling benefits of Spark. Support for advanced data management, mass ingestion, Spark Structured Streaming, message header for Kafka.
    • BDS now supports Spark Structured Streaming to process messages from streaming sources based on source event time & integrate “out of order” data
    • Supports message headers from streaming sources which enables real-time header metadata-based analytics.
    • BDS now integrates with Intelligent Structure Discovery (that is part of Informatica Intelligent Cloud Services) to provide machine learning capabilities in parsing the complex file formats and support dynamically evolving schemas.

     

    Enterprise Data Lake (EDL) Overview

    • Data analysts can now do advanced data preparation with 50+ new functions including Statistical, Text, Math, Date/Time, Window functions and ability to Apply external pre-built Active rules for BDQ Fuzzy Matching and Consolidation transformations. Productivity is improved with additional CLAIRE-based recommendations for alternate assets upstream and related by PK-FK, better join keys using PK-FK information and new data prep suggestions based on data types.
    • Customers can derive more value out of data lakes especially on AWS and Azure with direct data preparation for various file formats including Avro and Parquet on S3, ADLS and WASB, HDFS and MapR-FS file systems. They also get better performance with Spark execution and lower total cost of ownership with autoscaling on EMR.
    • Customers can improve data protection and governance with Informatica Dynamic Data Masking integration for all data touch points such as preview, prepare, publish and download etc.

     

    Details of the 10.2.2 GA Release

     

    Big Data Management (BDM)

    Enterprise Class

    • Zero client configuration: Developers can now import the metadata from Hadoop clusters without configuring Kerberos Keytabs and configuration files on individual workstations by leveraging the Metadata Access Service. Metadata Access Service now supports OS Profiles (when enabled) and can be executed on multiple nodes as a GRID
    • Mass ingestion: Data analysts can now ingest relational data into HDFS and Hive for both initial and incremental loads. Mass Ingestion service can now fetch incremental data based on date columns or numeric columns, persist the last values fetched in previous run and automatically use them in the subsequent runs.
    • SQOOP enhancements: SQOOP connector now supports high levels of concurrency and the ability to fetch incremental data
    • Bitbucket Support: Big Data Management administrators can now configure BitBucket (in addition to Perforce, SVN and Git) as the external versioning repository
    • Go-Live assistance: Release managers can now incrementally deploy objects into applications instead of overwriting the entire applications
    • Robustness & Concurrency: Data Integration Service is now highly robust can process 6 times more concurrent requests than it did in previous releases. The startup time of the Data Integration Service is improved by 2x.
    • Resilience: The Data Integration Service can now automatically reparent to the jobs that continue to run on the Hadoop clusters - even after the Data Integration Service experiences a crash or unexpected restart
    • Queuing: Data Integration Service is now enhanced to queue the jobs submitted to it and persist the requests so that the requests do not have to be resubmitted in cases the Data Integration Service experiences a crash or unexpected restart
    • REST Operations Hub: Operations teams can now perform REST queries that fetch the job status, row level statistics and other monitoring information for deployed mappings
    • Dynamic mappings: Dynamic mappings can now be used across various data types and various ecosystems including AWS and Azure 

     

    Advanced Spark

    • Advanced Data Integration: Spark now supports high precision decimals and executes the Python transformation many times faster than previous releases
    • Dynamic mappings: Complex data types such as Arrays, Structs and Maps can now be used in dynamic mappings
    • Debugging made easy: With the introduction of automatic Spark based data preview, developers can now debug advanced spark mappings that contain complex types and stateful functions as easily as they preview the native mappings
    • CLAIRE integration: Big Data Management now integrates with Intelligent Structure Discovery (that is part of Informatica Intelligent Cloud Services) to provide machine learning capabilities in parsing the complex file formats such as Weblogs

     

    Cloud and Connectivity

    • Core connectivity:
      • SQOOP connector is optimized to run faster and is designed to eliminate staging wherever possible
      • HBase sources and targets can be used with dynamic mappings
      • Schema drift is now supported. Changes in the source systems can now be applied on to the Hive targets
      • Data can now be loaded to Hive in Native mode
    • Amazon ecosystem:
      • Developer productivity is increased with the ability to use S3 and Redshift sources and targets in dynamic mappings
      • S3 data objects now support wild card characters in the file names.
      • File names can be dynamically generated for S3 objects using the target based FileName port
      • Many additional properties in the Redshift data objects can now be parameterized.
    • Azure ecosystem:
      • Developer productivity is increased manyfold with the ability to use ADLS and WASB sources and targets in dynamic mappings
      • Intelligent Structure Discovery is now supported with Azure ADLS and WASB
    • Azure Databricks: BDM now offers support for managed cluster computation on Azure Databricks
    • Containerization: Implementation teams can now build docker containers of BDM images and deploy them per their enterprise needs

     

    Platform PAM Update

    • Operating System Update:
      • RHEL - 7.3 & 6.7 - Added
      • RHEL - 7.0 , 7.1 ,7.2 and 6.5 , 6.6 - Dropped
      • SUSE 12 SP2 - Added
      • SUSE 12 SP0 & SP1 -  Dropped
      • SUSE 11 SP4 - Added
      • SUSE 11 SP2 & SP3 - Dropped
    • Database support :
      • Azure SQL DB  - Added
    • Authentication Support
      • Windows 2012 R2 & 2016 (LDAP and Kerberos) - Added
      • Azure Active Directory (LDAP only) - Added
    • Tomcat Support :
      • v 7.0.88 (No update)
    • JVM Support Update:
      • Azul OpenJDK 1.8.0_192 - Added
      • Oracle Java - Removed
      • Effective Informatica 10.2.2, Informatica platform supports Azul OpenJDK, instead of Oracle Java since Oracle has changed its Java licensing policy, ending public updates for Java 8 effective January 2019. Azul OpenJDK would come bundled with the product.
    • Model Repository - Versioned Controlled
      • BitBucket Server 5.16 Linux (hosting service for repositories) - Added
      • Perforce - 2017.2 - updated
      • Visual SVN - 3.9 - updated
      • Collabnet Subversion Edge - 5.2.2 - updated
    • Others
      • Microsoft Edge Browser (Win 10) 40.15 - updated
      • Internet Explorer -11.x - updated
      • Google Chrome- 68.0.x - updated
      • Safari - 11.1.2 ( MacOS 10.13 High Sierra) - updated
      • Adobe Flash Player  - 27.x - updated

     

    Informatica Docker Utility

    • Use the Informatica Docker Utility to create a custom Docker Container images for Big Data Management and then run the Docker Container image to create an Informatica Domain. The Informatica Docker utility provides a quick and easy process to install the Informatica Domain in Docker Containers.

     

     

    Big Data Streaming (BDS)

    Enterprise Class Streaming Data Integration

    • CLAIRE integration: Big Data Streaming now integrates with Intelligent Structure Discovery (that is part of Informatica Intelligent Cloud Services) to provide machine learning capabilities in parsing the complex file formats and support dynamically evolving schemas
    • Resilience: Data Integration Service can now automatically reparent to the jobs that continue to run on the Hadoop clusters - even after DIS experiences a crash or unexpected restart
    • Queuing: The Data Integration Service is now enhanced to queue the jobs submitted to it and persist the requests so that the requests do not have to be resubmitted in cases the Data Integration Service experiences a crash or unexpected restart
    • Incremental Deployment: Release managers can now incrementally deploy objects into applications instead of overwriting the entire applications
    • Latest Spark Version support: Big Data Streaming now supports Spark 2.3.1
    • Java Transformation support on HType Data

    Advanced Cloud Support

    • Amazon ecosystem:
      • Profile based authentication support for AWS Kinesis service
      • Cross Account authentication support for AWS Kinesis Streams service
      • Support for secure EMR cluster with Kerberos authentication
    • Azure ecosystem:
      • Support for deploying streaming mappings with an Azure EventHub source in Azure cloud with an HDInsight cluster

    Enhanced Streaming Data Processing and Analytics

    • Spark Structured Streaming support
      • Big Data Streaming now supports processing based on event time
      • Big Data Streaming can now integrate “out of order” data and process it in the same order as it was generated at the source
    • Message Header support
      • Better streaming data processing based on message metadata
      • Supports header metadata-based analytics without parsing the complete message
    • Machine Learning and Advanced Analytics support
      • Supports execution of Python script in streaming mapping with improved performance
    • Latest Apache Kafka version support
      • Big Data Streaming now supports Kafka 2.0

     

    Intelligent Structure Discovery

    Intelligent Structure Discovery is now integrated with Big Data Management and Big Data Streaming on Spark to allow high performance parsing of various file types with data drift handling.

    • Performance enhancements
      • Improved runtime performance for some use cases (JSON and XML) by up to 10X compared to the previous release
    • Improved handling of Data Drift with an Unassigned Port
      • Data not identified by the model will be routed to the unassigned port and not dropped
    • Data Type propagation
      • Intelligent Structure Discovery automatically discovers the model field names data types. When importing the model to the platform these Identified fields are propagated to the transformation with the corresponding data type
    • Handling of “Special” Characters
      • Intelligent Structure Discovery Models that contain characters that do not comply with the platform naming convention are automatically replaced with a compliant character
    • Enhanced Parsing Engine
      • Improved handling of XML files (Attributes and namespaces).
      • Support Discovery and parsing of multiple sheet Excel files
    • Improved Design Time
      • The Intelligent Cloud Design is enhanced with a Find functionality and the ability to apply actions on multiple elements (for example rename)

     

    Enterprise Data Lake (EDL)

    Core Data Preparation

     

    • New Advanced Data Preparation functions: Users can utilizemore than 50 new advanced data preparation functions for Statistical, Text, Math, Date/Time manipulations. Window functions help with calculation on a data window such as Rolling Average, Rolling Sum, Lead/Lag, Fill, Sessionize etc. Cluster and Categorize function uses phonetic algorithms to cluster data and then help users easily standardize. Delete duplicate rows function helps removing exact duplicates from the data.
    • Apply Active Rules: Users can Apply external pre-built rules with Active Transformations to support DQ processing like Fuzzy Matching and Consolidation. Expert users can use Informatica Developer Tool to build complex rules including active transformation and then expose those to the data preparation users. This helps collaboration, standardization, extensibility and re-usability.
    • Data Preparation for Avro and Parquet files: Users can add Avro and Parquet files to a project in addition to Hive tables and other file formats such as delimited files and JSONL files. This eliminates the need for creating a Hive table on top of files. They can structure the hierarchical data in row-column form by extracting specific attributes from the hierarchy, and can expanding (or exploding) arrays into rows in the worksheet.

     

    Self-service and Collaboration

    • Functional and UX Improvements: Users can apply conditions during aggregation, reorder sheets as needed. Recipe panel clearly shows steps that failed during data and recipe refresh. The “Back-in-time” functionality is now more on-demand to improve user experience for Edit step, Copy/Paste step etc.
    • Improved CLAIRE based recommendations: Improved user productivity with additional CLAIRE based recommendations for alternate assets upstream and related by PK-FK. During join, users are prompted to review sampling criteria in case of low overlap of join keys. User also get new data prep suggestions based on data types that are handy shortcuts to frequently used functions.
    • Ability to add recipe comments: Users can add comments to various recipe steps, view comments by other users for better collaboration and auditing.
    • Save mappings for Recipes: Users have option to save mappings corresponding to recipes for a worksheet instead of executing full at-scale execution and creation of new output table. This way expert IT users can inspect the mapping and execute at appropriate time and resource levels.

     

    Enterprise Focus

    • Support for S3, ADLS/WASB and MapR-FS files: Users can prepare data files directly from various file systems such as AWS S3, Azure ADLS and WASB and also for MapR-FS in addition to HDFS.
    • Spark Execution: Spark is used as the default execution engine for better performance and it also allows data prep users apply executing rules built with mapplets using advanced transformations such as Python transforms .
    • Autoscaling on AWS EMR: Customers can start with minimal number of EMR nodes and then auto-scale based on rules for resource consumption to lower overall total cost of ownership for data lakes in AWS.
    • Integration with Informatica Dynamic Data Masking: Data protection and governance is improved using Informatica Dynamic Data Masking. Based on DDM policies, data will be masked at various touch points such as preview, prepare, publish and download etc.
    • Scalability improvements: Performance, scalability and longevity improvements have been made in various services to support enterprise scale deployments with large number of users.

     

    Enterprise Data Catalog (EDC)

     

    • Collaboration: Data Analysts, Data Scientists and Line of Business users will now be able to find the most relevant, most trusted datasets for their analytic needs faster with Enterprise Data Catalog(EDC) v10.2.2. EDC 10.2.2 includes both top down and bottom up collaboration capabilities that bring to forefront the otherwise deeply siloed knowledge about trustworthiness and usefulness of datasets. This new capability will help data consumers save weeks, sometimes months of efforts in finding and using the right dataset.
      • Dataset Certifications: With EDC v10.2.2, Subject Matter Experts, Data Stewards and Data Owners will be able to certify datasets and data elements adding context information like data usage and constraints. Using EDC’s machine learning based semantic search, EDC will surface these certified datasets at the top of the search results to guide users to use these certified datasets among all other similarly named datasets in the organization.
      • Reviews and Ratings: Data consumers like Data Analysts and Data Scientists can now review and rate datasets. EDC pushes datasets that are rated highly to the top of the search results. There are new facets that are available to narrow down search results to highly rated datasets only.
      • Questions and Answers: Users will be able to use a new question/answer platform that allows subject matter experts to answer the most common questions of the data consumers. This will help data consumers to find experts, ask questions and see answers in the context of the dataset. For subject matter experts, this will mean less work and more reuse of information as they need not response to multiple emails and phone calls for the same queries on data.

     

    • Change Notifications: With change notifications, EDC will provide data consumers an easy way to stay on top of any metadata changes happening to their data assets. Users will be able to follow any datasets in the catalog and whenever scanners detect any changes to these datasets, both in-app and email notifications will be sent to the user. Additionally super users like database administrators, stewards and owners can follow entire databases and other metadata resources to get notified on any changes happening in the database.

     

    • Intelligent Business Glossary Associations: One of the most important and most tedious data governance tasks is to associate business glossary terms to physical data assets. In EDC v10.2.2 the glossary association process is a lot more easier. By using the CLAIRE based AI engine, the right business glossary terms are matched with the right physical assets at the data element level. This method uses the data domain discovery and data similarity capabilities to power automatic glossary associations with the goal of making the data stewards and business analysts responsible for this task about 2X more productive by providing these machine learning based assistants.
      • Business Glossary Assignment Report: EDC v10.2.2 includes a new business glossary assignment report at the resource level to help data stewards understand glossary association coverage for a resource in one place. Data Stewards will also be able to curate(accept/reject) all glossary recommendations from this new report as well.

     

    • Metadata and Profile Filters: Catalog and profile only selected metadata from databases, data warehouses and big data sources. Users will be able to provide both inclusion and exclusion criteria to filter datasets that are cataloged and profiled. The filter criteria can be a list of names or regular expressions that are matched against table/view names.
    • Remote Metadata Scanner: Catalog metadata from data sources that are behind a firewall or are remote with port restrictions. With EDC v10.2.2 a direct network connection from the Catalog to the data source is no longer required. Remote Metadata Scanner Utility can be downloaded and setup in a server close to the data source/in the same network and the extracted metadata can be uploaded to the catalog. Currently only metadata scan is supported for Oracle, SQL Server and Teradata.
    • New Scanners
      • Workday: Manage Workday metadata for governance, risk/compliance and self service analytics
      • Google BigQuery: Manage Google BigQuery metadata for governance, risk/compliance and self service analytics

     

    • Performance Improvements: EDC v10.2.2 includes a new graph schema that improves the performance of tasks like parameter assignment (63x faster), resource purge (5x faster) and re-index (2.5x faster). Additionally, there are all round scanner performance improvements in the areas of auto connection assignments (340x faster), SAP Business Objects scanner (1.5x faster), Oracle scanner (2x faster) and IBM Cognos scanner (2x faster).

     

    PAM for Informatica 10.2.2 - https://network.informatica.com/docs/DOC-18072

     

    Informatica 10.2.2 Release Notes