Informatica Enterprise Data Catalog Advanced Scanners 10.4.1.0.1 has been released to shipping and is available immediately.
What are we announcing?
The release of Informatica Enterprise Data Catalog Advanced Scanners 10.4.1.0.1
Who would benefit from this release?
This release is for all customers and prospects who want to take advantage of the advanced metadata extraction capabilities with Informatica Enterprise Data Catalog.
What’s in this release?
This release provides new features included in Informatica Enterprise Data Catalog Advanced Scanners
IBM InfoSphere DataStage is now GA.
SAP BW is now GA.
SAP BW/4HANA is now GA.
Microsoft SSIS: View detailed lineage for transformations, extract metadata from the SSIS database, view the field-level lineage for flat files, and view the control summary of SSIS assets.
Oracle stored procedures and scripts: Ability to extract detailed data lineage at asset and column level for stored procedures, functions, and packages. PL/SQL and loader scripts for Oracle including static and dynamic SQL statements.
MS SQL Server stored procedures and scripts: Ability to extract detailed data lineage at asset and column level for stored procedures, functions, and packages. Transact-SQL scripts for MS SQL Server including static and dynamic SQL statements.
Teradata stored procedures and scripts: Ability to extract detailed data lineage at asset and column level for stored procedures. Macros, BTEQ, Fastload, MultiLoad, FastExport scripts for Teradata including static and dynamic SQL statements.
IBM DB2 stored procedures and scripts: Ability to extract detailed data lineage at asset and column level for stored procedures, functions, and packages. PL/SQL and loader scripts for IBM DB2 LUW including static and dynamic SQL statements.
Netezza (IBM PureData systems) stored procedures and scripts: Ability to extract detailed data lineage at asset and column level for stored procedures, functions, and packages. NZPLSQL and nzload Scripts for Netezza including static and dynamic SQL statements.
Sysbase ASE stored procedures and scripts: Ability to extract detailed data lineage at asset and column level for stored procedures, functions, and packages. Transact-SQL scripts for Sysbase ASE including static and dynamic SQL statements.
SAS base: Ability to extract object metadata as well as detailed lineage at asset and column level from SAS programs.
Microsoft SSRS: Extract datasets and report metadata including lineage at field level from Microsoft SSRS.
Microsoft SSAS: Extract data sources, cubes, measures, and dimensions metadata including lineage at field level from Microsoft SSAS.
JCL: Extract JCL program metadata including lineage between data objects (DB2 assets, VSAM Files, etc.)
COBOL: Extract metadata from COBOL program and copybooks.
Advanced Custom Metadata Loader: Accelerate the creation of custom metadata model and automate load of custom metadata from CSV, XLS, XML, JSON, or databases into the catalog using the EDC Advanced Custom Metadata Loader.
Stored procedures and scripts for Oracle 11gR2, 12c,12cR1, 12cR2, 18c, 19c
Stored procedures and scripts for MS SQL Server 2012, 2014, 2016, 2017, 2019
Stored procedures and scripts for Teradata 15.10, 16.00, 16.20
Stored procedures and scripts for IBM DB2 10.5, 11.1
When it comes to extracting metadata from different systems, it can become very complex as each system has its own way to store and offer metadata. Often data store system type offer a way to access the data via JDBC, but JDBC drivers functionality implementation may vary and not all function necessary for metadata extraction are always implmented.
The EDC JDBC scanner expect a certain number of functions to be implemented and those needs to be in a certain so that metadata can be extracted in bulk (use of getColumns method without specifying table name for example)
In order to workaround some limitation of certain JDBC drivers, there is a possibility to extract the metadata by using the JDBC driver available feature and bring the extracted metadata into the catalog via custom scanner.
Below is a link to an example of such metadata extractor that has been tested for Denodo.
this can easily be extended to other systems that are not yet ceritified with EDC such as AWS Athena, Alibaba Max Compute, etc. by providing the right JDBC driver and understanding what JDBC method is available with the driver.
The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.
Get a deep-dive into how you can leverage the best of human expertise and AI with EDC 10.2.2. Subject matter expertise and social curation combined with the power of AI deliver a collaborative experience for data discovery, cataloging and management.
Informatica Enterprise Data Catalog 10.2.1 Update 1 has been released to shipping and is available immediately.
“Enterprise Data Catalog 10.2.1 Update 1” addresses several limitations and issues to improve usability and stability of the product.
Cloudera CSD Integration for HTTP Keytab management: Secure EDC deployment with creation of an Informatica EDC specific Cloudera Service Descriptor(CSD), that removes the requirement of provisioning http keytabs on Informatica Edge nodes. Starting 10.2.1 Update 1, Informatica CSD, when enabled from Cloudera Manager UI, will automatically copy the http keytab to the ramdisk specified on the node while creating the service.
WANdisco Support: WANDisco Fusion allows continuous availability of data with guaranteed consistency across clusters spanning Dev/Test, Production and Disaster Recovery environments. Starting 10.2.1 Update 1 EDC can be deployed on a WANDisco enabled Cloudera or Hortonworks cluster, enabling users getting all the benefits of WANDisco Fusion like data replication on an EDC cluster.
sudo-less Embedded Cluster Deployment: Catalog Administrators can now turn off sudo access on embedded EDC cluster post deployment. Requires Ambari-server and agent to be up before bringing up the catalog service.
Ambari User Customization for Embedded Cluster Deployment: Embedded Cluster Deployment now includes options to customize users for Ambari instead of using default users created by the service. This will allow administrators to configure local/AD users as Ambari service users.
Scanner Upgrades: Large number of scanner fixes that include an updated MITI distribution(v10) for reliable and high-performance metadata scans for databases, ETL and business intelligence tools.
Scanner log aggregation: Enhanced scanner error diagnosis with REST API that aggregates the scanner log, resource configuration and catalog service details.
Performance Improvements: Major improvements in performance for Informatica PowerCenter(2X) and SAP Business Objects(10X) scanners.
Performance Improvements: Major improvements in data profiling performance for high volume scans(>10k tables or files) across sources: relational databases(1.5x) to flat files(2.5X) with a 25X-100X reduction in Profiling Warehouse space required. ~4X Performance improvements in Similarity Profiling.
Unstructured Discovery Performance: Improved domain discovery performance in unstructured files with a scan throughput of 2.75 GB/Hour/20 data domains.
Ability to accurately label columns, attributes and fields is a critical requirement for both data discovery and data governance. However, organizations can have millions of datasets and hundreds of millions of columns/fields in various structured and semi-structured data sources, making it impossible to manually curate them one by one. Also, not all columns represent unique business concepts/data elements. A single data element, like a CUSTOMER ID or PRODUCT ID, can be a part of multiple datasets. Machine learning can help cluster these “instances” of data elements together based on data similarity. This makes it easier for data stewards and curators to manage these columns as well as data analysts to find the right data assets.
“Similar” is an overloaded term and similarity between two objects is largely based on context and use case. For a data curator labeling columns/fields with business terms, two columns are similar if they represent the same business concept like CUSTOMER ID or AIRLINE NAME or CITY. For a data scientist, searching for Sales numbers from Q42017, the sales dataset for Q12018 is similar.
Enterprise Data Catalog(EDC) uses unsupervised clustering based on multiple factors to cluster similar columns. These factors are:
Data Overlap: For any two columns, this metric determines the percentage of overlapping values in both the columns.
Distinct Value Match: This metric measures the overlap of distinct/unique values between two columns
Pattern Match: This metric uses data profiling to first identify the dominant data patterns for each column and field. Then it checks for the overlap of these patterns across column pairs.
Name Match: This uses fuzzy string matching to identify columns and fields that are named similarly.
This clustering is done across multiple data sources and datasets. It also assigns, both an overall similarity score as well as the match likelihood for each factor to help with different use cases.It is important to mention that the above computation cannot be done pairwise across all columns and fields. As an example, with 100M columns – not very uncommon in large enterprises today – there are roughly 5000 Trillion column-pairs. If evaluating similarity for each column pair took a millisecond, the calculation would take 5 trillion seconds or roughly 90000 years. Suffice to say we need a different approach. This is where EDC uses unsupervised machine learning provided through the CLAIRE™ engine to exponentially reduce the time taken to derive similarity from years to hours.Once these similar columns are clustered, they can be used for a wide variety of use cases:
Related Assets: Analysts are typically interested in finding data related to a topic. Their productivity can be greatly enhanced by recommending assets with substitutable (similar) or complementary (unionable or joinable) data. “Data Overlap” analysis at the dataset level can provide information about substitutable datasets. “Unique Value match” at the column level can provide information about other joinable datasets.
Data Curation: Curators adding business terms and labels to columns and fields can use similarity to identify cluster of similar columns and associate the label to the cluster instead of individual instances. This substantially reduces the amount of manual work needed to label all key data elements accurately in the enterprise.
Identify Duplicates: Identifying duplicate and redundant data can significantly help reduce data storage costs in an organization. Data Overlap analysis helps identify duplicate datasets across data sources.
Column Similarity is one of the key new features being introduced in Enterprise Data Catalog Spring 2018 release.