When it comes to extracting metadata from different systems, it can become very complex as each system has its own way to store and offer metadata. Often data store system type offer a way to access the data via JDBC, but JDBC drivers functionality implementation may vary and not all function necessary for metadata extraction are always implmented.
The EDC JDBC scanner expect a certain number of functions to be implemented and those needs to be in a certain so that metadata can be extracted in bulk (use of getColumns method without specifying table name for example)
In order to workaround some limitation of certain JDBC drivers, there is a possibility to extract the metadata by using the JDBC driver available feature and bring the extracted metadata into the catalog via custom scanner.
Below is a link to an example of such metadata extractor that has been tested for Denodo.
this can easily be extended to other systems that are not yet ceritified with EDC such as AWS Athena, Alibaba Max Compute, etc. by providing the right JDBC driver and understanding what JDBC method is available with the driver.
The Informatica Global Customer Support Team is excited to announce an all-new technical webinar and demo series – Meet the Experts, in partnership with our technical product experts and Product management. These technical sessions are designed to encourage interaction and knowledge gathering around some of our latest innovations and capabilities across Data Integration, Data Quality, Big Data and so on. In these sessions, we will strive to provide you with as many technical details as possible, including new features and functionalities, and where relevant, show you a demo or product walk-through as well.
Get a deep-dive into how you can leverage the best of human expertise and AI with EDC 10.2.2. Subject matter expertise and social curation combined with the power of AI deliver a collaborative experience for data discovery, cataloging and management.
Informatica Enterprise Data Catalog 10.2.1 Update 1 has been released to shipping and is available immediately.
“Enterprise Data Catalog 10.2.1 Update 1” addresses several limitations and issues to improve usability and stability of the product.
Cloudera CSD Integration for HTTP Keytab management: Secure EDC deployment with creation of an Informatica EDC specific Cloudera Service Descriptor(CSD), that removes the requirement of provisioning http keytabs on Informatica Edge nodes. Starting 10.2.1 Update 1, Informatica CSD, when enabled from Cloudera Manager UI, will automatically copy the http keytab to the ramdisk specified on the node while creating the service.
WANdisco Support: WANDisco Fusion allows continuous availability of data with guaranteed consistency across clusters spanning Dev/Test, Production and Disaster Recovery environments. Starting 10.2.1 Update 1 EDC can be deployed on a WANDisco enabled Cloudera or Hortonworks cluster, enabling users getting all the benefits of WANDisco Fusion like data replication on an EDC cluster.
sudo-less Embedded Cluster Deployment: Catalog Administrators can now turn off sudo access on embedded EDC cluster post deployment. Requires Ambari-server and agent to be up before bringing up the catalog service.
Ambari User Customization for Embedded Cluster Deployment: Embedded Cluster Deployment now includes options to customize users for Ambari instead of using default users created by the service. This will allow administrators to configure local/AD users as Ambari service users.
Scanner Upgrades: Large number of scanner fixes that include an updated MITI distribution(v10) for reliable and high-performance metadata scans for databases, ETL and business intelligence tools.
Scanner log aggregation: Enhanced scanner error diagnosis with REST API that aggregates the scanner log, resource configuration and catalog service details.
Performance Improvements: Major improvements in performance for Informatica PowerCenter(2X) and SAP Business Objects(10X) scanners.
Performance Improvements: Major improvements in data profiling performance for high volume scans(>10k tables or files) across sources: relational databases(1.5x) to flat files(2.5X) with a 25X-100X reduction in Profiling Warehouse space required. ~4X Performance improvements in Similarity Profiling.
Unstructured Discovery Performance: Improved domain discovery performance in unstructured files with a scan throughput of 2.75 GB/Hour/20 data domains.
Ability to accurately label columns, attributes and fields is a critical requirement for both data discovery and data governance. However, organizations can have millions of datasets and hundreds of millions of columns/fields in various structured and semi-structured data sources, making it impossible to manually curate them one by one. Also, not all columns represent unique business concepts/data elements. A single data element, like a CUSTOMER ID or PRODUCT ID, can be a part of multiple datasets. Machine learning can help cluster these “instances” of data elements together based on data similarity. This makes it easier for data stewards and curators to manage these columns as well as data analysts to find the right data assets.
“Similar” is an overloaded term and similarity between two objects is largely based on context and use case. For a data curator labeling columns/fields with business terms, two columns are similar if they represent the same business concept like CUSTOMER ID or AIRLINE NAME or CITY. For a data scientist, searching for Sales numbers from Q42017, the sales dataset for Q12018 is similar.
Enterprise Data Catalog(EDC) uses unsupervised clustering based on multiple factors to cluster similar columns. These factors are:
Data Overlap: For any two columns, this metric determines the percentage of overlapping values in both the columns.
Distinct Value Match: This metric measures the overlap of distinct/unique values between two columns
Pattern Match: This metric uses data profiling to first identify the dominant data patterns for each column and field. Then it checks for the overlap of these patterns across column pairs.
Name Match: This uses fuzzy string matching to identify columns and fields that are named similarly.
This clustering is done across multiple data sources and datasets. It also assigns, both an overall similarity score as well as the match likelihood for each factor to help with different use cases.It is important to mention that the above computation cannot be done pairwise across all columns and fields. As an example, with 100M columns – not very uncommon in large enterprises today – there are roughly 5000 Trillion column-pairs. If evaluating similarity for each column pair took a millisecond, the calculation would take 5 trillion seconds or roughly 90000 years. Suffice to say we need a different approach. This is where EDC uses unsupervised machine learning provided through the CLAIRE™ engine to exponentially reduce the time taken to derive similarity from years to hours.Once these similar columns are clustered, they can be used for a wide variety of use cases:
Related Assets: Analysts are typically interested in finding data related to a topic. Their productivity can be greatly enhanced by recommending assets with substitutable (similar) or complementary (unionable or joinable) data. “Data Overlap” analysis at the dataset level can provide information about substitutable datasets. “Unique Value match” at the column level can provide information about other joinable datasets.
Data Curation: Curators adding business terms and labels to columns and fields can use similarity to identify cluster of similar columns and associate the label to the cluster instead of individual instances. This substantially reduces the amount of manual work needed to label all key data elements accurately in the enterprise.
Identify Duplicates: Identifying duplicate and redundant data can significantly help reduce data storage costs in an organization. Data Overlap analysis helps identify duplicate datasets across data sources.
Column Similarity is one of the key new features being introduced in Enterprise Data Catalog Spring 2018 release.
Informatica EIC 10.2.0 Update 1 has been released to shipping and is available immediately.
Enhanced Tableau Integration
Informatica Enterprise Information Catalog (EIC) Discover Plugin for Tableau: EIC Discover for Tableau is a Chrome browser plugin that automatically detects the active Tableau report or dashboard and provides rich contextual information catalogued in EIC. This brings all governance context, business classifications, data sources and other important metadata to self-service BI users within the context of the Tableau app. EIC Discover Chrome plugin needs an existing EIC deployment to function.
Tableau Data Extract (TDE) Export: This capability is for data analysts, data scientists and business users who want to access data in Tableau for adhoc analysis after they find the right data asset in EIC. EIC allows users to securely and quickly get access to data in TDE format, which they can visualize and analyze in Tableau.
Ease of Deployment and getting started
Prereq utility for embedded cluster validation: To help run prerequisite checks on the Informatica server and intended cluster nodes to ensure they meet the requirements for EIC install.
Utility to generate merged key tabs: This utility is for creating, validating and merging key tabs required for EIC installations on Kerberos enabled clusters.
Import Metadata Manager Resources into EIC: Import existing connections from Metadata Manager to EIC, to help customers to quickly kickstart their EIC deployments.
Improved Operations and Administration:
Catalog Service HA Support: Catalog Service can now be configured in the Active-Passive HA mode.
Improved Catalog Service Startup Catalog service startup performance has been improved by performing on-demand deployment of scanner binaries to the cluster.
Phone Home Support: Built-in capabilities to anonymously provide configuration and usage statistics back to Informatica.