Enterprise Data Catalog: Overview

Version 1

    Introduction

    Data Asset Management and Governance

    With the advent of big data and data lakes Metadata management becomes even more crucial and central to how customers work with their data. Discovering, identifying and governing of data assets have also taken a new dimension. A typical enterprise today have more data silos than ever as data becomes a first class citizen in the polity of every business segment, every department generates, governs and guards its own data assets.

     

    Although managing independently gives greater flexibility and autonomy to how assets are governed , data assets are seldom useful as an individual entity .The relationship and lineage of data is where power of data is leveraged to its full potential. Asset relationships provides holistic view of enterprise , enables self service capabilities and accelerates discovery and identification of the data assets.

    Metadata plays a central role in managing the data assets an enterprise contains. Once we ingest data into any system the next logical step in managing it to index or catalog the metadata and what we have ingested. having more data is useless and irrelevant if we discover it when we need it .

     

    Management and governance of data assets is not new to Informatica. Metadata manager was the tool used for managing metadata for impact analysis and lineage. The tool was sufficient for most use cases in the pre big data enterprise ecosystem. Although a great tool, metadata manager came to realization when a petabyte size system was an exception not a norm and when most of the data was nicely arranged in rows and columns inside of a RDBMS. The users or consumers of metadata have also changed, it’s no longer domain of IT/Business analyst alone, data is central and not augmenting feature of how a business functions. Simply put, the size and nature of data has changed and Enterprise Data Catalog (EDC) conforms to requirements of today’s data.

    Enterprise Metadata

    In a typical enterprise there is heterogeneity when it comes to data and associated metadata , we can primarily distribute metadata into four distinct categories:

     

    Technical Metadata

    Metadata generated by technical assets , like

    • Database Name, Table Name, Column Name,
    • Connection Details
    • Column Structure
    • Data Type, Precision and Scale
    • Nullability
    • Keys, Joins
    • Data Lineage

     

    Business Metadata

    Metadata generated by business processes and documentation, like

    • Definition and Glossary
    • Data Owner, Data Steward
    • Organization, Department
    • Privacy Level, Classification
    • Standard Abbreviations, Synonyms
    • Business Rules
    • Business Lineage

    Usage Metadata

    Metadata generated by audit trails and privacy policies , like

    • Who accessed which data asset and when
    • Usage in SQL, BI, ETL and Applications
    • Popularity
    • Relevancy

     

    Operational Metadata

    Metadata generated by operational schedules and code execution statistics , like

    • Mappings, Jobs, flows
    • When was the mapping run(started/finished)
    • Whether the mapping run failed or had warnings
    • How many rows were read, written to, or referenced
    • Events that occurred during the run of the mapping

     

    All these metadata silos serve limited purpose if looked independently, when looked together they complete the picture.

    Tribal knowledge to AI powered catalog

    Informatica has put AI in the middle of data asset management. To accelerate discovery, classification and similarity detection between data assets, EDC uses an engine powered by machine learning.

    The metadata collected by the EDC provides a vast trove of information that the CLAIRE machine learning algorithms use to learn about an enterprise’s data landscape. Few examples on how CLAIRE helps EDC are

    • Assist in making intelligent recommendations.
    • Automate development and monitoring  and adapt to intrinsic/extrinsic changes.
    • Finding similar data assets faster and
    • Applying semantic labels to data and column name values

    You can read more about CLAIRE here.

    EDC Misconceptions

    Enterprise Data Catalog is not a big data/Hadoop specific tool, just like Power center is not an Oracle/SQL server specific tool. Enterprise Data Catalog (EDC) uses Hadoop ecosystem to interact and stores information its accountable for. EDC uses Hadoop to catalog the information but its transparent to users, EDC comes with a clean search interface without the additional know-how needed of backend technology.

    EDC shows Profiling results and allows for data domain discovery. It’s not a replacement to Analyst or IDQ.Unlike DQ/Analyst it doesn’t store trending or rule based profiling. For Advanced and customized profiling DQ/Analyst is the right toolset.

    EDC is a standalone tool, customers don’t need to have Hadoop cluster or BDM to use EDC. EDC comes with an internal cluster option wherein the Hadoop cluster required to store catalog information is created by Informatica domain.