Enterprise Data Catalog: Architecture

Version 13

    This document describes the overall architecture of the Informatica Enterprise Data Catalog and details the different component the architecture relies on.

     

    Service oriented architecture

    EDC follows the service oriented architecture (SOA) that’s been a standard with rest of the Informatica products. Services required for EDC run under a central administrative unit called Domain, Inside an Informatica domain primarily four services are configured for EDC.

     

    Informatica Domain and Nodes

    EDC services reside in a centrally administrative unit called Informatica domain. An Informatica domain is a collection of nodes and services. The domain requires a relational database to store configuration information when we install the Enterprise Data Catalog services, we must create the domain configuration repository in a relational database.

    Informatica domain manages a set of nodes and application services running on these nodes .

    A node is the logical representation of a machine in a domain. Domain controls a set of application services that represent a specific server-based functionality, running on the nodes.

    Application services

    When we create an application service, we designate a node with the service role to run the service process. The service process is the run-time representation of a service running on a node. The service type determines how many service processes can run at a time. with  the high availability option, we can run an application service on multiple nodes.

     

    From an administration and logistical point of view, kick starting an EDC project is not very different from a Informatica PowerCenter, Informatica Big Data Manager or Informatica Metadata manager engagement with slight nuances to hardware sizing and EDC specific configuration.

    • Model repository service : The Model Repository Service is an application service that manages model repository that stores metadata created by the Enterprise Data Catalog. MRS stores the underlying contents in a relational database to enable collaboration among the tools and services. Model repository primarily stores the resource configuration and data domain information. When you access a Model repository object from the Enterprise Data Catalog tools or the Data Integration Service, the client or service sends a request to the Model Repository Service. The Model Repository Service process fetches, inserts, and updates the metadata in the Model repository database tables.
    • Data Integration service : In EDC whenever s user run scans on resources and view the metadata and profiling statistics in Enterprise Data Catalog, the client tool sends requests to the Data Integration Service to perform the enterprise data discovery.
    • Catalog service : Catalog Service is needed to run the Enterprise data Catalog application and manage the connections between the Enterprise data Catalog components. You can configure the general, application service, and security properties of the Catalog Service
    • Content management services: Content Management Service is an application service that manages reference data. A reference data object contains a set of data values that you can search while performing data quality operations on source data. The Content Management Service also compiles rule specifications into mapplets. Content Management Service uses the Data Integration Service to run mappings to transfer data between reference tables and external data sources. The Content Management Service also provides transformations, mapping specifications, and rule specifications with the following types of reference data:
      • Address reference data
      • Identity populations
      • Probabilistic models and classifier

     

    Repository Databases

    The application services use database to store configuration metadata, runtime data and logs. The databases required for EDC can be deployed on Oracle DB, MS SQL Server or IBM DB2 UDB.

    • Domain repository database: store all configuration for the domain, nodes, services to run the EDC application.
    • Model Repository service database: store the connected service metadata manage by the Model repository service, this is where resource definition is stored.
    • Profile warehouse database (PWH) is used by the Data Integration services to store the profile statistics and value frequency are temporarily stored before being moved over to the HBase database by the Spark ingestion services.
    • Reference database  (REF) is used by the Content Management services to store reference data used for data discovery such as names, synonyms, location reference data, etc.

     

    Hadoop Cluster services

     

    EDC relies on Hadoop technologies to achieve scalability, performance and flexibility of an enterprise class application that will hold millions of information and serve thousands of requests at the same time as user community will grow in your company. To that end we support different deployment types

    • using an existing cluster if your organisation is familiar with the technology, it is the best option.
    • using an Informatica managed cluster, you just have to provision required hardware and the Informatica domain will take care of deploying the cluster for you.

     

    The Hadoop services that are leveraged by the catalog service are as follow:

    • Zookeeper: Maintains the configuration of the Hadoop cluster services across all the cluster nodes
    • HDFS : EDC stores the catalog information in HDFS , to store the catalog contents a special HDFS directory is created. /Informatica/LDM/<ServiceClusterName>. The directory needs to be owned by serviceuser name ( service principal In case of Kerberos)
    • Yarn: Stands for Yet Another Resource Negotiator, YARN uses a cluster wide resourcemanager and per node nodemanager daemons to manage Hadoop resource distribution and usage.When catalog service is started , it starts three applications in YARN
      • HBase: Manage and provide non SQL document storage capabilities on top of HDFS with scalability. This iswhere all the sources metadata will be stored.
      • SolR: Provide indexing and search capabilities, EDC build indexes of the metadata stored in HBase and allow fuzzy search and fast retrieve of the metadata
      • Spark: Ingest service used to collect, transform data and update HBase store and SolR indexes
      • Scanners: applications launched as batch job to extract metadata, run profiling and data discovery tasks and move the results over to the Catalog queue to be process by the ingestion service. The scanner application will pushdown the execution of the profiling and data discovery to the source application whenever possible (for databases, hadoop clusters, etc.)