Enterprise Data Catalog: deployment and sizing considerations

Version 6

    Deployment considerations

     

    When deploying the Enterprise Data catalog application, you should consider multiple aspect of the deployment, from the physical architecture to the logical implementation of the services for optimal performances. See below diagram for initial understanding of the deployment:

     

     

    Informatica domain deployment

     

    While all services needed for the Enterprise Data Catalog application (see EDC architecture page ) could be running on a single node with enough resources (CPU, RAM and DISK), running them on multiple separate machines will ensure that you can enforce quality of service for your technical and business users who will access the Enterprise Data Catalog by leveraging high availability (HA) of the Informatica platform and services

     

    To that end, services could be classified into 2 groups, the infrastructure services and the metadata processing services and those 2 groups can be use to setup the services on each selected machines.

    • The infrastructure services correspond to:
      • Model repository service
      • Cluster service
      • Catalog service
      • Analyst service
    • The metadata processing services correspond to
      • Content Management service
      • Data Integration service

     

     

    Applications deployed on Hadoop

    As describe in the EDC architecture page, there are different services that are deployed on an Hadoop cluster as YARN application to support the growth, performance and scalability requirement of an application such as an Enterprise Data Catalog. The applications deployed are:

    • HBase as a slider application
    • SolR as a slider application
    • Ingestion service as a Spark application
    • Scanner applications which are running based on a schedule.
    • Object similarity jobs

     

    There are different ways to support the cluster requirement with the Enterprise Data Catalog, you can either deploy EDC on an embedded (Informatica managed) Hadoop cluster or an existing cluster that you create for EDC or already use to cover other Big Data use cases. For simplification and easier maintainability it is recommended to use a cluster that's separateed from the your data lake to avoid downtime for you users when maintenance occur on the data lake.

     

    Embedded Hadoop Cluster Deployment

    When you install Enterprise Data Catalog on an embedded Hadoop cluster, EDC creates a dedicated cluster for the Catalog. The EDC installer creates an Informatica Cluster Service as an ISP service.

    The Informatica Cluster Service is an application service that runs and manages all the Hadoop services,Apache Ambari server, and Apache Ambari agents on a list of machines that you provide to the installer. In manual deployment, If you choose the embedded cluster deployment mode, you need to create the Informatica Cluster Service before you create the Catalog Service.

    Informatica Cluster Service distributes the Hortonworks binaries and launches the required Hadoop services on the hosts where the embedded cluster runs. Enterprise Data Catalog uses Apache Ambari to manage and monitor the embedded Hadoop cluster.

    • Zookeeper
    • HDFS
    • Yarn

     

    The embedded cluster can be deployed on 1, 3 or more machines, up to a total of 8 machines.

     

    Note: the embedded cluster is not designated to be used for any other use case than the storage and processing of the metadata created with EDC.

     

    Existing Hadoop cluster deployment

    You can also deploy Enterprise Data Catalog on an existing Hadoop cluster. In this case, you need to configure Zookeeper, HDFS, and Yarn specifications or simply point to Ambari or Cloudera Manager URL based on the distribution you are using. The Catalog Service uses these configuration properties and launches the different services and components on the Hadoop cluster as Yarn application:

     

    On existing cluster, in addition to the applications launched on Yarn, if you are planning on extract metadata from the existing Hadoop cluster itself you should consider the impact of the profiling and data discovery job pushed down to the cluster as blaze application. this applies for sources Hive and HDFS.

     

     

    Sizing considerations

    When sizing the Enterprise Data Catalog environment, you should consider starting collecting information about the source system you man to extract the metadata from. To be able to determine what will be required, you should have in mind the description of the source landscape, number of the sources, total number of objects (knowing that a database column is considered as an object in EDC), the number of object of the largest source, proportion of the large metadata sources.

     

    Based on these results you can use the EDC Sizing Guide to determine which load type you will fall into within

    • minium
    • low
    • medium
    • large

     

    Each one of the these load type correspond to sizing requirement and guidelines for the Informatica domain servers, the cluster servers as well as the database server, and for each of there, listing the CPU, RAM, disk space and disk number requirements. The disk number criteria n sizing is always forgotten but becomes often a bottleneck and should be considered carefully especially for the metadata processing mahcaine as well as the Hadoop cluster machines.

     

    If using an existing cluster, you should review the parameters of the cluster with your Hadoop administrator to ensure that the applications started by EDC on the cluster will be given the proper set of resources to execute each tasks. Important parameter to review are:

    • maximum number of client connection for zookeeper
    • maximum client session timeout for zookeeper
    • minimum and maximum CPU and Memory allocation for Yarn containers
    • memory and CPU allocated for each node manager (making sure the resources are not over allocated)