EDC Installation: High Level Overview

Version 2

    Scope of this document

     

    This document intends to provide a high level overview and checklist of the installation of the Enterprise Data Catalog.  This document is not intended as a replacement of the product documentation.  For more details, please refer the documentation (available in Communities) or the online version of the installation guide:

    https://network.informatica.com/onlinehelp/edc/Install_Help/index.htm

     

    Pre-install steps

     

    Before installing the product , please consider the product availability matrix for minimum hardware/software requirements and supported operating system and Hadoop versions.

    Please refer the following page for instruction on how to find the PAM for the EDC version you are planning to install

    https://network.informatica.com/community/informatica-network/blog/2015/12/20/update-on-pam-documents

    Prepare databases

     

    Both Domain and Model repository service needs RDBMS to host their contents. Prepare and configure appropriate RDBMS schema/databases to domain and MRS.Please refer the PAM for supported version and type of databases for EDC.

    HDFS Pre-requisites

    Non kerberized cluster

    Following directories must be created on HDFS

         /Informatica/EDC/<ServiceClusterName>

         /user/<ISPProcessOwner>

    Owner of the above directories should be the user using which the ISP domain has been started (ISPProcessOwner)

     

    Kerberized cluster

    Following directories must be created on HDFS

         /Informatica/EDC/<ServiceClusterName>

    where ServiceClusterName is the Principal username in Kerberos

         /user/ServiceClusterName

    Owner of the HDFS directory should be the Principal user

    Kerberos pre-requisites

     

    Most external Hadoop cluster deployments use Kerberos enabled authentication. EDC allows communication with Kerberos enabled Hadoop cluster as long as there is a single keytab file with user principal name for impersonation used by EDC. The other entries in the keytab are host specific for example if the service cluster name is <ServiceClusterName> there needs to be entries for all hosts in the Hadoop cluster ( ex <ServiceClusterName>/<hostname>@<REALM>).Informatica does not support cross or multi-realm Kerberos authentication. The server host, client machines, and Kerberos authentication server must be in the same realm

     

    Catalog service also needs HDFS and YARN service principal names during setup. Apart from service principal with service cluster name, host specific HTTP principal is also added to the keytab file. The HTTP principal is used to authenticate client requests coming to SOLR (the search engine) which can be launched on any cluster node by yarn.

    The client library (solrJ) uses HTTP to communicate with SOLR application For a kerberized Hadoop cluster three sets of principals are needed

    EDC expects keys for following types of principals in the keytab configured with the services.

    • <ServiceClusterName>@<REALM>
    • <ServiceClusterName>/_HOST@<REALM>
    • HTTP/_HOST@<REALM>

     

    Catalog service runs on the edge/Informatica node (lets call it EDC node) and deploys applications on YARN.

    <ServiceClusterName>@<REALM> principal is used by the catalog service to authenticate the service to zookeeper .HDFS and YARN.

    The other two principals are used by HBASE and SOLR applications that catalog service starts, these applications are always running for retrieving catalog information for EDC users and facilitate the data asset search.

    Kerberos Setup

     

    Informatica provides utilities in the form of JAR files that you can run. You can use the utilities to validate prerequisites or perform mandatory tasks before you install an Informatica product.

    The keytab management utility helps you create, merge, and validate keytab files. Creating, merging, and validating keytab files are prerequisites for deploying Enterprise Data Catalog in a cluster enabled for Kerberos.

    You can use the utility to perform the following steps to manage keytab files:

     

    1. Create a user principal for the HDFS service cluster name in Active Directory.
    2. Create service principals for all the cluster nodes in Active Directory.
    3. Create keytab files for all the principals.
    4. Merge the keytab files and spnego keytab files into a merged keytab output file.
    5. Validate the output file.

    The utility is provided in the EDC-keytab_utility.zip file. You must extract the KeytabUtility.jar file from the ZIP file to use the utility

     

    Refer https://kb.informatica.com/howto/6/Pages/20/522953.aspx?myk=EIC%2010.2%20admin for more information on the Kerberos utility.

     

    Creating a Keytab file

     

    to create user principal and service principals for all hosts, generate the keytab files for the principals, and merge the keytab files in an output file. If an embedded cluster is up and running, the command merges the spnego keytab files along with the keytab files for principals in the output file. For an existing cluster, the command merges the spnego keytab files along with the keytab files for principals in the output file. Before creating the principal or checking if the principal exists, the utility checks if the merged keytab output file exists at the specified location. If the merged keytab output file exists, then the utility creates principals and keytab files for the hosts for which the keytab entries are not present in the merged keytab output file.

     

    Merge HTTP keytabs

    You can use the command in the following ways based on the type of cluster you use:

    o   Embedded cluster - If you had not set up the cluster and you run the createKeytab command, the command does not merge the spnego keytab files in the output file. You must run the mergeHttpKeytabs command after you set up the cluster and when the cluster is up and running.

    o   Existing cluster -  If you had not copied the the spnego keytab files from all the host nodes to the keytab location, the createKeytab does not merge the spnego keytab files in the output file. You must run mergeHttpKeytabs after you copy the spnego keytab files to the keytab location

     

    SSL pre-requisites

    The following matrix displays the combination of services that are supported with SSL turned on.  Please refer the PAM.  We’ve update this recently.

    Domain and catalog service

    Ambari/Cloudera manager

    Yarn

    Supported

    SSL

    SSL

    SSL

    Yes

    Non-SSL

    Non-SSL

    Non-SSL

    Yes

    SSL

    Non-SSL

    Non-SSL

    Yes

    SSL

    SSL

    Non-SSL

    Yes

    SSL

    Non-SSL

    SSL

    Yes

    SSL setup

    Service configuration

    • Informatica domain is configured in the SSL mode.
    • The cluster and YARN REST endpoints are Kerberos-enabled.
    • Create a keystore file for the Apache Solr application on all nodes in the cluster. Import public certificates of Apache Solr keystore files on all the hosts into all the truststore files configured for HDFS and YARN. This step is required for Apache Spark and scanner jobs to connect to the Apache Solr application.
    • Import the public certificates of Apache Solr and YARN applications into the truststore file of the Informatica domain. This step is required for Catalog Service to connect to YARN and Solr applications.
    • Import the public certificates of Informatica domain and the Catalog Service into the YARN truststore.
    • Import the public certificate of the Catalog Service into the Informatica domain truststore
    • If you plan to deploy Enterprise Data Catalog on an existing Hortonworks version 2.5 cluster that does not support SSL authentication, perform the following steps:
      • Configure the following properties in the /etc/hadoop/conf/ssl-client.xml file: ssl.client.truststore.location and ssl.client.truststore.password.
      • Ensure that the ssl.client.truststore.location value is set to /opt directory and not /etc directory. Verify that you configure the full path to the truststore file for the ssl.client.truststore.location property. For example, you can set the value similar to /opt/ truststore/infa_truststore.jks.
      • Export the keystore certificate used in the Informatica domain.
      • Import the keystore certificate into the Informatica domain truststore file.
      • Place the domain truststore file in all the Hadoop nodes in the /opt directory. For example, /opt/ truststore/infa_truststore.jks.
      • Open the /etc/hadoop/conf/ssl-client.xml file.
    1. Modify the ssl.client.truststore.location and ssl.client.truststore.password properties. for more information please visit https://kb.informatica.com/h2l/HowTo%20Library/1/1096-PrerequisitestoScanningPowerCenterResourcesThatAreOnAnSSLEnabledDo…

     

    Working with external SSL enable source

    If you are planning on gathering metadata from systems that are SSL enable, you need to ensure that the services have the necessary configuration to be able to connect to those systems

    For example, for a PowerCenter repository running in on a SSL enabled Informatica domain, high level steps are as follow:

    • Export the Informatica domain certificate from the PowerCenter domain
    • Copy the certificate on the EDC domain machine host
    • Import the certificate into the EDC domain truststore
    • Import the certificate into the truststore of each Hadoop node of the cluster used for EDC
    • Encrypt you truststore password using the pmpasswd command
    • Add the 2 environment variables in the Yarn configuration:
      • INFA_TRUSTSTORE
      • INFA_TRUSTSTORE_PASSWORD

    Restart Yarn

     

    Validating embedded cluster pre-requisites

    Informatica provides utilities in the form of JAR files that you can run. You can use the utilities to validate prerequisites or perform mandatory tasks before you install an Informatica product.

    The embedded cluster validation utility helps you validate the prerequisites required to install Enterprise Data Catalog in an embedded Hadoop distribution on Hortonworks. Run the utility before you install Enterprise Data Catalog.

    When you run the utility, it validates the Enterprise Data Catalog prerequisites in your system environment. If your system environment is not compliant with any required prerequisite, the utility stops running and displays an error message. If your system environment is compliant with all the prerequisites, the utility completes the prerequisite validation. The utility also provides a log file after the validation. You can view the log file to identify the requirement that you might need to complete before you install Enterprise Data Catalog.

    The utility can be provided by Informatica shipping, and is also part of the installer. It validates if prerequisites for the host node, operating system, and Kerberos are complete.

     

    Install Enterprise Data Catalog

    After installing domain , using console or silent mode install the EDC binaries on Informatica nodes.

     

    Create or Join a Domain

    EDC can be installed in a same domain as BDM or in a split domain mode in a separate domain.Please refer section 2 for considering s single or split domain for installation.if you are sharing the domain , during install select the option of “Join the existing domain” else you can create a brand new informatica domain.