Customers use Informatica Big Data Management (BDM) product to access metadata (using developer client tool) as well as data (through Data Integration Service) for Hadoop based sources (HDFS files, HBase, Hive, MapR-DB) as well as non-Hadoop sources. One of the major pain points with accessing Hadoop data sources is related to the non-trivial configuration effort that goes in configuring access to the Hadoop systems including the Kerberos configuration using kinit tool, keytab files etc. This document explains how Metadata Access Service simplifies the configuration effort and also enables more secure metadata access from Hadoop data sources.
Metadata access process without Metadata Access Service
The metadata access process before Metadata Access Service to import metadata from a Hadoop data source like HDFS files, HBase, Hive, MapR-DB included the following steps. Note that the below steps were needed to be performed on each developer client box installation that needed to be configured to access metadata from a Hadoop data source.
- Informatica developer needs to execute kinit command on the developer client box to get and cache the initial ticket-granting ticket from the KDC server. This requires providing appropriate keytab and krb5 configuration files to individual developers and asking them to execute the command manually before requesting metadata using the developer client. Since keytab files include sensitive information, distributing the same to each developer box appropriately and asking developers to execute these commands manually requires a lot of careful handling on the customer side.
- If the cluster is SSL enabled, developer needs to import the corresponding certificates in each developer client installation using keytool commands to import the certificates in the jre folder.
- Developer also needs to export the cluster configuration XML from Informatica Admin console and manually extract the zip file and place the same into the 'conf' folder under the appropriate Hadoop distribution folder on the developer client installation.
- Developer also needs to update the variable 'INFA_HADOOP_DIST_DIR' defined in 'developerCore.ini' file on each client box under the client installation if connecting to a Hadoop distribution other than the default Cloudera version.
- Finally, after performing the above steps (needed on each developer client box), developer can launch the import wizard for the data source in the developer client to import and save the metadata.
As apparent from the above steps, performing the above steps on each developer client box (customer may potentially have tens or even hundreds of developer client installations) is a big hassle. Metadata Access Service was introduced to ease the above configuration and also provide improved security architecture w.r.t metadata access for Hadoop data sources.
Metadata access using Metadata Access Service
Metadata Access Service is intended for enabling metadata access to Hadoop data sources(HDFS Files, HBase, Hive and, MapR-DB) from the developer client tool. This is a mandatory service that must be created before any metadata access from Hadoop sources listed can be performed via the developer tool. The service can be created using either Informatica admin console or through command line using infacmd tool.
Metadata Access Service needs to be configured only once by the Informatica Administrator (similar to how existing services like Data Integration Service used for data access by Informatica mappings). If there is a single metadata access service configured and enabled, it'd get picked up automatically by the developer client installations, else the default metadata access service can be selected once at the developer client level and the developer selection gets cached until changed. Metadata Access Service provides a lot of features and advantages compared to the accessing metadata directly from developer client tools.
- Metadata Access Service enables configuration of Kerberos specific attributes like keytab location, principal name at a single location. There is no longer a need to run the kinit command on any developer client boxes since appropriate keytab file location and Service Principal Name can be provided in the MAS configuration in the Admin Console. This is also more secure as we no longer need to distribute and use sensitive information to developer boxes.
- Developer can configure multiple metadata access services either for load balancing purpose (to reduce the load on a single service process) or to connect a different Hadoop distribution type (Cloudera, Hortonworks, MapR etc) or a system with a different configuration (keytab, Service Principal Name) is desired. Hence, there is no need to perform steps like downloading cluster configuration files into localHadoop distribution folder on the developer client. Developer can select the appropriate Metadata Access Service name in the developer client as the default service for the current developer client session in case multiple services are configured (similar to how default DIS service is selected).
- Configuring access to SSL enabled clusters is also simplified as the SSL certificates for the clusters need to be imported (using keytool command) on a single node (where MAS is configured to run) and not on each developer client box.
- Developer can also enable the option to use 'logged in user as impersonation user' similar to DIS to enable current logged in user credentials in developer client box to be used while accessing any Hadoop resources.
- Support for centralized logging is enabled, hence, any metadata access related error message would also get captured and persisted in a centralized location (in addition to showing in a popup dialog box as earlier) just like other services like DIS and can be viewed (with features like filtering enabled) in Informatica Admin Console service log console in a similar manner. Without Metadata Access Service, the error messages were shown in a pop-up dialog only in developer tool but could not be retrieved later once the dialog box with an error message is closed by the developer.
- Metadata Access Service can be configured to interact with developer client tool over either http or more secure https protocol just like other Informatica services like DIS. Developer can configure the appropriate port (http/https) and keystore/truststore password/file locations (if https is enabled) as part of Metadata Access Service configuration through Admin Console or infacmd.
- Support for backup node is also enabled for high availability for Metadata Access Service. Hence, if the primary node where metadata access service runs goes down, the service should come back up on the backup node automatically.
- The local Hadoop distribution folders on the developer client boxes are now used during metadata access only if any metadata is accessed from the local file on the same box (where developer client box is installed) eg when a local avro or parquet file is used to import metadata without using a Hadoop File System connection object. Hence, the variable 'INFA_HADOOP_DIST_DIR' needs to be configured (or updated) only if metadata needs to be imported from a local file. In scenarios where a connection is used to import metadata, this variable is no longer required to be configured on the developer client side for metadata access. The size of Informatica Hadoop distribution folder on the developer client is also significantly reduced (by more than 1 GB) as most of the Hadoop distribution related jars/files are required to be deployed as part of only Informatica server-side installation.
'Metadata Access Service' provides a significant architectural improvement resulting in better security and easier configuration for Informatica Big Data Management client-side developer tool. This reduces the amount of time Informatica developers and administrator need to spend on configuring connectivity to Hadoop adapters like HDFS files, HBase, Hive and MapR-DB for importing metadata into the repository.
Sandeep Kathuria, Senior Staff Engineer