Performance issues are seen when processing huge EBCDIC files in hive pushdown mode. The mapping has a Complex Data Object as source to read the EBCDIC file in binary mode followed by a Data Processor streamer to chunk the input data and convert the Data to relational format and finally write data to flat file in HDFS.
We are not able to leverage Hadoop parallel distributed computing since only one map job is spawned reading the entire EBCDIC binary file.
This document discusses some performance tuning steps when processing such huge EBCDIC files in Hadoop pushdown mode. The EBCDIC files assumed in this article are fixed length records based on Cobol Copybooks.
Suggestions to Improve performance
The mapping assumed here has a Complex Data Object as source to read the EBCDIC file in binary mode followed by a Data Processor streamer to chunk the input data and convert the Data to relational format and finally write data to flat file in HDFS.
So, some options to improve performance -
- In the streamer data processor, look for the “count” property when you segment the binary input under repeating_segment. Set the count property to define the number of records that the Data Integration Service must treat as a batch. When you set the count property, the Data Processor Engine will be called once for each batch of records instead of calling the Data Processor Engine for every record. So, batch processing to improve performance.
- You use “org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat” the Custom Input format to split the binary records into equal length. This can be configured as custom Input format under the Complex File Reader, so the EBCDIC file is split-able based on multiples of single record length. That would help create multiple map jobs for each split. This would help only if your data has a fixed length records in EBCDIC format. If it is variable length, this approach would not help.
- Configure the Input Split size maximum and minimum in such a way that it creates multiple maps for each input split.
- There is also com.informatica.hadoop.reader.RegexInputFormat available for custom Input Format value to help with the split, but I am not sure if you can construct a regex given the data is in EBCDIC format.
Steps to Improve the performance by spawning multiple map jobs.
We will be using the custom input format “org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat” class to split the input. Note that the class file for the Input format “org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat” is already part of various Hadoop distribution vendor jars. So you need not worry about copying them to services/shared/hadoop/<distro>/infaLib directory.
Here is the proof from class finder utility
Now, the detailed Steps …
1. Add the below snippet in the core-site.xml file under services/shared/hadoop/<your distro>/conf directory. As you can see, this is where the fixed length record size 1026 (in my case) is specified.
2. Open hadoopEnv.properties under services/shared/hadoop/<your distro>/InfaConf directory and add the core-site.xml file to the infapdo.aux.jars.path as shown below
3. In the mapping Runtime properties, override the Input Split size so you can create multiple map jobs. In my case, the dfs block size is 128 MB. So in order to set the input split size as 64 MB, I set the below values in the mapping runtime properties
The split size is calculated by the formula:-
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
mapred.min.split.size : value 33554432
mapred.max.split.size : value 67108864
I have also set the number of mappers and reducers as shown below.
4. Complex File reader with input format, as of 10.1.1 version, prepends the size of the input length to the buffer that it sends out. So we need to skip it in the parser. You can see the highlighted section below where I have skipped the record size in bytes (4 bytes) under the repeating group in the Data Processor generated script using Cobol to relational wizard.
5. Open the Streamer Data Processor and set the offset to split as (fixed record length + bytes need to store the size of the record) = (1026+4) = 1030 in my case. Set as shown below
6. Set the custom input format under the complex file reader to “org.apache.hadoop.mapreduce.lib.input.FixedLengthInputFormat”
7. Adjust port precision depending on your record size. I am attaching my sample mapping here.
8. You can also set the count to greater than 1 to enable batch processing by Streamer Data Processor.
9. Run the mapping in Hadoop pushdown mode using Hive engine and check if multiple maps spawned.
10. Tune the performance by adjusting the input split size and also the batch processing count in the streamer.
In 10.1.1, the record length needs to be set as part of the core-site xml file. So, in case you need to process multiple EBCDIC files of different sizes, there is only crude workarounds to accomplish this currently. You can either have multiple Data Integration Service created [Or] use Fixed length binary record format code available in Internet, compile and place them under server/shared/hadoop/<distro>/infaLib directory with different package names for different hard-coded record lengths. Sample code: https://gist.github.com/freeman-lab/98d9096695e794391ab9. This custom Input format code is derived from github and this is not owned by Informatica. GCS will not be responsible for any issues or bug fixes with this format.
Tested in Product & Version: BDM 10.1.1 Update2
Author Name : Sugi