Metadata Load performance
If there are a lot of folders with multi-part files and/or time-partitioned folders, metadata load runtimes could be impacted. Configuring metadata load parameters, such as the maximum # of Connections and metadata, extraction scanner memory parameter, are possible options. Alternatively, I might consider different approach to how I create and manage my Resources, so that I am not undertaking a full scan at the bucket level or a single source directory in a bucket.
Are there any metrics on the Amazon S3 parquet file scanner to be released with EDC 10.4.1, with regard to the combination of the number of multi-part files for the same data file layout, and number of data file layouts, that the Amazon S3 scanner can extract object metadeta prior to degradation in scanner performance?
For XML, JSON, Avro, and Parquet resources, user must choose All rows sampling option. Due to All Rows sampling option restriction for profiling, there may not be sufficient resources to profile. Configuring the Data Integration Service parameters, profiling warehouse parameters, and Data Integration Service concurrency parameters are possible options.
With regard to Amazon S3 profiling of parquet files, are there any specific thresholds on the number of parquet multi-part files and time-partitioned folders before performance tuning requires attention?
Please raise a support ticket with Informatica GCS to review the use-case and research on the ask.