1 Reply Latest reply on Jun 8, 2020 2:05 PM by Amrita Prabhu

    Amazon S3 Metadata Extraction and Profiling:

    John Quillinan Seasoned Veteran

      Metadata Load performance

      If there are a lot of folders with multi-part files and/or time-partitioned folders, metadata load runtimes could be impacted.  Configuring metadata load parameters, such as the maximum # of Connections and metadata,  extraction scanner memory parameter, are possible options.  Alternatively, I might consider different approach to how I create and manage my Resources, so that I am not undertaking a full scan at the bucket level or a single source directory in a bucket.

       

      Are there any metrics on the Amazon S3 parquet file scanner to be released with EDC 10.4.1, with regard to the combination of the number of multi-part files for the same data file layout, and number of data file layouts, that the  Amazon S3 scanner can extract object metadeta prior to degradation in scanner performance?

       

      Profiling performance

      For XML, JSON, Avro, and Parquet resources, user must choose All rows sampling option. Due to All Rows sampling option restriction for profiling, there may not be sufficient resources to profile.  Configuring the Data Integration Service parameters, profiling warehouse parameters, and Data Integration Service concurrency parameters are possible options. 

       

      With regard to Amazon S3 profiling of parquet files, are there any specific thresholds on the number of parquet multi-part files and time-partitioned folders before performance tuning requires attention?