Microsoft Azure Data Lake Storage Gen2 Connector > Mappings and mapping tasks with Microsoft Azure Data Lake Storage Gen2 > Microsoft Azure Data Lake Storage Gen2 sources in mappings
  

Microsoft Azure Data Lake Storage Gen2 sources in mappings

In a mapping, you can configure a source transformation to represent a single Microsoft Azure Data Lake Storage Gen2 object.
The following table describes the Microsoft Azure Data Lake Storage Gen2 source properties that you can configure in a source transformation:
Property
Description
Connection
Name of the source connection. Select a source connection or click New Parameter to define a new parameter for the source connection.
When you switch between a non-parameterized and a parameterized Microsoft Azure Data Lake Storage Gen2 connection, the advanced property values are retained.
Source Type
Select Single Object or Parameter.
Object
Name of the source object.
Ensure that the headers or file data does not contain special characters.
Parameter
Select an existing parameter for the source object or click New Parameter to define a new parameter for the source object. The Parameter property appears only if you select Parameter as the source type.
Format
Specifies the file format that the Microsoft Azure Data Lake Storage Gen2 Connector uses to read data from Microsoft Azure Data Lake Storage Gen2.
You can select the following file format types:
  • - Delimited
  • - Avro
  • - Parquet
  • - JSON
  • - ORC
  • - Discover Structure2
Default is None. If you select None as the format type, Microsoft Azure Data Lake Storage Gen2 Connector reads data from Microsoft Azure Data Lake Storage Gen2 files in binary format.
Note: Ensure that the source file is not empty.
Intelligent Structure Model2
Applicable to Discover Structure format type. Select the intelligent structure model.
For more information, see Components.
Formatting Options
Mandatory. Microsoft Azure Data Lake Storage Gen2 format options. Opens the Formatting Options dialog box to define the format of the file.
Configure the following format options:
  • - Schema Source: Specify the source of the schema. You can select Read from data file or Import from schema file option.
  • If you select an Avro, JSON, ORC, or Parquet format type and select the Read from data file option, you cannot configure the delimiter, escapeChar, qualifier, and qualifier mode options.
    For any format type, if you select the Import from schema file option, you can only upload a schema file in the Schema File property field. You cannot configure the delimiter, escapeChar, and qualifier, and qualifier mode options.
  • - Data elements to sample2. Applicable only when you read JSON files in elastic mappings. Specify the number of rows to read to find the best match to populate the metadata.
  • - Memory available to process data2. Applicable only when you read JSON files in elastic mappings. The memory that the parser uses to read the JSON sample schema and process it.
  • The default value is 2 MB.
    If the file size is more than 2 MB, you might encounter an error. Set the value to the file size that you want to read.
  • - Schema File: You can upload a schema file.
  • - Delimiter: Character used to separate columns of data. You can set values as comma, tab, colon, semicolon, or others.
  • You cannot set a tab as a delimiter directly in the Delimiter field. To set a tab as a delimiter, you must type the tab character in any text editor. Then, copy and paste the tab character in the Delimiter field.
  • - EscapeChar: Character immediately preceding a column delimiter character embedded in an unquoted string, or immediately preceding the quote character in a quoted string.
  • - Qualifier: Quote character that defines the boundaries of data. You can set qualifier as single quote or double quote.
  • - Qualifier Mode1: This property is not applicable when you read data from a Microsoft Azure Data Lake Storage Gen2 source.
  • - Code Page: Select the code page that the Secure Agent must use to read or write data.
  • Microsoft Azure Data Lake Storage Gen2 Connector supports only UTF-8. Ignore rest of the code pages.
  • - Header Line Number1: Specify the line number that you want to use as the header when you read data from Microsoft Azure Data Lake Storage Gen2. You can also read a data from a file that does not have a header. To read data from a file with no header, specify the value of the Header Line Number field as 0.
  • - First Data Row1: Specify the line number from where you want the Secure Agent to read data. You must enter a value that is greater or equal to one.
  • To read data from the header, the value of the Header Line Number and the First Data Row fields should be the same. Default is 1.
  • - Target Header1: This property is not applicable when you read data from a Microsoft Azure Data Lake Storage Gen2 source.
  • - Distribution Column1: This property is not applicable when you read data from a Microsoft Azure Data Lake Storage Gen2 source.
  • - maxRowsToPreview: This property is not applicable when you read data from a Microsoft Azure Data Lake Storage Gen2 source.
  • - rowDelimiter1: This property is not applicable when you read data from a Microsoft Azure Data Lake Storage Gen2 source.
Note: You cannot preview data for hierarchical data types in elastic mappings.
1Applies only to mappings.
2Applies only to elastic mappings.
The remaining properties are applicable for both mappings and elastic mappings.
The following table describes the Microsoft Azure Data Lake Storage Gen2 source advance properties:
Property
Description
Concurrent Threads1
Number of concurrent connections to extract data from the Microsoft Azure Data Lake Storage Gen2. When reading a large file or object, you can spawn multiple threads to process data. Configure Block Size to divide a large file into smaller parts.
Default is 4. Maximum is 10.
Filesystem Name Override
Overrides the default file name.
Source Type
Select the type of source from which you want to read data. You can select the following source types:
  • - File
  • - Directory
Default is File.
Allow Wildcard Characters
Indicates whether you want to use wildcard characters for the directory source type.
For more information, see Wildcard characters.
Directory Override
Microsoft Azure Data Lake Storage Gen2 directory that you use to read data. Default is root directory. The directory path specified at run time overrides the path specified while creating a connection.
You can specify an absolute or a relative directory path:
  • - Absolute path - The Secure Agent searches this directory path in the specified file system.
  • Example of absolute path: Dir1/Dir2
  • - Relative path - The Secure Agent searches this directory path in the native directory path of the object.
  • Example of relative path: /Dir1/Dir2
    When you use the relative path, the imported object path is added to the file path used during the metadata fetch at runtime.
File Name Override
Source object. Select the file from which you want to read data. The file specified at run time overrides the file specified in Object.
Block Size1
Applicable to flat file format. Divides a large file into smaller specified block size. When you read a large file, divide the file into smaller parts and configure concurrent connections to spawn the required number of threads to process data in parallel.
Specify an integer value for the block size.
Default value in bytes for a flat file is 8388608 and maximum value is 104857600.
Timeout Interval
Not applicable.
Recursive Directory Read
Indicates whether you want to read objects stored in subdirectories in mappings and elastic mappings.
For more information, see Reading files from subdirectories
Compression Format
Reads compressed data from the source. Select Gzip to read flat files. Select None to read Avro, ORC, and Parquet complex files that use Snappy compression.
Note: Extension for flat files must be .GZ and extension for complex files must be .Snappy to read compressed files.
You cannot read compressed JSON files.
You cannot preview data for a compressed flat file.
Interim Directory1
Optional. Applicable to flat files and JSON files.
Path to the staging directory in the Secure Agent machine.
Specify the staging directory where you want to stage the files when you read data from Microsoft Azure Data Lake Storage Gen2. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is /tmp.
You cannot specify an interim directory when you use the Hosted Agent.
Tracing Level
Sets the amount of detail that appears in the log file. You can choose terse, normal, verbose initialization or verbose data. Default is normal.
1Applies only to mappings.
The remaining properties are applicable for both mappings and elastic mappings.

Directory source in Microsoft Azure Data Lake Storage Gen2 sources

You can select the type of source from which you want to read data.
You can select the following type of sources from the Source Type option under the advanced source properties:
Use the following rules and guidelines to select Directory as the source type:

Wildcard characters

When you read data from an Avro, flat, JSON, ORC, or Parquet file, you can use wildcard characters to specify the source file name.
To use wildcard characters for the source file name, select the source type as Directory and enable the Allow Wildcard Characters option in the advanced source properties.
When you run a mapping or an elastic mapping to read an Avro, JSON, ORC, Parquet, or flat file, you can use the ? and * wildcard characters to define one or more characters in a search.
You can use the following wildcard characters:
? (Question mark)
The question mark character (?) allows one occurrence of any character. For example, if you enter the source file name as a?b.txt, the Secure Agent reads data from files with the following names:
* (Asterisk)
The asterisk mark character (*) allows zero or more than one occurrence of any character. If you enter the source file name as a*b.txt, the Secure Agent reads data from files with the following names:

Rules and guidelines for wildcard characters

Consider the following rules and guidelines when you use wildcard characters:

Reading files from subdirectories

You can read objects stored in subdirectories in Microsoft Azure Data Lake Storage Gen2 in mappings and elastic mappings.
You can use recursive read for flat files in mappings and for complex files in mappings and elastic mappings.
To enable recursive read, select the source type as Directory in the advanced source properties. Enable the Recursive Directory Read advanced source property to read objects stored in subdirectories.

Rules and guidelines for reading from subdirectories

Consider the following rules and guidelines when you read objects stored in subdirectories:

Pushdown optimization

You can enable full pushdown optimization when you want to load data from Microsoft Azure Data Lake Storage Gen2 sources to your data warehouse in Microsoft Azure Synapse SQL. While loading the data to Microsoft Azure Synapse SQL, you can transform the data as per your data warehouse model and requirements. When you enable full pushdown on a mapping task, the mapping logic is pushed to the Azure environment to leverage Azure commands. For more information, see the help for Microsoft Azure Synapse SQL Connector.
If you need to load data to any other supported cloud data warehouse, see the connector help for the applicable cloud data warehouse.