Purpose
A source may contain duplicate records; this template illustrates how to remove duplicates before proceeding through the mapping.
Usage
Remove the duplicate records from a source when the source records contain a functional key. The records must contain a functional "group by" key.
Overview
The mapping illustrates the concept of using the functionality within an Aggregator transformation to remove duplicate records from a source and load this data into a target table of the same structure.
Challenge Addressed
Duplicate records are occasionally found in source data. Due to primary key constraints on a target database, only one version of a duplicate source record should be loaded into the target. This mapping illustrates one alternative for removing duplicate records when the source, flat file, has a functional key that can be used for grouping. For a relational source, you may select distinct records in the source qualifier to remove duplicate records.
Download
The example contains the mapping, session and workflow.
- PowerCenter 9.6.x Mapping XML (See Attachments section below)
- PowerCenter 9.1.0 Mapping XML (See Attachments section below)
- PowerCenter 8.6.1 Mapping XML (See Attachments section below)
Implementation Guidelines
The PowerCenter Aggregator transformation can be used to "group" records based on a logical key. In this example mapping, the source data is Nielsen company data that resides in a flat file that contains duplicates. The functional key to each source record is the STATE_TAX_ID field.
The key concept illustrated by this mapping can be found in the following transformation objects:
Transformation Name | Transformation Type | Description |
sq_NIELSEN | Source Qualifier or Application Source Qualifier | Selects all records from the flat file source. |
srt_STATEID | Sorter | A sorter transformation is used to order all the rows by STATE_TAX_ID (the functional key or group-by-column in the Aggregator transformation) so the “sorted ports” option can be used in the mapping. |
agg_REMOVE_DUP_RECORDS | Expression | In aggregator tranformation, the STATE_TAX_ID port is selected as the "Group By" port in the transformation (the number of "Group By" ports must correspond to the number of sorted ports indicated in the sorter transformation). The Integration Service, by default, returns the last row in a group if no aggregate function is specified. If two records with the same value for STATE_TAX_ID enter the Aggregator, only one record will be returned by the Aggregator. As a result, duplicate source records are eliminated. |
Note that the SORTED INPUT option is chosen in the aggregator so that only one group’s worth of data will be cached in memory by the PowerCenter Integration Service which will increases in Aggregator performance). Please refer to the metadata descriptions in the mapping for more details on this mapping’s functionality.
Note:
- The Mappings were originally built using PowerCenter Designer. If you are using PowerCenter Express (PCX), all the mappings cannot be imported as PCX includes only the Informatica Developer tool and not PowerCenter Designer tool. For example, Informatica Developer doesn’t support mappings that use “Sequence Generator” transformation.
- The objects are based on “UTF-8” codepage. If you have changed you codepage, XML may need to be edited.
Comments