Removing Duplicate Source Records

Version 6

    Purpose

    A source may contain duplicate records; this template illustrates how to remove duplicates before proceeding through the mapping.

    Usage

    Remove the duplicate records from a source when the source records contain a functional key. The records must contain a functional "group by" key.

    Overview

    The mapping illustrates the concept of using the functionality within an Aggregator transformation to remove duplicate records from a source and load this data into a target table of the same structure.

     

    remove duplicate source records.png

    Challenge Addressed

     

    Duplicate records are  occasionally found in source data. Due to primary key constraints on a  target database, only one version of a duplicate source record should be  loaded into the target. This mapping illustrates one alternative for  removing duplicate records when the source, flat file, has a functional  key that can be used for grouping. For a relational source, you may  select distinct records in the source qualifier to remove duplicate  records.

     

    Download

     

    The example contains the mapping, session and workflow.

     

    • PowerCenter 9.6.x Mapping XML (See Attachments section below)
    • PowerCenter 9.1.0 Mapping XML (See Attachments section below)
    • PowerCenter 8.6.1 Mapping XML (See Attachments section below)

    Implementation Guidelines

     

     

    The PowerCenter Aggregator transformation can be used to "group" records based on a logical key. In this example mapping, the source data is Nielsen company data that resides in a flat file that contains duplicates. The functional key to each source record is the STATE_TAX_ID field.

     

    The key concept illustrated by this mapping can be found in the following transformation objects:

     

    Transformation Name

    Transformation Type

    Description

    sq_NIELSEN

    Source   Qualifier or Application Source Qualifier

    Selects all records from the flat file   source.

    srt_STATEID

    Sorter

    A   sorter transformation is used to order all the rows by STATE_TAX_ID (the functional key or group-by-column in   the Aggregator transformation) so the “sorted ports” option can be used in   the mapping.

    agg_REMOVE_DUP_RECORDS

    Expression

    In aggregator tranformation, the STATE_TAX_ID port is selected as the "Group By" port in the transformation (the number of "Group By" ports must correspond to the number of sorted ports indicated in the sorter transformation). The Integration Service, by default, returns the last row in a group if no aggregate function is specified. If two records with the same value for STATE_TAX_ID enter the Aggregator, only one record will be returned by the Aggregator. As a result, duplicate source records are eliminated.

     

    Note that the SORTED INPUT option is chosen in the aggregator so that only one group’s worth of data will be cached in memory by the PowerCenter Integration Service which will increases in Aggregator performance). Please refer to the metadata descriptions in the mapping for more details on this mapping’s functionality.

     

    Note:

    • The Mappings were originally built using PowerCenter Designer. If you are using PowerCenter Express (PCX), all the mappings cannot be imported as PCX includes only the Informatica Developer tool and not PowerCenter Designer tool.  For example, Informatica Developer doesn’t support mappings that use “Sequence Generator” transformation.
    • The objects are based on “UTF-8” codepage. If you have changed you codepage, XML may need to be edited.