Add a Match Tx (you could also use an Aggregator if you only want to check for exact matches) after the Address Validator.
If using a Match Tx filter the results by clusterSize and write any records where the clusterSize is > 1 to your flatfile.
If using an Aggregator, you want to write any groups with > 1 record to the flatfile.
I tried using a Match Tx as you have explained above but it was not generating the Group keys properly.
If I understood correctly from the documentation for Match Tx : the Group key lets the Match Tx compare and find duplicates only within the same group. But in my case, I should be able to parse and find duplicate complete addresses across all the rows and not within a group only.
I tried both String and Soundex Strategies for Key generation , but still it didnt assign correct group keys because of which the clusters and linkscore were not calculated properly.
FYI , I used Bigram Distance "match strategy" (weight 0.5 ) with the Completeaddress field (from Address validator) as "Match Fields" in the Match Tx.
Now , I am going to try the second option of using an aggregator.
You're understanding of the Group Key is correct, however it is aimed at ensuring a balance between performance and good match results. While it would be ideal to match every record against others in the dataset, if you have millions of records this can take days.
When choosing a Group Key is is recommended to not use a field you will then use in your Match strategies, however in your specific use-case, you have standardized the addresses using the AV Tx and only want to find exact duplicates. Therefore I'd suggest using one or more of the fields you will match on to compare the addresses.
In fact you could probably just use the Key Generator Tx to perform the task of finding duplicates. If you include enough of the address elements in a composite key, each group should only contain identical records and you can filter the output to your flatfile to groups of > 1 record.
I tried creating the Group key using multiple fields from the Address validator i.e. AddressElementsStreetCompleteWithNumber1 , AddressElementsSub_buildingComplete1 and LastLineElementsPostcodeComplete with String strategy but it was unable to parse the Apartment/Unit numbers and put similar addresses with varying Apartment/Unit numbers into the same Cluster , thus tagging them as duplicates , which is not as expected.