4 Replies Latest reply on Mar 15, 2019 11:30 PM by ltrapadoux

    Best practices for data domain curation

    Ken Guyette Guru

      I am curious if there are any documented best practices, automated methods, or practical experience with curating data domains in EDC 10.2.1.  I am trying to assess the number of curators and time we may need to allocate for such an exercise.  To date I have found two methods for curating data domains, both of which seem quite time consuming.


      Please see attached file for some examples. We are new to this tool and cataloging in general so that is why we are looking for any tips or tricks on expediting this process or at least estimating what we will need for resources and time.  Our goal is to primarily curate PII related data domains and eventually we would have dozens of resources and millions of columns across HANA, SQL server, etc. 

        • 1. Re: Best practices for data domain curation
          Scott Lee Active Member

          In terms of your second use case, your PII Steward can navigate to the Data Domain (say, "Last Name") in the catalog through search or browse, and (given correct permissions) they will see a section for Assigned and another for Rejected.  You can do your curation in batches here by multi-selecting within one or the other section, and then clicking the triple-dot icon for that section and choosing to Accept or Reject the set.


          You can also do them one at a time by hovering to the right of each row where a set of check / X icons will appear (this is a non-intuitive interface).


          As far as I can tell, all curation activities are submitted asynchronously, so it can take some time for the UI front-end to reflect your stewardship activities.  But it is way faster to do it on this screen than in the assets.  You cannot sort here, but you can filter by using the little search box.


          In terms of best practice, I like to assign stewards cross-functionally - by Resource where a technical steward is there we can work with (DBA, engineer), and by Domain (or Domain Group) where a business steward (analyst, SME) is available.  Basic job criteria is to ensure all Domain assignments get pushed "up" (Accept) or "out" (Reject).  Every action taken by a steward influences CLAIRE, so take care with your work.


          Finally, if you have the compute power available, I recommend running and saving similarity profiles in the catalog; these distributions work in concert with Domains and Glossary to really make a catalog entry useful to a data seeker.


          Hope this helps...

          2 of 2 people found this helpful
          • 2. Re: Best practices for data domain curation
            Ken Guyette Guru

            Thanks for the feedback.  I definitely think the first option is better (by data domain) but both are cumbersome at best.


            In terms of the actions influencing CLAIRE is it across the platform or within the resource being curated?  For example.  I have the Age data domain and I have fields identified across three different resources (HANA, SQL Server, csv file) as Age.  I curate the ones identified in the HANA resource and reject most as a false positive.  Will this influence data domain discovery the next time I run it on the SQL Server resource or will it only impact HANA?


            I am trying to gauge how much curation is necessary for CLAIRE to "learn" as in some of our testing we find that even after rejecting hundreds of false positives we tend to pick up more the next time we run if we do a new schema within HANA for example. 

            • 3. Re: Best practices for data domain curation
              Scott Lee Active Member

              CLAIRE is marketed as the intelligence engine running underneath the entire Informatica platform, not just EDC, not just within a Resource.  However, CLAIRE is a black-box, like all M-L systems.  It is difficult to predict what any specific curation event will do to it. In my experience though, having Similarity Profiles saved across a range of assets and a up-to-date run of SimilarityDiscovery makes the biggest difference to asset entry quality / completeness.


              I will tell you that tuning the acceptance criteria for the Domains is much more efficient than waiting for Domain Propagation to get smarter.  It is also often more effective to build Domains from Profiles of real data than using the built-in ones; they are pretty, well, stupid.  Personally, I'd rather burn 10 hours of dev-time to save 100 in operations.  We use a lot of reference-table driven Domains with high match thresholds.


              Perhaps someone from EDC product team could answer your CLAIRE question in more detail.

              1 of 1 people found this helpful
              • 4. Re: Best practices for data domain curation
                ltrapadoux Guru



                A good way to accelerate the data domain curation would be as proposed by Scott, when value frequency profiling is enabled, it will allow the system to "learn by example" for that data steward have the option to create a data domain that correspond to a column, this domain (we call smart domain) will be propagated to other column that have the same characteristics or "profile". For this propagation, CLAIRE looks at the result of the column similarity jobs which find similar column based on the following criteria: matching column names, data patterns, unique values, value distribution. The weight of the criteria can be adjusted using service level custom properties, also we see mostly no need to customize these weight so far.

                The out of the box data domain are giving results that may be considered inaccurate as they are based on generic rules that may not exactly fit your data set. Using smart domain will allow to improve the accuracy of discovery and reduce the amount of effort needed to curate each individual columns.


                For more detail, you can refer to the following section of the EDC Administrator guide:

                - Column Similarity

                - Data Domains and Data Domain Groups


                Thank you


                2 of 2 people found this helpful