To follow up on Brian's question. My understanding is that for Similarity Profiling to work I must have it turned on in the Resource setting and then run the system wide Similarity Discovery job. If I shut this off on the individual resource would that have an impact on the system wide Similarity Discovery job?
In our case we have 100+ resources, I have run them all with Similarity Discovery on but NOT run the Similarity Discovery system job. We need this setting on because we are interested in the value frequency and this must be on to turn on value frequency.
Hypothetically if I wanted to do similarity discovery for say just two of my 100 resources would I have to shut this off on 98 and then run the system Similarity Discovery job?
Does the fact that I have already run them with it on for the 100 mean that these are essentially queued up and will be compared when the system job runs even if this is shut now off.
I have similar question, and through testing, I have observed the following behavior:
- Once Similarity Discovery is turned on for a resource, and the resource is loaded, then the resource is in the queue for system Similarity Discovery. Simply turning off Similarity Discovery for the resource without re-running or purging the resource does not seem to remove the resource from future system Similarity Discovery. Have not yet tested whether re-run or purge of resource removes from the queue.
- Purging the Similarity Discovery resource itself does not seem to clear out the queued resources from system Similarity Discovery. I turned off Similarity Discovery in the individual resources (but did not re-run or purge them) and then I purged the Similarity Discovery resource, and when I re-ran the system Similarity Discovery I noticed in the progress tab of monitoring on the resource that there was a full set of pair-wise similarity comparisons occurring across all resources that had ever been configured for and run with Similarity Discovery, including those for which the setting had been turned off.
Improved guidance on controlling its configuration would be appreciated so that users can more intelligently control which resources will be compared. In my experience, Similarity Discovery tends to run long when a number of resources have accumulated in the queue.
At the moment, Value Frequency is the most useful piece of the functionality for my use cases.
1 of 1 people found this helpful
Your understanding is correct about Similarity Discovery system job. When we run any resource with Similarity preparation and Value frequency step on, it stores results under hdfs directory
When we run Similarity Discovery job, it picks all stored results from above hdfs stage location and run Similarity discovery against.
Currently there is no way to limit on what resources Similarity Discovery job runs on to compare results. Also it is not recommended to manually delete resource directory from above HDFS directory
Is there a plan to address the limit issue? Maybe something in a future release? I think the Similarity Discovery offers some promise but trying to do it on 100+ resources makes it almost unworkable from a performance perspective. Would be nice if the value frequency could be uncoupled from the Similarity Discovery because we are essentially stuck now not being able to test/do the Similarity Discovery on a small scale.
What are the risks with manually deleting the entries in the hdfs directory?
3 of 3 people found this helpful
starting with EDC v10.4.0 - there are 2 jvm settings that can be used to limit the resources that is used by similarityDiscovery.
both take a comma separated list of values.
Thanks this seems to be working. I did have a follow up on how SimilarityDiscovery works with Data Domains. I though that if we added data domains to an asset that it would propagate to Similar Columns. Can you provide some more detail on how that works?
My understanding was that if you create or apply a 'Data Domain by Example' to a column directly in the Catalog UI, and then include SimilarityDiscovery for that Resource on a future load... it will compute 'characteristic metadata profile' on your 'Data Domains by Example' and use it to find columns with similar 'characteristic metadata profile'. Subsequently, when DataDomainPropagation job runs (typically on a schedule unless turned off), the 'Data Domains by Example' should also be propagated to sufficiently similar columns. I don't believe that 'Rules-Based' Data Domains (those created in Catalog Admin Console or IDQ Data Domain Glossary) are included in the Data Domain Propagation, as they will rely on measurement of their rules to be 'propagated' in an automated way. I believe the key is to understand 'Data Domain by Example' versus 'Rules-Based' Data Domains.