Scanners: Overview and Categories

Version 3

    Enterprise Data Catalog: Building Blocks

    Scanners

    Scanners are pluggable component of EDC that extract source and profile metadata for storage into the catalog. A scanner is so named as it performs a scan job on the metadata sources before fetching it into the catalog. A scanner typically maps to a single resource type. However, there can be more than one scanner for a resource type. Examples are profiling scanner and lineage analyzer.

    Think of scanner as the runtime component of the resource. EDC user sets a resource of a specific type to fetch metadata and profile information , scanner is the runtime component that is execute inside Hadoop that does the actual retrieving the information.

    Types of Scanners

    EDC comes with following flavors of scanners :

    Preconfigured scanners ( OOTB)

    EDC comes with pre-configured scanners for common RDBMS, Hadoop and BI applications. All the user needs to do is provide configuration properties ( user credentials/schema name etc.).

    Partner Developed and maintained

    Partners of Informatica are building their own scanner and integrate it with EDC to load the metadata for systems that are not already supported by EDC. These scanners are supported by the partners and usually part of a metadata extraction application.

    Custom Scanners

    EDC allows ingest and extract metadata from sources for which Enterprise data catalog does not provide support. EDC by default, provides models (referred to as system models) for multiple data sources from which users can extract metadata, the extraction is based on the model for the data source.

     

    Resources

    Resource represents an external data source or metadata repository from where EDC extracts metadata. The resource is created in EDC administrator and contents are stored in MRS associated with catalog service’s resource is not very different from connections in BDM/Power center, the basic operations of extraction and storage of metadata are performed at the resource level. In EDC there are multiple resource types for polyglot external systems, there is a resource type for Oracle one for Hive another for Microstrategy. Resources run to extract metadata from external systems, the run cab manual or scheduled. Think of resource as the bridge through which metadata information from external sources make its way to catalog.

    Catalog

    All the searchable information (metadata) is stored in an indexed inventory called catalog. Catalog organized all the metadata and information in an efficient way for easy search and understanding of underlying information for faster retrieval and storage catalog information is stored inside a Hadoop cluster.