3 Replies Latest reply on Nov 9, 2020 3:06 AM by Bibek Sahoo

    AVRO files produced by CDC Publisher unreadable

    Jim Kolberg New Member

      Hi,

      We have a long running instance of CDC publisher that produces human readable json output in the wrapped-generic schema. I tried setting up another instance on the same VM and configured it to produce binary AVRO files in the wrapped-flat schema. The avro files are unreadable in any client I've tried (nifi, some online tools) I get the error "Not a data file."  I repeated using wrapped-nested schema with same error.

       

      If I look at it in a hex viewer, the messages still contain the string "PWX_GENERIC" near the top. This publisher shouldn't be writing the generic schema. Is it getting wires crossed with the other publisher instance? We tried running the following commands to break any knowledge between the two instances

       

      PwxCDCAdmin.sh CLEAR=FORMAT TABLE=ALL INSTANCE={new instance name}

      PwxCDCAdmin.sh RESET=FORMAT INSTANCE={new instance name}

      PwxCDCAdmin.sh REPORT=FORMAT TABLE=ALL INSTANCE={new instance name}

       

      We got a few readable avro files after that. But it was a small minority for a short time. Now we're back to the "Not a data file" error.

       

      My questions: 1) Should we see PWX_GENERIC in the binary avro output when it should be a flat or nested schema? 2) Is it even possible to run two publishers with different configurations on the same VM reading same CDC listener?

       

      The original and new instances are the same except for the following config changes.

      cdcPublisherAvro.cfg

      Formatter.avroSchemaFormat=avroFlatSchemaFormatV1

      Formatter.avroEncodingType=binary

      Formatter.avroWrapperSchemaFormat=avroWrapperSchemaFormatV1

      Formatter.formatterAddTimestampColumn=true

       

      cdcPublisherKafka.cfg

      Connector.kafkaTopic=avro_cdcpub_wrapped_flat_redacted

      Connector.kafkaMessageKey=USE_TABLE_NAME

       

      cdcPublisherCommon.cfg – No changes

       

      cdcPublisherPowerExchange.cfg – No changes

       

        • 1. Re: AVRO files produced by CDC Publisher unreadable
          user126898 Guru

          couple of things to check or validate.

           

          in the cdcPublisherAvro.cfg

           

          Formatter.formatterType

          The type of data serialization formatter to use for messages. The only valid value is

          Avro

           

          Formatter.avroWrapperSchemaFormat:  I see you have this set.  This wraps the message the wrapper schema like below.  You may want to double check you are parsing the data based on the schema format.
          "To process the data in the messages based on this schema format, the consumer application must parse the messages to get the source

          mapname_tablename

          and then find the Avro flat, nested, or generic schema that matches that name value by using their own methods."
          { "type" : "record",
          "name" : "InfaAvroWrapperSchema",
          "fields" : [
               {"name" : "INFA_SEQUENCE","type" : [ "null", "string" ], "default" : null},
               {"name" : "INFA_TABLE_NAME", "type" : [ "null", "string" ],"default" : null},
                {"name" : "INFA_OP_TYPE","type" : [ "null", "string" ],"default" : null},
               {"name" : "ChildSchema","type" : [ "null", "string" ],"default" : null}
          ]
          }
          • 2. Re: AVRO files produced by CDC Publisher unreadable
            Jim Kolberg New Member

            Getting back to this issue after a while on other tasks,  but thanks for the tip.

             

            We removed the wrapper schema line to eliminate that variable, so it's just formatterType=avro and , Formatter.avroSchemaFormat=avroFlatSchemaFormatV1.  We've been deleting all the .rpt files and creating a new topic to eliminate any possibility of contamination across experiments. Still no luck.

             

            If I download a single kafka flow file, there is enough human readable text that I know I'm getting a specific table, and it's in the flat format. But nifi's convert record processor just chokes with a generic message and I the avro-python3 can't handle it either. The schema file I'm using is the contents of the .rpt file generated by publisher.

             

            Formatter.avroEncodingType=json works correctly for me, but the messages are huge. My assumptions are that a binary formatted avro with the schema stored externally would be the most efficient transfer to our Hadoop cluster.  Has anybody got this working? Or do you just pay the JSON tax and move on? Or are my assumptions not valid to begin with?

             

            The files I am getting do not start with the "Magic" number "Obj1" as defined in the Avro standard (Object Container Files Heading). A file I can work with starts like this (Linux od -c outout):

            0000000   O   b   j 001 004 024   a   v   r   o   .   c   o   d   e   c

            0000016  \b   n   u   l   l 026   a   v   r   o   .   s   c   h   e   m

            0000032   a 362 005   {   "   t   y   p   e   "   :   "   r   e   c   o ...

             

            All the avro binary files from publisher start like this

            0000000 002 206 002   2   ,   P   W   X   _   G   E   N   E   R   I   C

            0000020   ,   1   ,   ,   2   ,   3   ,   D   4   0   0   D   9   9   2

            0000040   7   1   0   7   2   D   0   0   0   0   0   0   0   0   0   0 ...

             

            If I have to put the schema in every file, then it is what it is. But seems like a waste.

            • 3. Re: AVRO files produced by CDC Publisher unreadable
              Bibek Sahoo Active Member

              For updated/formatted schema, You should use Publisher latest release 1.3 where we have introduced "Custom Pattern Formats" where you can change the format of the data as per your consumer need and even remove columns(PWX_Generic) from the target topic.

               

              In this also you can not use the .rpt file but you might use the target data as schema, because it will generate valid schema.

               

              You might want to refer to chaoter " Appendix C Custom Pattern Formats" in the user guide:

              Preface

               

              And yes, you can also raise a case with Informatica to guide you further on this

              1 of 1 people found this helpful