-
1. Re: AVRO files produced by CDC Publisher unreadable
user126898 Oct 13, 2020 2:18 PM (in response to Jim Kolberg)couple of things to check or validate.
in the cdcPublisherAvro.cfg
Formatter.formatterType
The type of data serialization formatter to use for messages. The only valid value is
AvroFormatter.avroWrapperSchemaFormat: I see you have this set. This wraps the message the wrapper schema like below. You may want to double check you are parsing the data based on the schema format."To process the data in the messages based on this schema format, the consumer application must parse the messages to get the sourcemapname_tablename
and then find the Avro flat, nested, or generic schema that matches that name value by using their own methods."{ "type" : "record","name" : "InfaAvroWrapperSchema","fields" : [{"name" : "INFA_SEQUENCE","type" : [ "null", "string" ], "default" : null},{"name" : "INFA_TABLE_NAME", "type" : [ "null", "string" ],"default" : null},{"name" : "INFA_OP_TYPE","type" : [ "null", "string" ],"default" : null},{"name" : "ChildSchema","type" : [ "null", "string" ],"default" : null}]} -
2. Re: AVRO files produced by CDC Publisher unreadable
Jim Kolberg Nov 2, 2020 2:18 PM (in response to user126898)Getting back to this issue after a while on other tasks, but thanks for the tip.
We removed the wrapper schema line to eliminate that variable, so it's just formatterType=avro and , Formatter.avroSchemaFormat=avroFlatSchemaFormatV1. We've been deleting all the .rpt files and creating a new topic to eliminate any possibility of contamination across experiments. Still no luck.
If I download a single kafka flow file, there is enough human readable text that I know I'm getting a specific table, and it's in the flat format. But nifi's convert record processor just chokes with a generic message and I the avro-python3 can't handle it either. The schema file I'm using is the contents of the .rpt file generated by publisher.
Formatter.avroEncodingType=json works correctly for me, but the messages are huge. My assumptions are that a binary formatted avro with the schema stored externally would be the most efficient transfer to our Hadoop cluster. Has anybody got this working? Or do you just pay the JSON tax and move on? Or are my assumptions not valid to begin with?
The files I am getting do not start with the "Magic" number "Obj1" as defined in the Avro standard (Object Container Files Heading). A file I can work with starts like this (Linux od -c outout):
0000000 O b j 001 004 024 a v r o . c o d e c
0000016 \b n u l l 026 a v r o . s c h e m
0000032 a 362 005 { " t y p e " : " r e c o ...
All the avro binary files from publisher start like this
0000000 002 206 002 2 , P W X _ G E N E R I C
0000020 , 1 , , 2 , 3 , D 4 0 0 D 9 9 2
0000040 7 1 0 7 2 D 0 0 0 0 0 0 0 0 0 0 ...
If I have to put the schema in every file, then it is what it is. But seems like a waste.
-
3. Re: AVRO files produced by CDC Publisher unreadable
Bibek Sahoo Nov 9, 2020 3:06 AM (in response to Jim Kolberg)1 of 1 people found this helpfulFor updated/formatted schema, You should use Publisher latest release 1.3 where we have introduced "Custom Pattern Formats" where you can change the format of the data as per your consumer need and even remove columns(PWX_Generic) from the target topic.
In this also you can not use the .rpt file but you might use the target data as schema, because it will generate valid schema.
You might want to refer to chaoter " Appendix C Custom Pattern Formats" in the user guide:
And yes, you can also raise a case with Informatica to guide you further on this