12 Replies Latest reply on May 24, 2020 5:52 AM by Nico Heinze

    All the Arabic characters are getting loaded as Junk Characters

    Saurabh Saneja New Member

      Hi, PowerCenter experts community,

       

      We are working with Arabic data set currently.

       

      We are facing the following issue:

       

      While Loading SRC(Oracle) Arabic data to TGT (Flat File) all the Arabic characters are getting loaded as Junk Characters in Flat file through Informatica Powercenter 10.2.0 Hotfix2 0401 1911

       

      Below are the further details associated with our dataset and the issue faced:

       

      1. Sample SRC Records –
        In this below data we are getting Arabic value while loading data from Oracle database to Flat file. That is. while loading Arabic Values to Flat file in this case for SOURCE_VALUE Column. Where Arabic values are getting converted to ‘ãÏíÑ ÎÏãÇÊ ÇáÚãáÇÁ’ in flat file as seen in screenshots below:


        .
      2. Sample TGT Records -



      3. Code page setting:-
        • SRC Relational Connection -
        • TGT Flat File:-
      4. Existing Configuration Code Page setting:-
        • 1. Integration service --- code page - MS windows Arabic

        • 2. In Repository Service – Code page – UTF8

        • 3. Data Movement is set to Unicode.

       

      We have also tried the following options as well, however, still no luck in getting this through:

       

       

      has anyone faced a similar issue before? it would be really helpful if anyone of you can help in getting this resolved. Looking forward to that.

       

       

       

      Thanks & Regards,

      Saurabh Saneja

        • 1. Re: All the Arabic characters are getting loaded as Junk Characters
          Sven Benzing Guru

          Hi,

           

          first of all we need to find out where the data corruption occurs, therefore you need to enable verbose data logging for this session and review the session log.

           

          As a second step you need to review your NLS_LANG settings:

           

          18761

           

          Hope that helps.

          Kind regards,
          Sven

          1 of 1 people found this helpful
          • 2. Re: All the Arabic characters are getting loaded as Junk Characters
            Paolo Moretti Seasoned Veteran

            Dealing with conde page's conversion issues can be difficult. The encoding used by the very tools you are seeing the data through might play also a role in it and make the troubleshooting process more difficult.

             

            I would start by reducing the mapping to its minimum (i.e. SRC=>SQ=>TGT) and use a flat file as source, with just a single row. This way you can focus on PowerCenter (and the OS).

             

            If you experience the same issue, then enabling "verbose data" mode at the session level might help you understand where the conversion is happening.

            1 of 1 people found this helpful
            • 3. Re: All the Arabic characters are getting loaded as Junk Characters
              Smitha HC Seasoned Veteran

              Hi Saurabh,

               

              As requested share the session log in verbose data mode fro one row. And check below settings.

               

              1.Source database codepage and set NLS_LANG parameter to the same codepage.

              2. Check the LC_LANG variable set

              3.Check Source connection codepage in workflow manager.

               

              Thanks,

              Smitha

              1 of 1 people found this helpful
              • 4. Re: All the Arabic characters are getting loaded as Junk Characters
                Nico Heinze Guru

                There's one more potential pain point which can cause this kind of trouble:

                 

                Suppose you have created a flat file target definition with the code page set to UTF-8.

                This target definition is used in some mapping for which you create a session.

                 

                Later you change the code page in the Target Designer of the PowerCenter Designer to MS Windows Arabic.

                 

                The not so funny thing is: the session will retain the code page as UTF-8.

                 

                Why?

                 

                When a session with a flat file source or target is created, the code page of that file is read from the source / target definition.

                From this time onward, later changes to the target definition will NOT reflect in the session.

                 

                In such a case you always have to manually change the code page of the target in the session in addition to the change in the target definition itself.

                 

                Not everyone is aware of this point, so please bear with me if you didn't have this problem.

                 

                Regards,

                Nico

                1 of 1 people found this helpful
                • 5. Re: All the Arabic characters are getting loaded as Junk Characters
                  Saurabh Saneja New Member

                  Hi All,

                   

                  Thank you so much Nico, Sven, Paolo, and Smitha!

                   

                  My team has tried all your inputs and here is what they have got:

                   

                  They have changed the tracing level to verbose data to check the data loading process. So, Arabic values are populated correctly from Source Qualifier till the last transformation which is used in the mapping. However, while inserting the data into the Flat file, values are loaded as a Special character.

                   

                  Attached the session log for your reference.

                   

                  Also checked with MS Windows Arabic option in the Target Designer as well as at session level of PowerCenter.

                   

                  Kindly Suggest.

                   

                  Thanks & regards,

                  Saurabh Saneja

                  • 6. Re: All the Arabic characters are getting loaded as Junk Characters
                    Smitha HC Seasoned Veteran

                    Hi Saurabh,

                     

                    From the session log, i could see that row has been pushed to target properly. Check the below screenshot.

                     

                     

                    So it could be possible that while opening the flatfile its showing special character due to code page settings of the editor. Please cross check once.

                     

                    Thanks,

                    Smitha

                    1 of 1 people found this helpful
                    • 7. Re: All the Arabic characters are getting loaded as Junk Characters
                      Nico Heinze Guru

                      The question now is:

                      How exactly did you check that the target file contained junk characters?

                      What tool did you use for this check?

                      What code page of the flat file does this tool expect?

                       

                      For example, if your target flat file was correctly written in MS Windows Arabic but your editor of choice expects UTF-8 encoding, then you will only see junk characters instead of text.

                       

                      In other words: as the "verbose data" log shows seemingly correct data in the last transformation before the target, there are only two possible causes of trouble: either the target writer is set to the wrong code page (hence my suggestion to check the target file setting in the session instead of in the Designer), or the file is checked with some software expecting a different code page than MS Windows Arabic.

                       

                      Regards,

                      Nico

                      1 of 1 people found this helpful
                      • 8. Re: All the Arabic characters are getting loaded as Junk Characters
                        user101600 Guru

                        Unfortunately as a general rule when you are working with different Code pages there are a lot of moving parts and a lot of things that could go wrong:

                        For example

                        1- The INFORMATICA engine is not running in UNICODE, in a case when It needs to be.

                        This doesnt seem to be the case here, since the session log says the following: (CMN_1569 : Server Mode: [UNICODE])

                        2- The PC Source or target connection might have the wrong code page selected in workflow manager.

                        3- The target database is setup incorrectly (The wrong code page is selected)

                        4- The database client is setup incorrectly (wrong code page is selected)

                        5- The characters that you are trying to write are just not supported within the particular DB databases codepage configuration.

                        6- The data is being moved properly in the proper codepage but the tools that you are using to view the data cannot display the characters properly.

                         

                        In order to help troubleshoot the problem we need to determine exactly where the bottle neck is.

                        Is the problem on the reader, on the writer, within power center or is it simply a display issue.

                         

                        Unfortunately there is no recipe book or exact science on what codepage should be used.

                        Therefore you will need to perform multiple tests and rely on trial and error in order to help debug this issue.

                        The main idea is to simply things as much as possible then slowly start adding things back into the equation until we can determine the bottle neck.

                         

                        For starters Please check the following configuration settings

                         

                        1-    Make sure that The INFORMATICA engine is running in UNICODE and not in ASCII. - According to the session log that you had sent me it looks like the IS is running in Unicode mode.

                        2-    Check the codepage you have defined on source DB connection settings in workflow manager (make sure you have defined a compatible codepage or set it to UTF-8)

                        3-    Check the codepage you have defined on Target DB connection settings in workflow manager (make sure you have defined a compatible codepage or set it to UTF-8)

                        4-    If you are Using oracle make sure that the Powercenter box has the correct nls_lang env variables defined where the IS is running on.

                         

                        The first step to debug this would be to check the reader and make sure the PC IS can read the data properly from the source.

                        So using the native DB client tools can you run the query and see the problematic character getting displayed properly?

                        Then the next step would be to take the existing mapping and session and write out to a FLAT FILE. This will help us determine if the problem is on the reader or on the Informatica side since we will be taking the target DB out of the picture.

                        If this still fails then we would need to try to create a flat file with these strange characters then create a simple mapping, going from this test flat file to a target that is also a flatfile.out.

                         

                        Now check to make sure whether or not those bad characters are still there.

                         

                        This will help us determine if the IS can read/write those characters correctly, if this works and all the data looks good after reading and writing to a FF.

                         

                        Then take this mapping and write to the original target database. So If reading from a FF and writing to the original target table works properly then we know that the issue is on the Source DB side. But if this test fails, then it can be that the target database is setup incorrectly (or the db-client).

                         

                        When you Write to a flat file please make sure you select the correct code page in session properties for the flat file. Also after the session runs Successfully take the flatfile.out output

                        And then rename the flat file to a .doc extension and open it with MS word, This way we can eliminate the fact that the characters are really in the file but the viewer you are using to view the file doesn't support or display the those characters properly. If you open this flat file in word you will either still see the strange characters or word it will try to display them by asking you to select a codepage in to use in order to display them properly. Also please keep in mind that in some cases you might not have the appropriate fonts installed on your PC to display the characters especially when working with Asian language characters.

                         

                        Another thing you can do is open your DB client SQL query tool and do a select on the table with the problematic charachters and spool the results to a file. Now try to open that file in word if this spooled file is properly displayed in MS word and the flat file that you wrote in PC before is not then we know that there is an issue with Informatica.

                         

                        Since the Informatica server leverages the oracle client in order to talk to an Oracle instance, you might need to define the correct NLS_LANG env variable on the IS side.

                        Please make sure that you define the NLS_LANG to eaxctly match the oracle codepage.

                         

                         

                        On UNIX, add a line (in the .profile of the user who starts the IS) like:

                        Or you can add it as a new env variable in admin console.

                         

                        NLS_LANG=AMERICA_AMERICAN.ISO8859-1

                        1 of 1 people found this helpful
                        • 9. Re: All the Arabic characters are getting loaded as Junk Characters
                          user101600 Guru

                          and as a last resort you can get the hex vlaues for the problematic characters in order to confirm they are correct.

                          You can also need to make sure that the target DB codepage is a subset of the source DB.

                          For example you have Arabic codepage in the source and the target DB is iso-8859 - obviously this will not work.

                          Make sure that the target DB can hold the Arabic data properly.

                          Try to manually insert the Arabic data via sql plus from the PC IS box. If you dont have the Arabic font installed on the PC IS, then you can insert the hex values

                          1 of 1 people found this helpful
                          • 10. Re: All the Arabic characters are getting loaded as Junk Characters
                            Nico Heinze Guru

                            Gabriel,

                             

                            your response was long enough to be one of mine. Well done, lad!

                             

                            Back to the matter. If you're interested, we can confirm whether PowerCenter transported the data correctly. I have Java transformations at hand which can e.g. write a hex dump of a character string to the session log; or a Java transformation which can produce arbitrary strings from a hex dump; and some more if needed.

                            If you're interested, let me know, then I'll post here what you need.

                             

                            Regards,

                            Nico

                            • 11. Re: All the Arabic characters are getting loaded as Junk Characters
                              Saurabh Saneja New Member

                              Thank you again, Smitha, Nico & Gabriel!

                               

                              Hi Gabriel,

                               

                              We have tried the checks that you had mentioned. Here is what we have got:

                               

                              1. It seems to be a display issue while writing to file as a target.

                              2. If the same values are loaded into the target database, it is showing the correct value.

                              3. We tried converting the flat file to a .doc extension and opened it with MS word, it seems it is still the same:

                               

                               

                              Hi inuser468357

                              We could definitely try using Java transformation as well. Please do share the same.

                               

                              However, we are still working on this. I will get back with our findings soon.

                               

                               

                              Thanks & Regards,

                              Saurabh

                              • 12. Re: All the Arabic characters are getting loaded as Junk Characters
                                Nico Heinze Guru

                                Hi Saurabh,

                                 

                                assuming you're asking for the Java transformation to produce a hex dump of an input string, here's a description of how to set it up.

                                I won't attach an XML export here because my XML file is pretty old, probably from some 9.6 version or so; as I don't want to damage your repository, it's safer for you to set up this JTX yourself.

                                 

                                1. Set up a passive(!) Java transformation in the Transformation Developer with the name JTX_Dump_String; description:

                                 

                                Dumps the given input string character by character in the session log with each character code and character type

                                if and only if the given input value "DisplayFlag" does not equal zero. If zero, nothing happens.

                                 

                                2. In the Input group, define one input-only port DisplayFlag of type Integer.

                                 

                                3. In the Output group, define one input-output port of type String(1048576).

                                 

                                4. Go to the Java Code tab.

                                 

                                5. At the top of the transformation, you find several tabs. Click on the Helper Code tab.

                                 

                                6. In the editor window, remove all existing code and text and insert the following text:

                                 

                                // Default value for characters per output buffer:

                                public static final int MAX_LEN = 64;

                                 

                                char[] hexChars = {'0', '1', '2', '3', '4', '5', '6', '7', '8' ,'9', 'a', 'b', 'c', 'd', 'e', 'f'};

                                 

                                // Text to be output to the session log:

                                StringBuilder text  = new StringBuilder( 100); // text content of "Str2Dump"

                                StringBuilder flags = new StringBuilder( 100); // flags (see explanation in session log)

                                StringBuilder line1 = new StringBuilder( 100); // dump line #1 (hex digit X... of Unicode value)

                                StringBuilder line2 = new StringBuilder( 100); // dump line #2 (hex digit .X.. of Unicode value)

                                StringBuilder line3 = new StringBuilder( 100); // dump line #3 (hex digit ..X. of Unicode value)

                                StringBuilder line4 = new StringBuilder( 100); // dump line #4 (hex digit ...X of Unicode value)

                                 

                                // Length of current input string:

                                int len;

                                // Index of first character in current output buffer:

                                int startOfBuffer;

                                // Index of current character in current output buffer:

                                int currentPos;

                                // Number of characters per output buffer:

                                int max_len;

                                // current character:

                                char ch;

                                // Flag: was previous character a high surrogate?

                                boolean wasPrevCharHighSurrogate;

                                // flag for current character:

                                char    flagCh;

                                 

                                // Has this JTX already been initialised?

                                boolean mustBeInitialised = true;

                                 

                                // Several messages are created using this:

                                String msg;

                                 

                                 

                                /** public boolean hasMore()

                                *  returns "true" if "startOfBuffer + currentPos < len && currentPos < max_len"

                                *  i.e. if at least one more character is available in "Str2Dump".

                                */

                                public boolean hasMore()

                                { return startOfBuffer + currentPos < len;

                                }

                                 

                                7. Now click on the On Every Row tab. Remove all existing text and insert the following text:

                                 

                                // Do we have to print the explanation first?

                                if (mustBeInitialised)

                                {    logInfo( "This module dumps its input strings to the session log in hex dump format.");

                                    logInfo( "Sample input: 'OK: ' || Chr( U+20b9f) || ' . HS: ' || Chr( U+d843) || " +

                                        "' . LS: ' || Chr( U+df92) || ' . Done!' . U+20b9f is a Han character, " +

                                        "U+d843 and U+df92 are invalid (they do not represent valid characters).");

                                    logInfo( "In UCS-2, the Han character \u20b9f is represented by the two 16-bit code units " +

                                        "U+d842 and U+df9f. U+d842 and U+d843 are so-called \"high surrogates\" (introducing " +

                                        "a so-called \"supplemental Unicode character\"), U+df9f and U+df92 are so-called " +

                                        "\"low surrogates\" (finishing a supplemental Unicode character).");

                                    logInfo( "All surrogate characters are \"dumped\" as question marks \"?\". All white-space " +

                                        "characters (line feed, blank space, etc.) are \"dumped\" as a blank space \" \". " +

                                        "All control characters are \"dumped\" as a period \".\".");

                                 

                                    msg = String .format( "%s%n%s%n%s%n%s%n%s%n%s%n%s%n%s%n",

                                        "Output in this sample case:",

                                        "",

                                        "0000000000: OK:  ?? .  HS:  ? .  LS:  ? . Done !",

                                        "  Flags:    .... <>.. .... .[.. .... .].. .... .",

                                        "            0000 dd00 0000 0d00 0000 0d00 0000 0",

                                        "            0000 8f00 0000 0800 0000 0f00 0000 0",

                                        "            6622 4922 2453 2422 2453 2922 4666 2",

                                        "            fbe0 2f0e 083a 030e 0c3a 020e 4fe5 1");

                                    logInfo( msg);

                                    msg = String .format( "%s%n%s%n",

                                        "Each line is introduced by its offset from the start of the string. " +

                                            "First line has offset 0, 2nd line offset 64, 3rd line = 128, and so on.",

                                        "The Unicode value of each character (e.g. the K in OK) is read from " +

                                            "top to bottom, yielding the character code U+006b.");

                                    logInfo( msg);

                                    msg = String .format( "%s%n%s%n%s%n%s%n%s%n%s%n%s%n%s%n%s%n%s%n%s%n",

                                        "Legend for flags:",

                                        "  (blank space) = blank space character.",

                                        "w = other white space (such as line feed, tab, and the like).",

                                        "< = high surrogate introducing a valid supplemental character.",

                                        "> = low surrogate finishing a valid supplemental character.",

                                        "[ = high surrogate not introducing a valid supplemental character.",

                                        "] = high surrogate not finishing a valid supplemental character.",

                                        "c = control character.",

                                        "0 = digit.",

                                        ". = other regular Unicode character.",

                                        "? = invalid Unicode character (dumped as a blank space.");

                                    logInfo( msg);

                                 

                                    // Don't forget to keep the fact that this module has been initialised by now:

                                    mustBeInitialised = false;

                                }

                                 

                                if (isNull( "DisplayFlag") || DisplayFlag != 0)

                                {    if (isNull( "Str2Dump"))

                                        len = -1;

                                    else

                                        len = Str2Dump .length();

                                 

                                    switch (len)

                                    {    case -1:    logInfo( "Input string is NULL.");

                                                break;

                                        case 0:    logInfo( "Input string is empty.");

                                                break;

                                        case 1:    logInfo( "Input string contains one code unit (16-bit).");

                                                break;

                                        default:    logInfo( "Input string contains " + len + " code units (16-bit) and " +

                                                    Str2Dump .codePointCount(0, len) + " Unicode characters.");

                                                break;

                                    }

                                 

                                    startOfBuffer = 0;

                                    currentPos    = 0;

                                    wasPrevCharHighSurrogate = false;

                                 

                                    // Main loop per string: is there more text to dump?

                                    while (hasMore())

                                    {    // Initialise max length of output lines:

                                        max_len = len - startOfBuffer;

                                        if (max_len > MAX_LEN)

                                            max_len = MAX_LEN;

                                 

                                        // Initialise the dump strings:

                                        text  .setLength( 0);

                                        text  .append( String .format( "%010d: ", startOfBuffer));

                                        flags .setLength( 0);

                                        flags .append( "  Flags:    ");

                                        line1 .setLength( 0);

                                        line1 .append( "            ");

                                        line2 .setLength( 0);

                                        line2 .append( "            ");

                                        line3 .setLength( 0);

                                        line3 .append( "            ");

                                        line4 .setLength( 0);

                                        line4 .append( "            ");

                                 

                                        while (currentPos < max_len)

                                        {    ch = Str2Dump .charAt( startOfBuffer + currentPos);

                                            currentPos ++;

                                            // determine flag:

                                            if (Character .isHighSurrogate( ch))

                                            {    if (hasMore() && Character .isLowSurrogate(

                                                        Str2Dump .charAt( startOfBuffer + currentPos)))

                                                    flagCh = '<';

                                                else

                                                    flagCh = '[';

                                            }

                                            else if (Character .isLowSurrogate( ch))

                                            {    if (wasPrevCharHighSurrogate)

                                                    flagCh = ']';

                                                else

                                                    flagCh = '>';

                                            }

                                            else if (Character .isSpaceChar( ch))

                                                flagCh = ' ';

                                            else if (Character .isWhitespace( ch))

                                                flagCh = 'w';

                                            else if (Character .isISOControl( ch))

                                                flagCh = 'c';

                                            else if (Character .isDigit( ch))

                                                flagCh = '0';

                                            else if ( ! Character .isDefined( ch))

                                                flagCh = '?';

                                            else

                                                flagCh = '.';

                                 

                                            // Append correct "dump character" to text line:

                                            if (flagCh == ' ' || flagCh == 'w')

                                                text .append( ' ');

                                            else if (flagCh == '<' || flagCh == '>' || flagCh == '[' || flagCh == ']')

                                                text .append( '?');

                                            else if (flagCh == 'c')

                                                text .append( '.');

                                            else if (flagCh == '?')

                                                text .append( ' ');

                                            else

                                                text .append( ch);

                                 

                                            // Append flag character to flags line:

                                            flags .append( flagCh);

                                 

                                            // Append hex value characters to hex dump lines:

                                            line1 .append( hexChars [((int) ch) / 4096]);

                                            line2 .append( hexChars [(((int) ch) % 4096) / 256]);

                                            line3 .append( hexChars [(((int) ch) / 16) % 16]);

                                            line4 .append( hexChars [((int) ch) % 16]);

                                 

                                            // After each 4th character, insert an extra blank space:

                                            if (currentPos % 4 == 0)

                                            {    text  .append( ' ');

                                                flags .append( ' ');

                                                line1 .append( ' ');

                                                line2 .append( ' ');

                                                line3 .append( ' ');

                                                line4 .append( ' ');

                                            }

                                 

                                            // If this character was a high surrogate, keep this in mind:

                                            wasPrevCharHighSurrogate = Character .isHighSurrogate( ch);

                                 

                                        } // while current_Pos < max_len

                                 

                                        // Print partial dump to session log:

                                        msg = String .format( "%n%s%n%s%n%s%n%s%n%s%n%s%n%n", text, flags, line1, line2, line3, line4);

                                        logInfo( msg);

                                 

                                        // Advance buffer index:

                                        startOfBuffer += max_len;

                                        currentPos     = 0;

                                 

                                    } // while hasMore()

                                 

                                } // if display flag is NULL or != 0

                                 

                                8. Now click on the Compile link to the right below the editor window.

                                The message window of the Java transformation should now display "Java code compilation successful."

                                If this is not the case, please post all error messages here.

                                 

                                The Java transformation prints a short description of the hex dumps at the beginning of the session log. Please ask about anything which is unclear to you.

                                 

                                Regards,

                                Nico