some nonASCII Characters are coming like a thick vertical line in talend 5.6 - talend

I have a file with name "→Ψjohn.txt" and now i want to remove those special characters from the file name and update the file name as "john.txt". But talend is recognizing those characters as thick vertical line so it is not recognizing the source file in the physical location.can anyone please suggest a solution.
I have this file in database as well as physical location and when I am reading the file from the database it has to remove the special characters and update the same in the database as well as physical location.
in database the file looks like this
database
when I am reading from database using talend it looks like following
talend
Thanks in advance

You dont have to specify the exact file name, you can use tFilelist and get all file in a specific directory, you can also regex to mask some names, for example iterate over all *john.txt.
Once you get the actual filename, use a regex to remove unwanted characters for example : \W for non word characters and rename the file through a system command or using tFileCopy.

After seeing the pictures this is not a Talend issue but has to do with the font used in the Talend (Eclipse) console and the enconding setting of Java.
Those rectangles (better visible with a bigger font size) show that the font cannot represent your characters - it has no symbols for it.
Talend (Eclipse) settings
In Talend, navigate to Window / Preferences and choose General / Appearance / Colors and Fonts (as described in the Eclipse help). Check what font you are using for Debug / Console font. I've had good results with the font Consolas, which is set from Talend version 6 onwards. Beforehand it was Courier New.
Java encoding
You should check that Java uses UTF-8 encoding to display the characters. The Console Enconding has to be set to UTF-8. See this answer for an explanation how to do this.
Alternative
Alternatively you could store all the log data into a file and open this file in e.g. Notepad++ to see if the output is generated correctly and only displayed wrong.

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

FontLab Glyph Name to Unicode Mapping Not Working

I have created my own font pack with a custom encoding file which worked great. I am just having trouble using the "Generate Unicodes" feature. I want the Glyphs to automatically take on corresponding unicodes based on the mapping file.
I created a file called standard.nam and place it in "FontLab VI.app/Contents/Resources" as suggested. I even created a custom file and pointed the app to it.
No matter what I do the Glyph names never match to their corresponding Unicodes. Nothing happens at all when I click the little refresh unicode name.
How do I get this to work? I want my Glyph using my custom encoding file to get assigned their unicodes from the ".nam" database file. Below is my current file:
%FONTLAB NAMETABLE[: Database_name]
0x0020 !visiblespace
0x0020 space
0x0021 exclam
0x0022 quotedbl
0x0023 numbersign
0x0024 dollar
HELP PLEASE!!!
Thanks to the Fontlab team I was able to fix it. The updated code is below and I hope it helps someone else:
%%FONTLAB NAMETABLE: Database Name
0x0020 !visiblespace
0x0020 space
0x0021 exclam
0x0022 quotedbl
0x0023 numbersign
0x0024 dollar

HtmlHelp hhc file doesn't show russian characters

I use free pascal's chmcmd command to create chm file from hhp. After converting content goes right, but left pane side (tree) doesn't show russian characters. I tried to set charset at hhc file to cp1251. And saved file in windows 1251 encoding. After that it shows tree in russian right in cool reader but not in xChm. In windows it still doesnt work, only weird symbols. Utf-8 doesn't work at all.
The Microsoft CHM help format is very old and not maintained anymore. It wasn't created with Unicode in mind and various tricks need to be done in order to be able to generate CHM files for certain encodings:
You Windows is setup in the target language of the help file
The content HTML pages must be created using the proper charset

ANSI view get differed from notepad and notepad++.why?

I am writing some data as a xml file with ISO-8859 encoding.If I tried to open the file in notepad++.I can able to see the 'Â' character which is already present in the file.But if I tried to open the file in notepad the character 'Â' gets removed.Though I am very new to Encoding,I don't know why.Please suggest some reason for this.
This file is also get opened in browser with the 'Â' character.
Thanks in Advance
Windows notepad is a very basic editor, and has quite a number of limitations, one of which is the support it has for different encoding formats other than ANSI, Unicode and UTF-8. When editing files in other formats, it can give unreliable/unexpected results.
If you are handling files in different encoding formats, you are better off avoiding notepad altogether and using an editor (such as Notepad++) which has better support for multiple encoding formats.
For more information on how Windows notepad "guesses" at the correct format to use (with varying levels of success) see here
Bear in mind that other editors often use similar techniques to "guess" the format of a file, so it is often a good idea to check/set the encoding for a file manually (where possible) for less common encoding formats to ensure you get the correct results every time.

Is there a way to get the encoding of a text file in UltraEdit?

Is there a setting in UltraEdit that allows me to see the encoding of the file?
In UltraEdit, the encoding that is being used to -display- the file, is shown in the status bar at the right somewhere, together with the line-ending type in use, for example, "U8-UNIX". You can also manually set as what encoding the file has to be displayed. In version 10 this is under menu View -> Set Code Page. You can also -convert- the actual codepage of the file under menu File -> Conversions.
If the file does not have a BOM header, a couple of bytes at the start of the file indicating the encoding, the -actual- encoding of the file, can only be guessed. And even if the file has a BOM header, there can still be encoding issues.
All text editors do this, and some are better at it than others. I haven't done a comparision to see which is best at it. At the moment (2012), I know UltraEdit fails to detect UTF-8 and other variants in 1000 line (or longer) text files if the first UTF-8 character only appears later in the document. It also fails to show the encoding properly when you set it manually.
Notepad++ is also not great at detecting it, but when you know the encoding, you can set it manually.
Sublime Text is, as far as I know, best at detecting the encoding, also in large files.
I think there are also some very good command line tools out there, ported from GNU to Windows, to detect encoding. My bet would be that that's going to be the best option.