I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.
Related
I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.
I am working on files with unknown encoding at first but I get the encoding with this lines in JAVA:
InputStream in = new FileInputStream(new File("D:\\lbl2\\1 (26).LBL"));
InputStreamReader inputStreamReader = new InputStreamReader(in);
System.out.print(inputStreamReader.getEncoding());
and we get UTF8 in output.
but the problem is that when I try to see file content with the browser or text editor like Notpad++ I can't see character correctly. Instead when I change the encoding to Windows-1256 all of characters view correct and readable.
Do i do any mistake?
Java does not attempt to detect the encoding of a file. getEncoding returns the encoding that was selected in the InputStreamReader constructor. If you don't use one of the constructors that take a character set parameter, you get the 'platform default charset', according to Oracle's documentation.
This question discusses what the platform default charset is, and how you can change it.
If you know in advance that this file is Windows-1256, you can use:
InputStreamReader inputStreamReader = new InputStreamReader(in, "Windows-1256");
Attempting to detect the encoding of a file usually fails - see for example the Bush hid the facts issue in Windows Notepad.
Unfortunately there is no 100% reliable way to detect the encoding of a file and as the other answer points out Java by default doesn't try. It simply assumes the platform's default encoding.
If you know the files are all in a single encoding then great, you can just specify that encoding and life is good.
If you know that some files are in UTF-8 and some files are in a single legacy encoding then you can generally get away with trying a strict* UTF-8 decode first. If the strict UTF-8 decode errors out then you move on to your legacy encoding.
If you have a wider mix of encodings things get considerablly harder, you may have to resort to some quite complex language processing to sort them out.
* I belive to get a strict decode in Java you need to first get the "Charset", then get a "CharsetDecoder" and then use the "onMalformedInput" method to set it to strict mode.
I don't want vim to ever interpret my data in any encoding specific way. In other words, when I'm in vim, I want the character that my cursor is on to correspond to the actual byte, not a utf* (etc.) representation of that byte.
I need to use vim to analyze issues caused by Unicode conversion errors made by other people (using other software) so it's important that I see what is actually there.
For example, in Cygwin's vim, I have been able to see UTF-8 BOMs as
 [START OF FILE DATA]
This is perfect. I recognize this as a UTF-8 BOM and if I want to know what the hex for each character is, I can put the cursor on the characters and use 'ga'.
I recently got a proper Linux machine (Fedora). In /etc/vimrc, this line exists
set fileencodings=ucs-bom,utf-8,latin1
When I look at a UTF-8 BOM on this machine, the BOM is completely hidden.
When I add the following line to ~/.vimrc
set fileencodings=latin1
I see

The first 3 characters are the BOM (when ga is used against them). I don't know what the last 3 characters are.
At one point, I even saw the UTF-8 BOM represented as "feff" - the UTF-16 BOM.
Anyway, you see my problem. I need to see exactly what is in my file without vim interpreting the bytes for me. I know I could use xxd, od, etc but vim has always been very convenient as an analysis tool. Plus I want to be able to edit the files and save them without any conversion problems.
Thanks for your help.
Use 'binary' mode:
:edit ++bin file
or
vim -b file
From :help 'binary':
The 'fileencoding' and 'fileencodings' options will not be used, the
file is read without conversion.
I get some good mileage from doing :e ++enc=latin1 after loading the file (VIm's initial guess on the encoding isn't important at this stage).
The sequence  is actually the U+FEFF (BOM) encoded UTF-8, decoded latin1, encoded UTF-8, and decoded latin1 again.  is the U+FEFF (BOM) encoded as UTF-8 and decoded as latin1. You can't get away from encodings. Those aren't the actual bytes, they are the latin1 characters displayed from an incorrect decoding. If you want bytes, use a hex editor; otherwise, use the correct decoding.
I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.
Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.
Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function
I have a .csv file generated by Excel that I got from my customer. My software has to open and parse it in java. I'm using universalchardet but it did not detect the encoding from the first 1,000 bytes of the file.
Within these 1,000 first bytes, there is a sequence that should be read as Boîte, however I cannot find the correct encoding to use to convert this file to UTF-8 strings in java.
In the file, Boîte is encoded as 42,6F,94,74,65 (read using a hex editor). B, o, t and e are using the standard latin encoding with 1 byte per character. The î is also encoded on only one byte, 0x94.
I don't know how to guess this charset, none of my searches online yielded relevant results.
I also tried to use file on that file:
$ file export.csv
/Users/bicou/Desktop/export.csv: Non-ISO extended-ASCII text, with CR line terminators
However I looked at the extended-ASCII charset, the value 0x94 stands for ö.
Have you got other ideas for guessing the encoding of that file?
This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:
InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");
I don't know how to delete my own question. I don't think it is useful anymore...