How to properly store umlauts and accents in vcard file? - encoding

From the RFC for card 4.0 I learned that vcard 4.0 is always utf-8.
I am using ez-vcard to export contacts into a export.vcf file transferred via http:
response.setContentType("text/vcard; charset=utf-8");
response.setStatus(HttpServletResponse.SC_OK);
PrintWriter writer = response.getWriter();
VCardWriter vCardWriter = new VCardWriter(writer, VCardVersion.V4_0);
// add cards...
vCardWriter.close();
Guess what? Characters are not being encoded properly. If I open the file in a text editor, I see characters are messed up.
Any help?

It may be ignoring the character encoding specified in the content type because you are setting it to something other than text/html.
Try setting the character encoding using setCharacterEncoding() instead (make sure to call it before calling getWriter()).
response.setContentType("text/vcard");
response.setCharacterEncoding("UTF-8");
response.setStatus(HttpServletResponse.SC_OK);
PrintWriter writer = response.getWriter();
Also, make sure your text editor is reading the file correctly. During my testing, I found that Eclipse would not display UTF-8 characters correctly, because it was configured to load the file under a different character set. Try viewing the file contents from the terminal:
cat the-vcard-file.vcf
EDIT: One more thing: Do not close the VCardWriter object. This will close the servlet's PrintWriter object, which you must never close!!

Related

How to change the default encoding and deocding type of Moba TextEditor?

I am using MobaXterm to remote contact with my server. I instantly found the MobaTextEditor, the built-in tools of MobaXterm that I used it to rewrite my uploaded code on the server. However, the MobaTextEditor use GBK to decode and encode files leading to some errors because my code was save as the default encoding type of UTF-8. Is there any way to change the default encoding type of my MobaTextEditor? Any help will be greatly appreciated.
I asked this issue to MobaX Team and received answer.
They don't have yet now. (changing charset in Moba TextEditor)
They told me that this issue will be added later but do not mention exact date.
So, it could be another key to resolve your problem.
In my case, I need text files on my server to read and insert in Mysql DB.
I make text files with BufferedWriter with "utf-8" and read it then insert DB.
therefore I do not lost word.
if you make text file,
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "utf-8"));
if you read text file,
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "utf-8"));

What does this decode to, and is it UTF? Игорќ

I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.

The File Encoding Is utf8 but is in Windows-1256 readable

I am working on files with unknown encoding at first but I get the encoding with this lines in JAVA:
InputStream in = new FileInputStream(new File("D:\\lbl2\\1 (26).LBL"));
InputStreamReader inputStreamReader = new InputStreamReader(in);
System.out.print(inputStreamReader.getEncoding());
and we get UTF8 in output.
but the problem is that when I try to see file content with the browser or text editor like Notpad++ I can't see character correctly. Instead when I change the encoding to Windows-1256 all of characters view correct and readable.
Do i do any mistake?
Java does not attempt to detect the encoding of a file. getEncoding returns the encoding that was selected in the InputStreamReader constructor. If you don't use one of the constructors that take a character set parameter, you get the 'platform default charset', according to Oracle's documentation.
This question discusses what the platform default charset is, and how you can change it.
If you know in advance that this file is Windows-1256, you can use:
InputStreamReader inputStreamReader = new InputStreamReader(in, "Windows-1256");
Attempting to detect the encoding of a file usually fails - see for example the Bush hid the facts issue in Windows Notepad.
Unfortunately there is no 100% reliable way to detect the encoding of a file and as the other answer points out Java by default doesn't try. It simply assumes the platform's default encoding.
If you know the files are all in a single encoding then great, you can just specify that encoding and life is good.
If you know that some files are in UTF-8 and some files are in a single legacy encoding then you can generally get away with trying a strict* UTF-8 decode first. If the strict UTF-8 decode errors out then you move on to your legacy encoding.
If you have a wider mix of encodings things get considerablly harder, you may have to resort to some quite complex language processing to sort them out.
* I belive to get a strict decode in Java you need to first get the "Charset", then get a "CharsetDecoder" and then use the "onMalformedInput" method to set it to strict mode.

Saving outlook mail items in UTF-8 / Unicode using C#

We have created an Outlook Plugin which (amongst other things) can be used to save Mail items in text form to a specific folder. However, the text of the resulting text file is encoded in ANSI and I would like to save it as UTF8. I have already set the Codepage of the mail item like so:
mail = (MailItem)objItem;
mail.InternetCodepage = 65001; // equal UTF8 encoding; see http://msdn.microsoft.com/en-us/library/office/ff860730.aspx
mail.SaveAs(filePath, olSaveAsType);
However, the resulting file is saved as "ANSI as UTF8" and all extended characters (e.g. in Arabic or Russian) come out as question marks.
Does anyone know how I can save the mail item in utf8?
Thanks a lot.
Cheers,
Martin
Instead of trying to set the encoding, try reading InternetCodepage and then using a System.Text.Encoding object to read the saved file into a string. You could then convert and re-save the string as another file in the encoding you prefer.

problem while parsing the CDATA

<text><![CDATA[øCu·l es tu principal reto, objetivo o problema?]]></text>
while parsing the above tag, its crashing.
how to parse the CDATA
the same line is appearing in windows like this...
<text><![CDATA[¿Cuál es tu principal reto, objetivo o problema?]]></text>
due to the special chars the parser is crashing.
why they are converted into special chars in Mac..?
how to solve this?
Well for one, the string as you post it here looks like something has gone wrong with the encoding. "ø" is not a Spanish character.
What xml parser are you using? I would guess that somewhere in that string is a character, possibly hidden, or maybe it's "ø" which makes your parser crash.
Edit (in response to the OP's comment)
I will try to guess what is happening and hope you can use my guess to resolve what is actually happening. So when you created the xml file you used some editor. This editor used a particular encoding. This means that it transferred the characters on your screen into bytes on your disk using a particular mapping from character into bytes (it encoded the characters into bytes). There are many different encodings, one common encoding is called Latin-1. So let's assume the file was encoded using Latin-1. After creating it, you transferred the file onto another machine where you opened it in a different editor. Now, how does the new editor know the encoding of the file? The answer is that it probably tried to guess the encoding. Now here is where the problem arises: it guessed wrong and interpreted the bytes using an encoding other than Latin-1.
While you have your (garbled) file open in an editor try selecting different encodings from the menu. The one that displays all your special characters correctly is likely to be the one used when the file was created.
Edit 2
But my other question remains: what xml parser are you using?
Edit 3
Ok, so now when you write "crashing", do you actually mean crashing or does it just return? Do you get an error message? If yes, what? Can you do the following:
Remove the funny characters from this line and run your code on the following:
<text><![CDATA[l es tu principal reto, objetivo o problema?]]></text>
Does it still crash?