Saving outlook mail items in UTF-8 / Unicode using C# - unicode

We have created an Outlook Plugin which (amongst other things) can be used to save Mail items in text form to a specific folder. However, the text of the resulting text file is encoded in ANSI and I would like to save it as UTF8. I have already set the Codepage of the mail item like so:
mail = (MailItem)objItem;
mail.InternetCodepage = 65001; // equal UTF8 encoding; see http://msdn.microsoft.com/en-us/library/office/ff860730.aspx
mail.SaveAs(filePath, olSaveAsType);
However, the resulting file is saved as "ANSI as UTF8" and all extended characters (e.g. in Arabic or Russian) come out as question marks.
Does anyone know how I can save the mail item in utf8?
Thanks a lot.
Cheers,
Martin

Instead of trying to set the encoding, try reading InternetCodepage and then using a System.Text.Encoding object to read the saved file into a string. You could then convert and re-save the string as another file in the encoding you prefer.

Related

What does this decode to, and is it UTF? Игорќ

I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

retrieving unicode text from notepad which is saved as ANSI text file

Yesterday I wrote some text in a notepad file which was full of Unicode characters and saved the file as ANSI. Notepad gave me some warning, which i clicked OK without reading it fully and closed notepad.
Today when I again opened the same text in notepad, I am seeing notepad full of ??? signs. I now understand that this happened because I saved Unicode data as ANSI text. Is there a way to retrieve this text back? May be using some hex-editor or so?
No. Certain characters cannot be encoded in certain encodings. "風" cannot be encoded at all in ISO-8859 or any other single-byte encoding, for example. Each ANSI encoding also can only encode a certain subset of all possible characters. It is simply not possible to store characters not defined in a particular ANSI encoding in that encoding, they're simply not defined there.
So, they're gone. You better pull out a backup.

Applescript: Save Word documents as plaintext while retaining accents

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?
Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil
This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.

How to convert UNICODE Hebrew appears as Gibberish in VBScript?

I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.
Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.
Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function