How to convert UNICODE Hebrew appears as Gibberish in VBScript? - unicode

I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.

Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.

Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function

Related

What does this decode to, and is it UTF? Игорќ

I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.

retrieving unicode text from notepad which is saved as ANSI text file

Yesterday I wrote some text in a notepad file which was full of Unicode characters and saved the file as ANSI. Notepad gave me some warning, which i clicked OK without reading it fully and closed notepad.
Today when I again opened the same text in notepad, I am seeing notepad full of ??? signs. I now understand that this happened because I saved Unicode data as ANSI text. Is there a way to retrieve this text back? May be using some hex-editor or so?
No. Certain characters cannot be encoded in certain encodings. "風" cannot be encoded at all in ISO-8859 or any other single-byte encoding, for example. Each ANSI encoding also can only encode a certain subset of all possible characters. It is simply not possible to store characters not defined in a particular ANSI encoding in that encoding, they're simply not defined there.
So, they're gone. You better pull out a backup.

Applescript: Save Word documents as plaintext while retaining accents

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?
Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil
This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.

Saving outlook mail items in UTF-8 / Unicode using C#

We have created an Outlook Plugin which (amongst other things) can be used to save Mail items in text form to a specific folder. However, the text of the resulting text file is encoded in ANSI and I would like to save it as UTF8. I have already set the Codepage of the mail item like so:
mail = (MailItem)objItem;
mail.InternetCodepage = 65001; // equal UTF8 encoding; see http://msdn.microsoft.com/en-us/library/office/ff860730.aspx
mail.SaveAs(filePath, olSaveAsType);
However, the resulting file is saved as "ANSI as UTF8" and all extended characters (e.g. in Arabic or Russian) come out as question marks.
Does anyone know how I can save the mail item in utf8?
Thanks a lot.
Cheers,
Martin
Instead of trying to set the encoding, try reading InternetCodepage and then using a System.Text.Encoding object to read the saved file into a string. You could then convert and re-save the string as another file in the encoding you prefer.

Encoding problems in ASP when using English and Chinese characters

I am having problems with encoding Chinese in an ASP site. The file formats are:
translations.txt - UTF-8 (to store my translations)
test.asp - UTF-8 - (to render the page)
test.asp is reading translations.txt that contains the following data:
Help|ZH|帮助
Home|ZH|首页
The test.asp splits on the pipe delimiter and if the user contains a cookie with ZH, it will display this translation, else it will just revert back to the Key value.
Now, I have tried the following things, which have not worked:
Add a meta tag
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
Set the Response.CharSet = "UTF-8"
Set the Response.ContentType = "text/html"
Set the Session.CodePage (and Response) to both 65001 (UTF-8)
I have confirmed that the text in translations.txt is definitely in UTF-8 and has no byte order mark
The browser is picking up that the page is Unicode UTF-8, but the page is displaying gobbledegook.
The Scripting.OpenTextFile(<file>,<create>,<iomode>,<encoding>) method returns the same incorrect text regardless of the Encoding parameter.
Here is a sample of what I want to be displayed in China (ZH):
首页
帮助
But the following is displayed:
首页
帮助
This occurs all tested browsers - Google Chrome, IE 7/8, and Firefox 4. The font definitely has a Chinese branch of glyphs. Also, I do have Eastern languages installed.
--
I have tried pasting in the original value into the HTML, which did work (but note this is a hard coded value).
首页
首页
However, this is odd.
首页 --(in hex)--> E9 A6 96 E9 A1 --(as chars)--> 首页
Any ideas what I am missing?
In order to read the UTF-8 file, you'll probably need to use the ADODB.Stream object. I don't claim to be an expert on character encoding, but this test worked for me:
test.txt (saved as UTF-8 without BOM):
首页
帮助
test.vbs
Option Explicit
Const adTypeText = 2
Const adReadLine = -2
Dim stream : Set stream = CreateObject("ADODB.Stream")
stream.Open
stream.Type = adTypeText
stream.Charset = "UTF-8"
stream.LoadFromFile "test.txt"
Do Until stream.EOS
WScript.Echo stream.ReadText(adReadLine)
Loop
stream.Close
Whatever part of the process is reading the translations.txt file does not seem to understand that the file is in UTF-8. It looks like it is reading it in as some other encoding. You should specify encoding in whatever process is opening and reading that file. This will be different from the encoding of your web page.
Inserting the byte order mark at the beginning of that file may also be a solution.
Scripting.OpenTextFile does not understand UTF-8 at all. It can only read the current OEM encoding or Unicode. As you can see from the number of bytes being used for some character sets UTF-8 is quite inefficient. I would recommend Unicode for this sort of data.
You should save the file as Unicode (in Windows parlance) and then open with:
Dim stream : Set stream = Scripting.OpenTextFile(yourFilePath, 1, false, -1)
Just use the script below at the top of your page
Response.CodePage=65001
Response.CharSet="UTF-8"