WiX Installer heat.exe and non-ascii filenames - encoding

I added a file in my WiX script with the character " î " in the path name. Light.exe will complain:
A string was provided with characters that are not available in the specified database code page '1252'
The character in question is 0xEE in Windows-1252 encoding, that is, 0x00EE Unicode or 0xC3AE in UTF-8. These files are in a wxs file generated by heat.exe, and this xml is encoded as UTF-8.
I assume the error message comes from the fact that it tries to input the character in UTF encoding while the database is 1252? Since UTF isn't really supported by Windows Installer (as described in the WiX documentation), should I be using input xml encoded in 1252 or iso-8859? If so, can I tell heat.exe to use another encoding for its output?
My question is similar to this one:
Leveraging heat.exe and harvest already localized file names and including them to msi using wix but the difference is that in that case the characters are "true" non-ansi charcaters, in my case the character can be encoded correctly in 1252, but it seems the conversion from utf-8 input files does not work.

The WiX toolset verifies codepages like so (roughly):
encoding = Encoding.GetEncoding(codepage, new EncoderExceptionFallback(),
new DecoderExceptionFallback());
writer = new StreamWriter(idtPath, false, encoding);
try
{
// GetBytes will throw an exception if any character doesn't
// match our current encoding
rowBytes = writer.Encoding.GetBytes(rowString);
}
catch (EncoderFallbackException)
{
rowBytes = convertEncoding.GetBytes(rowString);
messageHandler.OnMessage(WixErrors.InvalidStringForCodepage(
row.SourceLineNumbers,
writer.Encoding.WindowsCodePage));
}
It is possible that NETFX is not translating that "i" correctly. Explicitly setting the codepage on your XML may help. To do that from heat, you can try to use an XSLT (I've never tried changing the XML doc codepage via XSL but seems possible) or post-edit the document.

Related

How to detect file encoding in Octave?

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

What does this decode to, and is it UTF? Игорќ

I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.

How to convert UNICODE Hebrew appears as Gibberish in VBScript?

I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.
Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.
Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function

Reading unicode string from registry

I'm using codegear c++ builder 2007. I'm trying to read a string value with a path from the registry. This path can contain unicode characters, for example russian.
I have added a string value with regedit and verified by exporting that the value really contains the expected unicode characters. The result in S1, S2 and S3 below all contains '?' (0x3F) instead of the unicode characters. What am I missing?
TRegistry *Registry = new TRegistry;
try
{
Registry->RootKey = HKEY_CURRENT_USER;
if (Registry->OpenKey ("Software\\qwe\\asd", false))
{
AnsiString S1 = Registry->ReadString ("zxc");
WideString S2 = Registry->ReadString ("zxc");
UTF8String S3 = Registry->ReadString ("zxc");
}
}
__finally
{
delete Registry;
}
/Björn
The VCL in C++Builder (and Delphi) 2007 uses Ansi, not Unicode. TRegistry::ReadString() is internally calling the Win32 API RegQueryValueExA() function instead of RegQueryValueExW(), and TRegistry::ReadString() returns an AnsiString that uses the OS default Ansi codepage. Any Unicode data gets automatically converted to Ansi by the OS before your code ever sees it. The '?' character means that a Unicode character got converted to an Ansi codepage that does not support that character. It does not matter what string type you assign the result of ReadString() to, the Unicode data has already been lost before ReadString() even exits.
If you need to read Unicode data as Unicode, then you need to call RegQueryValueExW() directly instead of using TRegistry::ReadString() (or upgrade to C++Builder 2009 or later, which now use Unicode).
http://do-the-right-things.blogspot.com/2008/03/codegear-delphi-2006nets-tregistry.html
CodeGear Delphi 2006.Net's TRegistry fails in Framework 2 SP1
I don't know whether C++ 2007 is also affected, but if it is, maybe there is a patch available somewhere.

Creating files with french characters and encoding

HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.