Reading unicode string from registry - unicode

I'm using codegear c++ builder 2007. I'm trying to read a string value with a path from the registry. This path can contain unicode characters, for example russian.
I have added a string value with regedit and verified by exporting that the value really contains the expected unicode characters. The result in S1, S2 and S3 below all contains '?' (0x3F) instead of the unicode characters. What am I missing?
TRegistry *Registry = new TRegistry;
try
{
Registry->RootKey = HKEY_CURRENT_USER;
if (Registry->OpenKey ("Software\\qwe\\asd", false))
{
AnsiString S1 = Registry->ReadString ("zxc");
WideString S2 = Registry->ReadString ("zxc");
UTF8String S3 = Registry->ReadString ("zxc");
}
}
__finally
{
delete Registry;
}
/Björn

The VCL in C++Builder (and Delphi) 2007 uses Ansi, not Unicode. TRegistry::ReadString() is internally calling the Win32 API RegQueryValueExA() function instead of RegQueryValueExW(), and TRegistry::ReadString() returns an AnsiString that uses the OS default Ansi codepage. Any Unicode data gets automatically converted to Ansi by the OS before your code ever sees it. The '?' character means that a Unicode character got converted to an Ansi codepage that does not support that character. It does not matter what string type you assign the result of ReadString() to, the Unicode data has already been lost before ReadString() even exits.
If you need to read Unicode data as Unicode, then you need to call RegQueryValueExW() directly instead of using TRegistry::ReadString() (or upgrade to C++Builder 2009 or later, which now use Unicode).

http://do-the-right-things.blogspot.com/2008/03/codegear-delphi-2006nets-tregistry.html
CodeGear Delphi 2006.Net's TRegistry fails in Framework 2 SP1
I don't know whether C++ 2007 is also affected, but if it is, maybe there is a patch available somewhere.

Related

Delphi 7: No second parameter in TStrings.SaveToFile() method

I need to make changes to an old legacy project in Delphi 7.
I need to save a TStringList to a file with Unicode encoding. All resources I have found describe a second parameter for specifying an encoding in the SaveToFile()/LoadFromFile() methods, but there is no such parameter in Delphi 7. It was probably added in later versions.
How can I save UTF-8 text to a file (.csv) in Delphi 7?
The parameter you are looking for was introduced in Delphi 2009, when Delphi's String type migrated from AnsiString to UnicodeString.
Prior to Delphi 2009, you will have to encode the TStringList entries to UTF-8 yourself. You can put UTF-8 data in an AnsiString (UTF8String in those versions was just an alias for AnsiString), and TStringList will save the data as-is.
However, you may be tempted to use the RTL's UTF8Encode() function, but know that prior to Delphi 2009 it doesn't support Unicode codepoints above U+FFFF. If you need to handle codepoints that high, you will have to use Microsoft's MultiByteToWideChar() function instead.

Weird Normalization on .net

I am trying to normalize a string (using .net standard 2.0) using Form D, and it works perfectly and running on a Windows machine.
[TestMethod]
public void TestChars()
{
var original = "é";
var normalized = original.Normalize(NormalizationForm.FormD);
var originalBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(original));
Assert.AreEqual("233,0", originalBytesCsv);
var normalizedBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(normalized));
Assert.AreEqual("101,0,1,3", normalizedBytesCsv);
}
When I run this on Linux, it returns "253,255" for both strings, before and after normalization. These two bytes form the word 65533 which is the Unicode Replacement char, used when something goes wrong with encoding. That's the part where I am lost.
What am I missing here? Is there someone to point me in the right direction?
It might be related to the encoding of the source file. I'm not sure which encodings .net on Linux supports, but to be on the safe side, you should use plain ASCII source files and Unicode escapes for Non-ASCII characters:
var original = "\u00e9";
There is no text but encoded text.
When communicating text to person or program, both the bytes and the character encoding are essential.
The C# compiler (like all programs that process text, except in special cases like JSON) must know which character encoding the input files use. You must inform it accurately. The default is UTF-8 and that is a fine choice, especially for C# files, which are, lexically, sequences of Unicode codepoints.
If you used your editor or IDE or file transfer without full mindfulness of these requirements, you might have used an unintended character encoding.
For example, "é" when saved as Windows-1252 (0xE9) but read as UTF-8 (leading code unit that should be followed by two continuation code units), would give � to indicate this mishandling to the readers.
To be on the safe side, use UTF-8 everywhere but do it mindfully.

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

WiX Installer heat.exe and non-ascii filenames

I added a file in my WiX script with the character " î " in the path name. Light.exe will complain:
A string was provided with characters that are not available in the specified database code page '1252'
The character in question is 0xEE in Windows-1252 encoding, that is, 0x00EE Unicode or 0xC3AE in UTF-8. These files are in a wxs file generated by heat.exe, and this xml is encoded as UTF-8.
I assume the error message comes from the fact that it tries to input the character in UTF encoding while the database is 1252? Since UTF isn't really supported by Windows Installer (as described in the WiX documentation), should I be using input xml encoded in 1252 or iso-8859? If so, can I tell heat.exe to use another encoding for its output?
My question is similar to this one:
Leveraging heat.exe and harvest already localized file names and including them to msi using wix but the difference is that in that case the characters are "true" non-ansi charcaters, in my case the character can be encoded correctly in 1252, but it seems the conversion from utf-8 input files does not work.
The WiX toolset verifies codepages like so (roughly):
encoding = Encoding.GetEncoding(codepage, new EncoderExceptionFallback(),
new DecoderExceptionFallback());
writer = new StreamWriter(idtPath, false, encoding);
try
{
// GetBytes will throw an exception if any character doesn't
// match our current encoding
rowBytes = writer.Encoding.GetBytes(rowString);
}
catch (EncoderFallbackException)
{
rowBytes = convertEncoding.GetBytes(rowString);
messageHandler.OnMessage(WixErrors.InvalidStringForCodepage(
row.SourceLineNumbers,
writer.Encoding.WindowsCodePage));
}
It is possible that NETFX is not translating that "i" correctly. Explicitly setting the codepage on your XML may help. To do that from heat, you can try to use an XSLT (I've never tried changing the XML doc codepage via XSL but seems possible) or post-edit the document.

Search for unicode text inside Windows XP

Is there a way of searching for unicode characters inside a text file under Windows XP? For example suppose I wish to find text documents with the euro symbol. Although the standard XP search allows me to search for the euro symbol it does not produce any matches when I know they should be at least a few. Wingrep has the same issue. Is there any simple software/setting the I have missed?
The input encoding of the search field (in Windows XP, UTF-16) may not match the encoding of the text file (probably UTF-8).
I haven't used this tool (freeware), but it might work for your needs.
In windows or what ever else system you can find out that is it the document unicode (have a unicode character ) or not ?
To achieve this just use this simpl code, not that this code, written in C# and you should use your own equevalent.
public bool IsUnicode(string str)
{
int asciiBytesCount = System.Text.Encoding.ASCII.GetByteCount(str);
int unicodBytesCount = System.Text.Encoding.UTF8.GetByteCount(str);
if (asciiBytesCount!=unicodBytesCount )
return true;
return false;
}
if you do not want to write any code and find out that , is document contain any unicode character just see the document (save) Type.