Inno script function "LoadStringFromFile" NOT reading Unicode correctly [duplicate] - unicode

I have a function called GetServerName. I need to pass the file name (say for example 'test.txt') as well as a needed section string (say for example 'server')
The test.txt file is contains something like this
data1 | abcd
data2 | efgh
server| 'serverName1'
data3 | ijkl
I need to extract server name so in my function I will pass something like GetServerName('test.txt', 'server') and it should return serverName1.
My problem is that the test.txt was an ANSI-encoded file earlier. Now it can be an ANSI-encoded file or Unicode-encoded file. Below function worked correctly for ANSI-encoded file, but giving problem, if file is encoded in UNICODE. I suspect something with LoadStringsFromFile function. Because when, I debug I could see it returns Unicode characters instead of human readable characters. How to solve my issue simply? (or how to find the type of encoding of my file and how to convert UNICODE string to ANSI for comparison, then I can do it myself)
function GetServerName(const FileName, Section: string): string;
//Get Smartlink server name
var
DirLine: Integer;
LineCount: Integer;
SectionLine: Integer;
Lines: TArrayOfString;
//Lines: String;
AHA: TArrayOfString;
begin
Result := '';
if LoadStringsFromFile(FileName, Lines) then
begin
LineCount := GetArrayLength(Lines);
for SectionLine := 0 to LineCount - 1 do
begin
AHA := StrSplit(Trim(Lines[SectionLine]), '|')
if AHA[0] = Section then
begin
Result := AHA[1];
Exit;
end
end
end;
end;

First, note that the Unicode is not an encoding. The Unicode is a character set. Encoding is UTF-8, UTF-16, UTF-32 etc. So we do not know which encoding you actually use.
In the Unicode version of Inno Setup, the LoadStringsFromFile function (plural – do not confuse with singular LoadStringFromFile) uses the current Windows Ansi encoding by default.
But, if the file has the UTF-8 BOM, it will treat the contents accordingly. The BOM is a common way to autodetect the UTF-8 (and other UTF-*) encoding. You can create a file in the UTF-8 encoding with BOM using Windows Notepad.
UTF-16 or other encodings are not supported natively.
For implementation of reading UTF-16 file, see Reading UTF-16 file in Inno Setup Pascal script.
For working with files in any encoding, including UTF-8 without BOM, see Inno Setup - Convert array of string to Unicode and back to ANSI or Inno Setup replace a string in a UTF-8 file without BOM.

Related

Any chance for Indy 10 to output Unicode with Delphi 6?

I gave a try for Indy 10 on Delphi 6.
The problem is - with old Indy I was able to output Unicode through UTF-8 as AnsiString by setting proper encoding in ResponseInfo.ContentType. Now I lost the Unicode output. Here is an example how did I output an unicode string with old Indy:
var
MyUnicodeBodyString: WideString;
function MyUTF8Encode(const s: WideString): UTF8String;
var
Len: Integer;
begin
Len := WideCharToMultiByte(CP_UTF8, 0, PWideChar(s), Length(s), nil, 0, nil, nil);
SetLength(Result, Len);
if Len > 0 then
WideCharToMultiByte(CP_UTF8, 0, PWideChar(s), Length(s), PAnsiChar(Result), Len, nil, nil);
end;
begin
// ...
AResponseInfo.ContentText := MyUTF8Encode(MyUnicodeBodyString);
end;
When I do the same with Indy 10, the output is like
Товар
(the UTF-8 string where each byte is encoded as Unicode then).
When I change the output to just
AResponseInfo.ContentText := MyUnicodeBodyString;
I see the normal output of ASCII and of symbols for "language for non-Unicode programs" (in Windows control panel). Other languages are garbled.
Indy 10 is programmed with "string" and probably assumes that "string" is WideString, but in Delphi 6 string is an alias for AnsiString.
Can I influence the output of Indy 10 HTTP Server without replacing every string in Indy 10 source code with WideString ?
Indy 10 is programmed with "string" and probably assumes that "string" is WideString
That is incorrect. Indy's existence predates Delphi's switch to Unicode in Delphi 2009, so Indy has a lot of backwards compatibility for handling AnsiString in Delphi 2007 and earlier. In those versions, Indy does not use or assume WideString anywhere in its public API (well, except for in the IIdTextEncoding interface), everything is based on AnsiString instead.
in Delphi 6 string is an alias for AnsiString.
Yes, exactly. Which is why the preferred way to send non-ASCII content in an older ANSI version of Delphi is to use ANSI-encoded strings, eg:
var
MyAnsiBodyString: AnsiString;
...
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentText := MyAnsiBodyString;
...
If the AnsiString is encoded in the default OS ANSI codepage (as it typically should be), then Indy will simply convert the AnsiString to Unicode using that codepage by default, and then encode that Unicode result as UTF-8 for transmission.
Can I influence the output of Indy 10 HTTP Server without replacing every string in Indy 10 source code with WideString ?
Yes. In pre-Unicode versions of Delphi, most of Indy's components/classes have additional properties/parameters to specify an ANSI byte encoding, allowing Indy to properly convert an AnsiString to Unicode before charset-converting the Unicode to bytes for transmission (and vice versa on reception).
So, if you want to send an AnsiString that is already pre-encoded as UTF-8, one approach is to manually set the AResponseInfo.ContentLength property, as well as the IOHandler.DefAnsiEncoding property, eg:
var
MyUtf8Str: UTF8String;
...
MyUtf8Str := MyUTF8Encode(MyUnicodeBodyString);
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentText := myUtf8Str;
AResponseInfo.ContentLength := Length(myUtf8Str);
AContext.Connection.IOHandler.DefAnsiEncoding := IndyTextEncoding_UTF8;
...
If you don't set the ContentLength manually, TIdHTTPResponseInfo.WriteHeader() will calculate that value for you, by converting the ContentText to WideString using the RTL's default ANSI->Unicode conversion, and then encoding that WideString to UTF-8 to get the byte count. However, the initial ANSI->Unicode conversion will not know your AnsiString is encoded in UTF-8 and thus will not process it correctly.
If you don't set the DefAnsiEncoding manually, TIdIOHandler.Write() will use the default DefAnsiEncoding setting of IndyTextEncoding_OSDefault to convert the ContentText to Unicode using the OS's default ANSI codepage, which is likely not UTF-8 and so will not convert the text to Unicode properly before then encoding the Unicode result to UTF-8 bytes.
Another approach is to use AResponseInfo.ContentStream instead of AResponseInfo.ContentText. That way, you can simply store your UTF-8 bytes in a TMemoryStream or TStringStream and then TIdHTTPResponseInfo.WriteContent() can send those bytes as-is, eg:
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentStream := TStringStream.Create(MyUTF8Encode(MyUnicodeBodyString));
Or:
var
MyUtf8Str: UTF8String;
...
MyUtf8Str := MyUTF8Encode(MyUnicodeBodyString);
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentStream := TMemoryStream.Create;
AResponseInfo.ContentStream.WriteBuffer(PAnsiChar(MyUtf8Str)^, Length(MyUtf8Str));
AResponseInfo.ContentStream.Position := 0;
Or:
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentStream := TMemoryStream.Create;
WriteStringToStream(AResponseInfo.ContentStream, MyUTF8Encode(MyUnicodeBodyString), IndyTextEncoding_UTF8, IndyTextEncoding_UTF8);
AResponseInfo.ContentStream.Position := 0;

Trying to upload specific characters in Python 3 using Windows Powershell

I'm running this code in Windows Powershell and it includes this file called languages.txt where I'm trying to convert between bytes to strings:
Here is languages.txt:
Afrikaans
አማርኛ
Аҧсшәа
العربية
Aragonés
Arpetan
Azərbaycanca
Bamanankan
বাংলা
Bân-lâm-gú
Беларуская
Български
Boarisch
Bosanski
Буряад
Català
Чӑвашла
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
فارسی
Français
Frysk
Gaelg
Gàidhlig
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Ido
Interlingua
Italiano
עברית
ಕನ್ನಡ
Kapampangan
ქართული
Қазақша
Kreyòl ayisyen
Latgaļu
Latina
Latviešu
Lëtzebuergesch
Lietuvių
Magyar
Македонски
Malti
मराठी
მარგალური
مازِرونی
Bahasa Melayu
Монгол
Nederlands
नेपाल भाषा
日本語
Norsk bokmål
Nouormand
Occitan
Oʻzbekcha/ўзбекча
ਪੰਜਾਬੀ
پنجابی
پښتو
Plattdüütsch
Polski
Português
Română
Romani
Русский
Seeltersk
Shqip
Simple English
Slovenčina
کوردیی ناوەندی
Српски / srpski
Suomi
Svenska
Tagalog
தமிழ்
ภาษาไทย
Taqbaylit
Татарча/tatarça
తెలుగు
Тоҷикӣ
Türkçe
Українська
اردو
Tiếng Việt
Võro
文言
吴语
ייִדיש
中文
Then, here is the code I used:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<===>", cooked_string)
languages = open("languages.txt", encoding="utf-8")
main(languages, input_encoding, error)
Here's the output (credit: Learn Python 3 the Hard Way by Zed A. Shaw):
I don't know why it doesn't upload the characters and shows question blocks instead. Can anyone help me?
The first string which fails is አማርኛ. The first character, አ is in unicode 12A0 (see here). In UTF-8, that is b'\xe1\x8a\xa0'. So, that part is obviously fine. The file really is UTF-8.
Printing did not raise an exception, so your output encoding can handle all of the characters. Everything is fine.
The only remaining reason I see for it to fail is that the font used in the console does not support all of the characters.
If it is just for play, you should not worry about it. Consider it working correctly.
On the other hand, I would suggest changing some things in your code:
You are running main recursively for each line. There is absolutely no need for that and it would run into recursion depth limit on a longer file. User a for loop instead.
for line in lines:
print_line(line, encoding, errors)
You are opening the file as UTF-8, so reading from it automatically decodes UTF-8 into Unicode, then you encode it back into row_bytes and then encode again into cooked_string, which is the same as line. It would be a better exercise to read the file as raw binary, split it on newlines and then decode. Then you'd have a clearer picture of what is going on.
with open("languages.txt", 'rb') as f:
raw_file_contents = f.read()

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

Reading file in Ansi and Unicode encoding in Inno Setup

I have a function called GetServerName. I need to pass the file name (say for example 'test.txt') as well as a needed section string (say for example 'server')
The test.txt file is contains something like this
data1 | abcd
data2 | efgh
server| 'serverName1'
data3 | ijkl
I need to extract server name so in my function I will pass something like GetServerName('test.txt', 'server') and it should return serverName1.
My problem is that the test.txt was an ANSI-encoded file earlier. Now it can be an ANSI-encoded file or Unicode-encoded file. Below function worked correctly for ANSI-encoded file, but giving problem, if file is encoded in UNICODE. I suspect something with LoadStringsFromFile function. Because when, I debug I could see it returns Unicode characters instead of human readable characters. How to solve my issue simply? (or how to find the type of encoding of my file and how to convert UNICODE string to ANSI for comparison, then I can do it myself)
function GetServerName(const FileName, Section: string): string;
//Get Smartlink server name
var
DirLine: Integer;
LineCount: Integer;
SectionLine: Integer;
Lines: TArrayOfString;
//Lines: String;
AHA: TArrayOfString;
begin
Result := '';
if LoadStringsFromFile(FileName, Lines) then
begin
LineCount := GetArrayLength(Lines);
for SectionLine := 0 to LineCount - 1 do
begin
AHA := StrSplit(Trim(Lines[SectionLine]), '|')
if AHA[0] = Section then
begin
Result := AHA[1];
Exit;
end
end
end;
end;
First, note that the Unicode is not an encoding. The Unicode is a character set. Encoding is UTF-8, UTF-16, UTF-32 etc. So we do not know which encoding you actually use.
In the Unicode version of Inno Setup, the LoadStringsFromFile function (plural – do not confuse with singular LoadStringFromFile) uses the current Windows Ansi encoding by default.
But, if the file has the UTF-8 BOM, it will treat the contents accordingly. The BOM is a common way to autodetect the UTF-8 (and other UTF-*) encoding. You can create a file in the UTF-8 encoding with BOM using Windows Notepad.
UTF-16 or other encodings are not supported natively.
For implementation of reading UTF-16 file, see Reading UTF-16 file in Inno Setup Pascal script.
For working with files in any encoding, including UTF-8 without BOM, see Inno Setup - Convert array of string to Unicode and back to ANSI or Inno Setup replace a string in a UTF-8 file without BOM.

Writing UTF16 file with std::fstream

Is it possible to imbue a std::fstream so that a std::string containing UTF-8 encoding can be streamed to an UTF-16 file?
I tried the following using the utf8-to-utf16 facet, but the result file is still UTF-8:
std::fstream utf16_stream("test.txt", std::ios_base::trunc | std::ios_base::out);
utf16_stream.imbue(std::locale(std::locale(), new codecvt_utf8_utf16<wchar_t,
std::codecvt_mode(std::generate_header | std::little_endian)>);
std::string utf8_string = "\x54\\xE2\x83\xac\x73\x74";
utf16_stream << utf8_string;
References for the codecvt_utf8_utf16 facet seem to indicate it can be used to read and write UTF-8 files, not UTF-16 - is that correct, and if so, is there a simple way to do what I want to do?
file streams (by virtue of the requirements of std::basic_filebuf §22.4.1.4.2[locale.codecvt.virtuals]/3) do not support N:M character encoding conversions as would be the case with UTF8 internal / UTF16 external.
You'd have to build a UTF-16 string, e.g. by using wstring_convert, reinterpret it as a sequence of bytes, and output it using usual (non-converting) std::ofstream.
Or, alternatively, convert UTF-8 to wide first, and then use std::codecvt_utf16 which produces UTF-16 as a sequence of bytes, and therefore, can be used with file streams.