Unicode encoding / decoding error in Free Pascal 3.2.0

Unicode encoding / decoding error in Free Pascal 3.2.0 - unicode

This test passed with Free Pascal 3.0.4. (source file encoding is UTF8, OS is Windows 10 64 Bit)
{$MODE DELPHI}
...
var
Raw: RawByteString;
Actual: string;
begin
Raw := UTF8Encode('关于汉语');
Actual := string (UTF8Decode(Raw));
CheckEquals('关于汉语', Actual);
end
With Free Pascal 3.2.0 it fails:
expected: <关于汉语> but was: <å³äºæ±è¯>
RawByteString is declared as type AnsiString(CP_NONE) in system.h

The conversion works if I cast (or declare) the characters as UnicodeString. The following test succeeds with Free Pascal 3.2.0:
procedure TFreePascalTests.TestUTF8Encode;
const
THE_CHARACTERS: UnicodeString = '关于汉语';
var
Raw: UTF8String;
Actual: UnicodeString;
begin
Raw := UTF8Encode(THE_CHARACTERS);
Actual := UTF8Decode(Raw);
CheckEquals(THE_CHARACTERS, Actual);
end;
The Raw variable may be defined as RawByteString or Utf8String.

Related

Syntax for Returning One Character of String by Index

I am attempting to compare one character of a string to see if it is my delimiter character. However, when I execute the following code the value that gets placed in the variable valstring is a number that represents the byte that was converted to a string and not a character itself. For Example the value may be the string '58'.
Through my testing in CoDeSys using the debugging features I know that the string sReadLine contains a valid string of characters. I'm just not sure of the syntax to single only one of them out; the sReadLine[valPos + i] part is what I don't understand.
sReadLine : STRING;
valstring : STRING;
i : INT;
valPos : INT;
FOR i := 0 TO 20 DO
IF BYTE_TO_STRING(sReadLine[valPos + i]) = '"' THEN
EXIT;
END_IF
valstring := CONCAT(STR1 := valstring, STR2 := BYTE_TO_STRING(sReadLine[valPos + i]));
END_FOR

I think you have multiple choises.
1) Use built-in string functions instead. You can use MID function get get part of a string. So in your case something like "get one character from valPos + 1 from sReadLine.
FOR i := 0 TO 20 DO
IF MID(sReadLine, 1, valPos + i) = '"' THEN
EXIT;
END_IF
valstring := CONCAT(STR1 := valstring, STR2 := MID(sReadLine, 1, valPos + i));
END_FOR
2) Convert the ASCII byte to string. In TwinCAT systems, there is a function F_ToCHR. It takes a ASCII byte in and returns the character as string. I can't find something like that for Codesys, but i'm sure there would be a solution in some library. So please note that this won't work in Codesys without modifications:
FOR i := 0 TO 20 DO
IF F_ToCHR(sReadLine[valPos + i]) = '"' THEN
EXIT;
END_IF
valstring := CONCAT(STR1 := valstring, STR2 := F_ToCHR(sReadLine[valPos + i]));
END_FOR
3) The OSCAT library seems to have a CHR_TO_STRING function. You could use this instead of F_ToCHR in step 2.
4) You can use pointers to copy the ASCII byte to a string array (MemCpy) and add a string end character. This needs some knowledge of pointers etc. See Codesys forum for some example.
5) You can write a helper function similar to step 2 youself. Check the example from Codesys forums. That example doesn't include all characters so it needs to be updated. It's not quite elegant.

When you convert a byte to a string, what is beeing converted is the digital representation of the byte.
This means you are interpreting that byte as an ascii character (The ascii decimal value of : is 58).
So if you want to Concat chars instead of their ascii decimal representation, you need another function:
valstring := CONCAT(STR1 := valstring, STR2 := F_ToCHR(sReadLine[valPos + i]));
EDIT:
As Quirzo, I couldn't find a similar F_ToCHR function for Codesys, but you could easily build one yourself.
For example:
Declaration Part:
FUNCTION F_ASCII_TO_STRING : STRING
VAR_INPUT
input : BYTE;
END_VAR
VAR
ascii : ARRAY[0..255] OF STRING(1):=
[
33(' '),'!','"','#',
'$$' ,'%' ,'&' ,'´',
'(' ,')' ,'*' ,'+' ,
',' ,'-' ,'.' ,'/' ,
'0' ,'1' ,'2' ,'3' ,
'4' ,'5' ,'6' ,'7' ,
'8' ,'9' ,':' ,';' ,
'<' ,'=' ,'>' ,'?' ,
'#' ,'A' ,'B' ,'C' ,
'D' ,'E' ,'F' ,'G' ,
'H' ,'I' ,'J' ,'K' ,
'L' ,'M' ,'N' ,'O' ,
'P' ,'Q' ,'R' ,'S' ,
'T' ,'U' ,'V' ,'W' ,
'X' ,'Y' ,'Z' ,'[' ,
'\' ,']' ,'^' ,'_' ,
'`' ,'a' ,'b' ,'c' ,
'd' ,'e' ,'f' ,'g' ,
'h' ,'i' ,'j' ,'k' ,
'l' ,'m' ,'n' ,'o' ,
'p' ,'q' ,'r' ,'s' ,
't' ,'u' ,'v' ,'w' ,
'x' ,'y' ,'z' ,'{' ,
'|' ,'}' ,'~'
];
END_VAR
Implementation part:
F_ASCII_TO_STRING := ascii[input];

As Sergey said, this might not be an optimal solution to your problem. It seems like you want to extract the longest substring not containing any character " from initial input sReadLine to valstring, starting from position valPos.
In your implementation, for each valid input character, CONCAT() needs to search for the end of valstring, before appending only 1 character to it.
You should rather decompose your problem and use two standard functions to be optimal:
FIND() --> to get the position of the next character " (or to know if there is none),
MID() --> to create a string from initial position up to before the first character " (or the end of the input string).
That way, there remains only 2 loops; each one is hidden in these functions.

Inno Setup: recording/recover file path in UTF8

We are using Inno Setup (unicode version) to create resource package (or "samples") for our product. The program part of our product knows the location of the samples by a file that is written by samples installer. At current, it is implemented in plain way:
procedure CurStepChanged(CurStep: TSetupStep);
begin
if ( CurStep = ssPostInstall) then
begin
ForceDirectories(ExpandConstant('{userappdata}\MyCompany\MyApp'))
SaveStringToFile(ExpandConstant('{userappdata}\MyCompany\MyApp\SamplePath.txt'), ExpandConstant('{app}'), False);
end;
end;
This plain way has a fatal issue: the installer is run in Chinese language Windows, and the whole stuff works in GBK encoding, but our product is built up on UTF8 base.
After some search, I got some solution by calling Windows WideCharToMultiByte inside Pascal code. However this won't work, as it requires UTF16 as input, but what I have is GBK.
In addition, the Inno Setup also won't work with existing UTF8 file name in my SamplePath.txt. If I manually edit the SamplePath.txt file to fill UTF8-encoded Chinese letters, and initialize the app builtin with following code, it displays messy characters in dir selection page:
[Setup]
DefaultDirName={code:GetPreviousSampleDir}
[code]
function GetPreviousSampleDir(Param: String): String;
var
tmp: AnsiString;
begin
if FileExists( ExpandConstant('{userappdata}\MyCompany\MyApp\SamplePath.txt') ) then
begin
LoadStringFromFile(ExpandConstant('{userappdata}\MyCompany\MyApp\SamplePath.txt'), tmp)
Result := tmp
end
else
begin
Result := 'D:\MyApp_samples'
end;
end;
So is there any way to load/store a file name with i18n characters in UTF8?

To load a string from UTF-8 file, use LoadStringFromFileInCP from
Inno Setup - Convert array of string to Unicode and back to ANSI
const
CP_UTF8 = 65001;
{ ... }
var
FileName: string;
S: string;
begin
FileName := 'test.txt';
if not LoadStringFromFileInCP(FileName, S, CP_UTF8) then
begin
Log('Error reading the file');
end
else
begin
Log('Read: ' + S);
end;
end;
To save UTF-8 file without BOM:
either use SaveStringsToFileInCP from the same question
or see Create a UTF8 file without BOM with Inno Setup (Unicode version)

Is white space relevant when casting from the 'any' type in Apama EPL?

I am on Apama 10.3 (Community Edition):
any emptyString := "";
any emptyDictionary := new dictionary<string,any>;
string myString := <string> emptyString;
dictionary<string,any> := <dictionary<string,any>> emptyDictionary;
The cast in line 3 works, but in line 4 Designer complains about unexpected token: <. Only if I use white spaces does it work:
dictionary<string,any> := <dictionary< string,any> > emptyDictionary;
In the documentation Developing Apama Applications this is not mentioned but on page 296 when casting with optional<>, the correct syntax with the white spaces is used.
Does this work as expected or is it a bug?

The problem here isn't about casting to an any type. This is due to the EPL parser always interpreting expression >> as a right-shift operator. If you need to close two angle brackets, you always need to use a space between them. It’s only the closing brackets that are affected (as you’d never need to write << in EPL).
The form I always use is:
dictionary<string,any> x := <dictionary<string,any> > emptyDictionary;
sequence<sequence<string> > jaggedArray := new sequence<sequence<string> >;

how to convert hexadecimal ipv6 string seperated by : to decimal in PostgreSQL

I am trying to convert hex string to a NUMERIC column for IPV6 address
The hexadecimal input is 2001:200:101:ffff:ffff:ffff:ffff:ffff
My output should be 42540528727106952925351778646877011967
I tried the below function taken from this site by passing my input with eliminating : as 2001200101ffffffffffffffffffff
`CREATE OR REPLACE FUNCTION hex_to_int(hexval varchar) RETURNS numeric AS $$
DECLARE
result NUMERIC;
i integer;
len integer;
hexchar varchar;
BEGIN
result := 0;
len := length(hexval);
for i in 1..len loop
hexchar := substr(hexval, len - i + 1, 1);
result := result + 16 ^ (i - 1) * case
when hexchar between '0' and '9' then cast (hexchar as int)
when upper (hexchar) between 'A' and 'F' then ascii(upper(hexchar)) - 55
end;
end loop;
RETURN result;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE STRICT;`
I am getting decimal number as
select hex_to_int('2001200101ffffffffffffffffffff');
hex_to_int
--------------------------------------
166176317495821453702777150266933247
How to get my actual decimal number?

The decimal result you have in your question is the correct result for the hexadecimal number in your question.
What you are missing are the suppressed leading zeros in the second and third IPv6 address words. You are incorrectly converting the IPv6 address string representation (2001:200:101:ffff:ffff:ffff:ffff:ffff) to the actual hexadecimal number (200102000101ffffffffffffffffffff). Notice the added zero digits. Each IPv6 word is four hexadecimal digits, but it is allowed (required by RFC 5952) to suppress leading zeroes in the words for the string representation, but that doesn't mean they are not there.
You need to make sure each IPv6 address word is four hexadecimal digits (add any missing zeroes to get to four digits, and replace any double colons with the correct number of 0000 words) before removing the colons.
There doesn't seem to be any real, legitimate reason to convert an IPv6 address to a decimal representation. IP addresses (both IPv4 and IPv6) are binary numbers, and the hexadecimal representation of IPv6 directly translates to the binary. Adding decimal into the mix is just asking for trouble.

PowerBuilder 12 how to determine encoding of input file

I'm new to PowerBuilder 12, and would like to know is there any way to determine the encoding (e.g. Unicode, BIG5) of an input file. Any comments and code samples are appreciated! Thanks!

From the PB 12.5 help file :
FileEncoding ( filename )
filename : The name of the file you want to test for encoding type
Return Values
EncodingANSI!
EncodingUTF8!
EncodingUTF16LE!
EncodingUTF16BE!
If filename does not exist, returns null.

Finding Unicode is pretty easy, if you assume the Unicode file has a BOM prefix (although reality is that not all Unicode files do have this). Some code to do this is below. However, I have no idea about Big5; it looks to me (at first glance at the spec, never had occasion to use it) like it doesn't have a similar prefix.
Good luck,
Terry
function of_filetype (string as_filename) returns encoding
integer li_NullCount, li_NonNullCount, li_OffsetTest
long ll_File
encoding le_Return
blob lblb_UTF16BE, lblb_UTF16LE, lblb_UTF8, lblb_Test, lblb_BOMTest, lblb_Null
lblb_UTF16BE = Blob ("~hFE~hFF", EncodingANSI!)
lblb_UTF16LE = Blob ("~hFF~hFE", EncodingANSI!)
lblb_UTF8 = Blob ("~hEF~hBB~hBF", EncodingANSI!)
lblb_Null = blobmid (blob ("~h01", encodingutf16le!), 2, 1)
SetNull (le_Return)
// Get a set of bytes to test
ll_File = FileOpen (as_FileName, StreamMode!, Read!, Shared!)
FileRead (ll_File, lblb_Test)
FileClose (ll_File)
// test for BOMs: UTF-16BE (FF FE), UTF-16LE (FF FE), UTF-8 (EF BB BF)
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16BE))
IF lblb_BOMTest = lblb_UTF16BE THEN RETURN EncodingUTF16BE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16LE))
IF lblb_BOMTest = lblb_UTF16LE THEN RETURN EncodingUTF16LE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF8))
IF lblb_BOMTest = lblb_UTF8 THEN RETURN EncodingUTF8!
//I've removed a hack from here that I wouldn't encourage others to use, basically checking for
//0x00 in places I'd "expect" them to be if it was a Unicode file, but that makes assumptions
RETURN le_Return