DB2 UTF-8 Data Storage - Extraneous Byte Values

DB2 UTF-8 Data Storage - Extraneous Byte Values - unicode

I am attempting to store Unicode characters in UTF8 format on a DB2 database. I have confirmed that the charset is 1208 and the that the database is specified to hold UTF8.
I am, however, getting odd results when querying some unicode data.
select hex(firstname), firstname, from my_schema.my_table where my_pk = 1234;
The results are as below:
C383C289 Ã
The character in the result is displaying wrong. From what I gather, it's being represented by the hex values "C383C289". The actual character sent on the insert was É and should be represented in UTF8 as C389.
At this stage I'm assuming that it could be the program that I am using to query the data that is interpreting it wrong. But to what extent are the hex values (first result column) wrong? They seem to have unused fluff "83C2" between the actual bytes. Or, is "C383C289" actually correct, and some UTF8 decoding engines can't handle the fluff? This seems unlikely to me.
The client (DB2 For Toad, and WinSQL) both display the character as an Ã which is represented in UTF8 as C383.
*Edit. I tested on the CLI and it is correctly returning the É character. Am I missing something? Is the "hex" function returning something that it shouldn't be?

É (U+00C9) in UTF-8 is 0xC3 0x89.
Ã (U+00C3) in UTF-8 is 0xC3 0x83.
‰ (U+0089) in UTF-8 is 0xC2 0x89.
This means your insert code is taking É, encoding it to UTF-8 octets 0xC3 0x89 before then inserting those octets into the DB. The DB is interpreting them as individual characters 0xC3 and 0x89 and encoding them a second time into UTF-8, thus producing 0xC3 0x83 0xC2 0x89.
You need to fix your insert code to not perform that initial encode anymore, so the DB will see the original É as-is and not a pre-encoded version of it. How you actually do it is anyone's guess, since you did not show your actual insert code yet.

This is not really an answer, just to demonstrate the correct behaviour:
> db2 "insert into t1 values ('Élan')"
DB20000I The SQL command completed successfully.
> db2 select "hex (f1), f1 from t1"
1 F1
---------- -----
C3896C616E Élan
1 record(s) selected.

Related

Powershell, can't enter hashtable keys with some non-ASCII characters (in a script)

I'm trying to create a PowerShell hash table to convert non-ASCII (UTF8) characters to their ASCII look-a-likes.
Here are two hash table entries as examples: 'ñ'='n' and 'Ñ'='N'.
Editor's note: Using both theses entries in the same hash table literal (#{ 'ñ'='n'; 'Ñ'='N' }) wouldn't work, because PowerShell uses hash tables with case-insensitive key lookups and therefore considers 'ñ' and 'Ñ'duplicate keys and complains. However, this is incidental to the problem at hand.
The first one works: 'ñ' is 0xc3b1. The second one does not work: 'Ñ' is 0xc391 which PowerShell won't accept. (The problem seems to be that 0x91 is outside the range of an acceptable powershell char.)
A simpler example of the problem is:
$c = [convert]::toChar(0x91)
which results in $c getting a value of 0x3f instead of 0x91. So what can I do to get 'Ñ'='N' into the
hash table, or a char with a value of 0x91? I've already spent hours reading web pages and experimenting.

Note: By default, PowerShell hashtables, due to using case-insensitive lookups, do not support keys that are mere case variations of another; therefore, ñ and Ñ - the former being the lowercase version of the latter - cannot both be used as keys - see bottom section.
In memory, all PowerShell strings are UTF-16 .NET strings, which are capable of representing all Unicode characters, so using character such as Ñ as keys in hash tables is not a problem.
The problem you describe only arises when PowerShell misinterprets source code read from a file, due to assuming the wrong character encoding.
Your symptom suggests that your source code is UTF-8-encoded, but the file doesn't have a BOM, which causes Windows PowerShell (but, fortunately, no longer PowerShell [Core] v6+) to misinterpret the file as encoded based on the system's active legacy ANSI code page (e.g., Windows-1252 on US-English systems), a single-byte encoding.
Make sure that your source-code file is saved as UTF-8 with a BOM[1], and your problem will go away.
What you think are Unicode code points, 0xc3b1 and 0xc391, are in reality the 2-byte UTF-8 encodings (0xc3 0xb1 and 0xc3 91) of the true code points corresponding to ñ and Ñ: 0xf1 and 0xd1
As for:
[convert]::toChar(0x91)
seemingly not producing a [char] instance with the given code point, 0x91 (decimal 145):
It does, namely in memory, which you can easily verify:
[int] [convert]::toChar(0x91) # -> 145 (0x91)
You'll only get 0x3f - which is a literal ? character (try [char] 0x3f) - if you mistakenly save the in-memory representation with ASCII encoding: since 0x91 is outside the ASCII sub-range of Unicode (which goes from 0x00 to 0x7f), it cannot be represented in the output file, and the substitute character ? is used.
Note that PowerShell's hash tables are case-insensitive, so you cannot have keys that are merely case variations of one another:
# !! FAILS
PS> #{ Ñ = 'LATIN CAPITAL LETTER N WITH TILDE'; ñ = 'LATIN SMALL LETTER N WITH TILDE' }
... Duplicate keys 'ñ' are not allowed in hash literals.
You must use the .NET [hashtable] type (System.Collections.Hashtable) directly to create case-sensitive hash tables:
# Create case-SENSITIVE hash table:
$ht = [hashtable]::new()
$ht['ñ'] = 'LATIN SMALL LETTER N WITH TILDE'
$ht['Ñ'] = 'LATIN CAPITAL LETTER N WITH TILDE'
$ht now has 2 entries and $ht['ñ'] and $ht['Ñ'] retrieve the values case-sensitively.
By contrast, if you had used $ht = #{}, i.e. initialized the hash table as a regular, case-insensitive hash table, you'd only get 1 entry with value 'LATIN CAPITAL LETTER N WITH TILDE', because the 2nd assignment, $ht['Ñ'] =, simply updated the case-insensitively looked-up key created by the 1st statement.
[1] Alternatively, use a UTF-16 encoding, which invariably uses a BOM; the UTF-16LE form is (erroneously) referred to as Unicode in PowerShell.

PostgreSQL invalid byte sequence for encoding utf8 0xbf

I am importing a CSV file which is related to the properties. It has /n between the values. While trying to import it into a table, the following error shows up:
PostgreSQL invalid byte sequence for encoding utf8 0xbf
I tried by simply importing the single column only, but it is not working.
Column values will look like this:
"Job No 305385917-001: To attached Garage (Single remain).\n10305 - 132 STREET NW
Plan 23AF Blk 84 Lot 14\n2002995 LERTA LTD O/A LIR HOMES DONTON\nHENORA"
I want to import the above whole into a single column.
COPY edmonton.general_filtered (descriptive)
FROM 'D:/property_own/descriptive_details.csv'
DELIMITER ',' CSV HEADER;

Your COPY statement is correct, but your data are not in UTF8 encoding.
They are probably in Latin-1 or Windows-1252, where 0xBF is ¿.
Specify the encoding correctly, e.g.:
COPY edmonton.general_filtered (descriptive)
FROM 'D:/property_own/descriptive_details.csv'
(FORMAT 'csv', HEADER, ENCODING 'WIN1252');

Postgres - decode special characters

I have words like this encoded: "cizaña", the encoded result is 63697A61F161
When I try to convert to 'cizaña' again
select decode('63697A61F161'::text, 'hex')
I get:
"ciza\361a"
What can I do? I tried to set set client_encoding to 'UTF-8'; without luck

the encoded result is 63697A61F161
"encoded" how? I think you care confusing text encoding with representation format of binary data.
63697A61F161 is the iso-8859-1 ("latin-1") encoding of the text "cizaña" with the binary represetned as hex octets.
decode('63697A61F161', 'hex') produces the bytea value '\x63697A61F161' if bytea_encoding is hex or 'ciza\361a' if bytea_encoding is escape. Either way, it's a representation of a binary string, not text.
If you want text, you must decode the text encoding into the current database text encoding with convert_from, e.g.
test=> select convert_from(decode('63697A61F161', 'hex'), 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
This should help explain:
demo=> select convert_from(BYTEA 'ciza\361a', 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
See? 'ciza\361a' is an octal-escape representation of the binary data for the iso-8859-1 encoding of the text 'cizaña'. It's the exact same value as the bytea hex-format value '\x63697A61F161':
demo=> select convert_from(BYTEA '\x63697A61F161', 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
So:
decode and encode transform text-string representations of binary data to and from bytea literals, Postgres binary objects. Which are output in a text form for display.h The encoding/decoding here is one of binary representation e.g. hex or base64.
convert_from and convert_to take binary data and apply text encoding processing to convert it to or from the local native database text encoding, producing text strings. The encoding here is text encoding.
It's ... not easy to follow at first. You might have to learn more about text encodings.

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.

Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

Fetching unicode data from PostgreSQL Erlang

I'm trying to fetch data from PostgreSQL with Erlang.
Here's my code that gets data from DB. However i have cyrrilic data in 'status' column. This cyrrilic data is not being fetched correctly.
I tried using UserInfo = io_lib:format("~tp ~n",[UserInfoQuery]), however this doesn't seem to work, because it crashes the app.
UserInfoQuery = odbc_queries:get_user_info(LServer,LUser),
UserInfo = io_lib:format("~p",[UserInfoQuery]),
?DEBUG("UserInfo: ~p",[UserInfo]),
StringForUserInfo = lists:flatten(UserInfo),
get_user_info(LServer, Id) ->
ejabberd_odbc:sql_query(
LServer,
[<<"select * from users "
"where email_hash='">>, Id, "';"]).
Here's the data that is fetched from DB
{selected,[<<"username">>,<<"password">>,<<"created_at">>,
<<"id">>,<<"email_hash">>,<<"status">>],
[{<<"admin">>,<<"admin">>,<<"2014-05-13 12:40:30.757433">>,
<<"1">>,<<"adminhash">>,
<<209,139,209,132,208,178,208,176,209,139,209,132,208,
178,208,176>>}]}
Question:
How can i extract data from column? For example only data from
'status' column?
How can i extract data in unicode from DB? Should i fetch the data from DB then use
io_lib:format("~tp~n") on it? Is there any better way to do it?
Additional question: is there any way to get string in human readable format, so that StringForUserInfo = 'ыфваыфва' from RowUnicode?
I tried this:
{selected, _, [Row]} = UserInfoQuery,
RowUnicode = io_lib:format("~tp~n", [Row]),
?DEBUG("RowUnicode: ~p",[RowUnicode]),
StringForUserInfo = lists:flatten(RowUnicode),
Error:
bad argument in call to erlang:iolist_size([123,60,60,34,97,100,109,105,110,34,
62,62,44,60,60,34,97,100,109,105,110,34,62,62,44,60,60,34,50,...])

Erlang ODBC driver perfectly fetched the status column from your database. Indeed, PostgreSQL encodes your data is UTF-8, and the value you get is UTF-8 encoded.
Status = <<209,139,209,132,208,178,208,176,209,139,209,132,208,178,208,176>>.
This is a binary representing the string ыфваыфва in UTF-8.
You can directly use UTF-8 encoded binaries in your code. If you want to use unicode character points instead of UTF-8 bytes, you can convert this to a list of integers (a string in Erlang parlance). Just use unicode:characters_to_list/1, which in your case will yield list [1099,1092,1074,1072,1099,1092,1074,1072]. This is a list representation of the same string. Unicode character 1099 (16#044B in hex) is ы (CYRILLIC SMALL LETTER YERU, cf Cyrillic excerpt unicode chart).
Erlang can handle unicode texts in the two representations above: lists of unicode characters as integers and binaries of UTF-8 encoded characters.
Let's examine a smaller example, string "ы". This string is composed of unicode character 044B CYRILLIC SMALL LETTER YERU, and it can be encoded as a binary as <<209,139>> or as a list as [16#044B] (= [1099]).
Historically, lists of integers as well as binaries were Latin-1 (ISO-8859-1) encoded. Unicode and ISO-8859-1 have the same values from 0 to 255, but UTF-8 transformation only matches ISO-8859-1 for characters in the 0-127 range. For this reason, Erlang's ~s format argument has a unicode translation modifier, ~ts. The following line will not work as expected:
io:format("~s", [<<209,139>>]).
It will output two characters, 00D1 (LATIN CAPITAL LETTER N WITH TILDE) and 008B (PARTIAL LINE FORWARD). This is because <<209,139>> is interpreted as a Latin-1 string and not as a UTF-8 encoded string.
The following line will fail:
io:format("~s", [[1099]]).
This is because [1099] is not a valid Latin-1 string.
Instead, you should write:
io:format("~ts", [<<209,139>>]),
io:format("~ts", [[1099]]).
Erlang's ~p format argument also has a unicode translation modifier, ~tp. However, ~tp will not do what you are looking for alone. Whether you use ~p or ~tp, by default, io_lib:format/2 will format the Status UTF-8 encoded binary above as:
<<209,139,209,132,208,178,208,176,209,139,209,132,208,178,208,176>>
Indeed, t modifier only means the argument shall accept unicode input. If you do use ~p, when formatting a string or a binary, Erlang will determine whether this could be represented as a Latin-1 string since input may be Latin-1 encoded. This heuristic allows Erlang to properly distinguish lists of integers and strings, most of the time. To see the heuristic at work, you can try something like:
io:format("~p\n~p\n", [[69,114,108,97,110,103], [1,2,3,4,5,6]]).
The heuristic detects that [69,114,108,97,110,103] actually is "Erlang", while [1,2,3,4,5,6] is just, well, a list of integers.
If you do use ~tp, Erlang will expect strings or binaries to be unicode-encoded, and then apply the default identification heuristic. And default heuristic happens to currently (R17) be latin-1 as well. Since your string cannot be represented with Latin-1, Erlang will display it as a list of integers. Fortunately, you can switch to Unicode heuristics by passing +pc unicode to Erlang on command line, and this will produce what you are looking for.
$ erl +pc unicode
So a solution to your problem is to pass +pc unicode and to use ~tp.

I don't understand why io:format("~tp") doesn't work, but you can extract the row and column you need and print that with io:format("~ts"):
> {selected, _, [Row]} = UserInfoQuery.
> io:format("~ts~n", [element(6, Row)]).
ыфваыфва
ok

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

DB2 UTF-8 Data Storage - Extraneous Byte Values - unicode

This is not really an answer, just to demonstrate the correct behaviour: > db2 "insert into t1 values ('Élan')" DB20000I The SQL command completed successfully. > db2 select "hex (f1), f1 from t1" 1 F1 ---------- ----- C3896C616E Élan 1 record(s) selected.

Related

Powershell, can't enter hashtable keys with some non-ASCII characters (in a script)

PostgreSQL invalid byte sequence for encoding utf8 0xbf

Postgres - decode special characters

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

Fetching unicode data from PostgreSQL Erlang

Categories

Resources