Postgres upper function on turkish character does not return expected result - postgresql

It looks like postgres upper/lower function does not handle select characters in Turkish character set.
select upper('Aaı'), lower('Aaİ') from mytable;
returns :
AAı, aaİ
instead of :
AAI, aai
Note that normal english characters are converted correctly, but not the Turkish I (lower or upper)
Postgres version: 9.2 32 bit
Database encoding (Same result in any of these): UTF-8, WIN1254, C
Client encoding:
UTF-8, WIN1254, C
OS: Windows 7 enterprise edition 64bit
SQL functions lower and upper return the following same bytes for ı and İ on UTF-8 encoded database
\xc4b1
\xc4b0
And the following on WIN1254 (Turkish) encoded database
\xfd
\xdd
I hope my investigation is wrong, and there is something I missed.

Your problem is 100% Windows. (Or rather Microsoft Visual Studio, which PostgreSQL was built with, to be more precise.)
For the record, SQL UPPER ends up calling Windows' LCMapStringW (via towupper via str_toupper) with almost all the right parameters (locale 1055 Turkish for a UTF-8-encoded, Turkish_Turkey database),
but
the Visual Studio Runtime (towupper) does not set the LCMAP_LINGUISTIC_CASING bit in LCMapStringW's dwMapFlags. (I can confirm that setting it does the trick.) This is not considered a bug at Microsoft; it is by design, and will probably not ever be "fixed" (oh the joys of legacy.)
You have three ways out of this:
implement #Sorrow's wrapper solution (or write your own native function replacement (DLL).)
run your PostgreSQL instance on e.g. Ubuntu which exhibits the right behaviour for Turkic locales (#Sorrow confirmed that it works for him); this is probably the simplest and cleanest way out.
drop in a patched 32-bit MSVCR100.DLL in your PostgreSQL bin directory (but although UPPER and LOWER would work, other things such as collation may continue to fail -- again, at the Windows level. YMMV.)
For completeness (and nostalgic fun) ONLY, here is the procedure to patch a Windows system (but remember, unless you'll be managing this PostgreSQL instance from cradle to grave you may cause a lot of grief to your successor(s); whenever deploying a new test or backup system from scratch you or your successor(s) would have to remember to apply the patch again -- and if let's say you one day upgrade to PostgreSQL 10, which say uses MSVCR120.DLL instead of MSVCR100.DLL, then you'll have to try your luck with patching the new DLL, too.) On a test system
use HxD to open C:\WINDOWS\SYSTEM32\MSVCR100.DLL
save the DLL right away with the same name under you PostgreSQL bin directory (do not attempt to copy the file using Explorer or the command line, they might copy the 64bit version)
with the file still open in HxD, go to Search > Replace, pick Datatype: Hexvalues, then
search for...... 4E 14 33 DB 3B CB 0F 84 41 12 00 00 B8 00 01 00 00
replace with... 4E 14 33 DB 3B CB 0F 84 41 12 00 00 B8 00 01 00 01
...then once more...
search for...... FC 51 6A 01 8D 4D 08 51 68 00 02 00 00 50 E8 E2
replace with... FC 51 6A 01 8D 4D 08 51 68 00 02 00 01 50 E8 E2
...and re-save under the PostgreSQL bin directory, then restart PostgreSQL and re-run your query.
if your query still does not work (make sure your database is UTF-8 encoded with Turkish_Turkey for both LC_CTYPE and LC_COLLATE) open postgres.exe in 32-bit Dependency Walker and make sure it indicates it loads MSVCR100.DLL from the PostgreSQL bin directory.
if all functions well copy the patched DLL to the production PostgreSQL bin directory and restart.
BUT REMEMBER, the moment you move the data off the Ubuntu system or off the patched Windows system to an unpatched Windows system you will have the problem again, and you may be unable to import this data back on Ubuntu if the Windows instance introduced duplicates in a citext field or in a UPPER/LOWER-based function index.

It seems to me that your problem is related to Windows. This is how it looks on Ubuntu (Postgres 8.4.14), database encoding UTF-8:
test=# select upper('Aaı'), lower('Aaİ');
upper | lower
-------+-------
AAI | aai
(1 row)
My recommendation would be - if you have to use Windows - to write a stored procedure that will do the conversion for you. Use built-in replace: replace('abcdefabcdef', 'cd', 'XX') returns abXXefabXXef. There might be a more optimal solution, I do not claim that this approach is the correct one.

This is indeed bug in PostgreSQL (still not fixed, even in current git tree).
Proof: https://github.com/postgres/postgres/blob/master/src/port/pgstrcasecmp.c
PostgreSQL developers even mention specifically those Turkish characters there:
SQL99 specifies Unicode-aware case normalization, which we don't yet
have the infrastructure for. Instead we use tolower() to provide a
locale-aware translation.
However, there are some locales where this is not right either (eg, Turkish may do strange things with 'i' and 'I').
Our current compromise is to use tolower() for characters with
the high bit set, and use an ASCII-only downcasing for 7-bit
characters.
pg_upper() implemented in this file is extremely simplistic (as its companion pg_tolower()):
unsigned char
pg_toupper(unsigned char ch)
{
if (ch >= 'a' && ch <= 'z')
ch += 'A' - 'a';
else if (IS_HIGHBIT_SET(ch) && islower(ch))
ch = toupper(ch);
return ch;
}
As you can see, this code does not treat its parameter as Unicode code point, and cannot possibly work 100% correctly, unless currently selected locale happens to be the one that we care for (like Turkish non-unicode locale) and OS-provided non-unicode toupper() is working correctly.
This is really sad, I just hope that this will be solved in upcoming PostgreSQL releases...

The source of the problem explained above. It seems the problem only occurs with the conversion of 'I' to 'ı' and 'i' to 'İ'. As a workaround just replace those characters directly as below before calling lower or upper functions:
SELECT lower(replace('IİĞ', 'I', 'ı')) -> ıiğ
SELECT upper(replace('ıiğ', 'i', 'İ')) -> IİĞ

Related

Does PowerShell try to figure out a script's encoding?

When I execute the following simple script in PowerShell 7.1, I get the (correct) value of 3, regardless of whether the script's encoding is Latin1 or UTF8.
'Bär'.length
This surprises me because I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
Because both scripts evaluate the expression to 3, I am forced to conclude that PowerShell 7.1 applies some heuristic method to infer a Script's encoding when executing it.
Is my conclusion correct and is this documented somewhere?
I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
There are two distinct default character encodings to consider:
The default output encoding used by various cmdlets (Out-File, Set-Content) and the redirection operators (>, >>) when writing a file.
This encoding varies wildly across cmdlets in Windows PowerShell (PowerShell versions up to 5.1) but now - fortunately - consistently defaults to BOM-less UTF-8 in PowerShell [Core] v6+ - see this answer for more information.
Note: This encoding is always unrelated to the encoding of a file that data may have been read from originally, because PowerShell does not preserve this information and never passes text as raw bytes through - text is always converted to .NET ([string], System.String) instances by PowerShell before the data is processed further.
The default input encoding, when reading a file - both source code read by the engine and files read by Get-Content, for instance, which applies only to files without a BOM (because files with BOMs are always properly recognized).
In the absence of a BOM:
Windows PowerShell assumes the system's active ANSI code page, such as Windows-1252 on US-English systems. Note that this means that systems with different active system locales (settings for non-Unicode applications) can interpret a given file differently.
PowerShell [Core] v6+ more sensibly assumes UTF-8, which is capable of representing all Unicode characters and whose interpretation doesn't depend on system settings.
Note that these are fixed, deterministic assumptions - no heuristic is employed.
The upshot is that for cross-edition source code the best encoding to use is UTF-8 with BOM, which both editions recognize properly.
As for a source-code file containing 'Bär'.length:
If the source-code file's encoding is properly recognized, the result is always 3, given that a .NET string instance ([string], System.String) is constructed, which in memory is always composed of UTF-16 code units ([char], System.Char), and given that .Length counts the number of these code units.[1]
Leaving broken files out of the picture (such as a UTF-16 file without a BOM, or a file with a BOM that doesn't match the actual encoding):
The only scenario in which .Length does not return 3 is:
In Windows PowerShell, if the file was saved as a UTF-8 file without a BOM.
Since ANSI code pages use a fixed-width single-byte encoding, each byte that is part of a UTF-8 byte sequence is individually (mis-)interpreted as a character, and since ä (LATIN SMALL LETTER A WITH DIAERESIS, U+00E4) is encoded as 2 bytes in UTF-8, 0xc3 and 0xa4, the resulting string has 4 characters.
Thus, the string renders as Bär
By contrast, in PowerShell [Core] v6+, a BOM-less file that was saved based on the active ANSI (or OEM code) page (e.g., with Set-Content in Windows PowerShell) causes all non-ASCII characters (in the 8-bit range) to be considered invalid characters - because they cannot be interpreted as UTF-8.
All such invalid characters are simply replaced with � (REPLACEMENT CHARACTER, U+FFFD) - in other words: information is lost.
Thus, the string renders as B�r - and its .Length is still 3.
[1] A single UTF-16 code unit is capable of directly encoding all 65K characters in the so-called BMP (Basic Multi-Lingual Plane) of Unicode, but for characters outside this plane pairs of code units encode a single Unicode character. The upshot: .Length doesn't always return the count of characters, notably not with emoji; e.g., '👋'.length is 2
The encoding is unrelated to this case: you are calling string.Length which is documented to return the number of UTF-16 code units. This roughly correlates to letters (when you ignore combining characters and high codepoints like emoji)
Encoding only comes into play when converting implicitly or explicitly to/from a byte array, file, or p/invoke. It doesn’t affect how .Net stores the data backing a string.
Speaking to the encoding for PS1 files, that is dependent upon version. Older versions have a fallback encoding of Encoding.ASCII, but will respect a BOM for UTF-16 or UTF-8. Newer versions use UTF-8 as the fallback.
In at least 5.1.19041.1, loading the file 'Bär'.Length (27 42 C3 A4 72 27 2E 4C 65 6E 67 74 68) and running it with . .\Bar.ps1 will result in 4 printing.
If the same file is saved as Windows-1252 (27 42 E4 72 27 2E 4C 65 6E 67 74 68), then it will print 3.
tl;dr: string.Length always returns number of UTF-16 code units. PS1 files should be in UTF-8 with BOM for cross version compatibility.
I think without a BOM, PS 5 assumes ansi or windows-1252, while PS 7 assumes utf8 no bom. This file saved as ansi in notepad works in PS 5 but not perfectly in PS 7. Just like a utf8 no bom file with special characters wouldn't work perfectly in PS 5. A utf16 ps1 file would always have a BOM or encoding signature. A powershell string in memory would always be utf16, but a character is considered to have a length of 1 except for emoji's. If you have emacs, esc-x hexl-mode is a nice way to look at it.
'¿Cómo estás?'
format-hex file.ps1
Label: C:\Users\js\foo\file.ps1
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 27 BF 43 F3 6D 6F 20 65 73 74 E1 73 3F 27 0D 0A '¿Cómo estás?'��

What is the usage of MBR hex dump and what kind of think can be do using it?

I take copy bytes dump using my Ubuntu os(MBR sector) following command.
dc3dd if=/dev/sda of=x cnt=1 ssz=512 hash=sha256 mlog=hashes
And I convert it to hexdump using following command.
hexdump x > hex_x
I receive out put like this .
I have some experts hep to analysis this hex_dump. I need to know what are the benefit of getting MBR hex dump and what kind of thing can be do using it ? (Eg: can I tell my system os like information analyzing this ? )
Need to know ,are there any commands or tools to more deep analyzing and convert this hexdump to human readable way ?
Q. what are the benefit of getting MBR hex dump and what kind of thing can be done using it?
A. Microsoft says:
The MBR contains a small amount of executable code called the master boot code, the disk signature, and the partition table for the disk.
The master boot code and disk signatures aren't very useful for someone (investigator). However the partition table gives a lot of information, and it can be used to extract information, in scenarios, where OS is corrupted or not booting and MBR can be used to investigate disk drive and operating system.
Sample Partition Table Record: (taken from an MBR, using HEX editor)
80 20 21 00 07 7E 25 19 00 08 00 00 00 38 06 00
Each hexadecimal value has some specific meaning, for instance:
80 => Partition type, Active
20 21 00 => Partition’s starting sector, Cylinder-Head-Sector (CHS)
07 => File System, NTFS
7E 25 19 => Partition’s ending sector, CHS
00 08 00 00 => Starting sector
00 38 06 00 => Size of the partition, 199 MiB
You can read them in detail in Table 1.2 Partition Table Fields, at official site.
Q. are there any commands or tools to more deep analyzing and convert this hexdump to human readable way?
A. You can use any HEX editor, like Hex Editor Neo or Active Disk Editor. These editors will help you in understanding MBR, but there is no magic tool available to to convert hexdump into human readable format (based on my knowledge).
PS: The question is pretty old, I wasn't available earlier so please accept late answer... :)

Why is TeraTerm not putting out the same bytes that are in the send call?

Using TeraTerm and a Serial port adapter I ran a Macro with this line on it:
send $55 $0B $00 $00 $00 $BB $42 $AA
The $BB was sent out as two different bytes instead of just one. I forget which ones they were specifically but the result on the O-scope looked like this:
55 0B 00 00 00 C8 E9 42 AA
Does anyone know why this is?
I looked in the Manual and verified that Send8Ctrl is set to off and so is the Debug option.
Did the research, tested and verified that the answer to the question is that TeraTerm is using UTF-8 instead of English under the general settings menu. There are two options present that may be confusing so the two options are 'English' and 'Default'.

Convert Unicode code point to UTF-8 sequence

I am not sure I've got my nomenclature right, so please correct me :)
I've received a text file representing a Pāli dictionary: a list of words separated by newline \n (0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.
Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example
mhiti
A closer look at the actual bytes then reveal the following (using hexdump -C)
0a 0a 1e 6d 68 69 74 69 0a 0a ...mhiti..
which seems to me the Unicode code point U+1E6D ("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad.
My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?
EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes
01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā k i ñ c a ñ ñ ā y a t a n a
where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.
EDIT: Here's one that I can't quite figure out what it's supposed to be:
01 1e 37 01 01 76 61 6b 61
? ā v a k a
I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in
01 00 1e 37 01 01 76 61 6b 61
Ā ḷ ā v a k a
Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?
I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.
From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).
There's no invocation of iconv that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).
I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.
It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.
Using wget to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.

VerQueryValue and multi codepage Unicode characters

In our application we use VerQueryValue() API call to fetch version info such as ProductName etc. For some applications running on a machine in Traditional Chinese (code page 950), the ProductName which has Unicode sequences that span multiple code pages, some characters are not translated properly. For instance,in the sequence below,
51 00 51 00 6F 8F F6 4E A1 7B 06 74
Some characters are returned as invalid Unicode 0x003f (question mark)
In the above sequence, the Unicode '8F 6F' is not picked up & converted properly by the WinAPI call and is just filled with the invalid Unicode '00 3F' - since '8F 6F' is present in codepage 936 only (ie., Simplified Chinese)
The .exe has just one translation table - as '\StringFileInfo\080404B0' - which refers to a language ID of '804' for Traditional Chinese only
How should one handle such cases - where the ProductName refers to Unicode from both 936 and 950 even though the translation table has one entry only ? Is there any other API call to use ?
Also, if I were to right-click on the exe and view 'details' tab, it shows the Productname correctly ! So it appears Microsoft uses a different API call or somehow
handle this correctly. I need to know how it so done.
Thanks in advance,
Venkat
It looks somewhat waierd to have contents compatible with codepage1 only in a block marked as codepage2. This is the source of your problem.
The best way to handle multi-codepages issues is obviously to turn your app to a Unicode-aware application. There will be no conversion to any codepages anymore, which will make everyone happy.
The LANGID (0804) is only an indication about the language of the contents in the block. If a version info has several blocks, you may program your app to lookup the block in the language of your user.
When you call VerQueryValue() in an ANSI application, this LANGID is not taken into account when converting the Unicode contents to ANSI: You're ANSI, so Windows assume you only understand the machine's default ANSI codepage.
Note about display in console
Beware of the console! It's an old creature that is not totally Unicode-aware. It is based on codepages. Therefore, you should expect display problems which can't be addressed. Even worse: It uses its own codepage (called OEM codepage) which may be different that the usual ANSI codepage (Although for East Asian languages, OEM codepage = ANSI codepage).
HTH.