How to convert string to unicode using PostgreSQL? - postgresql

Here I want to convert my string to unicode. I am using PostgreSQL 9.3 version. In SQL Server its much more easy:
Example:
sql = N'select * from tabletest'; /* For nvarchar/nchar/ntext */
OR
sql = U'select * from tabletest'; /* For varchar/char/text */
Question: How can I do the above conversion in PostgreSQL?

PostgreSQL databases have a native character type, the "server encoding". It is usually utf-8.
All text is in this encoding. Mixed encoding text is not supported, except if stored as bytea (i.e. as opaque byte sequences).
You can't store "unicode" or "non-unicode" strings, and PostgreSQL has no concept of "varchar" vs "nvarchar". With utf-8, characters that fall in the 7-bit ASCII range (and some others) are stored as a single byte, and wider chars require more storage, so it's just automatic. utf-8 requires more storage than ucs-2 or utf-16 for text that is all "wide" characters, but less for text that's a mixture.
PostgreSQL automatically converts to/from the client's text encoding, using the client_encoding setting. There is no need to convert explicitly.
If your client is "Unicode" (which Microsoft products tend to say when they mean UCS-2 or UTF-16), then most client drivers take care of any utf-8 <--> utf-16 conversion for you.
So you should not need to care, so long as your client does I/O with correct charset options and sets a correct client_encoding that matches the data its actually sends on the wire. (This is automatic with most client drivers like PgJDBC, nPgSQL, or the Unicode psqlODBC driver).
See:
character set support

Related

Is there any NoSQL database that allows for 8-bit single-byte encodings, such as ISO-8859-1?

I'm trying to choose the best NoSQL database service for my data.
Most NoSQL databases out there only support UTF-8, and there's no way to enforce an encoding, unlike relational dbs. And the problem is that UTF-8 uses one byte only for the first 127 characters, but uses two bytes for the next 128, and these are the characters that comprise 80% of my data (don't ask me why I have more of these squiggles than the actual English alphabet, it's a long-winded answer).
I'll have to perform lots of queries with regular expressions on those strings, that look like "àñÿÝtçh" and are made up of mostly characters 128-255 in ISO-8859-1, ISO-8859-15 or Windows-1252. I know that I will never need to store characters outside that range in my database, so it's safe to work with only 256 characters and miss out on the gazillion characters UTF-8 supports. I am also aware that ISO-8859-1 will create lots of compatibility problems with JSON objects and things like that. However, we will be running lots of regex queries, and those can be quite complex, and the extra cost of doubling the bytes just because I have no choice but to use UTF-8 may have a negative impact on performance.
I understand that NoSQL databases tend to be schema-less, and fields are not normally defined with data types and encodings, but NoSQL will suit our project much better than SQL. Cassandra stores strings in 1-byte US-ASCII for the 0-127 lot, and not 1-byte UTF-8. Is there any NoSQL out there that defaults to ISO-8859-1 or ISO-8859-15 for the 0-255 lot?

cant find Varchar Chart of acceptable characters

Does anyone know of a simple chart or list that would show all acceptable varchar characters? I cannot seem to find this in my googling.
What codepage? Collation? Varchar stores characters assuming a specific codepage. Only the lower 127 characters (the ASCII subset) is standard. Higher characters vary by codepage.
The default codepage used matches the collation of the column, whose defaults are inherited from the table,database,server. All of the defaults can be overriden.
In sort, there IS no "simple chart". You'll have to check the character chart for the specific codepage, eg. using the "Character Map" utility in Windows.
It's far, far better to use Unicode and nvarchar when storing to the database. If you store text data from the wrong codepage you can easily end up with mangled and unrecoverable data. The only way to ensure the correct codepage is used, is to enforce it all the way from the client (ie the desktop app) to the application server, down to the database.
Even if your client/application server uses Unicode, a difference in the locale between the server and the database can result in faulty codepage conversions and mangled data.
On the other hand, when you use Unicode no conversions are needed or made.

PostgreSQL UTF-8 binary collation

I would like to have a collation which orders the UTF-8 encoding of 0x1234 below of 0x1235 regardless of the character mapping in the Unicode standard. MySQL uses utf8_bin for this. MSSQL apparently http://msdn.microsoft.com/en-us/library/ms143350.aspx have BIN and BIN2 collations. While finding these were easy, I can't even find a list of collations PostgreSQL supports much less answer to this specific question.
The C locale will do. UTF-8 is designed so that byte ordering is also codepoint ordering. This is not trivial but consider how UTF-8 works:
Number range Byte 1 Byte 2 Byte 3
0000-007F 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
When sorting binary data aka C locale, the first non-equal byte will determine the ordering. What we neeed to see that if two numbers encoded into UTF-8 differ then the first non-equal byte will be lower for the lower value. If the numbers are in different ranges then the first byte will indeed be lower for the lower number. Within the same range, the order is determined by literally the same bits as without encoding.
Sort order of text depends on lc_collate (not on the system locale!). The system locale only serves as a default when creating the db cluster if you don't provide another locale.
The behaviour you are expecting only works with locale C. Read all about it in the fine manual:
The C and POSIX collations both specify "traditional C" behavior, in
which only the ASCII letters "A" through "Z" are treated as letters,
and sorting is done strictly by character code byte values.
Emphasis mine. PostgreSQL 9.1 has a couple of new features for collation. Might be exactly what you are looking for.
Postgres uses the collation defined by the system locale on cluster creation.
You might try to ORDER BY encode(column,'hex')

DB2 VARCHAR unicode data storage

We are currently using VARCHAR for storing text data in DB2 however we are hitting the problem that length of VARCHAR specified is not the same as length of text because in DB2 VARCHAR length specified is UTF-8 data length which can vary depending on stored text data. For example some texts contain characters from different languages and because of it some texts with 500 characters can't be saved in VARCHAR(500) and etc.
Now we are planning to migrate to VARGRAPHIC. I need to know what are limitations of using VARGRAPHIC for storing unicode text data in DB2.
Are there any problems with using VARGRAPHIC?
DB2 doesn't check that the data is in fact double-byte String, but it assumes it must be. Usually the drivers will do proper conversions for you but you might one day bump into some bug. It is unlikely though.
If you use federated databases Vargraphic support in queries might fail completely. In overall the amount of bug reports for vargraphic data types is somewhat high. Support for it isn't probably as well tested and tried as for other data types.
Vargraphic will with unicode database (ie. UTF-8 is requirement) use big-endian UCS-2, meaning your space requirements for those columns double. Vargraphic is DB2 properietary data type. If you migrate off DB2 some day you will have to do an extra conversion.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.