If I have fields of NVARCHAR (or NTEXT) data type in a Microsoft SQL Server database, what would be the equivalent data type in a PostgreSQL database?
I'm pretty sure postgres varchar is the same as Oracle/Sybase/MSSQL nvarchar even though it is not explicit in the manual:
http://www.postgresql.org/docs/7.4/static/datatype-character.html
Encoding conversion functions are here:
http://www.postgresql.org/docs/current/static/functions-string.html
http://www.postgresql.org/docs/current/static/functions-string.html#CONVERSION-NAMES
Example:
create table
nvctest (
utf8fld varchar(12)
);
insert into nvctest
select convert('PostgreSQL' using ascii_to_utf_8);
select * from nvctest;
Also, there is this response to a similar question from a Postgresql rep:
All of our TEXT datatypes are
multibyte-capable, provided you've
installed PostgreSQL correctly.
This includes: TEXT (recommended)
VARCHAR CHAR
Short answer: There is no PostgreSQL equivalent to SQL Server NVARCHAR.
The types of NVARCHAR(N) on different database are not equivalent.
The standard allows for a wide choice of character collations and encodings/character sets. When dealing with unicode PostgreSQL and SQLServer fall into different camps and no equivalence exists.
These differ w.r.t.
length semantics
representable content
sort order
padding semantics
Thus moving data from one DB system (or encoding/character set) to another can lead to truncation/content loss.
Specifically there is no equivalent between a PostgreSQL (9.1) character type and SQL Server NVARCHAR.
You may migrate the data to a PostgreSQL binary type, but would then loose text querying capabilities.
(Unless PostgreSQL starts supporting a UTF-16 based unicode character set)
Length semantics
N is interpreted differently (Characters, Bytes, 2*N = Bytes) depending on database and encoding.
Microsoft SQL Server uses UCS2 encoding with the VARCHAR length interpreted as UCS-2 points, that is length*2 = bytes length ( https://learn.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-2017 ):
their NVARCHAR(1) can store 1 UCS2 Characters (2 bytes of UCS2).
Oracle UTF-encoding has the same semantics ( and internal CESU-8 storage).
Postgres 9.1 only has a Unicode UTF-8 character set (https://www.postgresql.org/docs/9.1/multibyte.html) , which, like
Oracle (in AL32UTF8 or AL16UTF16 encoding) can store 1 full UCS32 codepoints. That is potentially up to 4 bytes (See e.g
http://www.oracletutorial.com/oracle-basics/oracle-nvarchar2/ which explicitly state the NVARCHAR2(50) column may take up to 200 bytes).
The difference becomes significant when dealing with characters outside the basic multilingual plane which count as one "char unit" in utf8 ucs32 (go, char, char32_t, PostgreSQL ), but are represented as surrogate pairs in UTF-16 which count as two units ( Java, Javascript, C#, ABAP, wchar_t , SQLServer).
e.g.
U+1F60A SMILING FACE WITH SMILING EYES will use up all space in SQL Server NVARCHAR(2).
But only one character unit in PostgreSQL.
Classical enterprise grade DBs will offer at least a choice with UTF-16 like semantics (SAP HANA (CESU-8), DB 2 with collation, SQL Anywhere (CESU8BIN), ...)
E.g. Oracle also offers what they misleadingly call an UTF-8 Collation, which is effectivly CESU-8.
This has the same length semantics, representable content as UTF-16 (=Microsoft SQL Server) and is a suitable collation used by UTF-16 based enterprise systems ( e.g. SAP R/3 ) or under a Java application server.
Note that some databases may still interpret NVARCHAR(N) as a length in byte limitation, even with a variable length unicode encoding ( Example SAP IQ ).
Unrepresentable content
UTF-16 / CESU-8 based system can represent half surrogate pairs, while
UTF-8/UTF-32 based system can not.
This content is unrepresentable in this character set, but are a frequent occurrence in UTF-16 based enterprise systems.
(e.g. Windows pathnames may contain such non-utf-8 representable characters, see e.g. https://github.com/rust-lang/rust/issues/12056).
Thus UTF-16 is a "superset" of UTF-8/UTF-16 which is typically a killer-criteria when dealing with data from enterprise/os-systems based on this encoding ( SAP, Windows, Java, JavaScript ). Note that Javascript JSON encoding took specific care to be able to represent these characters (https://www.rfc-editor.org/rfc/rfc8259#page-10 ).
(2) and (3) are more relevant when migration queries, but not for data migration.
Binary sort order:
Note that binary sort order of CESU-8/UTF-16 is different than UTF-8/UTF-32.
UTF-16/CESU-8/Java/JavaScript/ABAP sort order:
U+0041 LATIN CAPITAL LETTER A
U+1F60A SMILING FACE WITH SMILING EYES
U+FB03 LATIN SMALL LIGATURE ffi
UTF-8 / UCS-32 (go) sort order:
U+0041 LATIN CAPITAL LETTER A
U+FB03 LATIN SMALL LIGATURE ffi
U+1F60A SMILING FACE WITH SMILING EYES
Padding semantics
Padding semantics differ on databases esp. when comparing VARCHAR with CHAR content.
It's varchar and text, assuming your database is in UNICODE encoding. If your database is in a non-UNICODE encoding, there is no special datatype that will give you a unicode string - you can store it as a bytea stream, but that will not be a string.
Standard TEXT datatype is perfectly fine for it.
Related
I have a column that stores a numeric value ranging from 1 to 5.
I considered using smallint and character(1), but I feel like there might be a better data type that is "char" (note the quotes).
"char" requires only 1 byte, which led me to thinking a performance gain over smallint and character(1).
However, in PostgreSQL's own documentation, it says that "char" is not intended for general-purpose use, only for use in the internal system catalogs.
Does this mean I shouldn't use "char" data type for my production application?
If so, which data type would you recommend to store a numeric value ranging from 1 to 5.
The data type "char" is a Postgres type that's not in standard SQL. It's "safe" and you can use it freely, even if "not intended for general-purpose use". There are no restrictions, other than it being non-standard. The type is used across many system catalogs, so it's not going away. If you ever need to migrate to another RDBMS, make the target varchar(1) or similar.
"char" lends itself to 1-letter enumeration types that never grow beyond a hand full of distinct plain ASCII letters. That's how I use it - including productive DBs. (You may want to add a CHECK or FOREIGN KEY constraint to columns enforcing valid values.)
For a "numeric value ranging from 1 to 5" I would still prefer smallint, seems more appropriate for numeric data.
A storage benefit (1 byte vs. 2 bytes) only kicks in for multiple columns - and/or (more importantly!) indexes that can retain a smaller footprint after applying alignment padding.
Notable updates in Postgres 15, quoting the release notes:
Change the I/O format of type "char" for non-ASCII characters (Tom Lane)
Bytes with the high bit set are now output as a backslash and three
octal digits, to avoid encoding issues.
And:
Create a new pg_type.typcategory value for "char" (Tom Lane)
Some other internal-use-only types have also been assigned to this
category.
(No effect on enumeration with plain ASCII letters.)
Related:
Shall I use enum when are too many "categories" with PostgreSQL?
How to store one-byte integer in PostgreSQL?
Calculating and saving space in PostgreSQL
Is there any relation between single byte character locale and index scan in postgreSQL?
I already read the pages in https://www.postgresql.org/doc, but I could not find it.
No, not really. The important division is not between single-byte and multi-byte locales, but between the C (or POSIX) locale and all others.
The C locale is English with a sort order (collation) that sorts words character by character, with lower code points (for example, ASCII values) sorted before higher ones. Because this collation is much simpler than natural language collations, comparisons and sorting are much faster with the C locale. Moreover, and this is where indexes come in, a simple B-tree index on a string column can support LIKE conditions, because words are ordered character by character. With other collations, you need to use the text_pattern_ops operator class to support LIKE.
So if you can live with the strange sort order of the C collation, use the C locale by all means.
Do not use any single-byte locales, ever. Even with the C locale, use the encoding UTF8. There is no benefit in choosing any other encoding, and you will find it quite annoying if you encounter a character that you want to store in your database, but cannot.
We are working towards migration of databases from MSSQL to PostgreSQL database. During this process we came across a situation where a table contains password field which is of NVARCHAR type and this field value got converted from VARBINARY type and stored as NVARCHAR type.
For example: if I execute
SELECT HASHBYTES('SHA1','Password')`
then it returns 0x8BE3C943B1609FFFBFC51AAD666D0A04ADF83C9D and in turn if this value is converted into NVARCHAR then it is returning a text in the format "䏉悱゚얿괚浦Њ鴼"
As we know that PostgreSQL doesn't support VARBINARY so we have used BYTEA instead and it is returning binary data. But when we try to convert this binary data into VARCHAR type it is returning hex format
For example: if the same statement is executed in PostgreSQL
SELECT ENCODE(DIGEST('Password','SHA1'),'hex')
then it returns
8be3c943b1609fffbfc51aad666d0a04adf83c9d.
When we try to convert this encoded text into VARCHAR type it is returning the same result as 8be3c943b1609fffbfc51aad666d0a04adf83c9d
Is it possible to get the same result what we retrieved from MSSQL server? As these are related to password fields we are not intended to change the values. Please suggest on what needs to be done
It sounds like you're taking a byte array containing a cryptographic hash and you want to convert it to a string to do a string comparison. This is a strange way to do hash comparisons but it might be possible depending on which encoding you were using on the MSSQL side.
If you have a byte array that can be converted to string in the encoding you're using (e.g. doesn't contain any invalid code points or sequences for that encoding) you can convert the byte array to string as follows:
SELECT CONVERT_FROM(DIGEST('Password','SHA1'), 'latin1') AS hash_string;
hash_string
-----------------------------
\u008BãÉC±`\u009Fÿ¿Å\x1Afm+
\x04ø<\u009D
If you're using Unicode this approach won't work at all since random binary arrays can't be converted to Unicode because there are certain sequences that are always invalid. You'll get an error like follows:
# SELECT CONVERT_FROM(DIGEST('Password','SHA1'), 'utf-8');
ERROR: invalid byte sequence for encoding "UTF8": 0x8b
Here's a list of valid string encodings in PostgreSQL. Find out which encoding you're using on the MSSQL side and try to match it to PostgreSQL. If you can I'd recommend changing your business logic to compare byte arrays directly since this will be less error prone and should be significantly faster.
then it returns 0x8BE3C943B1609FFFBFC51AAD666D0A04ADF83C9D and in turn
if this value is converted into NVARCHAR then it is returning a text
in the format "䏉悱゚얿괚浦Њ鴼"
Based on that, MSSQL interprets these bytes as a text encoded in UTF-16LE.
With PostgreSQL and using only built-in functions, you cannot obtain that result because PostgreSQL doesn't use or support UTF-16 at all, for anything.
It also doesn't support nul bytes in strings, and there are nul bytes in UTF-16.
This Q/A: UTF16 hex to text suggests several solutions.
Changing your business logic not to depend on UTF-16 would be your best long-term option, though. The hexadecimal representation, for instance, is simpler and much more portable.
I have two databases, one is running on postgresql 8.4 and the other on postgresql 9.1.
Both are on CentOS machines with the same locale (en_US).
Suppose i have a table with this data:
id | description
1 Morango
2 CAFÉ
3 pera
4 Uva
The odd thing is, when i run a query like this one:
SELECT * FROM products WHERE description ~* 'café'
On the 8.4 machine i get no results, but on the 9.1 machine i got the row (CAFÉ).
Apparently they differ on how to compare the upcase unicode character.
Could someone give me some insight about this problem?
Is it the different version o postgresql that can cause this problem?
Are there any additional configuration i could make to equalize the behavior from the two machines?
UPDATE: Both databases are UTF-8
Case-insensitive regex matching for non-US Unicode characters was basically not supported before 9.0.
See this snippet in the 9.0 release notes:
E.14.3.6. Functions
[...]
Support locale-specific regular expression processing with UTF-8
server encoding (Tom Lane)
Locale-specific regular expression functionality includes
case-insensitive matching and locale-specific character classes.
Previously, these features worked correctly for non-ASCII characters
only if the database used a single-byte server encoding (such as
LATIN1). They will still misbehave in multi-byte encodings other than
UTF-8.
I want to store unicode characters in on of the column of PostgreSQL8.4 datat base table. I want to store non-English language data say want to store the Indic language texts. I have achieved the same in Oracle XE by converting the text into unicode and stored in the table using nvarchar2 column data type.
The same way I want to store unicode characters of Indic languages say (Tamil,Hindi) in one of the column of a table. How to I can achieve that,what data type should I use?
Please guide me, thanks in advance
Just make sure the database is initialized with encoding utf8. This applies to the whole database for 8.4, later versions are more sophisticated. You might want to check the locale settings too - see the manual for details, particularly around matching with LIKE and text pattern ops.