Postgres - character sets and encodings - postgresql

I was wondering if someone can help me understand what's going on/wrong with my Postgres data please...
I'll explain things below - but I guess ultimately the questions I have are :
What characterset/encoding should I be using (i.e. what is best
practise)?
IF the answer is UTF8, then will certain characters (e.g. UK pound symbols) always look "funny" in the database?
I've got a database that has a table with data about flights (although obviously it could be anything really), defined as follows...
CREATE TABLE public.flight (
flightid integer DEFAULT nextval('public.flight_seq'::regclass) NOT NULL,
tripid integer NOT NULL,
flightdatedeparted date NOT NULL,
flightairportdeparted text NOT NULL,
flightairportarrived text NOT NULL,
flightairline text NOT NULL,
flightdetails text,
flightdayflightnumber integer DEFAULT 1 NOT NULL,
flightdistance numeric
);
Now, when I enter data into it via a web front end connected to this database then I end up with data something like...
holidayinfo=# select distinct * from flight where flightid=97;
-[ RECORD 1 ]---------+---------------------------------------
flightid | 97
tripid | 36
flightdatedeparted | 2004-05-14
flightairportdeparted | LHR
flightairportarrived | WAW
flightairline | British Airways
flightdetails | Hotline, £82.40, BA850, 13:40 -> 17:05
flightdayflightnumber | 1
flightdistance | 912.7
However, the data that I'd entered into the web form for the field "flightdetails" was actually...
Hotline, £82.40, BA850, 13:40 -> 17:05
Now, when I dump the data and look at it in Notepad++, depending on what encoding I use then sometimes I see it correctly as the pound symbol (when I choose ANSI) and other times it's incorrect as xA3 (when I choose UTF8).
At least when it's stored in Postgres as the "funny" value then it also displays correctly on my webpage when I retrieve the data - so that's good.
If I try to manually update the value via psql then I get the following...
holidayinfo=# update flight set flightdetails='Hotline, £82.40, BA850, 13:40 -> 17:05' where flightid=97;
ERROR: invalid byte sequence for encoding "UTF8": 0x9c
In terms of how my database is created and what client encoding its using then I've got the following...
holidayinfo=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-------------+----------+----------+-----------------------------+-----------------------------+-----------------------
holidayinfo | postgres | UTF8 | English_United Kingdom.1252 | English_United Kingdom.1252 |
leagueinfo | postgres | UTF8 | English_United Kingdom.1252 | English_United Kingdom.1252 |
postgres | postgres | UTF8 | English_United Kingdom.1252 | English_United Kingdom.1252 |
template0 | postgres | UTF8 | English_United Kingdom.1252 | English_United Kingdom.1252 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | English_United Kingdom.1252 | English_United Kingdom.1252 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)
holidayinfo=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)
Maybe this is all working as designed, but I'm just confused as to how things should be?
Ultimately, I'd love to be able to have the data stored so that I can see it as the pound sign AND be entered/retrieved/displayed as the pound sign.
The former is desirable so that if ever I need to look at the data then I can see what the real data is - not have to make assumptions on what character "£" really means.
Also, this problem scales up when there are other characters having the same "issue" such as a hyphen (-) showing as "ÔÇô" and or an apostrophe (') showing as "ÔÇÖ".
Thanks in advance!

You must be viewing the data with psql using cmd.exe with code page CP-850.
The data in your database are wrong, because the application that inserted them had client_encoding set to WIN1252 while feeding the database UTF-8 characters.
So £, which is 0xC2A3 in UTF-8, is interpreted as two characters, namely  (0xC2) and £ (0xA3). They are converted to UTF-8 and stored in the database as 4 bytes (0xC382 and 0xC2A3). When you view them with psql, they are converted back to WINDOWS-1252, but cmd.exe interprets them as CP-850 and renders them as ┬ú.
The fix is to change client_encoding to UTF8 in the application that inserts the data into the database.

Related

UPPER function is not working properly on O umlaut characters in Postgres

Please note that I have a postgres database as : -
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+------------+----------+-------------+-------------+----------------------------
my_db | admin_user | UTF8 | de_DE.UTF-8 | C | =Tc/admin_user +
| | | | | admin_user=CTc/admin_user +
| | | | | my_readonly=c/admin_user
UPPER function is not working properly on O umlaut characters in this database.
Please advice if there is any settings that can be the issue.
What determines the rules about what is a number or a letter; or the correspondence of small to big letters is LC_CTYPE*. You need it to be something like de_DE.UTF-8 in order to do UPPER for such letters. You have C at the moment.
When creating a DB, Postgres takes these settings from the environment variables in operating system. But you can override them at that point.
*I read CTYPE as Character TYPE

Use date function to rename a database in postgres

I would like to know how to rename a database with the current date
thanks for your help
You may use dynamic SQL in aDO block. Here I use a date suffix in YYYYMMDD format for the database name.
knayak=# CREATE DATABASE mydatabase;
CREATE DATABASE
DO $$
BEGIN
EXECUTE format('ALTER DATABASE %I RENAME TO %I_%s', 'mydatabase','mydatabase',
to_char(current_date,'YYYYMMDD')::TEXT);
END
$$;
knayak=#
knayak=# \l mydatabase*
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
---------------------+--------+----------+-------------+-------------+-------------------
mydatabase_20181214 | knayak | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
(1 row)

Column sorting in PostgreSQL is different between macOS and Ubuntu using same collation

I created a database with UTF8 encoding and fr_FR collation on both my Mac and Ubuntu server like this:
CREATE DATABASE my_database OWNER 'admin' TEMPLATE 'template0' ENCODING 'UTF8' LC_COLLATE 'fr_FR.UTF-8' LC_CTYPE 'fr_FR.UTF-8';
On both, I queried the collation:
show lc_collate;
and obtained:
fr_FR.UTF-8
Then, I tried to sort the same database and didn't obtain same results:
SELECT winery FROM usr_wines WHERE user_id=1 AND status=1 ORDER BY winery LIMIT 5;
1 - On macOS:
a space before the a
A New record
Aa
Altesinoo
Aé
2- On Ubuntu 14.04:
Aa
Aé
Altesino
A New Wine
a space before a
On Ubuntu, I have installed the desired locales and create a new collation:
CREATE COLLATION "fr_FR.utf8" (LOCALE = "fr_FR.utf8")
select * from pg_collation;
collname | collnamespace | collowner | collencoding | collcollate | collctype
------------+---------------+-----------+--------------+-------------+------------
default | 11 | 10 | -1 | |
C | 11 | 10 | -1 | C | C
POSIX | 11 | 10 | -1 | POSIX | POSIX
C.UTF-8 | 11 | 10 | 6 | C.UTF-8 | C.UTF-8
en_US | 11 | 10 | 6 | en_US.utf8 | en_US.utf8
en_US.utf8 | 11 | 10 | 6 | en_US.utf8 | en_US.utf8
ucs_basic | 11 | 10 | 6 | C | C
fr_FR | 2200 | 10 | 6 | fr_FR.utf8 | fr_FR.utf8
On the mac, the fr_FR collation was already installed.
So why this difference in sorting ?
Another strange issue on Ubuntu: I fi tried to force the collation in my request:
SELECT winery FROM usr_wines WHERE user_id=1 AND status=1 ORDER BY winery COLLATE "fr_FR" LIMIT 5;
I got:
ERROR: collation "fr_FR" for encoding "UTF8" does not exist
Any help is welcome.
COLLATE "C" will give you predictable results on all platforms. Additional collations may be available depending on operating system support. And thus its behaviour totally depends on OS.
https://www.postgresql.org/docs/current/static/collation.html:
On all platforms, the collations named default, C, and POSIX are
available. Additional collations may be available depending on
operating system support. The default collation selects the LC_COLLATE
and LC_CTYPE values specified at database creation time. The C and
POSIX collations both specify "traditional C" behavior, in which only
the ASCII letters "A" through "Z" are treated as letters, and sorting
is done strictly by character code byte values.
If the operating system provides support for using multiple locales
within a single program (newlocale and related functions), then when a
database cluster is initialized, initdb populates the system catalog
pg_collation with collations based on all the locales it finds on the
operating system at the time. For example, the operating system might
provide a locale named de_DE.utf8. initdb would then create a
collation named de_DE.utf8 for encoding UTF8 that has both LC_COLLATE
and LC_CTYPE set to de_DE.utf8. It will also create a collation with
the .utf8 tag stripped off the name. So you could also use the
collation under the name de_DE, which is less cumbersome to write and
makes the name less encoding-dependent. Note that, nevertheless, the
initial set of collation names is platform-dependent.

Postgres 9.6.1 Full Text Search dictionaries for most spoken languages

I am trying to run full text search operations, such as to_tsvector, to_tsquery, etc and have a need for dictionaries in about 80+ languages.
Postgres seems to only come with 16 language configurations, with an additional two I am testing for Chinese (jiebacfg and testzhcg aka ZHParse). I'm looking for documentation or a repository of other languages beyond these.
mydatabase=# \dF
List of text search configurations
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | italian | configuration for italian language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | turkish | configuration for turkish language
public | jiebacfg | configuration for jieba
public | testzhcfg |
(18 rows)
As pozs commented you can get dictionary files from OpenOffice (or LibreOffice) extensions. From documentation:
To create an Ispell dictionary perform these steps:
download dictionary configuration files. OpenOffice extension files have the .oxt extension. It is necessary to extract .aff and .dic files, change extensions to .affix and .dict. For some dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for example, for a Norwegian language dictionary):
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
copy files to the $SHAREDIR/tsearch_data directory
load files into PostgreSQL with the following command:
CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
DictFile = en_us,
AffFile = en_us,
Stopwords = english);
Also there is a list of extensions which provide easy way of dictionary installing. You can download them from github.

Alternative to list command in postgres for scripting

Unfortunatly psql -l uses linewraps
example
see output and regard the "access" column
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-------------------------+------------+----------+-------------+-------------+-----------------------
firstdb | postgres | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 |
secnddb | scnduser | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 |
thrddb | scnduser | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 |
postgres | postgres | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 |
template0 | postgres | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(6 rows)
hint even with some option, I can't get that gone:
$ psql -Atlqn
firstdb|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|
secnddb|scnduser|UTF8|de_DE.UTF-8|de_DE.UTF-8|
thrddb|scnduser|UTF8|de_DE.UTF-8|de_DE.UTF-8|
postgres|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|
template0|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|=c/postgres
postgres=CTc/postgres
template1|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|=c/postgres
postgres=CTc/postgres
Question Is there another way to get the list of databases in the same way \list prints it so I can use it in scripts for parsing with e.g. awk?
Interesting issue. You're being bitten by hard linewraps in ACL entries. Those aren't the only place they can appear btw, just the most common.
Use a null-byte recordsep
Rather than trying to avoid the newlines, why not use a different record separator? A null byte (\0) can be good; that's what the -0 option is for. It's only useful if your client can deal with null bytes, though; good for xargs -0, not good for lots of other stuff. The handy thing about a null byte is that it won't appear in psql's output otherwise so there's no risk of conflict. gawk does support null-separated records, though it's woefully underdocumented.
Try, e.g:
psql -Atlqn -0 | gawk -vRS=$'\0' '{ gsub("\n", " "); print }
which replaces newlines in database names (yes, they can appear there!), ACL entries, etc with a space.
Use a different recordsep
Alternately, use -R, e.g. -R '!' or -R '--SEPARATOR--' or whatever is easy for you to parse and not likely to appear in the output.
Query the catalogs yourself, escaping strings
Depending on the information you need, you can instead query the catalogs or information_schema directly, too. You'll still have to deal with funny chars, so you may want a regexp to escape any funny business.
Newlines and shell metacharacters
Beware that you still have to deal with unexpected newlines; consider what happens when some ##$# does this:
CREATE DATABASE "my
database";
Yes, that's a legal database name. So are both of:
CREATE DATABASE "$(rm -rf /gladthisisnotroot);";
CREATE DATABASE "$(createuser -l -s my_haxxor -W top_secret)"
Yes, both are legal database names. Yes, that's really, really bad if you don't escape your shell metacharacters properly, you made the mistake of running your shell scripts as root or the postgres user, and they're not feeling nice.
All relevant data is in pg_database and pg_user
see http://www.postgresql.org/docs/current/static/catalog-pg-database.html
select pg_database.datname,pg_user.usename,pg_encoding_to_char(pg_database.encoding),pg_database.datcollate,pg_database.datctype,pg_database.datacl from pg_database,pg_user WHERE pg_database.datdba = pg_user.usesysid;
on shell, wrapped in psql command:
psql -AStnq -c "select [...]"
returns correct formatted
template1|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|{=c/postgres,postgres=CTc/postgres}
template0|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|{=c/postgres,postgres=CTc/postgres}
postgres|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|
firstdb|postgres|UTF8|de_DE.UTF-8|de_DE.UTF-8|
secnddb|scnduser|UTF8|de_DE.UTF-8|de_DE.UTF-8|
thrddb|scnduser|UTF8|de_DE.UTF-8|de_DE.UTF-8|