Postgresql ORDER BY not working as expected - postgresql

Let's try this simple example to represent the problem I'm facing.
Assume this table:
CREATE TABLE testing1
(
id serial NOT NULL,
word text,
CONSTRAINT testing1_pkey PRIMARY KEY (id)
);
and that data:
insert into testing1 (word) values ('Heliod, God');
insert into testing1 (word) values ('Heliod''s Inter');
insert into testing1 (word) values ('Heliod''s Pilg');
insert into testing1 (word) values ('Heliod, Sun');
Then I want to run this query to get the results ordered by the word column:
SELECT
id, word
FROM testing1
WHERE UPPER(word::text) LIKE UPPER('heliod%')
ORDER BY word asc;
But look at the output, it's not ordered. I would expect the rows to be in that order, using their ids: 2, 3, 1, 4 (or, if I use the word's values: Heliod's Inter, Heliod's Pilg, Heliod, God, Heliod, Sun). This is what I get:
I thought that maybe something could confuse postgresql because of the WHERE criteria I used, but the below happens if I just order by on the rows:
Am I missing something here? I couldn't find anything in the docs about ordering values that contain quotes (I suspect that the quotes cause that behaviour because of their special meaning in postgresql, but I may be wrong).
I am using UTF-8 encoding for my database (not sure if it matters though) and this issue is happening on Postgresql version 12.7.
The output of
show lc_ctype;
is
"en_GB.UTF-8"
and the output of
show lc_collate;
is
"en_GB.UTF-8"

That is the correct way to order the rows in en_US.UTF-8. It does 'weird' (to someone used to ASCII) things with punctuation and whitespace, skipping on a first pass and considering it only for otherwise tied values.
If you don't want those rules, maybe use the C collation instead.

Indeed, I've tried #jjanes's suggestion to use the C collation and the output is the one I would expect:
SELECT
id, word
FROM testing1
ORDER BY word collate "C" ;
How weird, I have been using postgresql for some years now and I never noticed that behaviour.
Relevant section from the docs:
23.2.2.1. Standard Collations
On all platforms, the collations named default, C, and POSIX are available. > Additional collations may be available depending on operating system
support. The default collation selects the LC_COLLATE and LC_CTYPE values
specified at database creation time. The C and POSIX collations both
specify “traditional C” behavior, in which only the ASCII letters “A”
through “Z” are treated as letters, and sorting is done strictly by
character code byte values.

Related

Postgres Varchar(n) or text which is better [duplicate]

What's the difference between the text data type and the character varying (varchar) data types?
According to the documentation
If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension.
and
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type text is not in the SQL standard, several other SQL database management systems have it as well.
So what's the difference?
There is no difference, under the hood it's all varlena (variable length array).
Check this article from Depesz: http://www.depesz.com/index.php/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/
A couple of highlights:
To sum it all up:
char(n) – takes too much space when dealing with values shorter than n (pads them to n), and can lead to subtle errors because of adding trailing
spaces, plus it is problematic to change the limit
varchar(n) – it's problematic to change the limit in live environment (requires exclusive lock while altering table)
varchar – just like text
text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
The article does detailed testing to show that the performance of inserts and selects for all 4 data types are similar. It also takes a detailed look at alternate ways on constraining the length when needed. Function based constraints or domains provide the advantage of instant increase of the length constraint, and on the basis that decreasing a string length constraint is rare, depesz concludes that one of them is usually the best choice for a length limit.
As "Character Types" in the documentation points out, varchar(n), char(n), and text are all stored the same way. The only difference is extra cycles are needed to check the length, if one is given, and the extra space and time required if padding is needed for char(n).
However, when you only need to store a single character, there is a slight performance advantage to using the special type "char" (keep the double-quotes — they're part of the type name). You get faster access to the field, and there is no overhead to store the length.
I just made a table of 1,000,000 random "char" chosen from the lower-case alphabet. A query to get a frequency distribution (select count(*), field ... group by field) takes about 650 milliseconds, vs about 760 on the same data using a text field.
(this answer is a Wiki, you can edit - please correct and improve!)
UPDATING BENCHMARKS FOR 2016 (pg9.5+)
And using "pure SQL" benchmarks (without any external script)
use any string_generator with UTF8
main benchmarks:
2.1. INSERT
2.2. SELECT comparing and counting
CREATE FUNCTION string_generator(int DEFAULT 20,int DEFAULT 10) RETURNS text AS $f$
SELECT array_to_string( array_agg(
substring(md5(random()::text),1,$1)||chr( 9824 + (random()*10)::int )
), ' ' ) as s
FROM generate_series(1, $2) i(x);
$f$ LANGUAGE SQL IMMUTABLE;
Prepare specific test (examples)
DROP TABLE IF EXISTS test;
-- CREATE TABLE test ( f varchar(500));
-- CREATE TABLE test ( f text);
CREATE TABLE test ( f text CHECK(char_length(f)<=500) );
Perform a basic test:
INSERT INTO test
SELECT string_generator(20+(random()*(i%11))::int)
FROM generate_series(1, 99000) t(i);
And other tests,
CREATE INDEX q on test (f);
SELECT count(*) FROM (
SELECT substring(f,1,1) || f FROM test WHERE f<'a0' ORDER BY 1 LIMIT 80000
) t;
... And use EXPLAIN ANALYZE.
UPDATED AGAIN 2018 (pg10)
little edit to add 2018's results and reinforce recommendations.
Results in 2016 and 2018
My results, after average, in many machines and many tests: all the same (statistically less than standard deviation).
Recommendation
Use text datatype, avoid old varchar(x) because sometimes it is not a standard, e.g. in CREATE FUNCTION clauses varchar(x)≠varchar(y).
express limits (with same varchar performance!) by with CHECK clause in the CREATE TABLE e.g. CHECK(char_length(x)<=10). With a negligible loss of performance in INSERT/UPDATE you can also to control ranges and string structure e.g. CHECK(char_length(x)>5 AND char_length(x)<=20 AND x LIKE 'Hello%')
On PostgreSQL manual
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
I usually use text
References: http://www.postgresql.org/docs/current/static/datatype-character.html
In my opinion, varchar(n) has it's own advantages. Yes, they all use the same underlying type and all that. But, it should be pointed out that indexes in PostgreSQL has its size limit of 2712 bytes per row.
TL;DR:
If you use text type without a constraint and have indexes on these columns, it is very possible that you hit this limit for some of your columns and get error when you try to insert data but with using varchar(n), you can prevent it.
Some more details: The problem here is that PostgreSQL doesn't give any exceptions when creating indexes for text type or varchar(n) where n is greater than 2712. However, it will give error when a record with compressed size of greater than 2712 is tried to be inserted. It means that you can insert 100.000 character of string which is composed by repetitive characters easily because it will be compressed far below 2712 but you may not be able to insert some string with 4000 characters because the compressed size is greater than 2712 bytes. Using varchar(n) where n is not too much greater than 2712, you're safe from these errors.
text and varchar have different implicit type conversions. The biggest impact that I've noticed is handling of trailing spaces. For example ...
select ' '::char = ' '::varchar, ' '::char = ' '::text, ' '::varchar = ' '::text
returns true, false, true and not true, true, true as you might expect.
Somewhat OT: If you're using Rails, the standard formatting of webpages may be different. For data entry forms text boxes are scrollable, but character varying (Rails string) boxes are one-line. Show views are as long as needed.
A good explanation from http://www.sqlines.com/postgresql/datatypes/text:
The only difference between TEXT and VARCHAR(n) is that you can limit
the maximum length of a VARCHAR column, for example, VARCHAR(255) does
not allow inserting a string more than 255 characters long.
Both TEXT and VARCHAR have the upper limit at 1 Gb, and there is no
performance difference among them (according to the PostgreSQL
documentation).
The difference is between tradition and modern.
Traditionally you were required to specify the width of each table column. If you specify too much width, expensive storage space is wasted, but if you specify too little width, some data will not fit. Then you would resize the column, and had to change a lot of connected software, fix introduced bugs, which is all very cumbersome.
Modern systems allow for unlimited string storage with dynamic storage allocation, so the incidental large string would be stored just fine without much waste of storage of small data items.
While a lot of programming languages have adopted a data type of 'string' with unlimited size, like C#, javascript, java, etc, a database like Oracle did not.
Now that PostgreSQL supports 'text', a lot of programmers are still used to VARCHAR(N), and reason like: yes, text is the same as VARCHAR, except that with VARCHAR you MAY add a limit N, so VARCHAR is more flexible.
You might as well reason, why should we bother using VARCHAR without N, now that we can simplify our life with TEXT?
In my recent years with Oracle, I have used CHAR(N) or VARCHAR(N) on very few occasions. Because Oracle does (did?) not have an unlimited string type, I used for most string columns VARCHAR(2000), where 2000 was at some time the maximum for VARCHAR, and in all practical purposes not much different from 'infinite'.
Now that I am working with PostgreSQL, I see TEXT as real progress. No more emphasis on the VAR feature of the CHAR type. No more emphasis on let's use VARCHAR without N. Besides, typing TEXT saves 3 keystrokes compared to VARCHAR.
Younger colleagues would now grow up without even knowing that in the old days there were no unlimited strings. Just like that in most projects they don't have to know about assembly programming.
I wasted way too much time because of using varchar instead of text for PostgreSQL arrays.
PostgreSQL Array operators do not work with string columns. Refer these links for more details: (https://github.com/rails/rails/issues/13127) and (http://adamsanderson.github.io/railsconf_2013/?full#10).
If you only use TEXT type you can run into issues when using AWS Database Migration Service:
Large objects (LOBs) are used but target LOB columns are not nullable
Due to their unknown and sometimes large size, large objects (LOBs) require more processing
and resources than standard objects. To help with tuning migrations of systems that contain
LOBs, AWS DMS offers the following options
If you are only sticking to PostgreSQL for everything probably you're fine. But if you are going to interact with your db via ODBC or external tools like DMS you should consider using not using TEXT for everything.
character varying(n), varchar(n) - (Both the same). value will be truncated to n characters without raising an error.
character(n), char(n) - (Both the same). fixed-length and will pad with blanks till the end of the length.
text - Unlimited length.
Example:
Table test:
a character(7)
b varchar(7)
insert "ok " to a
insert "ok " to b
We get the results:
a | (a)char_length | b | (b)char_length
----------+----------------+-------+----------------
"ok "| 7 | "ok" | 2

How would I diagnose what error seems to lead to non-functional underscore wildcard queries in Postgresql 15?

I am working through a quick refresher ('SQL Handbook' by Flavio Copes), and any LIKE or ILIKE query I use with the underscore wildcard returns no results.
The table is created as such:
CREATE TABLE people (
names CHAR(20)
);
INSERT INTO people VALUES ('Joe'), ('John'), ('Johanna'), ('Zoe');
Given this table, I use the following query:
SELECT * FROM people WHERE names LIKE '_oe';
I expect it to return
names
1
Joe
2
Zoe
Instead, it returns
names
The install is PostgreSQL 15 (x64), pgAdmin 4, and PostGIS v3.3.1
Using char(20) means all strings are exactly 20 chars long, being padded with spaces out to that length. The spaces make it not match the pattern, as there is nothing in the pattern to accommodate spaces at the end.
If you make the pattern be '_oe%' it would work. Or better yet, don't use char(20).

Alphanumeric sorting without any pattern on the strings [duplicate]

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

Postgresql constraint to check for non-ascii characters

I have a Postgresql 9.3 database that is encoded 'UTF8'. However, there is a column in database that should never contain anything but ASCII. And if non-ascii gets in there, it causes a problem in another system that I have no control over. Therefore, I want to add a constraint to the column. Note: I already have a BEFORE INSERT trigger - so that might be a good place to do the check.
What's the best way to accomplish this in PostgreSQL?
You can define ASCII as ordinal 1 to 127 for this purpose, so the following query will identify a string with "non-ascii" values:
SELECT exists(SELECT 1 from regexp_split_to_table('abcdéfg','') x where ascii(x) not between 1 and 127);
but it's not likely to be super-efficient, and the use of subqueries would force you to do it in a trigger rather than a CHECK constraint.
Instead I'd use a regular expression. If you want all printable characters then you can use a range in a check constraint, like:
CHECK (my_column ~ '^[ -~]*$')
this will match everything from the space to the tilde, which is the printable ASCII range.
If you want all ASCII, printable and nonprintable, you can use byte escapes:
CHECK (my_column ~ '^[\x00-\x7F]*$')
The most strictly correct approach would be to convert_to(my_string, 'ascii') and let an exception be raised if it fails ... but PostgreSQL doesn't offer an ascii (i.e. 7-bit) encoding, so that approach isn't possible.
Use a CHECK constraint built around a regular expression.
Assuming that you mean a certain column should never contain anything but the lowercase letters from a to z, the uppercase letters from A to Z, and the numbers 0 through 9, something like this should work.
alter table your_table
add constraint allow_ascii_only
check (your_column ~ '^[a-zA-Z0-9]+$');
This is what people usually mean when they talk about "only ASCII" with respect to database columns, but ASCII also includes glyphs for punctuation, arithmetic operators, etc. Characters you want to allow go between the square brackets.

Is name a special keyword in PostgreSQL?

I am using Ubuntu and PostgreSql 8.4.9.
Now, for any table in my database, if I do select table_name.name from table_name, it shows a result of concatenated columns for each row, although I don't have any name column in the table. For the tables which have name column, no issue. Any idea why?
My results are like this:
select taggings.name from taggings limit 3;
---------------------------------------------------------------
(1,4,84,,,PlantCategory,soil_pref_tags,"2010-03-18 00:37:55")
(2,5,84,,,PlantCategory,soil_pref_tags,"2010-03-18 00:37:55")
(3,6,84,,,PlantCategory,soil_pref_tags,"2010-03-18 00:37:55")
(3 rows)
select name from taggings limit 3;
ERROR: column "name" does not exist
LINE 1: select name from taggings limit 3;
This is a known confusing "feature" with a bit of history. Specifically, you could refer to tuples from the table as a whole with the table name, and then appending .name would invoke the name function on them (i.e. it would be interpreted as select name(t) from t).
At some point in the PostgreSQL 9 development, Istr this was cleaned up a bit. You can still do select t from t explicitly to get the rows-as-tuples effect, but you can't apply a function in the same way. So on PostgreSQL 8.4.9, this:
create table t(id serial primary key, value text not null);
insert into t(value) values('foo');
select t.name from t;
produces the bizarre:
name
---------
(1,foo)
(1 row)
but on 9.1.1 produces:
ERROR: column t.name does not exist
LINE 1: select t.name from t;
^
as you would expect.
So, to specifically answer your question: name is a standard type in PostgreSQL (used in the catalogue for table names etc) and also some standard functions to convert things to the name type. It's not actually reserved, just the objects that exist called that, plus some historical strange syntax, made things confusing; and this has been fixed by the developers in recent versions.
According to the PostgreSQL documentation, name is a "non-reserved" keyword in PostgreSQL, SQL:2003, SQL:1999, or SQL-92.
SQL distinguishes between reserved and non-reserved key words. According to the standard, reserved key words are the only real key words; they are never allowed as identifiers. Non-reserved key words only have a special meaning in particular contexts and can be used as identifiers in other contexts. Most non-reserved key words are actually the names of built-in tables and functions specified by SQL. The concept of non-reserved key words essentially only exists to declare that some predefined meaning is attached to a word in some contexts.
The suggested fix when using keywords is:
As a general rule, if you get spurious parser errors for commands that contain any of the listed key words as an identifier you should try to quote the identifier to see if the problem goes away.