After quite a while of running my script, I realized that the <> operator is not working as it should, or at least when NULL values are concerned. Many of my MERGE commands did not fall in the UPDATE block because <> did not detect two different values.
So for a few of my comparisons, I added COALESCE.
E.g. instead of
Value1 <> Value2
I used
COALESCE(Value1, '') <> COALESCE(Value2, '')
The question is whether it is safe to add COALESCE to all other ones or does this also depend on the data type or are there other caveats?
You can do the coalesces like that so long as your system considers NULLs and zero-length strings to be equivalent, i.e.: NULL = ''
If you have a huge list of comparisons in your merge statement you might try just replacing it with an EXCEPT set operator
WHEN MATCHED AND EXISTS (SELECT SRC.* EXCEPT SELECT TGT.*) THEN
UPDATE ...
a NULL value compared to a NULL is unknown, but in set-based operations they look for distinctness: NULL is distinct from non-nulls but NULL is not distinct from another NULL.
If the SRC row and TGT row are identical it will return an empty set thus skipping the update.
The COALESCE command, should be used when the data can be NULL, like you're not using inner join or the column allows for null values. For my personal opinion: don't force the system to make unnecessary instructions, if you could avoid them.
For SQL NULL and NULL are unknown values, they can't be equal nor different. See example:
declare #vA int
declare #vB int
select #vA, #vB, iif(#vA = #vB, 1, 0), iif(#vA <> #vB, 1, 0)
Related
I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.
I'm having troubles resetting the sequences as automatically as possible.
I'm trying to use the next query from phpPgAdmin:
SELECT SETVAL('course_subjects_seq', (SELECT MAX(subject_id) FROM course_subjects));
Somehow this query returns:
> HINT: No function matches the given name and argument types. You might need to add explicit type casts.
pointing to the first SELECT SETVAL
The next query will give the same error:
SELECT setval("course_subjects_seq", COALESCE((SELECT MAX(subject_id) FROM course_subjects), 1))
Can anyone point me to what am I doing wrong?
Fixed by doing so:
setval function requires regclass, bigint and boolean as arguments, therefore I added the type casts:
SELECT setval('course_subjects_seq'::regclass, COALESCE((SELECT MAX(subject_id) FROM course_subjects)::bigint, 1));
::regclass
and ::bigint
You don't need a subquery at all here. Can be a single SELECT:
SELECT setval(pg_get_serial_sequence('course_subjects', 'subject_id')
, COALESCE(max(subject_id) + 1, 1)
, false) -- not called yet
FROM course_subjects;
Assuming subject_id is a serial column, pg_get_serial_sequence() is useful so you don't have to know the sequence name (which is an implementation detail, really).
SELECT with an aggregate function like max() always returns a single row, even if the underlying table has no rows. The value is NULL in this case, that's why you have COALESCE in there.
But if you call setval() with 1, the next sequence returned number will be 2, not 1, since that is considered to be called already. There is an overloaded variant of setval() with a 3rd, boolean parameter: is_called, which makes it possible to actually start from 1 in this case like demonstrated.
Related:
How to reset postgres' primary key sequence when it falls out of sync?
My web-application allows for token replacements and therefore my SQL INSERT query looks something like this:
INSERT INTO mytable (Col1, Col2, Col3)
VALUES ('[Col1Value]', '[Col2Value]', '[Col3Value]')
The web app is replacing whats inside the brackets [Col1Value] with the input entered into the form field.
Problem is when an input field is left empty this query is still inserting ' ' just an empty space so the field is not considered null
I'm trying to use SQL's default value/binding section so that all columns that are null have a default value of -- but having my query insert a blank space ' ' is making it so SQL does not trigger the default value action and it still shows blank rather than my desired default value of --
Any ideas as to how I can solve this and make sure ' ' is inserted as a null rather than a space or empty space so it will trigger SQL replacing null with my default value of --
There is no easy going...
How are you inserting the values? If you create these statements literally you are stumbling on the dangerous fields of SQL injection... Use parameters!
One approach might be an insert through a Stored Procedure, another approach is an Instead Of TRIGGER and the third uses the fact, that the string-length does not calculate trailing blanks:
SELECT LEN('') --returns 0
,LEN(' ') --returns 0 too
You can use this in an expression like this:
CASE WHEN LEN(#YourInputValue)=0 THEN NULL ELSE #YourInputValue END
I have three different values in my database that represent a null: an actual null, an empty string, and a string {x:Null}. This value appears across multiple columns.
{x:Null} is normalized on the web front-end, so all these values look exactly the same although they end up ordered differently in a sort. How can I write a query that will take these values and make them actual nulls across every column and every table?
Bonus points if you can tell me how to make sure these other empty values are always inserted as nulls going forward. (Disclaimer: I have no power to grant any actual bonus points. ;)
You can query the information_schema to get a list of all tables and columns with a string type.
SELECT table_name, column_name
FROM information_schema.columns
WHERE data_type IN ('text', 'character', 'character varying')
NOTE double check first what values data_type has, I'm not sure if it will be character or char or what.
Then I would write a small program to update each column in each table. Here it is sketched out in Perl.
while( my($table, $column) = $sth->fetch ) {
my $q_table = $dbh->quote($table);
my $q_column = $dbh->quote($column);
$dbh->do(q[
UPDATE `$q_table`
SET `$q_column` = NULL
WHERE `$q_column` = '{x:Null}'
OR `$q_column` = ''
]);
}
Be sure to SQL escape $table and $column as in my sample.
Going forward, you'll have to set CONSTRAINTS on each and every column. You can use the information_schema.columns to do this as well. Something like
ALTER TABLE `$q_table` ADD CHECK(`$q_column` NOT IN ('{x:Null}', ''))
You could use a trigger to change the values to NULL, but I don't like data stores that silently change basic data for application purposes.
For new columns and tables, you'll have to remember to add that constraint. Same caveats about data_type apply.
However, it's probably a bad idea to say that no column can ever be an empty string. You might want to be bit more selective.
Another thing to note: NULL is a funny thing, its not true and its not false. You might be better off deciding that an empty string is the thing to set empty values to.
I don't think this approach is maintainable. It's scribbling an application rule all over the data layer. What if you have some data that doesn't follow that rule? And it will have to be continuously maintained for any new data schema added. Perhaps instead you should put this at your ORM layer. Or write a few stored procedures to take care of this.
Using the information_schema.columns table, write a procedural language routine which iterates through all applicable tables and columns, executing an update... set *column* = NULL...where column in ('','{x:Null}'). for each eligible column.
As for inserting these values as NULL going forward, you would have to set triggers on your tables to intercept these values and replace them with NULL.
I don't think there is any query that would do this thing for every table and every column. In principle, what you want to do is
UPDATE table SET column=NULL WHERE column='' OR column='{x:Null}';
You could try selecting data from the pg_attribute and pg_class columns to get the names of the tables and names of the columns and then generating automatically the queries. Be sure to select only those columns that contain textual data.
What if somebody has entered a genuine string '{x:Null}'? You would then change it into NULL.
However, you have done a real mistake by letting the situation to be as bad as it's currently. You should always normalize data before putting it into a database.
I'm not using a full on DB Abstraction library, and am using raw sql templates in psycopg2 that look like this :
SELECT id FROM table WHERE message = %(message)s ;
The ideal query to retrieve my intended results looks something like this :
SELECT id FROM table WHERE message = 'a3cbb207' ;
SELECT id FROM table WHERE message IS NULL ;
Unfortunately... the obvious problem is that my NULL comparisons come out looking like this:
SELECT id FROM table WHERE message = NULL ;
... which is not the correct comparison - and doesn't give me the intended result set.
My actual queries are much more complex than the illustration above - so I can't change them easily. ( which would be the correct solution , i agree. i'm looking for an emergency fix right now )
Does anyone know of a workaround , so I can keep the same singular templates going until a proper fix is in place ? I was trying to get coalesce and/or cast to work , but I struck out with my attempts.
What you want is IS NOT DISTINCT FROM.
SELECT id FROM table WHERE message IS NOT DISTINCT FROM 'the text';
SELECT id FROM table WHERE message IS NOT DISTINCT FROM NULL;
NULL IS NOT DISTINCT FROM NULL is true, not NULL, so it's like = but with different NULL comparison semantics. Great in trigger functions.
AFAIK can't use IS DISTINCT FROM for index lookups though, so be careful there. It can be better to use separate tests for null and value.
You can try writing your query clause as follows:
WHERE message = %(message)s OR ((%message)s IS NULL AND message IS NULL))
It's a bit rough, but it means "select the message that match my parameter, or all the messages that are null if my parameter is null". It should do the trick.
Unfortunately, NULL does not actually equal anything (not even another NULL) as the value of NULL is intended to represent an unknown. Your best bet is to change your templates to handle this correctly.
If it's possible that you can pass in separate values for the left and right operand in your template, one way to still use an equal sign would be:
SELECT id FROM table WHERE true = (message is null);