Why does Postgres handle NULLs inconsistently where unique constraints are involved? - postgresql

I recently noticed an inconsistency in how Postgres handles NULLs in columns with a unique constraint.
Consider a table of people:
create table People (
pid int not null,
name text not null,
SSN text unique,
primary key (pid)
);
The SSN column should be kept unique. We can check that:
-- Add a row.
insert into People(pid, name, SSN)
values(0, 'Bob', '123');
-- Test the unique constraint.
insert into People(pid, name, SSN)
values(1, 'Carol', '123');
The second insert fails because it violates the unique constraint on SSN. So far, so good. But let's try a NULL:
insert into People(pid, name, SSN)
values(1, 'Carol', null);
That works.
select *
from People;
0;"Bob";"123"
1;"Carol";"<NULL>"
A unique column will take a null. Interesting. How can Postgres assert that null is in any way unique, or not unique for that matter?
I wonder if I can add two rows with null in a unique column.
insert into People(pid, name, SSN)
values(2, 'Ted', null);
select *
from People;
0;"Bob";"123"
1;"Carol";"<NULL>"
2;"Ted";"<NULL>"
Yes I can. Now there are two rows with NULL in the SSN column even though SSN is supposed to be unique.
The Postgres documentation says, For the purpose of a unique constraint, null values are not considered equal.
Okay. I can see the point of this. It's a nice subtlety in null-handling: By considering all NULLs in a unique-constrained column to be disjoint, we delay the unique constraint enforcement until there is an actual non-null value on which to base that enforcement.
That's pretty cool. But here's where Postgres loses me. If all NULLs in a unique-constrained column are not equal, as the documentation says, then we should see all of the nulls in a select distinct query.
select distinct SSN
from People;
"<NULL>"
"123"
Nope. There's only a single null there. It seems like Postgres has this wrong. But I wonder: Is there another explanation?
Edit:
The Postgres docs do specify that "Null values are considered equal in this comparison." in the section on SELECT DISTINCT. While I do not understand that notion, I'm glad it's spelled out in the docs.

It is almost always a mistake when dealing with null to say:
"nulls behave like so-and-so here, *so they should behave like
such-and-such here"
Here is an excellent essay on the subject from a postgres perspective. Briefly summed up by saying nulls are treated differently depending on the context and don't make the mistake of making any assumptions about them.

The bottom line is, PostgreSQL does what it does with nulls because the SQL standard says so.
Nulls are obviously tricky and can be interpreted in multiple ways (unknown value, absent value, etc.), and so when the SQL standard was initially written, the authors had to make some calls at certain places. I'd say time has proved them more or less right, but that doesn't mean that there couldn't be another database language that handles unknown and absent values slightly (or wildly) differently. But PostgreSQL implements SQL, so that's that.
As was already mentioned in a different answer, Jeff Davis has written some good articles and presentations on dealing with nulls.

NULL is considered to be unique because NULL doesn't represent the absence of a value. A NULL in a column is an unknown value. When you compare two unknowns, you don't know whether or not they are equal because you don't know what they are.
Imagine that you have two boxes marked A and B. If you don't open the boxes and you can't see inside, you never know what the contents are. If you're asked "Are the contents of these two boxes the same?" you can only answer "I don't know".
In this case, PostgreSQL will do the same thing. When asked to compare two NULLs, it says "I don't know." This has a lot to do with the crazy semantics around NULL in SQL databases. The article linked to in the other answer is an excellent starting point to understanding how NULLs behave. Just beware: it varies by vendor.

Multiple NULL values in a unique index are okay because x = NULL is false for all x and, in particular, when x is itself NULL. You'll also run into this behavior in WHERE clauses where you have to say WHERE x IS NULL and WHERE x IS NOT NULL rather than WHERE x = NULL and WHERE x <> NULL.

Related

Create unique integer id column for result rows of union query

I have a view as below in which I union several tables and I'm thinking it might be a good idea to have a unique row number for each row in the result set. The prescient reason is I have an admin tool which doesn't know I'm using a view rather than an ordinary table, and which expects a unique id to be present, but I'm now speculating it might be worth doing more generally (i.e. it may make sense to do this in certain theoretical terms - discussion on this would be welcome). Wondering how to do this in postgresql.
CREATE VIEW subscriptions AS (
SELECT subscriber_id, course, end_at
FROM subscriptions_individual_stripe
UNION ALL SELECT subscriber_id, course, end_at
FROM subscriptions_individual_bank_transfer
ORDER BY end_at DESC);
Discussion
The reason these are separate tables is of course that they are actually different entities, and yet I also need to be able to contemplate them in a combined way, hence the VIEW. This is my way of avoiding so-called 'polymorphic relationships' in certain popular web frameworks.
I have a tool that expects an id and while my first thought was that views don't need a unique key, on the other hand, maybe they do...?
Reason being two records could exist in one of the UNIONed tables which were only unique by virtue of the primary key. If one does not include the primary key, the union should remove one of those, so a record would be lost. Should we also take that into account, i.e. select the primary key (here an integer id) for each of the UNIONed tables, but, "convert it" to some other unique id, so the view has its own unique integer primary key? Of course this won't be usable in terms of referencing anything in the original UNIONed tables, but I'm OK with that (The view is a terminal point of my analysis, I don't intend to do anything further with it, and of course it is not writable).
Update
I'm accepting S-Man's answer below because it is a solution to the question I asked, however, as pointed out, the row_number() must not be treated as if it was a real identifier because it will not be.
So as an important aside, I'm left wondering what row_number() is really intended for then. Perhaps it's (mainly? occasionally?) useful where you want to output some query when you plan to export the data somewhere else (i.e. seems almost spreadsheet-ish), and you abandon any sense of it being integrated with the rest of your database?
Table inheritance may be better as Abelisto has pointed out in the comments.
You can add a row count to the UNION using the row_number() window function:
demo:db<>fiddle
CREATE VIEW v_myview AS
SELECT
row_number() OVER (ORDER BY ...) AS id,
*
FROM (
SELECT ...
UNION
SELECT ...
) AS foo;
The main problem with this is: You should never deal with this id as an real identifier because the data of the table can change. So it could be that one table today generates a few records more than yesterday. So, the generated row numbers wouldn't match to the same record as before.
Edit: Removed the md5 solution I added before because of some problems with uniqueness on same data.

What to do with null values when modeling and normalizing?

I have to create a database for a venue.
A client books a room for an event. The problem is that the clients don't always provide their name, their email, and their phone number. Most of the time it's either name and email or name and phone. It's rarely all 3 but it happens.
I need to store each of these in their respective attribute (name, email, phone). But the way they give me their info, I have a lot of null values.
What can I do with these nulls? I've been told that it's better to not have nulls. I also need to normalize my table after that.
SQL treats NULL specially per its version of 3VL (3-valued logic). Normalization & other relational theory does not. However, we can translate SQL designs into relational designs and back. (Assume no duplicate rows here.)
Normalization happens to relations and is defined in terms of operators that don't treat NULL specially. The term "normalization" has two most common distinct meanings: putting a table into "1NF" and into "higher NFs (normal forms)". NULL doesn't affect "normalization to 1NF". "Normalization to higher NFs" replaces a table by smaller tables that natural join back to it. For purposes of normalization you could treat NULL like a value that is allowed in the domain of a nullable column in addition to the values of its SQL type. If our SQL tables have no NULLs then we can interpret them as relations & SQL join etc as join, etc. But if you decompose where a nullable column was shared between components then realize that to reconstruct the original in SQL you have to SQL join on same-named columns being equal or both NULL. And you won't want such CKs (candidate keys) in an SQL database. Eg you can't declare it as an SQL PK (primary key) because that means UNIQUE NOT NULL. Eg a UNIQUE constraint involving a nullable column allows multiple rows that have a NULL in that column, even if the rows have the same values in every column. Eg NULLs in SQL FKs cause them to be satisfied (in various ways per MATCH mode), not to fail from not appearing in the referenced table. (But DBMSs idiosyncratically differ from standard SQL.)
Unfortunately decomposition might lead to a table with all CKs containing NULL, so that we have nothing to declare as SQL PK or UNIQUE NOT NULL. The only sure solution is to convert to a NULL-free design. After then normalizing we might want to reintroduce some nullability in the components.
In practice, we manage to design tables so that there is always a set of NULL-free columns that we can declare as CK, via SQL PK or UNIQUE NOT NULL. Then we can get rid of a nullable column by dropping it from the table and adding a table with that column and the columns of some NULL-free CK: If the column is non-NULL for a row in the old design then a row with its CK subrow and column value go in the added table; otherwise it is NULL in the old design and no corresponding row is in the added table. (The original table is a natural left join of the new ones.) Of course, we also have to modify queries from the old design to the new design.
We can always avoid NULLs via a design that adds a boolean column for each old nullable column and has the old column NOT NULL. The new column says for a row whether the old column was NULL in the old design and when true has the old column be some one value that we pick for that purpose for that type throughout the database. Of course, we also have to modify queries from the old design to the new design.
Whether you want to avoid NULL is a separate question. Your database might in some way be "better" or "worse" for your application with either design. The idea behind avoiding NULL is that it complicates the meanings of queries, hence complicates querying, in a perverse way, compared to the complication of more joins from more NULL-free tables. (That perversity is typically managed by removing NULLs in query expressions as close to where they appear as possible.)
PS Many SQL terms including PK & FK differ from the relational terms. SQL PK means something more like superkey; SQL FK means something more like foreign superkey; but it doesn't even make sense to talk about a "superkey" in SQL:
Because of the resemblance of SQL tables to relations, terms that involve relations get sloppily applied to tables. But although you can borrow terms and give them SQL meanings--value, table, FD (functional dependency), superkey, CK (candidate key), PK (primary key), FK (foreign key), join, and, predicate, NF (normal form), normalize, 1NF, etc--you can't just substitute those SQL meanings for those words in RM definitions, theorems or algorithms and get something sensible or true. Moreover SQL presentations of RM notions almost never actually tell you how to soundly apply RM notions to an SQL database. They just parrot RM presentations, oblivious to whether their use of SQL meanings for terms makes things nonsensical or invalid.
First of all there is nothing wrong with nulls in a database. And they are made exactly for this purpose where attributes are unknown. To avoid nulls in a database is an advice that makes little sense in my opinion.
So you'd have three (or four) values - name (first/last), email address, and phone number - identifying a client. You can have them in a table and add a constraint to it assuring that always at least one of these columns is filled, e.g. coalesce(name, email, phone) is not null. This makes sure a booking cannot be done completely anonymously.
From your explanation it is not clear whether you will always have the same information from a client. So can it happen that a client books a room giving their name and later they book another room giving their phone instead? Or will the client be looked up in the database, their name found and the two booking assigned to them? In the latter case you can have a clients table holding all information you got so far, and the booking will contain the client record ID as a reference to this data. In the former case you may not want to have a clients table, because you cannot identify whether two clients (Jane Miller and mrsx#gmail.com) are really two different clients or only one client actually.
The tables I see so far:
room (room_id, ...)
venue (venue_id, ...)
client (client_id, name, email, phone)
booking (venue_id, room_id, client_id, ...)

Prevent non-collation characters in a NVarChar column using constraint?

Little weird requirement, but here it goes. We have a CustomerId VarChar(25) column in a table. We need to make it NVarChar(25) to work around issues with type conversions.
CHARINDEX vs LIKE search gives very different performance, why?
But, we don't want to allow non-latin characters to be stored in this column. Is there any way to place such a constraint on column? I'd rather let database handle this check. In general we OK with NVarChar for all of our strings, but some columns like ID's is not a good candidates for this because of possibility of look alike strings from different languages
Example:
CustomerId NVarChar(1) - PK
Value 1: BOPOH
Value 2: ВОРОН
Those 2 strings different (second one is Cyrillic)
I want to prevent this entry scenario. I want to make sure Value 2 can not be saved into the field.
Just in case it helps somebody. Not sure it's most "elegant" solution but I placed constraint like this on those fields:
ALTER TABLE [dbo].[Carrier] WITH CHECK ADD CONSTRAINT [CK_Carrier_CarrierId] CHECK ((CONVERT([varchar](25),[CarrierId],(0))=[CarrierId]))
GO

SQLlite3 Update statment has no effect

I'm no SQL expert, but have a bug in an iPhone app where and UPDATE statement has no effect on the db.
I have been using the SQLlite manger plugin for FireFox to try dbug by repeatedly amending and running the UPDATE on the db. I also ran the statement thorough and SQL Validator which said it complied to the core SQL standard.
Can you spot anything wrong with the statement given below?
UPDATE sections
SET
title = 'What is acne ? ABC',
text = 'Pus on your face',
created = '2010-03-10 18:46:55',
modified = '2011-07-04 17:38:44',
position = 1,
condition_id = 4
WHERE id = 10;
There is some confusion and inconsistency in the way SQLite and the various implementations by Mozilla, Google, Adobe, and others handle numeric primary keys in databases whose tables were created outside of these implementations and where the primary keys were defined as an integer type but not as "INTEGER" [verbatim] -- that is, they were defined as INT or INT16 or INT32 etc.
INTEGER PRIMARY KEY in mothership SQLite is an alias for the rowid.
INT PRIMARY KEY in mothership SQLite is not an alias for the rowid.
A consortium member (or any implementor) may or may not follow this rule. (SQLite is in the public domain, of course.)
See section 2.0 here: http://www.sqlite.org/datatypes.html
and see section on RowId and Primary Key here: http://www.sqlite.org/lang_createtable.html#rowid
A PRIMARY KEY column only becomes an integer primary key if the
declared type name is exactly "INTEGER". Other integer type names like
"INT" or "BIGINT" or "SHORT INTEGER" or "UNSIGNED INTEGER" causes the
primary key column to behave as an ordinary table column with integer
affinity and a unique index, not as an alias for the rowid. [emphasis
added]
An implementor who does not follow the rule might not have even been aware that they were breaking the rule in the first place, since it is a "gotcha" arcane sort of rule. Anyway, what this means practically is that one implementation may treat a supplied value as an alias for the rowid and another implementation might not. If given the value 10, one might retrieve the tuple whose rowid = 10 and one might retrieve the tuple where the specified column's value = 10. This, of course, leads to spurious results in queries -- and they might look like perfectly good and plausible results but they are dead wrong.
Consider the following simple test: using flagship SQLite's utilities, not those provided by one of the implementors, execute the following DDL and DML statements; then, in your implementation, open the database and execute the DML statements again to compare the DML results:
CREATE TABLE TEST
("id" INT PRIMARY KEY, "name" text) -- ** NOTE "INT" not "INTEGER"
INSERT INTO TEST
(id, name)
VALUES
(7,'seven')
** *** N.B. THE ROWID OF THE ROW INSERTED ABOVE = 1 *** **
select rowid, id, name from test
result: 1 | 7 | seven
select * from TEST
result: 7 | seven
select * from TEST where id = 7
result: ????? [ymmv]
select * from TEST where id = 1
result: ????? [ymmv]
Depending on how the specific implementation treats an INT primary key the third select statement above (select * from TEST where id = 7) may return one row or it may return nothing!
If the implementation treats the INT PK as an alias for the row id, well, there is no row whose rowid = 7, and so it will return nothing. If the implementation treats the INT PK as a normal value, it will find the row.
Now, if you were to insert more rows into table TEST, you would eventually create a row whose rowid = 7. In one of these wayward implementations, when you use this where-clause -- where id = 7 --- you might think you were addressing the tuple whose id=7, but you'd actually be addressing the tuple whose rowid=7. You would get the wrong tuple and you might not realize it. Consider the possibilities when joining a child table to a parent table: the child table contains foreign key value of 7. What tuple does an inner join return from the parent table? It depends on whether the implementation honors the distinction between INT and INTEGER primary keys.
Last year, I documented this thoroughly for Adobe AIR, BTW, and also reported it on the SQLite news group. It is possible that some implementations have changed the behavior in the interim.
When creating SQLite tables, it is best to use INTEGER [verbatim] for primary keys, not any of the other recognized int types.
If your query is correct then You need to make sure about 2 things.
Have you written finalize_statement like this ?
sqlite3_finalize(selectStatement);
2.If you are testing in simulator. Are you sure you are checking database update in following path ?
/user/Libary/Application Support/iPhone Simulator/Your_Version_Number/Applications/YOUR_APPLICATION_GUID/Documents
Hope this help.
The only thing in the query that strikes me as worthy of further investigation is the question mark in the value for [title]. Remove it and see if that changes anything. Maybe it's being incorrectly parsed somewhere along the way as a parameter placeholder.

In postgresql: Clarification on "CONSTRAINT foo_key PRIMARY KEY (foo)"

Sorry if this is a dead simple question but I'm confused from the documentation and I'm not getting any clear answers from searching the web.
If I have the following table schema:
CREATE TABLE footable
(
foo character varying(10) NOT NULL,
bar timestamp without time zone,
CONSTRAINT pk_foo PRIMARY KEY (foo)
);
and then use the query:
SELECT bar FROM footable WHERE foo = '1234567890';
Will the select query find the given row by searching an index or not? In other word: does the table have a primary key (which is foo) or not?
Just to get it clear. I'm used to specifying "PRIMARY KEY" after the column I'm specifying like this:
"...foo character varying(10) PRIMARY KEY, ..."
Does it change anything?
Why not look at the query plan and find out yourself? The query plan will tell you exactly what indexes are being used, so you don't have to guess. Here's how to do it:
http://www.postgresql.org/docs/current/static/sql-explain.html
But in general, it should use the index in this case since you specified the primary key in the where clause and you didn't use something that could prevent it from using it (a LIKE, for example).
It's always best to look at the query plan to verify it for sure, then there's no doubt.
In both cases, the primary key can be used but it depends. The optimizer will make a choice depending on the amount of data, the statistics, etc.
Naming the constraint can make debugging and error handling easier, you know what constraint is violated. Without a name, it can be confusing.