What to do with null values when modeling and normalizing? - postgresql

I have to create a database for a venue.
A client books a room for an event. The problem is that the clients don't always provide their name, their email, and their phone number. Most of the time it's either name and email or name and phone. It's rarely all 3 but it happens.
I need to store each of these in their respective attribute (name, email, phone). But the way they give me their info, I have a lot of null values.
What can I do with these nulls? I've been told that it's better to not have nulls. I also need to normalize my table after that.

SQL treats NULL specially per its version of 3VL (3-valued logic). Normalization & other relational theory does not. However, we can translate SQL designs into relational designs and back. (Assume no duplicate rows here.)
Normalization happens to relations and is defined in terms of operators that don't treat NULL specially. The term "normalization" has two most common distinct meanings: putting a table into "1NF" and into "higher NFs (normal forms)". NULL doesn't affect "normalization to 1NF". "Normalization to higher NFs" replaces a table by smaller tables that natural join back to it. For purposes of normalization you could treat NULL like a value that is allowed in the domain of a nullable column in addition to the values of its SQL type. If our SQL tables have no NULLs then we can interpret them as relations & SQL join etc as join, etc. But if you decompose where a nullable column was shared between components then realize that to reconstruct the original in SQL you have to SQL join on same-named columns being equal or both NULL. And you won't want such CKs (candidate keys) in an SQL database. Eg you can't declare it as an SQL PK (primary key) because that means UNIQUE NOT NULL. Eg a UNIQUE constraint involving a nullable column allows multiple rows that have a NULL in that column, even if the rows have the same values in every column. Eg NULLs in SQL FKs cause them to be satisfied (in various ways per MATCH mode), not to fail from not appearing in the referenced table. (But DBMSs idiosyncratically differ from standard SQL.)
Unfortunately decomposition might lead to a table with all CKs containing NULL, so that we have nothing to declare as SQL PK or UNIQUE NOT NULL. The only sure solution is to convert to a NULL-free design. After then normalizing we might want to reintroduce some nullability in the components.
In practice, we manage to design tables so that there is always a set of NULL-free columns that we can declare as CK, via SQL PK or UNIQUE NOT NULL. Then we can get rid of a nullable column by dropping it from the table and adding a table with that column and the columns of some NULL-free CK: If the column is non-NULL for a row in the old design then a row with its CK subrow and column value go in the added table; otherwise it is NULL in the old design and no corresponding row is in the added table. (The original table is a natural left join of the new ones.) Of course, we also have to modify queries from the old design to the new design.
We can always avoid NULLs via a design that adds a boolean column for each old nullable column and has the old column NOT NULL. The new column says for a row whether the old column was NULL in the old design and when true has the old column be some one value that we pick for that purpose for that type throughout the database. Of course, we also have to modify queries from the old design to the new design.
Whether you want to avoid NULL is a separate question. Your database might in some way be "better" or "worse" for your application with either design. The idea behind avoiding NULL is that it complicates the meanings of queries, hence complicates querying, in a perverse way, compared to the complication of more joins from more NULL-free tables. (That perversity is typically managed by removing NULLs in query expressions as close to where they appear as possible.)
PS Many SQL terms including PK & FK differ from the relational terms. SQL PK means something more like superkey; SQL FK means something more like foreign superkey; but it doesn't even make sense to talk about a "superkey" in SQL:
Because of the resemblance of SQL tables to relations, terms that involve relations get sloppily applied to tables. But although you can borrow terms and give them SQL meanings--value, table, FD (functional dependency), superkey, CK (candidate key), PK (primary key), FK (foreign key), join, and, predicate, NF (normal form), normalize, 1NF, etc--you can't just substitute those SQL meanings for those words in RM definitions, theorems or algorithms and get something sensible or true. Moreover SQL presentations of RM notions almost never actually tell you how to soundly apply RM notions to an SQL database. They just parrot RM presentations, oblivious to whether their use of SQL meanings for terms makes things nonsensical or invalid.

First of all there is nothing wrong with nulls in a database. And they are made exactly for this purpose where attributes are unknown. To avoid nulls in a database is an advice that makes little sense in my opinion.
So you'd have three (or four) values - name (first/last), email address, and phone number - identifying a client. You can have them in a table and add a constraint to it assuring that always at least one of these columns is filled, e.g. coalesce(name, email, phone) is not null. This makes sure a booking cannot be done completely anonymously.
From your explanation it is not clear whether you will always have the same information from a client. So can it happen that a client books a room giving their name and later they book another room giving their phone instead? Or will the client be looked up in the database, their name found and the two booking assigned to them? In the latter case you can have a clients table holding all information you got so far, and the booking will contain the client record ID as a reference to this data. In the former case you may not want to have a clients table, because you cannot identify whether two clients (Jane Miller and mrsx#gmail.com) are really two different clients or only one client actually.
The tables I see so far:
room (room_id, ...)
venue (venue_id, ...)
client (client_id, name, email, phone)
booking (venue_id, room_id, client_id, ...)

Related

Create unique integer id column for result rows of union query

I have a view as below in which I union several tables and I'm thinking it might be a good idea to have a unique row number for each row in the result set. The prescient reason is I have an admin tool which doesn't know I'm using a view rather than an ordinary table, and which expects a unique id to be present, but I'm now speculating it might be worth doing more generally (i.e. it may make sense to do this in certain theoretical terms - discussion on this would be welcome). Wondering how to do this in postgresql.
CREATE VIEW subscriptions AS (
SELECT subscriber_id, course, end_at
FROM subscriptions_individual_stripe
UNION ALL SELECT subscriber_id, course, end_at
FROM subscriptions_individual_bank_transfer
ORDER BY end_at DESC);
Discussion
The reason these are separate tables is of course that they are actually different entities, and yet I also need to be able to contemplate them in a combined way, hence the VIEW. This is my way of avoiding so-called 'polymorphic relationships' in certain popular web frameworks.
I have a tool that expects an id and while my first thought was that views don't need a unique key, on the other hand, maybe they do...?
Reason being two records could exist in one of the UNIONed tables which were only unique by virtue of the primary key. If one does not include the primary key, the union should remove one of those, so a record would be lost. Should we also take that into account, i.e. select the primary key (here an integer id) for each of the UNIONed tables, but, "convert it" to some other unique id, so the view has its own unique integer primary key? Of course this won't be usable in terms of referencing anything in the original UNIONed tables, but I'm OK with that (The view is a terminal point of my analysis, I don't intend to do anything further with it, and of course it is not writable).
Update
I'm accepting S-Man's answer below because it is a solution to the question I asked, however, as pointed out, the row_number() must not be treated as if it was a real identifier because it will not be.
So as an important aside, I'm left wondering what row_number() is really intended for then. Perhaps it's (mainly? occasionally?) useful where you want to output some query when you plan to export the data somewhere else (i.e. seems almost spreadsheet-ish), and you abandon any sense of it being integrated with the rest of your database?
Table inheritance may be better as Abelisto has pointed out in the comments.
You can add a row count to the UNION using the row_number() window function:
demo:db<>fiddle
CREATE VIEW v_myview AS
SELECT
row_number() OVER (ORDER BY ...) AS id,
*
FROM (
SELECT ...
UNION
SELECT ...
) AS foo;
The main problem with this is: You should never deal with this id as an real identifier because the data of the table can change. So it could be that one table today generates a few records more than yesterday. So, the generated row numbers wouldn't match to the same record as before.
Edit: Removed the md5 solution I added before because of some problems with uniqueness on same data.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.

How do I create a drop down list?

I'm new to PostgreSQL.
I was wondering, how do I make a column a drop down list.
So i've got a table called Student. There's a column in there called "student_type", which means whether the student is a part time student, full time student or is sandwich course student.
So I want to make "student_type" a drop down list with 3 choices: "part time" student, "full time" and "sandwich".
How do I do this?
(I'm using pgAdmin to create the databse, by the way.)
A drop-down is a client side thing and should be dealt with accordingly. But as far as a relational database is involved there should exist a student_type relation with the id and type columns which you would query like this:
select st.id, st.type
from student_type st
inner join student s on s.type_id = st.id
group by st.id, st.type
order by st.type
The inner join is to make sure you don't show an option that does not exist in the student table and would therefore produce an empty result if chosen. In the client side the id should be the option value and the type the option text.
If there is no student_type relation as a consequence of bad db design or if you are only allowed to query a denormalized view, you can still use the student relation:
select distinct student_type
from student
order by student_type
In this case the student_type will be both the option value and the option text.
I think You use MS Access before. It is not possible to create such a drop-down list using pgAdmin, but it is possible to limit acceptable values in field "student_type" in a few different ways. Still, this will not be a drop-down or Combobox filed.
eg:
You can use a table with a dictionary and then use a foreign key,
You can use the constraint to check the inserted value
You can use a domain (the field is in the type of Your domain, and the domain is based on
proper constraint)
You can use trigger (before insert or update)
etc.
To put this a different way, Postgres is a database, not a user-interface
library. It doesn't have dropdown lists, text boxes, labels and all that. It is not directly usable by a human that way.
What PG can do is provide a stable source for data used by such widgets. Each widget library has its own rules for how to bind (that is, connect) to a data source. Some let you directly connect the visual component, in this case, the dropdown widget, to a database, or better, a database cursor.
With Postgres, you can create a cursor, which is an in-memory window into the result of a SELECT query of some kind, that your widget or favorite programming language binds to. In your example, the cursor or widget binding would be to the result of a "SELECT student_type FROM student_type" query.
As an aside, the values for "student_type" should not be stored only in
"student". You should have a normalized table structure, which here would give you a "student_type" table that holds the three choices, one per row. (There
are other ways to do this.) The values you specified would be the primary key column. (Alternatively, you'd have those values in a UNIQUE column with a
surrogate key as the primary key, but that's probably overkilling for a simple
lookup table.) The "student.student_type" column would then be a foreign key
into the "student_type.student_type" column.

How to use BULK INSERT when rows depend on foreign keys values?

My question is related to this one I asked on ServerFault.
Based on this, I've considered the use of BULK INSERT. I now understand that I have to prepare myself a file for each entities I want to save into the database. No matter what, I still wonder whether this BULK INSERT will avoid the memory issue on my system as described in the referenced question on ServerFault.
As for the Streets table, it's quite simple! I have only two cities and five sectors to care about as the foreign keys. But then, how about the Addresses? The Addresses table is structured like this:
AddressId int not null identity(1,1) primary key
StreetNumber int null
NumberSuffix_Value int not null DEFAULT 0
StreetId int null references Streets (StreetId)
CityId int not null references Cities (CityId)
SectorId int null references Sectors (SectorId)
As I said on ServerFault, I have about 35,000 addresses to insert. Shall I memorize all the IDs? =P
And then, I now have the citizen people to insert who have an association with the addresses.
PersonId int not null indentity(1,1) primary key
Surname nvarchar not null
FirstName nvarchar not null
IsActive bit
AddressId int null references Addresses (AddressId)
The only thing I can think of is to force the IDs to static values, but then, I lose any flexibility that I had with my former approach with the INSERT..SELECT stategy.
What are then my options?
I force the IDs to be always the same, then I have to SET IDENTITY_INSERT ON so that I can force the values into the table, this way I always have the same IDs for each of my rows just as suggested here.
How to BULK INSERT with foreign keys? I can't get any docs on this anywhere. =(
Thanks for your kind assistance!
EDIT
I edited in order to include the BULK INSERT SQL instruction that finally made it for me!
I had my Excel workbook ready with the information I needed to insert. So, I simply created a few supplemental worksheet and began to write formulas in order to "import" the information data to these new sheets. I had one for each of my entities.
Streets;
Addresses;
Citizens.
As for the two other entities, it wasn't worthy to bulk insert them, as I had only two cities and five sectors (cities subdivisions) to insert. Once the both the cities and sectors inserted, I noted their respective IDs and began to ready my record sets for bulk insert. Using the power of Excel to compute the values and to "import" the foreign keys was a charm of itself, by the way. Afterwards, I have saved each of the worksheets to a separated CSV file. My records were then ready to bulked.
USE [DatabaseName]
GO
delete from Citizens
delete from Addresses
delete from Streets
BULK INSERT Streets
FROM N'C:\SomeFolder\SomeSubfolder\Streets.csv'
WITH (
FIRSTROW = 2
, KEEPIDENTITY
, FIELDTERMINATOR = N','
, ROWTERMINATOR = N'\n'
, CODEPAGE = N'ACP'
)
GO
FIRSTROW
Indicates the row number at which to begin the insert. In my situation, my CSVs contained the column headers, so the second row was the one to begin with. Aside, one could possibly want to start anywhere in his file, let's say the 15th row.
KEEPIDENTITY
Allows one to bulk-insert specified in-file entity IDs even though the table has an identity column. This parameter is the same as SET INDENTITY_INSERT my_table ON before a row insert when you wish to insert with a precise id.
As for the other parameters, they speak by themselves.
Now that this is explained, the same code was repeated for each of the two remaining entities to insert Addresses and Citizens. And because the KEEPIDENTITY was specified, all of my foreign keys remained still, though my primary keys were set as identities in SQL Server.
Only a few tweaks though, just the exact same thing as marc_s said in his answer, just import your data as fast as you can into a staging table with no restriction at all. This way, you're gonna make your life much easier, while following good practices nevertheless. =)
The basic idea is to bulk insert your data into a staging table that doesn't have any restrictions, any constraints etc. - just bulk load the data as fast as you can.
Once you have the data in the staging table, then you need to start to worry about constraints etc. when you insert the data from the staging table into the real tables.
Here, you could e.g.
insert only those rows into your real work tables that match all the criteria (and mark them as "successfully inserted" in your staging table)
handle all rows that are left in the staging table that aren't successfully inserted by some error / recovery process - whatever that could be: printing a report with all the "problem" rows, tossing them into an "error bin" or whatever - totally up to you.
Key point is: the actual BULK INSERT should be into a totally unconstrained table - just load the data as fast as you can - and only then in a second step start to worry about constraints and lookup data and references and stuff like that

Why does Postgres handle NULLs inconsistently where unique constraints are involved?

I recently noticed an inconsistency in how Postgres handles NULLs in columns with a unique constraint.
Consider a table of people:
create table People (
pid int not null,
name text not null,
SSN text unique,
primary key (pid)
);
The SSN column should be kept unique. We can check that:
-- Add a row.
insert into People(pid, name, SSN)
values(0, 'Bob', '123');
-- Test the unique constraint.
insert into People(pid, name, SSN)
values(1, 'Carol', '123');
The second insert fails because it violates the unique constraint on SSN. So far, so good. But let's try a NULL:
insert into People(pid, name, SSN)
values(1, 'Carol', null);
That works.
select *
from People;
0;"Bob";"123"
1;"Carol";"<NULL>"
A unique column will take a null. Interesting. How can Postgres assert that null is in any way unique, or not unique for that matter?
I wonder if I can add two rows with null in a unique column.
insert into People(pid, name, SSN)
values(2, 'Ted', null);
select *
from People;
0;"Bob";"123"
1;"Carol";"<NULL>"
2;"Ted";"<NULL>"
Yes I can. Now there are two rows with NULL in the SSN column even though SSN is supposed to be unique.
The Postgres documentation says, For the purpose of a unique constraint, null values are not considered equal.
Okay. I can see the point of this. It's a nice subtlety in null-handling: By considering all NULLs in a unique-constrained column to be disjoint, we delay the unique constraint enforcement until there is an actual non-null value on which to base that enforcement.
That's pretty cool. But here's where Postgres loses me. If all NULLs in a unique-constrained column are not equal, as the documentation says, then we should see all of the nulls in a select distinct query.
select distinct SSN
from People;
"<NULL>"
"123"
Nope. There's only a single null there. It seems like Postgres has this wrong. But I wonder: Is there another explanation?
Edit:
The Postgres docs do specify that "Null values are considered equal in this comparison." in the section on SELECT DISTINCT. While I do not understand that notion, I'm glad it's spelled out in the docs.
It is almost always a mistake when dealing with null to say:
"nulls behave like so-and-so here, *so they should behave like
such-and-such here"
Here is an excellent essay on the subject from a postgres perspective. Briefly summed up by saying nulls are treated differently depending on the context and don't make the mistake of making any assumptions about them.
The bottom line is, PostgreSQL does what it does with nulls because the SQL standard says so.
Nulls are obviously tricky and can be interpreted in multiple ways (unknown value, absent value, etc.), and so when the SQL standard was initially written, the authors had to make some calls at certain places. I'd say time has proved them more or less right, but that doesn't mean that there couldn't be another database language that handles unknown and absent values slightly (or wildly) differently. But PostgreSQL implements SQL, so that's that.
As was already mentioned in a different answer, Jeff Davis has written some good articles and presentations on dealing with nulls.
NULL is considered to be unique because NULL doesn't represent the absence of a value. A NULL in a column is an unknown value. When you compare two unknowns, you don't know whether or not they are equal because you don't know what they are.
Imagine that you have two boxes marked A and B. If you don't open the boxes and you can't see inside, you never know what the contents are. If you're asked "Are the contents of these two boxes the same?" you can only answer "I don't know".
In this case, PostgreSQL will do the same thing. When asked to compare two NULLs, it says "I don't know." This has a lot to do with the crazy semantics around NULL in SQL databases. The article linked to in the other answer is an excellent starting point to understanding how NULLs behave. Just beware: it varies by vendor.
Multiple NULL values in a unique index are okay because x = NULL is false for all x and, in particular, when x is itself NULL. You'll also run into this behavior in WHERE clauses where you have to say WHERE x IS NULL and WHERE x IS NOT NULL rather than WHERE x = NULL and WHERE x <> NULL.