Create unique integer id column for result rows of union query - postgresql

I have a view as below in which I union several tables and I'm thinking it might be a good idea to have a unique row number for each row in the result set. The prescient reason is I have an admin tool which doesn't know I'm using a view rather than an ordinary table, and which expects a unique id to be present, but I'm now speculating it might be worth doing more generally (i.e. it may make sense to do this in certain theoretical terms - discussion on this would be welcome). Wondering how to do this in postgresql.
CREATE VIEW subscriptions AS (
SELECT subscriber_id, course, end_at
FROM subscriptions_individual_stripe
UNION ALL SELECT subscriber_id, course, end_at
FROM subscriptions_individual_bank_transfer
ORDER BY end_at DESC);
Discussion
The reason these are separate tables is of course that they are actually different entities, and yet I also need to be able to contemplate them in a combined way, hence the VIEW. This is my way of avoiding so-called 'polymorphic relationships' in certain popular web frameworks.
I have a tool that expects an id and while my first thought was that views don't need a unique key, on the other hand, maybe they do...?
Reason being two records could exist in one of the UNIONed tables which were only unique by virtue of the primary key. If one does not include the primary key, the union should remove one of those, so a record would be lost. Should we also take that into account, i.e. select the primary key (here an integer id) for each of the UNIONed tables, but, "convert it" to some other unique id, so the view has its own unique integer primary key? Of course this won't be usable in terms of referencing anything in the original UNIONed tables, but I'm OK with that (The view is a terminal point of my analysis, I don't intend to do anything further with it, and of course it is not writable).
Update
I'm accepting S-Man's answer below because it is a solution to the question I asked, however, as pointed out, the row_number() must not be treated as if it was a real identifier because it will not be.
So as an important aside, I'm left wondering what row_number() is really intended for then. Perhaps it's (mainly? occasionally?) useful where you want to output some query when you plan to export the data somewhere else (i.e. seems almost spreadsheet-ish), and you abandon any sense of it being integrated with the rest of your database?
Table inheritance may be better as Abelisto has pointed out in the comments.

You can add a row count to the UNION using the row_number() window function:
demo:db<>fiddle
CREATE VIEW v_myview AS
SELECT
row_number() OVER (ORDER BY ...) AS id,
*
FROM (
SELECT ...
UNION
SELECT ...
) AS foo;
The main problem with this is: You should never deal with this id as an real identifier because the data of the table can change. So it could be that one table today generates a few records more than yesterday. So, the generated row numbers wouldn't match to the same record as before.
Edit: Removed the md5 solution I added before because of some problems with uniqueness on same data.

Related

PostgreSQL 9.5 ON CONFLICT DO UPDATE command cannot affect row a second time

I have a table from which I want to UPSERT into another, when try to launch the query, I get the "cannot affect row a second time" error. So I tried to see if I have some duplicate on my first table regarding the field with the UNIQUE constraint, and I have none. I must be missing something, but since I cannot figure out what (and my query is a bit complex because it is including some JOIN), here is the query, the field with the UNIQUE constraint is "identifiant_immeuble" :
with upd(a,b,c,d,e,f,g,h,i,j,k) as(
select id_parcelle, batimentimmeuble,etatimmeuble,nb_loc_hab_ma,nb_loc_hab_ap,nb_loc_pro, dossier.id_dossier, adresse.id_adresse, zapms.geom, 1, batimentimmeuble2
from public.zapms
left join geo_pays_gex.dossier on dossier.designation_siea=zapms.id_dossier
left join geo_pays_gex.adresse on adresse.id_voie=(select id_voie from geo_pays_gex.voie where (voie.designation=zapms.nom_voie or voie.nom_quartier=zapms.nom_quartier) and voie.rivoli=lpad(zapms.rivoli,4,'0'))
and adresse.num_voie=zapms.num_voie
and adresse.insee=zapms.insee_commune::integer
)
insert into geo_pays_gex.bal2(identifiant_immeuble, batimentimmeuble, id_etat_addr, nb_loc_hab_ma, nb_loc_hab_ap, nb_loc_pro, id_dossier, id_adresse, geom, raccordement, batimentimmeuble2)
select a,b,c,d,e,f,g,h,i,j,k from upd
on conflict (identifiant_immeuble) do update
set batimentimmeuble=excluded.batimentimmeuble, id_etat_addr=excluded.id_etat_addr, nb_loc_hab_ma=excluded.nb_loc_hab_ma, nb_loc_hab_ap=excluded.nb_loc_hab_ap, nb_loc_pro=excluded.nb_loc_pro,
id_dossier=excluded.id_dossier, id_adresse=excluded.id_adresse,geom=excluded.geom, raccordement=1, batimentimmeuble2=excluded.batimentimmeuble2
;
As you can see, I use several intermediary tables in this query : one to store the street's names (voie), one related to this one storing the adresses (adresse, basically numbers related through a foreign key to the street's names table), and another storing some other datas related to the projects' names (dossier).
I don't know what other information I could give to help find an answer, I guess it is better I do not share the actual content of my tables since it may touch some privacy regulations or such.
Thanks for your attention.
EDIT : I found a workaround by deleting the entries present in the zapms table from the bal2 table, as such
delete from geo_pays_gex.bal2 where bal2.identifiant_immeuble in (select id_parcelle from zapms);
it is not entirely satisfying though, since I would have prefered to keep track of the data creator and the date of creation, as much as the fact that the data has been modified (I have some fields to store this information) and here I simply erase all this history... And I have another table with the primary key of the bal2 table as a foreign key. I am still in the DB creation so I can afford to truncate this table, but in production it wouldn't be possible since I would lose some datas.

Is postgresql sequence next val consist with insert order?

Given an table with id bigint default next_val('foo_sequence')
Can I assume the order of id consisting with the insert order ?
I mean the later inserted id is always greater then earlier inserted ids.
I am trying to calculate and save an increment continuous number of row,
Here is how I did
SELECT count(*) as seq_no from foo where id < some_id;
// get the seq no
UPDATE foo SET seq_no = seq_no_above + 1 WHERE id = some_id;
But it sometimes give duplicate seq_no value,
if the id consists with insert order, it should not have duplicate value.
In the simplest and purest sense, yes. It depends what you mean by "earlier" and "later", though, as you have to consider opening the transaction and closing the transaction. If a transaction has not been committed, then theoretically a record could show up later with an earlier ID.
The IDs are allocated when the insert happens, but the records will not show up until the records are committed. So if commit order is different, you may see some strange behavior depending on how strict your use case is.
I.e.
Open Transaction A
Insert records 1,2
Open Transaction B
Insert records 3,4
Close transaction B
Select * (get 3,4)
Close transaction A
Select * (get 1,2,3,4)
You also have to worry about caching on whether you consider them to be sequential. From the (very good) Postgres docs:
Furthermore, although multiple sessions are guaranteed to allocate
distinct sequence values, the values might be generated out of
sequence when all the sessions are considered. For example, with a
cache setting of 10, session A might reserve values 1..10 and return
nextval=1, then session B might reserve values 11..20 and return
nextval=11 before session A has generated nextval=2. Thus, with a
cache setting of one it is safe to assume that nextval values are
generated sequentially; with a cache setting greater than one you
should only assume that the nextval values are all distinct, not that
they are generated purely sequentially. Also, last_value will reflect
the latest value reserved by any session, whether or not it has yet
been returned by nextval.
One last caveat is someone with appropriate privileges can always reset the sequence to a different value, which obviously would throw a wrench into things.
EDIT:
To address your use case above, you definitely want to use sequences (and likely add NOT NULL / PRIMARY KEY constraints as well, to ensure uniqueness). In pgAdmin, at least, you can do all of this by setting data type serial. Though I have mentioned caveats, for 99% of practical purposes, you get uniqueness and sequential ordering (hence sequences) the way that you want.
In any case, we would need to see example data to confirm why you are seeing duplication (how to create a reproducible example). I presume the duplication you are seeing is in seq_no and not id, which illustrates that the problem is your query. If duplication is in id, then you have other problems, and that would explain duplication in seq_no.
Sequences are much better for transactional definition in the data (they take care of uniqueness for you, perform well in concurrency, and do not cause duplication... plus you get sequential ordering for the most part). For unique keys, they are best used with NOT NULL and PRIMARY KEY or UNIQUE constraints.
But if you need a perfect increment, it is better to do something like the below:
select *, row_number() over (order by value) as id
from foo
;
Postgres window functions are very powerful, but are definitely not the standard to use for inserting data with sequential keys. They are more useful for reporting, analysis, and complex queries after the fact.

Postgres table partitioning with star schema

I have a schema with one table with the majority of data, customer, and three other tables with foreign key references to customer.entry_id which is a BIGSERIAL field. The three other tables are called location, devices and urls where we store various data related to a specific entry in the customer table.
I want to partition the customer table into monthly child tables, and have that part worked out; customer will stay as-is, each month will have a table customer_YYYY_MM that inherits from the master table with the right CHECK constraint and indexes will be created on each individual child table. Data will be moved to the correct child tables while the master table stays empty.
My question is about the other three tables, as I want to partition them as well. However, they have no date information (at all), only the reference to the primary key from the master table. How can I setup the constraints on these tables? Is it even meaningful or possible without date information?
My application logic knows where to insert all the data (it's fairly trivial), but I expect to be able to do simple SELECT queries without specifying which child tables to get it from. So this should work as you would expect from non-partitioned tables:
SELECT l.*
FROM customer c
JOIN location l USING entry_id
WHERE c.date_field > '2015-01-01'
I would partition them by the reference key. The foreign key is used in join conditions and is not usually subject to change so it fulfills the following important points:
Partition by the information that is used mostly in the WHERE clauses of the queries or other parts where partitioning can be used to filter out tables that don't need to be scanned. As one guide puts it:
The objective when defining partitions should be to allow as many queries as possible to fetch data from as few partitions as possible - ideally one.
Partition by information that is not going to be changed so that rows don't constantly need to be thrown from one subtable to another
This all depends of the size of the tables too of course. If the sizes stay small then there is no need to partition.
Read more about partitioning here.
Use views:
create view customer as
select * from customer_jan_15 union all
select * from customer_feb_15 union all
select * from customer_mar_15;
create view location as
select * from location_jan_15 union all
select * from location_feb_15 union all
select * from location_mar_15;

How multiple indexes in postgres work on the same column

I was wondering I'm not really sure how multiple indexes would work on the same column.
So lets say I have an id column and a country column. And on those I have an index on id and another index on id and country. When I do my query plan it looks like its using both those indexes. I was just wondering how that works? Can I force it to use just the id and country index.
Also is it bad practice to do that? When is it a good idea to index the same column multiple times?
It is common to have indexes on both (id) and (country,id), or alternatively (country) and (country,id) if you have queries that benefit from each of them. You might also have (id) and (id, country) if you want the "covering" index on (id,country) to support index only scans, but still need the stand along to enforce a unique constraint.
In theory you could just have (id,country) and still use it to enforce uniqueness of id, but PostgreSQL does not support that at this time.
You could also sensibly have different indexes on the same column if you need to support different collations or operator classes.
If you want to force PostgreSQL to not use a particular index to see what happens with it gone, you can drop it in a transactions then roll it back when done:
BEGIN; drop index table_id_country_idx; explain analyze select * from ....; ROLLBACK;

auto-increment column in PostgreSQL on the fly?

I was wondering if it is possible to add an auto-increment integer field on the fly, i.e. without defining it in a CREATE TABLE statement?
For example, I have a statement:
SELECT 1 AS id, t.type FROM t;
and I am can I change this to
SELECT some_nextval_magic AS id, t.type FROM t;
I need to create the auto-increment field on the fly in the some_nextval_magic part because the result relation is a temporary one during the construction of a bigger SQL statement. And the value of id field is not really important as long as it is unique.
I search around here, and the answers to related questions (e.g. PostgreSQL Autoincrement) mostly involving specifying SERIAL or using nextval in CREATE TABLE. But I don't necessarily want to use CREATE TABLE or VIEW (unless I have to). There are also some discussions of generate_series(), but I am not sure whether it applies here.
-- Update --
My motivation is illustrated in this GIS.SE answer regarding the PostGIS extension. The original query was:
CREATE VIEW buffer40units AS
SELECT
g.path[1] as gid,
g.geom::geometry(Polygon, 31492) as geom
FROM
(SELECT
(ST_Dump(ST_UNION(ST_Buffer(geom, 40)))).*
FROM point
) as g;
where g.path[1] as gid is an id field "required for visualization in QGIS". I believe the only requirement is that it is integer and unique across the table. I encountered some errors when running the above query when the g.path[] array is empty.
While trying to fix the array in the above query, this thought came to me:
Since the gid value does not matter anyways, is there an auto-increment function that can be used here instead?
If you wish to have an id field that assigns a unique integer to each row in the output, then use the row_number() window function:
select
row_number() over () as id,
t.type from t;
The generated id will only be unique within each execution of the query. Multiple executions will not generate new unique values for id.