I'm working on a 1M+ row table. The software that inserts the data sometimes tries to select all rows. If it tries to do that; It crashes.
I'm not able to modify the software so I'm trying to implement a fix on the Postgresql side.
I want Postgresql to limit SELECT query results that are coming from a special user to 1.
I tried to implement a RULE but haven't been able to do it with success. Any suggestions are welcome.
Br,
You could rename the table and create a view with the name of the table (selecting from the renamed table).
Then you can include a LIMIT clause in the view definition.
There is a chance you need an index. Let me give you a few scenarios
There is a unique constraint on one of the fields but no corresponding index. This way when you insert a record PostgreSQL has to scan the table to see if there is an existing record with the same value in that field.
Your software mimics unique field constraint. Before inserting a new record it scans the table for a record with the same value in one of the fields to check if such a record already exists. Index on the right field would definitely help.
You software wants to compute the next "id" value. In this case it runs SELECT MAX(id) in order to find the next available value. "id" needs an index.
Try to find out if indexing one of the table fields helps. You can also try to trace and analyze queries submitted to the server and see if those queries can benefit from indexing the table. You can enable query logging this way How to log PostgreSQL queries?
Another guess is that your software buffers all records before processing them. Reading 1M records into memory may crash it. Limiting fetchSize (e.g. if your software uses JDBC you could add defaultRowFetchSize connection parameter to the connection string) may help though I realize you may not have means to change the way the existing software fetches data from DB.
Related
I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.
Can I do row-specific update / delete operations in a DB2 table Via SQL, in a NON QUNIQUE Primary Key Context?
The Table is a PHYSICAL FILE on the NATIVE SYSTEM of the AS/400.
It was, like many other Files, created without the unique definition, which leads DB2 to the conclusion, that The Table, or PF has no qunique Key.
And that's my problem. I can't override the structure of the table to insert a unique ID ROW, because, I would have to recompile ALL my correlating Programs on the AS/400, which is a serious issue, much things would not work anymore, "perhaps". Of course, I can do that refactoring for one table, but our system has thousands of those native FILES, some well done with Unique Key, some without Unique definition...
Well, I work most of the time with db2 and sql on that old files. And all files which have a UNIQUE Key are no problem for me to do those important update / delete operations.
Is there some way to get an additional column to every select with a very unique row id, respective row number. And in addition, what is much more important, how can I update this RowNumber.
I did some research and meanwhile I assume, that there is no chance to do exact alterations or deletes, when there is no unique key present. What I would wish would be some additional ID-ROW which is always been sent with the table, which I can Refer to when I do my update / delete operations. Perhaps my thinking here has an fallacy as non Unique Key Tables are purposed to be edited in other ways.
Try the RRN function.
SELECT RRN(EMPLOYEE), LASTNAME
FROM EMPLOYEE
WHERE ...;
UPDATE EMPLOYEE
SET ...
WHERE RRN(EMPLOYEE) = ...;
I have a table from which I want to UPSERT into another, when try to launch the query, I get the "cannot affect row a second time" error. So I tried to see if I have some duplicate on my first table regarding the field with the UNIQUE constraint, and I have none. I must be missing something, but since I cannot figure out what (and my query is a bit complex because it is including some JOIN), here is the query, the field with the UNIQUE constraint is "identifiant_immeuble" :
with upd(a,b,c,d,e,f,g,h,i,j,k) as(
select id_parcelle, batimentimmeuble,etatimmeuble,nb_loc_hab_ma,nb_loc_hab_ap,nb_loc_pro, dossier.id_dossier, adresse.id_adresse, zapms.geom, 1, batimentimmeuble2
from public.zapms
left join geo_pays_gex.dossier on dossier.designation_siea=zapms.id_dossier
left join geo_pays_gex.adresse on adresse.id_voie=(select id_voie from geo_pays_gex.voie where (voie.designation=zapms.nom_voie or voie.nom_quartier=zapms.nom_quartier) and voie.rivoli=lpad(zapms.rivoli,4,'0'))
and adresse.num_voie=zapms.num_voie
and adresse.insee=zapms.insee_commune::integer
)
insert into geo_pays_gex.bal2(identifiant_immeuble, batimentimmeuble, id_etat_addr, nb_loc_hab_ma, nb_loc_hab_ap, nb_loc_pro, id_dossier, id_adresse, geom, raccordement, batimentimmeuble2)
select a,b,c,d,e,f,g,h,i,j,k from upd
on conflict (identifiant_immeuble) do update
set batimentimmeuble=excluded.batimentimmeuble, id_etat_addr=excluded.id_etat_addr, nb_loc_hab_ma=excluded.nb_loc_hab_ma, nb_loc_hab_ap=excluded.nb_loc_hab_ap, nb_loc_pro=excluded.nb_loc_pro,
id_dossier=excluded.id_dossier, id_adresse=excluded.id_adresse,geom=excluded.geom, raccordement=1, batimentimmeuble2=excluded.batimentimmeuble2
;
As you can see, I use several intermediary tables in this query : one to store the street's names (voie), one related to this one storing the adresses (adresse, basically numbers related through a foreign key to the street's names table), and another storing some other datas related to the projects' names (dossier).
I don't know what other information I could give to help find an answer, I guess it is better I do not share the actual content of my tables since it may touch some privacy regulations or such.
Thanks for your attention.
EDIT : I found a workaround by deleting the entries present in the zapms table from the bal2 table, as such
delete from geo_pays_gex.bal2 where bal2.identifiant_immeuble in (select id_parcelle from zapms);
it is not entirely satisfying though, since I would have prefered to keep track of the data creator and the date of creation, as much as the fact that the data has been modified (I have some fields to store this information) and here I simply erase all this history... And I have another table with the primary key of the bal2 table as a foreign key. I am still in the DB creation so I can afford to truncate this table, but in production it wouldn't be possible since I would lose some datas.
I want to load many rows from a CSV file.
The files contain data like these "article_name,article_time,start_time,end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new article_time falls in an existing range [start_time,end_time] for the same article.
ie: don't insert row y if exists [start_time_x,end_time_x] for which time_article_y inside range [start_time_x,end_time_x] , with article_name_y = article_name_x
I tried with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes), but still not as fast as it needs to be for the amount of data I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
Lots of thanks folks
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.
Am inserting huge number of records containing Phone Number and Status (a million) in a postgresql database from hibernate. I am reading the records from a file, processing each, and inserting them one at a time. But before the insert I need to check whether this combination of phone number and status already exists in the table.
It seems to me that the fastest way would be to do a query and limit it by 1, or an Exists query, but another suggestion I got from a colleague is to add a unique constraint on the table on the phone number and status fields and in case the unique key rule is being violated, just catch the exception in hibernate.
Any thoughts on what's the fastest and most reliable method?
It depends whether there are only these columns, or also some others, for example date. If you don't care which record will stay in database (for example you need latest combination of number, status and date), then create unique constraint and recover from exception that is thrown when inserting duplicities.
You may also insert all with duplicities but with some primary key (id) and then delete all duplicities but one you like to have (group by...), then create unique constraint.
Last option depends on size of records - if it is only 1M of them, you may filter them in application layer and then save them.
All of these depend on how much duplicities are there, if just few, use option one, if each record may be there 10 times, maybe last option is the best (depending on RAM, but you will hold only currently best record for phone and status)