Postgres index creation time on temp table - postgresql

I have a statement trigger that executes a function after some rows are updated.
With the transition table of the trigger (joined with some other tables) I need to create a temporary table, which I will use to query later.
The thing is, the query on this temp table is pretty fast but when the number of rows increase it gets slower. So I was trying to add an index.
My question is how can I measure the impact of the index creation? If the number of rows is not high, the index is not being used, so I would need to know if the index creation is expensive.

The cost of creating a b-tree index is dominated by the cost of sorting for bigger tables: O(n * log(n))
For small tables, the cost is low. So I wouldn't worry about creating the index even for small tables where it is not used.
You shouldn't forget to ANALYZE the table, however, because autovacuum doesn't do that for temporary tables.

Related

Temp table updates is slower then normal table in postgresql

I have a situation where updates on my temp table is slow. Below is the scenario
Created temp table in session for every session,first time temp table created and then going forward doing insert,update and delete operations this operations until session ends only.
First i'm inserting the rows and based on rows i'm updateing other columns. but this updates is slow compared to norma table. i checked the performance by replacing temp table whereas normal table taking around 50 to 60s but temp table is taking nearly 5 mins.
I tried analyze on temp table, then i got the improved performance. when im using analyze the updates are completed in with 50 seconds.
I tried Types also, but no luck.
Record count in temp table is 480
Can anyone help to imprrove the performance on temp table with out analyze OR any alternative for bulk collect and bulk insert in user defined types
All the above ooperations i'm doing in postgresql.
The lack of information in your question forces me to guess, but if all other things are equal, the difference is probably that you don't have accurate statistics on the temporary table. For normal tables (which are visible to the public), autovacuum takes care of that automatically, but for temporary tables, you have to call ANALYZE explicitly to gather table statistics.

What is the best approach for upserting large number of rows into a single table?

Im working on a product that involves large number of upsert operations into a single table.
We are dealing with a time-based data and using timescaledb hypertables with 7 days chunk interval size. we have concurrent tasks that upserts data into a single table, and in extreme cases its possible that we will have 40 concurrent tasks, each one upserting around 250k rows, all to the same table.
Initially we decided to go with the approach of deleting all the old rows and then inserting the updated ones with a COPY FROM statement, but when we got to test the system on large scale these COPYs took long time to finish, eventually resulting in the db's CPU usage to reach 100%, and become unresponsive.
We also noticed that the index size of the table increased radically and filled up the disk usage to 100%, and SELECT statements took extremely long time to execute (over 10 minutes). We concluded that the reason for that was large amount of delete statements that caused index fragmentation, and decided to go with another approach.
following the answers on this post, we decided to copy all the data to a temporary table, and then upsert all the data to the actual table using an "extended insert" statement -
INSERT INTO table SELECT * FROM temp_table ON CONFLICT DO UPDATE...;
our tests show that it helped with the index fragmentation issues, but still large upsert operations of ~250K take over 4 minutes to execute, and during this upsert process SELECT statements take too long to finish which is unacceptable for us.
I'm wondering whats the best approach to create this upsert operation with as low impact to the performance of SELECTs as possible. The only thing that comes in mind right now is to split the insert into smaller chunks -
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 0 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 50000 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 100000 ON CONFLICT DO UPDATE ...;
...
but if we batch the inserts, is there any advantage of first copying all the data into a temporary table? will it perform better then a simple multi-row insert statement?
and how do i decide whats the best chunk size to use when splitting up the upsert? is
using a temporary table and upserting the rows directly from it allows for a bigger chunk sizes?
Is there any better approach to achieve this? any suggestion would be appreciated
There are a handful of things that you can do to speed up data loading:
Have no index or foreign key on the table while you load data (check constraints are fine). I am not suggesting that you drop all your constraints, but you could for example use partitioning, load the data into a new table, then create indexes and constraints and attach the table as a new partition.
Load the data with COPY. If you cannot use COPY, use a prepared statement for the INSERT to save on parsing time.
Load many rows in a single transaction.
Set max_wal_size high so that you get no more checkpoints than necessary.
Get fast local disks.

Difference between BRIN index and table partitioning in PostgreSQL

What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.

Should I use database trigger in insert heavy operation table?

My use-case is that I need to copy a few columns from TABLE A to another TABLE B and also derive values of a few other columns of TABLE B by some calculation.
As per current estimation around 50,000 rows will be inserted on daily basis in TABLE A.
TABLE B should be updated with all data before End of day.
Hence, either I can use trigger which will be invoked on INSERT operation on TABLE A or schedule some Job at EOD which read all data in bulk from TABLE A , do some calculation and insert in TABLE B.
As I am new to trigger, I am not sure which option should i pick for this use-case. Any suggestion which would be a better approach ?
So far what I have read about triggers, they can slow down DBs performance if they are invoked frequently.
As around 50,000 insert operation will happen daily , so can I assume that 50,000 falls under heavy operations where triggers would not be beneficial ?
EDIT 1 : 50,000 insert operation will reach 100,000 insert operations daily
Postgres DB is used.
If you are doing bulk COPY into an unindexed table, adding a simple trigger will slow you down by a lot (like 5 fold). But if you are using single-row INSERTs or the table is indexed, the marginal slow down of adding a simple trigger will be pretty small.
50,000 inserts per day is not very many. You should have no trouble using a trigger on that, unless the trigger has to scan a large table on each call or something.

Is the speed of a PostgreSQL SELECT adversely affected by too many indexes on the table?

I have read that when having a lot of indexes on a database It can seriously hurt the performance but in the PostgreSQL doc I can't find anything about it.
I have a very big table with something like 100 columns and a billion rows and often I have to do a lot of searches in a lot of different fields.
Does the performance of the PostgreSQL table will drop if I add a lot of indexes (maybe 10 unique column indexes and 5 or 7 3 column indexes)?
EDIT: With performance drop I mean the performance in fetching rows (select), the database will be updated once a month so the update and insert time are not an issue.
The indexes are maintained when the content of the table has been modified (i.e. INSERT, UPDATE, DELETE)
The query planner of PostgreSQL can decide when to use an index and when it's not needed and a sequential scan is more optimal.
So having too many indexes will hurt the modifying performance, not the fetching.
The indexes will have to be updated at each insert and update involving those columns.
I have some charts about that on my site: http://use-the-index-luke.com/sql/dml
An index is pure redundancy. It contains only data that is also stored
in the table. During write operations, the database must keep those
redundancies consistent. Specifically, it means that insert, delete
and update not only affect the table but also the indexes that hold a
copy of the affected data.
The chapter titles suggest the impact that indexes can have:
Insert — cannot take direct benefit from indexes
Delete — uses indexes for the where clause
Update — does not affect all indexes of the table