unexplained variations in postgresql insert performance - postgresql

I have a processing pipeline generates two streams of data, then joins the two streams of data to produce a third stream. Each stream of data is a timeseries over 1 year of 30 minute intervals (so 17520 rows). Both the generated streams and the joined stream are written into a single table keyed by a unique stream id and the timestamp of each point in the timeseries.
In abstract terms, the c and g series are generated by plpgsql functions which insert into the timeseries table from data stored elsewhere in the database (e.g. with a select) and then return the unique identifiers of the newly created series. The n series is generated with a join between the timeseries identified by c_id and g_id by the calculate_n() function which returns the id of the new n series.
To illustrate with some pseudo code:
-- generation transaction
begin;
c_id = select generate_c(p_c);
g_id = select generate_g(p_g);
end transaction;
-- calculation transaction
begin;
n_id = select calculate_n(c_id, g_id);
end transaction;
I observe that generate_c() and generate_g() typically run in a lot less than a second however the first time calculate_n() runs, it typically takes 1 minute.
However if I run calculate_n() a second time with exactly the same parameters as the first run, it runs in less than a second. (calculate_n() generates a completely new series each time it runs - it is not reading or re-writing any data calculated by the first execution)
If I stop the database server, restart it, then run calculate_n() on c_id and g_id calculated previously, the execution of calculate_n() also takes less than a second.
This is very confusing to me. I could understand the second run of calculate_n() is taking only a second if, somehow, the first run had warmed a cache but if that is so, then why does the third run (after a server restart) still run quickly when any such cache would have been cleared?
It appears to me that perhaps some kind of write cache, generated by the first generation transaction, is (unexpectedly) impeding the first execution of calculate_n() but once calculate_n() completes, that cache is purged so that it doesn't get in the way of subsequent executions of calculate_n() when they occur. I have had a look at the activity of the shared buffer cache via pg_buffercache but didn't see any strong evidence that this was happening although there were certainly evidence of cache activity across executions of calculate_n().
I may be completely off-base about this being the result of an interaction with a write-cache that was populated by the first transaction, but I am struggling to understand why the performance of calculate_n() is so poor immediately after the first transaction completes but not at other times, such as immediately after the first attempt or after the database server is restarted.
I am using postgres 11.6.
What explains this behaviour?
update:
So further on this. Running the vacuum analyze between the two generate steps and the calculate step did improve the performance of the calculate step, but if I found that if I repeated the steps again, I needed to run vacuum analyze in between the generate steps and the calculate step every time I executed the sequence which doesn't seem like a particularly practical thing to do (since you can't call vacuum analyze in a function or a procedure). I understand the need to run vacuum analyze at least once with a reasonable number of rows in the table. But do I really need to do it every time I insert 34000 more rows?

Related

row caching with Postgres

Being new to Postgres, I have a question regarding caching. I do not know how much is handled by the database, vs. how much I need to handle myself.
I have rows which are part of a time series. For the sake of example, let's say I have 1 row every second.
My query is to get the last minute (60 rows), on a rolling basis (from now - 1min to now).
Ideally, I could cache the last result in the client and request the last second.
Unfortunately, the data has holes in it: some rows are missing and me be added a few seconds later. So I need to query the whole thing anyways.
I could find where the holes are and make a query just for these, this is also an option.
What I was wondering is: how much caching does Postgres perform on its own?
If I have a query returning rows 3, 4, 5, 6 and my next query returns 4, 5, 6, 7. Would rows 4, 5, 6 have to be fetched again, or are they cached by the database?
Postgres includes an extensive caching system. It is very likely that subsequent executions of the same query will use the cached data, but we have virtually no influence on it because the cache works autonomously. What can and should be done is to use prepared statements. Per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
However, the scenario described in the question suggests a completely different approach to the issue, namely the use of the NOTIFY / LISTEN feature. In brief:
the server sends a notification every time a row is inserted into the table,
the application listens on the agreed channel and receives newly entered rows.
In this solution, the application performs the query only once and completes the data from notifications.
Sample trigger function:
create or replace function before_insert_on_my_table()
returns trigger language plpgsql as $$
begin
perform pg_notify('my_data', new::text);
return new;
end $$;
create trigger before_insert_on_my_table
before insert on my_table
for each row execute procedure before_insert_on_my_table();
The implementation of the LISTEN statement in an application depends on the language used. For example, the python psycopg2 has convenient tools for it.
Read more about NOTIFY in the docs.
PostgreSQL caches data automatically. Data are read, written and cached in units of 8kB, and whenever you access a row, the block containing the row will be cached.
All you have to make sure is that you have an index on the timestamp of your table, otherwise PostgreSQL has to read the whole table to compute the result. With an index it will read (and cache) only the blocks that contain the data you need, and that will speed up your next query for these data.

How to avoid being blocked by deadlocks?

Can I write an UPDATE statement that will simply not bother executing if there's a deadlock?
I have a small, but frequently updated table.
This statement is run quite frequently on it....
UPDATE table_a SET lastChangedTime = 'blah' WHERE pk = 1234;
Where pk is the primary key.
Every now and again this statement gets blocked. That's not in itself a big deal; the issue is that each time there's a lock it seems to take a minute or two for Postgres to sort itself out, and I can lose a lot a data.
table_a is very volatile, and lastChangedTime gets altered all the time, so rather than occasionally having to wait two minutes for the UPDATE to get executed, I'd rather it just didn't bother. Ok, my data might not be as up-to-date as I'd like for this one record, but at least I wouldn't have locked the whole table for 2 minutes.
Update following comments:
the application interacts very simply with the database, it only issues simple, one line UPDATE and INSERT statements, and commits each one immediately. One of the issues causing me a lot of head scratching is how can something work a million times without problem, and then just fail on another record that appears to be identical to all the others.
Final suggestion/question.....
The UPDATE statement is being invoked from a C# application. If I change the 'command timeout' to a very short value - say 1 millisecond would that have the desired effect? or might it end up clogging up the database with lots of broken transactions?
To avoid waiting for locks, first run
SELECT 1 FROM table_a WHERE pk = 1234 FOR UPDATE NOWAIT;
If there is a lock on the row, the statement will fail immediately, and you can go o working on something different.
Mind that the SELECT ... FOR UPDATE statement has to be in the same database transaction as your UPDATE statement.
As a general advice, you should use shorter transactions, which will reduce the length of lock waits and the risk of deadlock.

Benchmarking Redshift Queries

I want to know how long my queries take to execute, so that I can see whether my changes improve the runtime or not.
Simply timing the executing of the whole query is unsuitable, since this also takes into account the (highly variable) time spent waiting in an execution queue.
Redshift provides the STL_WLM_QUERY table that contains separate columns for queue wait time and execution time. However, my queries do not reliably show up in this table. For example if I execute the same query multiple times the number of corresponding rows in STL_WLM_QUERY is often much smaller than the number of repetitions. Sometimes, but not always, only one row is generated no matter how often I run the query. I suspect some caching is going on.
Is there a better way to find the actual execution time of a Redshift query, or can someone at least explain under what circumstances exactly a row in STL_WLM_QUERY is generated?
My tips
If possible, ensure that your query has not waited at all, if it has
there should be a row on stl_wlm_query. If it did wait - then rerun
it.
Run the query once to compile it, then a second time to benchmark
it. compile time can be significant
Disable the new query result caching feature (if you have it yet -
you probably don't)
(https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-redshift-introduces-result-caching-for-sub-second-response-for-repeat-queries/)

Executing same query makes time difference in postgresql

I just want to know what is the reason for having different time while executing the same query in PostgreSQL.
For Eg: select * from datas;
For the first time it takes 45ms
For the second time the same query takes 55ms and the next time it takes some other time.Can any one say What is the reason for having non static time.
Simple, everytime the database has to read the whole table and retrieve the rows. There might be 100 different things happening in database which might cause a difference of few millis. There is no need to panic. This is bound to happen. You can expect the operation to take same time with some millis accuracy. If there is a huge difference then it is something which has to be looked.
Have u applied indexing in your table . it also increases speed to a great deal!
Compiling the explanation from
Reference by matt b
EXPLAIN statement? helps us to display the execution plan that the PostgreSQL planner generates for the supplied statement.
The execution plan shows how the
table(s) referenced by the statement will be scanned — by plain
sequential scan, index scan, etc. — and if multiple tables are
referenced, what join algorithms will be used to bring together the
required rows from each input table
And Reference by Pablo Santa Cruz
You need to change your PostgreSQL configuration file.
Do enable this property:
log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
After that, execution time will be logged and you will be able to figure out exactly how bad (or good) are performing your queries.
Well that's about the case with every app on every computer. Sometimes the operating system is busier than other times, so it takes more time to get the memory you ask it for or your app gets fewer CPU time slices or whatever.

Can multiple threads cause duplicate updates on constrained set?

In postgres if I run the following statement
update table set col = 1 where col = 2
In the default READ COMMITTED isolation level, from multiple concurrent sessions, am I guaranteed that:
In a case of a single match only 1 thread will get a ROWCOUNT of 1 (meaning only one thread writes)
In a case of a multi match that only 1 thread will get a ROWCOUNT > 0 (meaning only one thread writes the batch)
Your stated guarantees apply in this simple case, but not necessarily in slightly more complex queries. See the end of the answer for examples.
The simple case
Assuming that col1 is unique, has exactly one value "2", or has stable ordering so every UPDATE matches the same rows in the same order:
What'll happen for this query is that the threads will find the row with col=2 and all try to grab a write lock on that tuple. Exactly one of them will succeed. The others will block waiting for the first thread's transaction to commit.
That first tx will write, commit, and return a rowcount of 1. The commit will release the lock.
The other tx's will again try to grab the lock. One by one they'll succeed. Each transaction will in turn go through the following process:
Obtain the write lock on the contested tuple.
Re-check the WHERE col=2 condition after getting the lock.
The re-check will show that the condition no longer matches so the UPDATE will skip that row.
The UPDATE has no other rows so it will report zero rows updated.
Commit, releasing the lock for the next tx trying to get hold of it.
In this simple case the row-level locking and the condition re-check effectively serializes the updates. In more complex cases, not so much.
You can easily demonstrate this. Open say four psql sessions. In the first, lock the table with BEGIN; LOCK TABLE test;*. In the rest of the sessions run identical UPDATEs - they'll block on the table level lock. Now release the lock by COMMITting your first session. Watch them race. Only one will report a row count of 1, the others will report 0. This is easily automated and scripted for repetition and scaling up to more connections/threads.
To learn more, read rules for concurrent writing, page 11 of PostgreSQL concurrency issues - and then read the rest of that presentation.
And if col1 is non-unique?
As Kevin noted in the comments, if col isn't unique so you might match multiple rows, then different executions of the UPDATE could get different orderings. This can happen if they choose different plans (say one is a via a PREPARE and EXECUTE and another is direct, or you're messing with the enable_ GUCs) or if the plan they all use uses an unstable sort of equal values. If they get the rows in a different order then tx1 will lock one tuple, tx2 will lock another, then they'll each try to get locks on each others' already-locked tuples. PostgreSQL will abort one of them with a deadlock exception. This is yet another good reason why all your database code should always be prepared to retry transactions.
If you're careful to make sure concurrent UPDATEs always get the same rows in the same order you can still rely on the behaviour described in the first part of the answer.
Frustratingly, PostgreSQL doesn't offer UPDATE ... ORDER BY so ensuring that your updates always select the same rows in the same order isn't as simple as you might wish. A SELECT ... FOR UPDATE ... ORDER BY followed by a separate UPDATE is often safest.
More complex queries, queuing systems
If you're doing queries with multiple phases, involving multiple tuples, or conditions other than equality you can get surprising results that differ from the results of a serial execution. In particular, concurrent runs of anything like:
UPDATE test SET col = 1 WHERE col = (SELECT t.col FROM test t ORDER BY t.col LIMIT 1);
or other efforts to build a simple "queue" system will *fail* to work how you expect. See the PostgreSQL docs on concurrency and this presentation for more info.
If you want a work queue backed by a database there are well-tested solutions that handle all the surprisingly complicated corner cases. One of the most popular is PgQ. There's a useful PgCon paper on the topic, and a Google search for 'postgresql queue' is full of useful results.
* BTW, instead of a LOCK TABLE you can use SELECT 1 FROM test WHERE col = 2 FOR UPDATE; to obtain a write lock on just that on tuple. That'll block updates against it but not block writes to other tuples or block any reads. That allows you to simulate different kinds of concurrency issues.