I have a plv8 function (but I assume the same thing applies to any language). It could be called a lot in one statement (select say). It computes expensive stuff so I dont want to recompute every time. But the stuff it computes depends on the contents of the database. I can naively cache things in the function but that will never get cleared. So after the first call I am always operating on old data. If would be nice to flush at the start of each statement execution.
Note that triggering off changes to the tables the cache depends on doesnt work. The cache exists in connectionA, the DB can be changed by connectionB (or C,...)
Related
My question:
How to delete the cache of the database, so that the same query will always take the "real" time to run.
The context:
I'm trying to improve runtime for a query. The plan is to run the query once, than run explain on it and add some relevant indexes based on the explaination's output, and finally run the query again.
I was told that caching that occurs in the database might affect the results of my tests.
What is the simplest way to clear the cache, or to have a clean slate for tests in general?
Restarting the database will clear the database's shared_buffers cache. It will not clear the filesystem cache, which PostgreSQL relies upon heavily.
On Linux, writing 1 into the file /proc/sys/vm/drop_caches will drop the FS cache. (Do this after restarting the database) But you need to be a privileged user to do that. Other OS will have other methods.
It is dubious that this produces times that are more "real". They could easily be less "real". How often do you reboot your production server in reality? Usually better would be to write a driver script that runs the same query repeatedly but with different parameters so that it hits different parts of the data.
DISCARD releases internal resources associated with a database session. This command is useful for partially or fully resetting the session's state. There are several subcommands to release different types of resources; the DISCARD ALL variant subsumes all the others, and also resets additional state. Please try this
SET SESSION AUTHORIZATION DEFAULT;
RESET ALL;
DEALLOCATE ALL;
CLOSE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD PLANS;
DISCARD SEQUENCES;
DISCARD TEMP;
N.B.: DISCARD ALL cannot be executed inside a transaction block.
A metadata change should automatically invalidate the affected plans. So altering the table, creating/dropping indexes should do it and you don't need to do anything special.
The ANALYZE command also does it.
I read some materials before saying that each update statement in postgresql is atomic.
For example,
update set column_1 = column_1 + 100 where column_2 = 10;
Even though I have multiple processes calling the update simultaneously, I can rely on it that they will happen in sequence because each update is atomic behind the scene and the "read_modify_write" cycle is encapsulated in a bundle.
However, what if the update statement looks like the following:
update set column_1 = myFunction(column_1) where column_2 = 10;
Here, myFunction() is a stored procedure created by me.In this function, I will apply different math operations to column_1 depending on its amount. Something like:
if(column_1 < 10):
// do something
else if (column_1 < 20):
// do something
else
// do something
In this case, when the single update statement contains self-defined function, does it remain atomic?
OK, #Schwern's knowledge of Perl may well be world class but as regards PostgreSQL transactions, I can correct him :-)
Every statement in PostgreSQL is executed within a transaction, either an explicit one you BEGIN/COMMIT yourself or an implicit one wrapping the statement. For the duration of a statement you will see a stable view of the entire database.
If you write myFunction as an in-database custom function in pl/pgsql or some such then it too will be in the same transaction as the statement that calls it. If it doesn't run its own queries, just operates on its parameters then you don't need to think any further.
If you are reading from tables within your function then you will need a passing familiarity with transaction isolation levels. In particular, make sure you understand what "read committed" implies about seeing other processes' activities.
The blog article you refer to is discussing performing operations outside of the database. The solution it proposes is exactly what you are asking about - an atomic update.
The update, subqueries, and queries in function calls should all see a consistent view of the data.
From Chapter 13. Concurrency Control.
Internally, data consistency is maintained by using a multiversion model (Multiversion Concurrency Control, MVCC). This means each SQL statement sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session.
What that means, I believe, is for each statement Postgres keeps a version of the data. Each statement sees a consistent version of the database for the entirety of its run. That includes sub-queries and function calls.
This doesn't mean you don't have to think about concurrency. It just means that update will see consistent data.
If I change PL/pgSQL function that is in use in another PL/pgSQL function will the PostgreSQL rebuild execution plan for both of them or only for changed one?
Say I have 2 functions using third one. Say function check_permission(user_id) is used by get_company(user_id) and get_location(user_id).
And I they got cached their execution plan somehow.
And then I change check_permission, would the execution plan caches for get_company(user_id) and get_location(user_id) be deleted and rebuilt on demand?
PostgreSQL tracks dependencies, and it flushes caches pretty aggressively when things change.
If you change a function, it'll invalidate at least the plans of all functions that depend on it. In practice, IIRC it just flushes all cached query plans entirely.
The same is true of views that depend on other views, prepared statements that reference views, etc.
If you find a case where it fails to do so you have found a bug. Please report it with a complete reproducible test case.
I'm trying to run a SELECT statement on PostgreSQL database and save its result into a file.
The code runs on my environment but fails once I run it on a lightweight server.
I monitored it and saw that the reason it fails after several seconds is due to a lack of memory (the machine has only 512MB RAM). I didn't expect this to be a problem, as all I want to do is to save the whole result set as a JSON file on disk.
I was planning to use fetchrow_array or fetchrow_arrayref functions hoping to fetch and process only one row at a time.
Unfortunately I discovered there's no difference when it comes to the true fetch operations between the two above and fetchall_arrayref when you use DBD::Pg. My script fails at the $sth->execute() call, even before it has a chance to do call any fetch... function.
This suggests to me that the implementation of execute in DBD::Pg actually fetches ALL the rows into memory, leaving only the actual format its returned to the fetch... functions.
A quick look at the DBI documentation gives a hint:
If the driver supports a local row cache for SELECT statements, then this attribute holds the number of un-fetched rows in the cache. If the driver doesn't, then it returns undef. Note that some drivers pre-fetch rows on execute, whereas others wait till the first fetch.
So in theory I would just need to set the RowCacheSize parameter. I've tried but this feature doesn't seem to be implemented by DBD::Pg
Not used by DBD::Pg
I find this limitation a huge general problem (execute() call pre-fetches all rows?) and more inclined to believe that I'm missing something here, than that this is actually a true limitation of interacting with PostgreSQL databases using Perl.
Update (2014-03-09): My script works now thanks to using a workaround as described in my comment to Borodin's answer. The maintainer of DBD::Pg library got back to me on the issue actually saying the root cause is deeper and lies within libpq postgresql internal library (used by DBD::Pg). Also, I think very similar issue to the one described here affects pgAdmin. Being postgresql native tool it still doesn't give in the Options chance to define the default limit of the result set row size. This is probably why it makes Query tool sometimes waiting a good while before presenting results from bulky queries, potentially breaking the app in some cases too.
In the section Cursors, the documentation for the database driver says this
Therefore the "execute" method fetches all data at once into data structures located in the front-end application. This fact must to be considered when selecting large amounts of data!
So your supposition is correct. However the same section goes on to describe how you can use cursors in your Perl application to read the data in chunks. I believe this would fix your problem.
Another alternative is to use OFFSET and LIMIT clauses on your SELECT statement to emulate cursor functionality. If you write
my $select = $dbh->prepare('SELECT * FROM table OFFSET ? LIMIT 1');
then you can say something like (all of this is untested)
my $i = 0;
while ($select->execute($i++)) {
my #data = $select->fetchrow_array;
# Process data
}
to read your tables one row at a time.
You may find that you need to increase the chunk size to get an acceptable level of efficiency.
I have a table called deposits
When a deposit is made, the table is locked, so the query looks something like:
SELECT * FROM deposits WHERE id=123 FOR UPDATE
I assume FOR UPDATE is locking the table so that we can manipulate it without another thread stomping on the data.
The problem occurs though, when other deposits are trying to get the lock for the table. What happens is, somewhere in between locking the table and calling psql_commit() something is failing and keeping the lock for a stupidly long amount of time. There are a couple of things I need help addressing:
Subsequent queries trying to get the lock should fail, I have tried achieving this with NOWAIT but would prefer a timeout method (because it may be ok to wait, just not wait for a 'stupid amount of time')
Ideally I would head this off at the pass, and have my initial query only hold the lock for a certain amount of time, is this possible with postgresql?
Is there some other magic function I can tack onto the query (similar to NOWAIT) which will only wait for the lock for 4 seconds before failing?
Due to the painfully monolithic spaghetti code nature of the code base, its not simply a matter of changing global configs, it kinda needs to be a per-query based solution
Thanks for your help guys, I will keep poking around but I haven't had much luck. Is this a non-existing function of psql, because I found this: http://www.postgresql.org/message-id/40286F1F.8050703#optusnet.com.au
I assume FOR UPDATE is locking the table so that we can manipulate it without another thread stomping on the data.
Nope. FOR UPDATE locks only those rows, so that another transaction that attempts to lock them (with FOR SHARE, FOR UPDATE, UPDATE or DELETE) blocks until your transaction commits or rolls back.
If you want a whole table lock that blocks inserts/updates/deletes you probably want LOCK TABLE ... IN EXCLUSIVE MODE.
Subsequent queries trying to get the lock should fail, I have tried achieving this with NOWAIT but would prefer a timeout method (because it may be ok to wait, just not wait for a 'stupid amount of time')
See the lock_timeout setting. This was added in 9.3 and is not available in older versions.
Crude approximations for older versions can be achieved with statement_timeout, but that can lead to statements being cancelled unnecessarily. If statement_timeout is 1s and a statement waits 950ms on a lock, it might then get the lock and proceed, only to be immediately cancelled by a timeout. Not what you want.
There's no query-level way to set lock_timeout, but you can and should just:
SET LOCAL lock_timeout = '1s';
after you BEGIN a transaction.
Ideally I would head this off at the pass, and have my initial query only hold the lock for a certain amount of time, is this possible with postgresql?
There is a statement timeout, but locks are held at transaction level. There's no transaction timeout feature.
If you're running single-statement transactions you can just set a statement_timeout before running the statement to limit how long it can run for. This isn't quite the same thing as limiting how long it can hold a lock, though, because it might wait 900ms of an allowed 1s for the lock, only actually hold the lock for 100ms, then get cancelled by the timeout.
Is there some other magic function I can tack onto the query (similar to NOWAIT) which will only wait for the lock for 4 seconds before failing?
No. You must:
BEGIN;
SET LOCAL lock_timeout = '4s';
SELECT ....;
COMMIT;
Due to the painfully monolithic spaghetti code nature of the code base, its not simply a matter of changing global configs, it kinda needs to be a per-query based solution
SET LOCAL is suitable, and preferred, for this.
There's no way to do it in the text of the query, it must be a separate statement.
The mailing list post you linked to is a proposal for an imaginary syntax that was never implemented (at least in a public PostgreSQL release) and does not exist.
In a situation like this you may want to consider "optimistic concurrency control", often called "optimistic locking". It gives you greater control over locking behaviour at the cost of increased rates of query repetition and the need for more application logic.