Clearing cache in PostgreSQL - postgresql

My question:
How to delete the cache of the database, so that the same query will always take the "real" time to run.
The context:
I'm trying to improve runtime for a query. The plan is to run the query once, than run explain on it and add some relevant indexes based on the explaination's output, and finally run the query again.
I was told that caching that occurs in the database might affect the results of my tests.
What is the simplest way to clear the cache, or to have a clean slate for tests in general?

Restarting the database will clear the database's shared_buffers cache. It will not clear the filesystem cache, which PostgreSQL relies upon heavily.
On Linux, writing 1 into the file /proc/sys/vm/drop_caches will drop the FS cache. (Do this after restarting the database) But you need to be a privileged user to do that. Other OS will have other methods.
It is dubious that this produces times that are more "real". They could easily be less "real". How often do you reboot your production server in reality? Usually better would be to write a driver script that runs the same query repeatedly but with different parameters so that it hits different parts of the data.

DISCARD releases internal resources associated with a database session. This command is useful for partially or fully resetting the session's state. There are several subcommands to release different types of resources; the DISCARD ALL variant subsumes all the others, and also resets additional state. Please try this
SET SESSION AUTHORIZATION DEFAULT;
RESET ALL;
DEALLOCATE ALL;
CLOSE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD PLANS;
DISCARD SEQUENCES;
DISCARD TEMP;
N.B.: DISCARD ALL cannot be executed inside a transaction block.

A metadata change should automatically invalidate the affected plans. So altering the table, creating/dropping indexes should do it and you don't need to do anything special.
The ANALYZE command also does it.

Related

Script to track Database change

I need to track any changes of data in postgresql database. Is there any option in database or any script to view those data and DML as well.
Sorry - I have no clue. But I do have some different suggestions:
Log /all/ queries and grep for those involving update, delete, insert, alter table etc. Caveats: may cause performance problems if there are lots of queries and the log is on the same RAID as data and/or WAL. Not sure if it's easy to make some regexp that is 100% certain to catch all modifying statements. May be difficult to catch rollbacks etc. To log everything, add this to the configuration file: log_min_duration_statement = 0. Have a look that the other log_* configuration variables are sane as well.
The rules/trigger approach (as hinted by other user) - I believe it involves writing up rules for each and every table - but it's of course doable (and should be possible to create the rules through some external script if you have a lot of tables). You may also look a bit into how slony works - slony is a trigger-based replication system, should be possible to use it to catch all the changes in the DB.
All changes to the database ends up in the WAL-file, maybe it's theoretically possible to extract something out from the WAL, but I suspect that's not practical unless you're already a skilled postgres hacker ... and if you're a skilled postgres hacker, you probably wouldn't ask this question in the first place ;-) (eventually, the WALs may be used to see the rate of changes in the data and spot times of the day when there are more updates than otherwise etc. They may also be used for replication and roll-forward from a binary backup)
Between setting log_statement='all' in the postgresql.conf, you can also use tablelog to capture old data.

Executing VACUUM FULL on postgresql db which is stopped/down

I am attempting to restart a postgresql db which has stopped/is down and requires a VACUUM.
http://suwala.eu/blog/2010/10/09/how-to-vacuum-postgresql/
Following the above sequence of commands, I can't seem to get the last line to execute right.
$ postgres -D /var/lib/pgsql/data YOUR_DATABASE_NAME < /tmp/fix.sql
This gives me an error that says
postgres: invalid argument: "YOUR_DATABASE_NAME"
Try "postgres --help" for more information.
Any idea why?
CLARIFICATION
The 'YOUR_DATABASE_NAME' and the data directory I used on my server are the correct ones.
The referenced "how-to-vacuum-postgresql" page referenced in the question gives some very bad advice when it recommends VACUUM FULL. All that is needed is a full-database vacuum, which is simply a VACUUM run as the database superuser against the entire database (i.e., you don't specify any table name).
A VACUUM FULL works differently based on the version, but it eliminates all space within the heap files which is held by the database for quick re-use, and releases it to the OS. This can be much slower than the minimum needed to get back to a usable database, by orders of magnitude. And since any inserts or updates after the VACUUM FULL require OS calls to re-allocate space to the database, it can cause slower execution afterward, unless your database had a lot of bloat. (Although, if you turned off autovacuum, it might be in horrible shape, but you probably want to get back on your feet first, and sort that out later.)
Another issue with VACUUM FULL before version 9.0 is that while it eliminates bloat in a table's heap files, it tends to increase bloat in its index files, sometimes dramatically. If you issue a VACUUM FULL, you should normally follow it with a REINDEX to get the indexes back into good shape.
The page referenced in the question also fails to heed the advice given in the PostgreSQL docs at http://www.postgresql.org/docs/8.3/interactive/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND to use single-user mode:
since the system will not execute commands once it has gone into the
safety shutdown mode, the only way to do this is to stop the server
and use a single-user backend to execute VACUUM. The shutdown mode is
not enforced by a single-user backend. See the postgres reference page
for details about using a single-user backend.
As others have mentioned -- there is almost no use case where turning off autovacuum is beneficial. It may be useful to supplement the autovacuum activity with explicit vacuums on large tables, or you may want to adjust autovacuum configuration, but really -- don't turn it off or you will see bloat which saps performance and you'll run into transaction ID wraparound problems periodically. People who notice a performance hit when autovacuum is performing maintenance sometimes have an instinct to make it less aggressive in triggering, but that is usually counter-productive. It is generally better to adjust the autovacuum cost limitation parameters to pace the work, rather than have it neglect tables which need maintenance.
This appears to be an issue in PostgreSQL, as according to documentation for 9.0 and 8.3 it should work with those versions but doesn't.
However, using --single switch makes it work:
postgres --single -D [path-to-data-dir] [db-name] < /tmp/fix.sql

How does PostgreSQL cache statements and data?

In Oracle, SQL statements will be cached in shared_pool, and data which is selected frequently will be cached in db_cache.
What does PostgreSQL do? Will SQL statements and data be cached in shared_buffers?
Generally, only the contents of table and index files will be cached in the shared buffer space.
Query plans are cached in some circumstances. The best way to ensure this is to PREPARE the query once, then EXECUTE it each time.
The results of a query are not automatically cached. If you rerun the same query -- even if it's letter-for-letter identical, and no updates have been performed on the DB -- it will still execute the whole plan. It will, of course, make use of any table/index data that's already in the shared buffers cache; so it will not necessarily have to read all the data from disk again.
Update on plan caching
Plan caching is generally done per session. This means only the connection that makes the plan can use the cached version. Other connections have to make and use their own cached versions. This isn't really a performance issue because the saving you get from reusing a plan is almost always miniscule compared to the cost of connecting anyway. (Unless your queries are really complicated.)
It does cache if you use PREPARE: http://www.postgresql.org/docs/current/static/sql-prepare.html
It does cache when the query is in a PL/plSQL function: http://www.postgresql.org/docs/current/static/plpgsql-implementation.html#PLPGSQL-PLAN-CACHING
It does not cache ad-hoc queries entered in psql.
Hopefully someone else can elaborate on any other cases of query plan caching.

Clearing APC cache with postgres trigger

I was looking for a suitable caching solution for a PHP app.
I decided to let the application do all the "first of all I must land on the right server within the cluster", so I use the much faster APC cache, rather then memcache.
It does include an overhead, to find(in terms of improving the caching) the right server to land on, but I kind of like it.
I heard there was a project pgmemcache, to for example clear outdated memcached entries from within postgres triggers.
I do handle outdated date my own way, but Im still curious if theres something out there to acces APC cache from within postgres triggers.
Thanks in advance,
kriscom
I don't see any equivalent of pgmemcache for APC. Pgmemcache is open source, so you could use it as a basis for creating an APC equivalent: https://github.com/Ormod/pgmemcache.
If it is OK for your cache to be a little stale, you could create a table in Postgres to function as an invalidation/update queue. Use a trigger to insert a row when the cache needs to be updated. Then create a PHP script that constantly polls the queue and performs the cache manipulations.
I would not suggest spreading your cache management across layers. Either do it all in you data access layer or all at the database layer but don't mix them.

Is it possible to pause an SQL query?

I've got a really long running SQL query (data import, etc). It's crap - it uses cursors and it running slowly. It's doing it, so I'm not too worried about performance.
Anyways, can I pause it for a while (instead of canceling the query)?
It chews up a a bit of CPU so i was hoping to pause it, do some other stuff ... then resume it.
I'm assuming the answer is 'NO' because of how rows and data gets locked, etc.
I'm using Sql Server 2008, btw.
The best approximation I know for what you're looking for is
BEGIN
WAITFOR DELAY 'TIME';
EXECUTE XXXX;
END;
GO
Not only can you not pause it, doing so would be bad. SQL queries hold locks (for transactional integrity), and if you paused the query, it would have to hold any locks while it was paused. This could really slow down other queries running on the server.
Rather than pause it, I would write the query so that it can be terminated, and pick up from where it left off when it is restarted. This requires work on your part as a query author, but it's the only feasible approach if you want to interrupt and resume the query. It's a good idea for other reasons as well: long running queries are often interrupted anyway.
Click the debug button instead of execute. SQL 2008 introduced the ability to debug queries on the fly. Put a breakpoint at convenient locations
When working on similar situations, where I was trying to go through an entire list of data, which could be huge, and could tell which ones I have visited already, I would run the processing in chunks.
update or whatever
where (still not done)
limit 1000
And then I would just keep running the query until there are no rows being modified. This breaks the locks up into reasonable time chunks and can allow you to do thinks like move tens of millions of rows between tables while a system is in production.
Jacob
Instead of pausing the script, perhaps you could use resource governor. That way you could allow the script to run in the background without severely impacting performance of other tasks.
MSDN-Resource Governor