PostgreSQL - Why are some queries on large datasets so incredibly slow - postgresql

I have two types of queries I run often on two large datasets. They run much slower than I would expect them to.
The first type is a sequential scan updating all records:
Update rcra_sites Set street = regexp_replace(street,'/','','i')
rcra_sites has 700,000 records. It takes 22 minutes from pgAdmin! I wrote a vb.net function that loops through each record and sends an update query for each record (yes, 700,000 update queries!) and it runs in less than half the time. Hmmm....
The second type is a simple update with a relation and then a sequential scan:
Update rcra_sites as sites
Set violations='No'
From narcra_monitoring as v
Where sites.agencyid=v.agencyid and v.found_violation_flag='N'
narcra_monitoring has 1,700,000 records. This takes 8 minutes. The query planner refuses to use my indexes. The query runs much faster if I start with a set enable_seqscan = false;. I would prefer if the query planner would do its job.
I have appropriate indexes, I have vacuumed and analyzed. I optimized my shared_buffers and effective_cache_size best I know to use more memory since I have 4GB. My hardware is pretty darn good. I am running v8.4 on Windows 7.
Is PostgreSQL just this slow? Or am I still missing something?

Possibly try reducing your random_page_cost (default: 4) compared to seq_page_cost: this will reduce the planner's preference for seq scans by making random-accesses driven by indices more attractive.
Another thing to bear in mind is that MVCC means that updating a row is fairly expensive. In particular, updating every row in a table requires doubling the amount of storage for the table, until it can be vacuumed. So in your first query, you may want to qualify your update:
UPDATE rcra_sites Set street = regexp_replace(street,'/','','i')
where street ~ '/'
(afaik postgresql doesn't automatically suppress the update if it looks like you're not actually updating anything. Istr there was a standard trigger function added in 8.4 (?) to allow you to do that, but it's perhaps better to address it in the client side)

When a row is updated, a new row version is written.
If the new row does not fit in the same disk block, then every index entry pointing to the old row needs to be updated to point to the new row.
It is not just indexes on the updated data that need updating.
If you have a lot of indexes on rcra_sites, and only one or two frequently updated fields, then you might gain by separating the frequently updated fields into a table of their own.
You can also reduce the fillfactor percentage below its default of 100, so that some of the updates can result in new rows being written to the same block, resulting in the indexes pointing to that block not needing to be updated.

Related

How to decrease size of a large postgresql table

I have a postgresql table that is "frozen" i.e. no new data is coming into it. The table is strictly used for reading purposes. The table contains about 17M records. The table has 130 columns and can be queried multiple different ways. To make the queries faster, I created indices for all combinations for filters that can be used. So I have a total of about 265 indexes on the table. Each index is about 1.1 GB. This makes the total table size to be around 265 GB. I have vacuumed the table as well.
Question
Is there a way to further bring down the disk usage of this table?
Is there a better way to handle queries for "frozen" tables that never get any data entered into them?
If your table or indexes are bloated, then VACUUM FULL tablename could shrink them. But if they aren't bloated then this won't do any good. This is not a benign operation, it will lock the table for a period of time (needing rebuild hundreds of index, probably a long period of time) and generate large amounts of IO and of WAL, the last of which will be especially troublesome for replicas. So I would test it on a non-production clone to see it actually shrinks things and see about how long of a maintenance window you will need to declare.
Other than that, be more judicious in your choice of indexes. How did you get the list of "all combinations for filters that can be used"? Was it by inspecting your source code, or just by tackling slow queries one by one until you ran out of slow queries? Maybe you can look at snapshots of pg_stat_user_indexes taken a few days apart to see if all them are actually being used.
Are these mostly two-column indexes?

Run vacuum by schedule

I'm using Postgres version 9.6
Most of my tables are for queries, update, insert.
Most of them around 200K-700K.
There are bigger (millions) and smaller.
Is that a good idea to perform vacuum (and analyze?) operation once a day? once a week? regardless if there is an autovacuum..
Advantages vs disadvantages?
Autovacuum is done when needed and it only creates statistics that are used when planning a query.
Basically you never need to do this manually, unless you have made vast changes to a table (filled it with data for example), and want to use it in another query within a few milliseconds. In that scenario, old statistics will result in the query planner coming up with a very bad query plan and will lead to a significantly slower query.
What you might want to do once per day / per week, or whatever, is to cluster tables, recreate degraded indexes, on tables that were modified a lot. Research these topics more to decide if / when / how to do it.

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

Postgres: Do we always need at least 3-4 times free the space of the biggest table?

we are using Postgres to store ~ 2.000.000.000 samples. This ends up in tables with ~ 500 mio entries and ~100GB Size each table.
What I want to do:
E.g. update the table entries: UPDATE table SET flag = true;
After this, the table is twice as big, i.e. 200GB
To get the space (stored on a SSD) back we: "VACCUM FULL table"
Unfortunately, this step needs again loads of space which results in the Vacuum to fail due to too little space left.
My Questions:
Does this mean, that, in order to make this UPDATE query only once and to get the space back for other tables in this DB we need at least 300-400GB space for a 100GB table?
In your scenario, you won't get away without having at least twice as much space as the table data would require.
The cheapest solution is probably to define the table with a fillfactor of 50 so that half of each block is left empty, thereby doubling the table size. Then the updated rows can all be in the same block as the original rows, and the UPDATE won't increase the table size because PostgreSQL can use the heap only tuple (HOT) update feature. The old versions will be freed immediately if there are no long running transactions that can still see them.
NOTE: This will only work if the colum you are updating is not indexed.
The downside of this approach is that the table is always twice the necessary size, and all sequential scans will take twice as long. It won't bother you if you don't use sequential scans of the table.

Why is count(*) taking extremely long in one PostgreSQL database but not another?

I have two Postgres databases. In one I have two tables, each with about 8,000,000 rows, and a count on either of them takes about a second. In another database, also Postgres, there are tables that are 1,000,000 rows, and a count takes 10s, and one table thats about 6,000,000 rows, and count takes 3min to run. What factors determine how long this will take? They are on different machines, but the database that takes longer is on a faster machine.
I've read about how postgres count is slow in general, but this seems odd to me. I can't really use a workaround, because I am using django, and it does a count in the admin, which is taking forever and making it dificult to use.
Any information on this would be helpful.
Speed of counting depends not just on the number of rows in the table but on the time taken to read the data from disk. The time depends on many things:
Number of rows in the table - as you already mentioned.
The number of records per page (if each record takes more space you need to read more pages to read the same number of rows).
If pages are only partly full you have to read more pages.
If the tables is already cached in memory (having more memory available helps here).
If the table is indexed with a small index (the index can be counted instead).
Hardware differences.
etc....
Indexes, caches, disk speed, for starters all have an impact.
Is the "slow table" properly vacuumed?
Do not use VACUUM FULL, it only creates table and index bloat. VACUUM is absolutely enough. VACUUM ANALYZE would even be better.
And make sure autovacuum is turned on and properly configured