Is it possible, perhaps using DB-triggers to set a maximum table-size in a postgres DB?
For example, say I have a table called: Comments.
From the user perspective, this can be done as frequently as possible, but say I only want to store the 100 most recent comments in the DB. So what I want to do is have a trigger that automatically maintains this. I.e. when more than 100 comments are there, it deletes the oldest one, etc.
Could someone help me with writing such a trigger?
I think a trigger is the wrong tool for the job; although it is possible to implement this. Something about spawning a "delete" from an executing insert makes the hair on my neck neck stand up. You will generate a lot of locking and possibly contention that way; and inserts should generally not generate locks.
To me this says "stored procedure" all the way.
But I also think you should ask yourself, "why delete" old comments? Deletes are an anathema. Better just limit them when you display them. If you are really worried about the size of the table, use a TEXT column. Postgres will maintain these in a shadow table and full scans of the original table will blaze along just fine.
Limiting to 100 comments per user is rather simple, e.g.
delete from comments where user_id = new.user_id
order by comment_date desc offset 100;
Limiting the byte size is trickier. You'd need to calculate the relevant row sizes and that won't account for index sizes, dead rows, etc. At best you'd use the admin functions to get the table size but these won't yield the size per user, only the total size.
We could in theory create a table of 100 dummy records and then simply overwrite them with the actual comments. Once we pass the 100th we will overwrite the 1st one, etc.
This way we are suppose to keep the same size of the table, but that is not possible, because an update is equivalent to delete,insert in Postgresql. So the size of the table will continue to grow.
So if the objective is not to overflow the disk drive then once the disk is full at 80% a "vacuum full" should be performed to free up disk space. "Vacuum full" requires disk space by itself. If you kept the records to a fixed number then there will be an effect of the vacuum. Also there seems to be cases where vacuum can fail.
Related
I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.
we are using Postgres to store ~ 2.000.000.000 samples. This ends up in tables with ~ 500 mio entries and ~100GB Size each table.
What I want to do:
E.g. update the table entries: UPDATE table SET flag = true;
After this, the table is twice as big, i.e. 200GB
To get the space (stored on a SSD) back we: "VACCUM FULL table"
Unfortunately, this step needs again loads of space which results in the Vacuum to fail due to too little space left.
My Questions:
Does this mean, that, in order to make this UPDATE query only once and to get the space back for other tables in this DB we need at least 300-400GB space for a 100GB table?
In your scenario, you won't get away without having at least twice as much space as the table data would require.
The cheapest solution is probably to define the table with a fillfactor of 50 so that half of each block is left empty, thereby doubling the table size. Then the updated rows can all be in the same block as the original rows, and the UPDATE won't increase the table size because PostgreSQL can use the heap only tuple (HOT) update feature. The old versions will be freed immediately if there are no long running transactions that can still see them.
NOTE: This will only work if the colum you are updating is not indexed.
The downside of this approach is that the table is always twice the necessary size, and all sequential scans will take twice as long. It won't bother you if you don't use sequential scans of the table.
I'm running a really big query, that insert a lot of rows in table, almost 8 million of rows divide in some smaller querys, but in some moment appear that error : "I get an error "could not write block .... of temporary file no space left on device ..." using postgresql". I don't know if i need to delete temporary files after each query and how I can to do that, or if it is related with another issue.
Thank you
OK. As there are still some facts missing, an attempt to answer to maybe clarify the issue:
It appears that you are running out of disk space. Most likely because you don't have enough space on your disk. Check on a Linux/Unix df -h for example.
To show you, how this could happen:
Having a table with maybe 3 integers the data alone will occupy about 12Byte. You need to add some overhead to it for row management etc. On another answer Erwin mentioned about 23Byte and linked to the manual for more information about. Also there might needs some padding betweens rows etc. So doing a little math:
Even with a 3 integer we will end up at about 40 Byte per row. Having in mind you wanted to insert 8,000,000 this will sum up to 320,000,000Byte or ~ 300MB (for our 3 integer example only and very roughly).
Now giving, you have a couple of indexes on this table, the indexes will also grow during the inserts. Also another aspect might could be bloat on the table and indexes which might can be cleared with a vacuum.
So what's the solution:
Provide more disk space to your database
Split your inserts a little more and ensure, vacuum is running between them
Inserting data or index(create) always needs temp_tablespaces, which determines the placement of temporary tables and indexes, as well as temporary files that are used for purposes such as sorting large data sets.according to your error, it meant that your temp_tablespace location is not enough for disk space .
To resolve this problem you may need these two ways:
1.Re-claim the space of your temp_tablespace located to, default /PG_DATA/base/pgsql_tmp
2. If your temp_tablespace space still not enough for temp storing you can create the other temp tablespace for that database:
create tablespace tmp_YOURS location '[your enough space location';
alter database yourDB set temp_tablespaces = tmp_YOURS ;
GRANT ALL ON TABLESPACE tmp_YOURS to USER_OF_DB;
then disconnect the session and reconnect it.
The error is quite self-explanatory. You are running a big query yet you do not have enough disk space to do so. If postgresql is installed in /opt...check if you have enough space to run the query. If not LIMIT the output to confirm you are getting the expected output and then proceed to run the query and write the output to a file.
I process a table with ~10^7 rows the following way: take last N rows, update them in some way, and delete, then vacuum table. In the end I make a query for pg_total_relation_size. Loop repeats until the table is over. Each iteration last for several seconds. There are no any other queries for this table except mentioned above. The problem is that I get the same results for table size. It changes about once a several hours.
So the question is -- does postgres store somewhere the table size or does it calculate it every time the function is invoked? I.e., does my table size really stays the same in spite of its processing?
Your table really does stay the same size on disk despite the DELETEs and VACUUMing you're doing. As per the documentation on VACUUM, ordinary VACUUM only releases space back to the OS if it can do so by truncating free space from the end of the file without rearranging live rows.
The space is still "free" in that PostgreSQL can re-use it for other new rows. It is much, much faster to re-use space that PostgreSQL hasn't given back to the OS than it is to extend a relation with new space, so this is often preferable.
The other reason Pg doesn't just give this space back is that it can only give space back to the OS when it's a contiguous chunk with no visible rows until the end of the file. This doesn't happen much so in practice Pg needs to move some rows around to compact the table and allow it to free space at the end, kind of like a defrag on a file system. This is an inefficient and slow process that can counter-intuitively make the table slower to access instead of faster, so it's not always a good idea.
If you have a relation that's mostly but not entirely empty it can be worth doing a VACUUM FULL (Pg 9.0 and above) or CLUSTER (all versions) to free the space. If you expect to refill the table this is usually counter-productive; it's actually better to leave it as-is.
(For what I mean by terms like "live" and "visible" see the documentation on MVCC which will help you understand Pg's table organisation.)
Personally I'd skip the manual VACUUM in your case. Turn autovacuum up if you need to. If you really need to you could consider partitioning your table, processing it partition by partition and TRUNCATE each partition when you're done processing it.
I have two Postgres databases. In one I have two tables, each with about 8,000,000 rows, and a count on either of them takes about a second. In another database, also Postgres, there are tables that are 1,000,000 rows, and a count takes 10s, and one table thats about 6,000,000 rows, and count takes 3min to run. What factors determine how long this will take? They are on different machines, but the database that takes longer is on a faster machine.
I've read about how postgres count is slow in general, but this seems odd to me. I can't really use a workaround, because I am using django, and it does a count in the admin, which is taking forever and making it dificult to use.
Any information on this would be helpful.
Speed of counting depends not just on the number of rows in the table but on the time taken to read the data from disk. The time depends on many things:
Number of rows in the table - as you already mentioned.
The number of records per page (if each record takes more space you need to read more pages to read the same number of rows).
If pages are only partly full you have to read more pages.
If the tables is already cached in memory (having more memory available helps here).
If the table is indexed with a small index (the index can be counted instead).
Hardware differences.
etc....
Indexes, caches, disk speed, for starters all have an impact.
Is the "slow table" properly vacuumed?
Do not use VACUUM FULL, it only creates table and index bloat. VACUUM is absolutely enough. VACUUM ANALYZE would even be better.
And make sure autovacuum is turned on and properly configured