PostgreSQL Database size is not equal to sum of size of all tables - postgresql

I am using an AWS RDS PostgreSQL instance. I am using below query to get size of all databases.
SELECT datname, pg_size_pretty(pg_database_size(datname))
from pg_database
order by pg_database_size(datname) desc
One database's size is 23 GB and when I ran below query to get sum of size of all individual tables in this particular database, it was around 8 GB.
select pg_size_pretty(sum(pg_total_relation_size(table_schema || '.' || table_name)))
from information_schema.tables
As it is an AWS RDS instance, I don't have rights on pg_toast schema.
How can I find out which database object is consuming size?
Thanks in advance.

The documentation says:
pg_total_relation_size ( regclass ) → bigint
Computes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.
So TOAST tables are covered, and so are indexes.
One simple explanation could be that you are connected to a different database than the one that is shown to be 23GB in size.
Another likely explanation would be materialized views, which consume space, but do not show up in information_schema.tables.
Yet another explanation could be that there have been crashes that left some garbage files behind, for example after an out-of-space condition during the rewrite of a table or index.
This is of course harder to debug on a hosted platform, where you don't have shell access...

Related

Query takes long time to run on postgreSQL database despite creating an index

Using PostgreSQL 14.3.1, I have created a database instance that is now 1TB in size. The main userlogs table is 751GB in size with 525GB used for data and 226GB used for various indexes on this table. The userlogs table currently contains over 900 million rows. In order to assist with querying this table, a separate Logdates table holds all unique dates for the user logs and there is an integer foreign key column for logdates created in userlogs called logdateID. Amongst the various indexes on the userlogs table, one of them is on logdateID. There are 104 date entries in Logdates table. When running the below query I would expect the index to be used and the 104 records to be retrieved in a reasonable period of time.
select distinct logdateid from userlogs;
This query took a few hours to return with the data. I did an explain plan on the query and the output is as shown below.
"HashAggregate (cost=80564410.60..80564412.60 rows=200 width=4)"
" Group Key: logdateid"
" -> Seq Scan on userlogs (cost=0.00..78220134.28 rows=937710528 width=4)"
I then issues the below command to request the database to use the index.
set enable_seqscan=off
The revised explain plan now comes as below:
"Unique (cost=0.57..3705494150.82 rows=200 width=4)"
" -> Index Only Scan using ix_userlogs_logdateid on userlogs (cost=0.57..3703149874.49 rows=937710528 width=4)"
However, when running the same query, it still takes a few hours to retrieve the data. My question is, why should it take that long to retrieve the data if it is doing an index only scan?
The machine on which the database sits is highly spec'd: a xeon 16-core processor, that with virtualisation enabled, gives 32 logical cores. There is 96GB of RAM and data storage is via a RAID 10 configured 2TB SSD disk with a separate 500GB system SSD disk.
There is no possibilities to optimize such queries in PostGreSQL due to the internal structure of the data storage into rows inside pages.
All queries involving an aggregate in PostGreSQL such as COUNT, COUNT DISTINCT or DISTINCT must read all rows inside the table pages to produce the result.
Let'us take a look over the paper I wrote about this problem :
PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances
It seems like your table has none of its pages set as all visible (compare pg_class.relallvisible to the actual number of pages in the table), which is weird because even insert-only tables should get autovacuumed in v13 and up. This will severely punish the index-only scan. You can try to manually vacuum the table to see if that changes things.
It is also weird that it is not using parallelization. It certainly should be. What are your non-default configuration settings?
Finally, I wouldn't expect even the poor plan you show to take a few hours. Maybe your hardware is not performing up to what it should. (Also, RAID 10 requires at least 4 disks, but your description makes it sound like that is not what you have)
Since you have the foreign key table, you could use that in your query, just testing each row that it has at least one row from the log table.
select logdateid from logdate where exists
(select 1 from userlogs where userlogs.logdateid=logdate.logdateid);

How does Heroku Postgres count rows for purposes of enforcing plan limits?

I have received the following email from Heroku:
The database DATABASE_URL on Heroku app [redacted] has
exceeded its allocated storage capacity. Immediate action is required.
The database contains 12,858 rows, exceeding the Hobby-dev plan limit
of 10,000. INSERT privileges to the database will be automatically
revoked in 7 days. This will cause service failures in most
applications dependent on this database.
To avoid a disruption to your service, migrate the database to a Hobby
Basic ($9/month) or higher database plan:
https://hello.heroku.com/upgrade-postgres-c#upgrading-with-pg-copy
If you are unable to upgrade the database, you should reduce the
number of records stored in it.
My postgres database had a single table with 5693 rows at the time I received this email, which does not match the '12858 rows' mentioned by the email. What am I missing here?
It is perhaps worth mentioning that my DB also has a view of the table mentioned above, which Heroku might be adding to the count (despite not being an actual table), doubling the row count from 5693 to 11386, which still does not match the 12858 mentioned in the email, but it is closer.
TL;DR the rows in views DO factor into the total row count, even when it isn't a materialized view, despite the fact that views do not store data.
I ran heroku pg:info and saw the line:
Rows: 12858/10000 (Above limits, access disruption imminent)
I then dropped the view I mentioned in the original post, and ran heroku pg:info again:
Rows: 5767/10000 (Above limits, access disruption imminent)
So it seems indeed that views DO get counted in the total row count, which seems rather silly, since views don't actually store any data.
I also don't know why the (Above limits, access disruption imminent) string is still present after reducing the row number below the 10000 limit, but after running heroku pg:info again a minute later, I got
Rows: 5767/10000 (In compliance)
so apparently the compliance flag is not updated at the same time as the row number.
What's even stranger is that when I later re-created the same view that I had dropped, and ran heroku pg:info again, the row count did not double back up to ~11000, it stayed at ~5500.
It is useful to note that the following SQL command will display the row counts of the various objects in the database:
select table_schema,
table_name,
(xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count
from (
select table_name, table_schema,
query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count
from information_schema.tables
where table_schema = 'public' --<< change here for the schema you want
) t
the above query was copy-pasted from here
It sounds as you would have two different moments where usage of your postgresql database were measured: first one with higher values (12.858 rows is over the free limit and measured at Heroku) and the second one with less values (5693 rows which would be in free limit and could be measured on your local environment?).
Anyway - first things first: Take a look into your PostgreSQL database at Heroku - this can be done in two ways:
Connect your local Heroku CLI with your dyno and check the info of the related PostgreSQL database
HowTo see the database related info
Login into your Heroku WebGUI and check size and rows in there
Heroku Postgres
The background behind their database monitoring is Heroku explain in there:
Monitoring Heroku Postgres

PostgreSQL: UPDATE large table

I have a large PostgreSQL table of 29 million rows. The size (according to the stats tab in pgAdmin is almost 9GB.) The table is post-gis enabled with an empty geometry column.
I want to UPDATE the geometry column using ST_GeomFromText, reading from X and Y coordinate columns (SRID: 27700) stored in the same table. However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
To overcome this, should I UPDATE the 29 million rows in batches/stages? How can I do 1 million rows (the FIRST 1 million), then do the next 1 million rows until I reach 29 million?
Or are there other more efficient ways of updating large tables like this?
I should add, the table is hosted in AWS.
My UPDATE query is:
UPDATE schema.table
SET geom = ST_GeomFromText('POINT(' || eastingcolumn || ' ' || northingcolumn || ')',27700);
You did not give any server specs, writing 9GB can be pretty fast on recent hardware.
You should be OK with one, long, update - unless you have concurrent writes to this table.
A common trick to overcome this problem (a very long transaction, locking writes to the table) is to split the UPDATE into ranges based on the primary key, ran in separate transactions.
/* Use PK or any attribute with a known distribution pattern */
UPDATE schema.table SET ... WHERE id BETWEEN 0 AND 1000000;
UPDATE schema.table SET ... WHERE id BETWEEN 1000001 AND 2000000;
For high level of concurrent writes people use more subtle tricks (like: SELECT FOR UPDATE / NOWAIT, lightweight locks, retry logic, etc).
From my original question:
However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
Turns out our Amazon AWS instance database was running out of space, stopping my original ST_GeomFromText query from completing. Freeing up space fixed it.
On an important note, as suggested by #mlinth, ST_Point ran my query far quicker than ST_GeomFromText (24mins vs 2hours).
My final query being:
UPDATE schema.tablename
SET geom = ST_SetSRID(ST_Point(eastingcolumn,northingcolumn),27700);

Evaluate how much space will be freed by VACUUM in Redshift

According to AWS doc:
Amazon Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows.
Before running VACUUM, is there a way to know or evaluate how much space will be free from disk by the VACUUM?
Thx
References:
http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html
http://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html
You can calculate the amount of storage that will be freed up from a vacuum command by looking up the tbl_rows column in the svv_table_info view. This includes rows that are marked for deletion. Compare that to a select count(*) from the same table and you'll have a ratio. Something like this on a theoretical table named factsales.
select (select cast(count(*) as numeric(12,0)) from factsales) /
cast(tbl_rows as numeric(12,0))
as "percentage of non deleted rows"
from svv_table_info where "table" = 'factsales'
There doesn't appear to be a straightforward way to execute dynamic SQL and cursors so to get this same ratio across all tables you'd have to execute the code from an external source or programming language i.e. python.
Its not an extremely accurate way, but you can query svv_table_info and look for the column deleted_pct. This will give you a rough idea, in percentage terms, about what fraction of the table needs to be rebuilt using vacuum.
You can run it for all the tables in your system to get this estimate for the whole system.

PostgreSQL - restored database smaller than original

I have made a backup of my PostgreSQL database using pg_dump to ".sql" file.
When I restored the database, its size was 2.8GB compared with 3.7GB of the source (original) database. The application that access the database appears to work fine.
What is the reason to smaller size of the restored database?
The short answer is that database storage is more optimised for speed than space.
For instance, if you inserted 100 rows into a table, then deleted every row with an odd numbered ID, the DBMS could write out a new table with only 50 rows, but it's more efficient for it to simply mark the deleted rows as free space and reuse them when you next insert a row. Therefore the table takes up twice as much space as is currently needed.
Postgres's use of "MVCC", rather than locking, for transaction management makes this even more likely, since an UPDATE usually involves writing a new row to storage, then marking the old row deleted once no transactions are looking at it.
By dumping and restoring the database, you are recreating a DB without all this free space. This is essentially what the VACUUM FULL command does - it rewrites the current data into a new file, then deletes the old file.
There is an extension distributed with Postgres called pg_freespace which lets you examine some of this. e.g. you can list the main table size (not including indexes and columns stored in separate "TOAST" tables) and free space used by each table with the below:
Select oid::regclass::varchar as table,
pg_size_pretty(pg_relation_size(oid)/1024 * 1024) As size,
pg_size_pretty(sum(free)) As free
From (
Select c.oid,
(pg_freespace(c.oid)).avail As free
From pg_class c
Join pg_namespace n on n.oid = c.relnamespace
Where c.relkind = 'r'
And n.nspname Not In ('information_schema', 'pg_catalog')
) tbl
Group By oid
Order By pg_relation_size(oid) Desc, sum(free) Desc;
The reason is simple: during its normal operation, when rows are updated, PostgreSQL adds a new copy of the row and marks the old copy of the row as deleted. This is multi-version concurrency control (MVCC) in action. Then VACUUM reclaims the space taken by the old row for data that can be inserted in the future, but doesn't return this space to the operating system as it's in the middle of a file. Note that VACUUM isn't executed immediately, only after enough data has been modified in the table or deleted from the table.
What you're seeing is entirely normal. It just shows that PostgreSQL database will be larger in size than the sum of the sizes of the rows. Your new database will most likely eventaully grow to 3.7GB when you start actively using it.