Evaluate how much space will be freed by VACUUM in Redshift - amazon-redshift

According to AWS doc:
Amazon Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows.
Before running VACUUM, is there a way to know or evaluate how much space will be free from disk by the VACUUM?
Thx
References:
http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html
http://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html

You can calculate the amount of storage that will be freed up from a vacuum command by looking up the tbl_rows column in the svv_table_info view. This includes rows that are marked for deletion. Compare that to a select count(*) from the same table and you'll have a ratio. Something like this on a theoretical table named factsales.
select (select cast(count(*) as numeric(12,0)) from factsales) /
cast(tbl_rows as numeric(12,0))
as "percentage of non deleted rows"
from svv_table_info where "table" = 'factsales'
There doesn't appear to be a straightforward way to execute dynamic SQL and cursors so to get this same ratio across all tables you'd have to execute the code from an external source or programming language i.e. python.

Its not an extremely accurate way, but you can query svv_table_info and look for the column deleted_pct. This will give you a rough idea, in percentage terms, about what fraction of the table needs to be rebuilt using vacuum.
You can run it for all the tables in your system to get this estimate for the whole system.

Related

PostgreSQL Database size is not equal to sum of size of all tables

I am using an AWS RDS PostgreSQL instance. I am using below query to get size of all databases.
SELECT datname, pg_size_pretty(pg_database_size(datname))
from pg_database
order by pg_database_size(datname) desc
One database's size is 23 GB and when I ran below query to get sum of size of all individual tables in this particular database, it was around 8 GB.
select pg_size_pretty(sum(pg_total_relation_size(table_schema || '.' || table_name)))
from information_schema.tables
As it is an AWS RDS instance, I don't have rights on pg_toast schema.
How can I find out which database object is consuming size?
Thanks in advance.
The documentation says:
pg_total_relation_size ( regclass ) → bigint
Computes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.
So TOAST tables are covered, and so are indexes.
One simple explanation could be that you are connected to a different database than the one that is shown to be 23GB in size.
Another likely explanation would be materialized views, which consume space, but do not show up in information_schema.tables.
Yet another explanation could be that there have been crashes that left some garbage files behind, for example after an out-of-space condition during the rewrite of a table or index.
This is of course harder to debug on a hosted platform, where you don't have shell access...

Postgres Vacuum doesnt free up space

I have a table in my database which is occupying 161GB hard disk space. Only 5 gb free space is left out of 200Gb harddisk.
The following command shows that my table is consuming 161GB harddisk space,
select pg_size_pretty(pg_total_relation_size('Employee'));
There are close to 527 rows in the table. Now I deleted 250 rows. Again I checked the pg_total_relation_size of Employee. Still the size is 161GB.
After seeing the output of the above query, I ran the vacuum command:
VACUUM VERBOSE ANALYZE Employee;
I checked if the VACUUM actually happened using,
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables;
I can see the last vacuum time matching the time I ran the VACUUM command.
I also ran the below command to see if there any dead tuples,
SELECT relname, n_dead_tup FROM pg_stat_user_tables;
n_dead_tup count for Employee table is 0.
Still after all these above commands if I run,
select pg_size_pretty(pg_total_relation_size('Employee'));
it still shows 161GB.
May I please know the reason behind this? Also please correct me on how to free interface_list.
vacuum doesn't physically "free" space. It only marks no longer used space as re-usable. So subsequent UPDATE or INSERT statements can use that space instead of appending to the table.
Quote from the manual
The standard form of VACUUM removes dead row versions in tables and indexes and marks the space available for future reuse. However, it will not return the space to the operating system, except in the special case where one or more pages at the end of a table become entirely free and an exclusive table lock can be easily obtained
(emphasis mine)
If you re-insert the 250 deleted rows, you will see that the table doesn't grow again, as the newly inserted rows simply use the space that was marked free by vacuum.
If you actually want to physically reduce the size of the table to size that is "needed", you need to run vacuum full.
Quote from the manual
VACUUM FULL actively compacts tables by writing a complete new version of the table file with no dead space. This minimizes the size of the table, but can take a long time. It also requires extra disk space for the new copy of the table, until the operation completes
(emphasis mine)

Select * from table_name is running slow

The table contains around 700 000 data. Is there any way to make the query run faster?
This table is stored on a server.
I have tried to run the query by taking the specific columns.
If select * from table_name is unusually slow, check for these things:
Network speed. How large is the data and how fast is your network? For large queries you may want to think about your data in bytes instead of rows. Run select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'TABLE_NAME'; and compare that with your network speed.
Row fetch size. If the application or IDE is fetching one-row-at-a-time, each row has a large overhead with network lag. You may need to increase that setting.
Empty segment. In a few weird cases the table's segment size can increase and never shrink. For example, if the table used to have billions of rows, and they were deleted but not truncated, the space would not be released. Then a select * from table_name may need to read a lot of empty extents to get to the real data. If the GB size from the above query seems too large, run alter table table_name move; to rebuild the table and possible save space.
Recursive query. A query that simple almost couldn't have a bad execution plan. It's possible, but rate, for a recursive query has a bad execution plan. While the query is running, look at select * from gv$sql where users_executing > 0;. There might be a data dictionary query that's really slow and needs to be tuned.

PostgreSQL - Does doing analyze have the same performance cost as count(*)?

To start off, I want to do an estimated count of how large a table is. Since I am building an analytical data of the database using graphs, exact count is not important. Thus I came across this wiki and it suggested doing an estimated count using
SELECT reltuples FROM pg_class WHERE relname = 'table_name';
Now in order to get an updated count, we would have to do an analyze on that table.
So my question is that is using analyze to get an updated count of the reltuples the same thing as doing count(*)? Is there a performance hit on doing analyze as much as doing count(*)?
The cost of running analyze depends on how much of the table is sampled, which depends on the default_statistics_target or the per-column statistics setting. It could be either faster or slower than count(*) depending on your specifics--there is not replacement for actually trying and seeing on your own system with our own data.
But normally if you are happy with an estimate, you would just use the reltuples you find there already. re-analyzing the table each time would seem to defeat the purpose.

Is using Table variables faster than temp tables

Am I safe to assume that where I have stored procedures using the tempdb to write a temporary table, I'd be better off switching these to table variables to get better performance?
Temp tables are better in performance. If you use a Table Variable and the Data in the Variable gets too big, the SQL Server converts the Variable automatically into a temp table.
It depends, like almost every Database related question, on what you try to do. So it is hard to answer without more information.
So my answer is, try it and have a look at the execution plan. Use the fastest way with the lowest costs.
MSDN - Displaying Graphical Execution Plans (SQL Server Management Studio)
#Table can be faster as there is less "setup time" since the object is in memory only.
#Tables have a lot of catches though.
You can have a primary key on a #Table but thats about it. Other indexes Clustered NonClustered for combinations of columns are not possible.
Also if your table is going to contain any real data volumes (more then about 200 maybe 1000 rows) then accessing the table will be slower. Especially when you will probably not have a useful index on it.
#Tables are a pain in procs as they need to be dropped when debugging, They take longer to create. and they take longer to setup as you need to add indexs as a second step. But if you have lots of data then its #tables every time.
Even in cases where you have less then 100 rows of data in a table you may still want to use #Tables as you can create a usefull index on the table.
In summary i use #Tables most of the time for the ease when doing simple proc etc. But anything that need to perform should be a #Table.
#Tables have no statistics so the execution plan entails more guesswork. Hence the recommended upper limit of 1000-ish rows. #Tables have statistics but these can be cached between invocations. If your cardinalities differ significantly each time the SP runs you'd want to REBUILD and RECOMPILE each time. This is an overhead, of course, but one which must be balanced against the cost of a rubbish plan.
Both types will do IO to TempDB.
So no, #Tables are not a panacea.
Table variables can perform very poorly as the number of rows in them increases.
Why is this?
Table variables don’t have distribution statistics and don’t trigger recompiles. Because of this, SQL Server is not able to estimate the number of rows in a table variable like it does for normal tables. When the optimiser compiles code that contains a table variable, it assumes a table is empty and uses an expected row count of 1 for the cardinality estimate. Because the optimiser only thinks a table variable contains a single row, it picks operators for the execution plan that work well with a small set of records, like the NESTED LOOPS operator for a JOIN operation.
As an example, I have just fixed a stored procedure which was performing poorly. The code was populating a table variable and using it in a join to filter the number of rows to accounts which were relevant:
FROM dbo.DimInvestorAccount
INNER JOIN #accounts acclist
ON acclist.AccountNumber = DimInvestorAccount.investorAccountNumber
+ 9 additional tables joined...
When run for list of 1700 accounts, the query was taking 1m17s. Just changing the filter table definition from:
DECLARE #accounts TABLE (AccountNumber VARCHAR(20) COLLATE Latin1_General_BIN INDEX idx NONCLUSTERED)
to
CREATE TABLE #accounts (AccountNumber VARCHAR(20) COLLATE Latin1_General_BIN INDEX idx NONCLUSTERED)
brought the query time down to 800ms. Note that with 5 rows in the table, there was no significant difference - both temp table and table variable run in +/-400ms.
Microsoft's recommendation is to use Table Variables if the number of rows is <100.
Note that Microsoft have made changes in SQL Server 2019 to improve this (v15.x/Compatibility level 150)