Postgres table bloated without dead tuples - postgresql

I have a table which has 0 dead tuples, but at the same time the bloat value is 1.7.
The wasted bytes is around 21GB.
Is it possible to have 0 dead tuples but the table being bloated?
If so, on what basis the wasted bytes is calculated?
EDIT: Below is the query I used to get the information.
This query was from AWS support.
SELECT
current_database(),
schemaname,
tablename,
/*reltuples::bigint, relpages::bigint, otta,*/
ROUND((
CASE WHEN otta = 0 THEN
0.0
ELSE
sml.relpages::float / otta
END)::numeric, 1) AS tbloat,
CASE WHEN relpages < otta THEN
0
ELSE
bs * (sml.relpages - otta)::bigint
END AS wastedbytes,
iname,
/*ituples::bigint, ipages::bigint, iotta,*/
ROUND((
CASE WHEN iotta = 0
OR ipages = 0 THEN
0.0
ELSE
ipages::float / iotta
END)::numeric, 1) AS ibloat,
CASE WHEN ipages < iotta THEN
0
ELSE
bs * (ipages - iotta)
END AS wastedibytes
FROM (
SELECT
schemaname,
tablename,
cc.reltuples,
cc.relpages,
bs,
CEIL((cc.reltuples * ((datahdr + ma - (
CASE WHEN datahdr % ma = 0 THEN
ma
ELSE
datahdr % ma
END)) + nullhdr2 + 4)) / (bs - 20::float)) AS otta,
COALESCE(c2.relname, '?') AS iname,
COALESCE(c2.reltuples, 0) AS ituples,
COALESCE(c2.relpages, 0) AS ipages,
COALESCE(CEIL((c2.reltuples * (datahdr - 12)) / (bs - 20::float)), 0) AS iotta -- very rough approximation, assumes all cols
FROM (
SELECT
ma,
bs,
schemaname,
tablename,
(datawidth + (hdr + ma - (
CASE WHEN hdr % ma = 0 THEN
ma
ELSE
hdr % ma
END)))::numeric AS datahdr,
(maxfracsum * (nullhdr + ma - (
CASE WHEN nullhdr % ma = 0 THEN
ma
ELSE
nullhdr % ma
END))) AS nullhdr2
FROM (
SELECT
schemaname,
tablename,
hdr,
ma,
bs,
SUM((1 - null_frac) * avg_width) AS datawidth,
MAX(null_frac) AS maxfracsum,
hdr + (
SELECT
1 + COUNT(*) / 8
FROM
pg_stats s2
WHERE
null_frac <> 0
AND s2.schemaname = s.schemaname
AND s2.tablename = s.tablename) AS nullhdr
FROM
pg_stats s,
(
SELECT
(
SELECT
current_setting('block_size')::numeric) AS bs,
CASE WHEN SUBSTRING(v, 12, 3) IN ('8.0', '8.1', '8.2') THEN
27
ELSE
23
END AS hdr,
CASE WHEN v ~ 'mingw32' THEN
8
ELSE
4
END AS ma
FROM (
SELECT
version() AS v) AS foo) AS constants
GROUP BY 1, 2, 3, 4, 5) AS foo) AS rs
JOIN pg_class cc ON cc.relname = rs.tablename
JOIN pg_namespace nn ON cc.relnamespace = nn.oid
AND nn.nspname = rs.schemaname
AND nn.nspname <> 'information_schema'
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid) AS sml ORDER BY wastedbytes DESC;

A table can be bloated with empty space even if there is not a single dead tuple. The query you show uses heuristics and has been known to get it wrong occasionally.
Use pgstattuple:
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('tablename');
That will show you the actual bloat.

Your query considers any space not thought to be used by live tuples to be wasted. So that would include space that is currently occupied by dead tuples, and space that used to be occupied by dead tuples but have since been vacuumed away and is currently available for reuse. It also includes space which is reserved by fillfactor settings, and so is available for updates but not for inserts.

Related

Split comma separated data and get its respective value from another table

I have concated data in table1
id
concats
sum
1
b,c
2
a,k,f,l,s
3
b,f,t
4
a,b,h,k,l,q,s,t
5
b,c,k,f,p,s
6
a,c,q,s
and another table with value
grade
score
a
4.82
b
2.65
c
2.56
d
2.75
g
6.90
h
5.90
k
6.41
f
12.80
l
2.56
p
12.80
q
1.35
s
2.90
t
5.97
I want to update table1.sum, something like b,c=(2.65+2.56=5.21)
Tried the below mentioned code, but there is an error.
UPDATE table1 as t1 SET sum =
(SELECT (CASE WHEN (SELECT SPLIT_PART(concats,',',1) from t1) = t2.grade then t2.score ELSE 0 END) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',2) from t1) = t2.grade then t2.score ELSE 0 END) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',3) from t1) = t2.grade then t2.score ELSE 0 END) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',4) from t1) = t2.grade then t2.score ELSE 0 END) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',5) from t1) = t2.grade then t2.score ELSE 0 END) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',6) from t1) = t2.grade then t2.score ELSE 0 END) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',7) from t1) = t2.grade then t2.score ELSE 0 END ) +
(CASE WHEN (SELECT SPLIT_PART(concats,',',8) from t1) = t2.grade then t2.score ELSE 0 END )
FROM table2 AS t2 )
You can join the two tables by converting the dreaded CSV columns to an array, then do the GROUP BY and sum on the result of that. This can be used to update the target table:
update table1
set sum = x.sum_score
from (
select t1.id,
sum(t2.score) as sum_score
from table1 t1
join table2 t2 on t2.grade = any(string_to_array(t1.concats, ','))
group by t1.id
) x
where x.id = table1.id;

How can you generate a date list from a range in Amazon Redshift?

Getting date list in a range in PostgreSQL shows how to get a date range in PostgreSQL. However, Redshift does not support generate_series():
ans=> select (generate_series('2012-06-29', '2012-07-03', '1 day'::interval))::date;
ERROR: function generate_series("unknown", "unknown", interval) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
Is there way to replicate what generate_series() does in Redshift?
a hack, but works:
use a table with many many rows, and a window function to generate the series
this works as long as you are generating a series that is smaller than the number of rows in the table you're using to generate the series
WITH x(dt) AS (SELECT '2016-01-01'::date)
SELECT
dateadd(
day,
COUNT(*) over(rows between unbounded preceding and current row) - 1,
dt)
FROM users, x
LIMIT 100
the initial date 2016-01-01 controls the start date, and the limit controls the number of days in the generated series.
Update: * Will only run on the leader node
Redshift has partial support for the generate_series function but unfortunately does not mention it in their documentation.
This will work and is the shortest & most legible way of generating a series of dates as of this date (2018-01-29):
SELECT ('2016-01-01'::date + x)::date
FROM generate_series(1, 100, 1) x
One option if you don't want to rely on any existing tables is to pre-generate a series table filled with a range of numbers, one for each row.
create table numbers as (
select
p0.n
+ p1.n*2
+ p2.n * power(2,2)
+ p3.n * power(2,3)
+ p4.n * power(2,4)
+ p5.n * power(2,5)
+ p6.n * power(2,6)
+ p7.n * power(2,7)
+ p8.n * power(2,8)
+ p9.n * power(2,9)
+ p10.n * power(2,10)
as number
from
(select 0 as n union select 1) p0,
(select 0 as n union select 1) p1,
(select 0 as n union select 1) p2,
(select 0 as n union select 1) p3,
(select 0 as n union select 1) p4,
(select 0 as n union select 1) p5,
(select 0 as n union select 1) p6,
(select 0 as n union select 1) p7,
(select 0 as n union select 1) p8,
(select 0 as n union select 1) p9,
(select 0 as n union select 1) p10
order by 1
);
This will create a table with numbers from 0 to 2^10, if you need more numbers, just add more clauses :D
Once you have this table, you can join to it as a substitute for generate_series
with date_range as (select
'2012-06-29'::timestamp as start_date ,
'2012-07-03'::timestamp as end_date
)
select
dateadd(day, number::int, start_date)
from date_range
inner join numbers on number <= datediff(day, start_date, end_date)
#michael_erasmus It's interesting, and I make a change for maybe better performance.
CREATE OR REPLACE VIEW v_series_0_to_1024 AS SELECT
p0.n
| (p1.n << 1)
| (p2.n << 2)
| (p3.n << 3)
| (p4.n << 4)
| (p5.n << 5)
| (p6.n << 6)
| (p7.n << 7)
| (p8.n << 8)
| (p9.n << 9)
as number
from
(select 0 as n union select 1) p0,
(select 0 as n union select 1) p1,
(select 0 as n union select 1) p2,
(select 0 as n union select 1) p3,
(select 0 as n union select 1) p4,
(select 0 as n union select 1) p5,
(select 0 as n union select 1) p6,
(select 0 as n union select 1) p7,
(select 0 as n union select 1) p8,
(select 0 as n union select 1) p9
order by number
Last 30 days date series:
select dateadd(day, -number, current_date) as dt from v_series_0_to_1024 where number < 30

Debugging: LATERAL query in PostgreSQL

SELECT
cu.user_id,
cu.gender,
CASE WHEN cu.looking_for_gender = cu.gender THEN 1 ELSE 0 END AS
sexual_orientation,
os_name,
ROUND((DATE(NOW()) - cu.birthdate)/365.25) AS user_age,
SUM(dsb.likes) AS likes,
SUM(dsb.dislikes) AS dislikes,
SUM(dsb.blocks) AS blocks,
SUM(dsb.matches) AS matches,
SUM(dsb.received_likes) AS received_likes,
SUM(dsb.received_dislikes) AS received_dislikes,
SUM(dsb.received_blocks) AS received_blocks,
CASE WHEN cu.status = 'default' THEN 1 ELSE 0 END AS recall_case,
CASE WHEN cu.status = 'default' THEN extract(epoch from
cu.last_activity - cu.updated_time)/86400 ELSE 0 END AS
recall_retention
FROM ( SELECT stats.core_users cu
LEFT JOIN yay.daily_swipes_by_users dsb ON (dsb.user_id = cu.user_id)
WHERE cu.user_id = '1' GROUP BY 1) e1
LEFT JOIN LATERAL (SELECT cd.os_name FROM stats.core_devices cd WHERE
e1.user_id = cd.user_id ORDER BY cd.updated_time DESC LIMIT 1) e2
ON TRUE;
Current Error Code:
ERROR: syntax error at or near "LEFT"
LINE 18: LEFT JOIN yay.daily_swipes_by_users dsb ON (dsb.user_id = cu...
^
The supplied query would fail in many ways, the following might work I hope, but as you can see this drops a great deal of other columns in the process
SELECT
e1.user_id
, e1.cu
, e2.os_name
FROM (
SELECT stats.core_users cu, cu.user_id
LEFT JOIN yay.daily_swipes_by_users dsb ON (dsb.user_id = cu.user_id)
WHERE cu.user_id = '1'
GROUP BY stats.core_users cu, cu.user_id
) e1
LEFT JOIN LATERAL(
SELECT cd.os_name
FROM stats.core_devices cd
WHERE e1.user_id = cd.user_id
ORDER BY cd.updated_time DESC
LIMIT 1) e2 ON TRUE
;
SELECT
cu.user_id,
cu.gender,
CASE WHEN cu.looking_for_gender = cu.gender THEN 1 ELSE 0 END AS sexual_orientation,
e2.os_name,
ROUND((DATE(NOW()) - cu.birthdate)/365.25) AS user_age,
CASE WHEN cu.status = 'default' THEN 1 ELSE 0 END AS recall_case,
CASE WHEN cu.status = 'default' THEN extract(epoch from cu.last_activity - cu.updated_time)/86400 ELSE 0 END AS recall_retention,
SUM(dsb.likes) AS likes,
SUM(dsb.dislikes) AS dislikes,
SUM(dsb.blocks) AS blocks,
SUM(dsb.matches) AS matches,
SUM(dsb.received_likes) AS received_likes,
SUM(dsb.received_dislikes) AS received_dislikes,
SUM(dsb.received_blocks) AS received_blocks
FROM
stats.core_users cu
LEFT JOIN yay.daily_swipes_by_users dsb ON (dsb.user_id = cu.user_id)
LEFT JOIN LATERAL (SELECT cd.os_name FROM stats.core_devices cd WHERE cu.user_id = cd.user_id ORDER BY cd.updated_time DESC LIMIT 1) e2
ON TRUE
WHERE cu.user_id = '1'
GROUP BY 1,2,3,4,5,6,7
;
This works.

Postgresql: Select query on view returning no records

I have a view named vw_check_space in my public schema (using postgresql 9.4). When I run a
select * from public.vw_check_space;
as a postgres user, I get a list of rows but when I run the same query by another user 'user1', it returns nothing.
View:
CREATE OR REPLACE VIEW public.vw_check_space AS
WITH constants AS (
SELECT current_setting('block_size'::text)::numeric AS bs,
23 AS hdr,
8 AS ma
), no_stats AS (
SELECT columns.table_schema,
columns.table_name,
psut.n_live_tup::numeric AS est_rows,
pg_table_size(psut.relid::regclass)::numeric AS table_size
FROM columns
JOIN pg_stat_user_tables psut ON columns.table_schema::name = psut.schemaname AND columns.table_name::name = psut.relname
LEFT JOIN pg_stats ON columns.table_schema::name = pg_stats.schemaname AND columns.table_name::name = pg_stats.tablename AND columns.column_name::name = pg_stats.attname
WHERE pg_stats.attname IS NULL AND (columns.table_schema::text <> ALL (ARRAY['pg_catalog'::character varying, 'information_schema'::character varying]::text[]))
GROUP BY columns.table_schema, columns.table_name, psut.relid, psut.n_live_tup
), null_headers AS (
SELECT constants.hdr + 1 + sum(
CASE
WHEN pg_stats.null_frac <> 0::double precision THEN 1
ELSE 0
END) / 8 AS nullhdr,
sum((1::double precision - pg_stats.null_frac) * pg_stats.avg_width::double precision) AS datawidth,
max(pg_stats.null_frac) AS maxfracsum,
pg_stats.schemaname,
pg_stats.tablename,
constants.hdr,
constants.ma,
constants.bs
FROM pg_stats
CROSS JOIN constants
LEFT JOIN no_stats ON pg_stats.schemaname = no_stats.table_schema::name AND pg_stats.tablename = no_stats.table_name::name
WHERE (pg_stats.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND no_stats.table_name IS NULL AND (EXISTS ( SELECT 1
FROM columns
WHERE pg_stats.schemaname = columns.table_schema::name AND pg_stats.tablename = columns.table_name::name))
GROUP BY pg_stats.schemaname, pg_stats.tablename, constants.hdr, constants.ma, constants.bs
), data_headers AS (
SELECT null_headers.ma,
null_headers.bs,
null_headers.hdr,
null_headers.schemaname,
null_headers.tablename,
(null_headers.datawidth + (null_headers.hdr + null_headers.ma -
CASE
WHEN (null_headers.hdr % null_headers.ma) = 0 THEN null_headers.ma
ELSE null_headers.hdr % null_headers.ma
END)::double precision)::numeric AS datahdr,
null_headers.maxfracsum * (null_headers.nullhdr + null_headers.ma -
CASE
WHEN (null_headers.nullhdr % null_headers.ma::bigint) = 0 THEN null_headers.ma::bigint
ELSE null_headers.nullhdr % null_headers.ma::bigint
END)::double precision AS nullhdr2
FROM null_headers
), table_estimates AS (
SELECT data_headers.schemaname,
data_headers.tablename,
data_headers.bs,
pg_class.reltuples::numeric AS est_rows,
pg_class.relpages::numeric * data_headers.bs AS table_bytes,
ceil(pg_class.reltuples * (data_headers.datahdr::double precision + data_headers.nullhdr2 + 4::double precision + data_headers.ma::double precision -
CASE
WHEN (data_headers.datahdr % data_headers.ma::numeric) = 0::numeric THEN data_headers.ma::numeric
ELSE data_headers.datahdr % data_headers.ma::numeric
END::double precision) / (data_headers.bs - 20::numeric)::double precision) * data_headers.bs::double precision AS expected_bytes,
pg_class.reltoastrelid
FROM data_headers
JOIN pg_class ON data_headers.tablename = pg_class.relname
JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.oid AND data_headers.schemaname = pg_namespace.nspname
WHERE pg_class.relkind = 'r'::"char"
), estimates_with_toast AS (
SELECT table_estimates.schemaname,
table_estimates.tablename,
true AS can_estimate,
table_estimates.est_rows,
table_estimates.table_bytes + COALESCE(toast.relpages, 0)::numeric * table_estimates.bs AS table_bytes,
table_estimates.expected_bytes + ceil(COALESCE(toast.reltuples, 0::real) / 4::double precision) * table_estimates.bs::double precision AS expected_bytes
FROM table_estimates
LEFT JOIN pg_class toast ON table_estimates.reltoastrelid = toast.oid AND toast.relkind = 't'::"char"
), table_estimates_plus AS (
SELECT current_database() AS databasename,
estimates_with_toast.schemaname,
estimates_with_toast.tablename,
estimates_with_toast.can_estimate,
estimates_with_toast.est_rows,
CASE
WHEN estimates_with_toast.table_bytes > 0::numeric THEN estimates_with_toast.table_bytes
ELSE NULL::numeric
END AS table_bytes,
CASE
WHEN estimates_with_toast.expected_bytes > 0::double precision THEN estimates_with_toast.expected_bytes::numeric
ELSE NULL::numeric
END AS expected_bytes,
CASE
WHEN estimates_with_toast.expected_bytes > 0::double precision AND estimates_with_toast.table_bytes > 0::numeric AND estimates_with_toast.expected_bytes <= estimates_with_toast.table_bytes::double precision THEN (estimates_with_toast.table_bytes::double precision - estimates_with_toast.expected_bytes)::numeric
ELSE 0::numeric
END AS bloat_bytes
FROM estimates_with_toast
UNION ALL
SELECT current_database() AS databasename,
no_stats.table_schema,
no_stats.table_name,
false AS bool,
no_stats.est_rows,
no_stats.table_size,
NULL::numeric AS "numeric",
NULL::numeric AS "numeric"
FROM no_stats
), bloat_data AS (
SELECT current_database() AS databasename,
table_estimates_plus.schemaname,
table_estimates_plus.tablename,
table_estimates_plus.can_estimate,
table_estimates_plus.table_bytes,
round(table_estimates_plus.table_bytes / (1024::double precision ^ 2::double precision)::numeric, 3) AS table_mb,
table_estimates_plus.expected_bytes,
round(table_estimates_plus.expected_bytes / (1024::double precision ^ 2::double precision)::numeric, 3) AS expected_mb,
round(table_estimates_plus.bloat_bytes * 100::numeric / table_estimates_plus.table_bytes) AS pct_bloat,
round(table_estimates_plus.bloat_bytes / (1024::numeric ^ 2::numeric), 2) AS mb_bloat,
table_estimates_plus.est_rows
FROM table_estimates_plus
)
SELECT bloat_data.databasename,
bloat_data.schemaname,
bloat_data.tablename,
bloat_data.can_estimate,
bloat_data.table_bytes,
bloat_data.table_mb,
bloat_data.expected_bytes,
bloat_data.expected_mb,
bloat_data.pct_bloat,
bloat_data.mb_bloat,
bloat_data.est_rows
FROM bloat_data
ORDER BY bloat_data.pct_bloat DESC;
I have provided connect privilege to the database and grant usage and select privilege to user user1. I am not sure what other privileges I would be missing here. Any help would be appreciated.
PS: I have also provided usage and select privilege to the tables and schema the view is using during its creation.
https://www.postgresql.org/docs/9.4/static/view-pg-stats.html
The view pg_stats provides access to the information stored in the
pg_statistic catalog. This view allows access only to rows of
pg_statistic that correspond to tables the user has permission to
read, and therefore it is safe to allow public read access to this
view.
https://www.postgresql.org/docs/9.4/static/monitoring-stats.html
pg_stat_user_tables Same as pg_stat_all_tables, except that only user
tables are shown.
so after you grant read on other owner tables to user, you still join pg_stat_user_tables which will cut list to only those tables onwer of which you are... - either exclude it from view, or use left outer join instead of inner join
I'm talking about JOIN pg_stat_user_tables, but you should check every table you join and read about all views you include in your query

Replace Subselect for something more efficient

I have this query which takes a long time, partly because the number of records in the table excedd 500 000 records, but the join I have to use slows it down quite a lot, at least to my beliefs
SELECT TOP (10) PERCENT H1.DateCompteur, CASE WHEN (h1.cSortie - h2.cSortie > 0)
THEN h1.cSortie - h2.cSortie ELSE 0 END AS Compte, H1.IdMachine
FROM dbo.T_HistoriqueCompteur AS H1 INNER JOIN
dbo.T_HistoriqueCompteur AS H2 ON H1.IdMachine = H2.IdMachine AND H2.DateCompteur =
(SELECT MAX(DateCompteur) AS Expr1
FROM dbo.T_HistoriqueCompteur AS HS
WHERE (DateCompteur < H1.DateCompteur) AND (H1.IdMachine = IdMachine))
ORDER BY H1.DateCompteur DESC
The order by is important since I need only the most recent informations. I tried using the ID field in my sub select since they are ordred by date but could not detect any significant improvement.
SELECT TOP (10) PERCENT H1.DateCompteur, CASE WHEN (h1.cSortie - h2.cSortie > 0)
THEN h1.cSortie - h2.cSortie ELSE 0 END AS Compte, H1.IdMachine
FROM dbo.T_HistoriqueCompteur AS H1 INNER JOIN
dbo.T_HistoriqueCompteur AS H2 ON H1.IdMachine = H2.IdMachine AND H2.ID =
(SELECT MAX(ID) AS Expr1
FROM dbo.T_HistoriqueCompteur AS HS
WHERE (ID < H1.ID) AND (H1.IdMachine = IdMachine))
ORDER BY H1.DateCompteur DESC
the table I use look a little like this (I got much more columns but they are unused in this query).
ID bigint
IdMachine bigint
cSortie bigint
DateCompteur datetime
I think that if I could get rid of the sub select, my query would run much faster but I can't really find a way to do so. What I really want to do is to find the previous row with the same IdMachine so that I can calculate the difference between the two cSortie values. The case in the query is because something it's reseted to 0 and in this case, I want to return 0 instead of a negative value.
So my question is this : Can I do better than what I already have ??? I plan to put this in a view if that makes a difference.
Try this query
WITH T as
(
SELECT TOP (10) PERCENT H1.DateCompteur, H1.cSortie as cSortie1, H1.IdMachine,
(
SELECT TOP 1 H2.cSortie
FROM dbo.T_HistoriqueCompteur H2
WHERE (H2.DateCompteur < H1.DateCompteur) AND (H1.IdMachine = H2.IdMachine)
ORDER BY H2.DateCompteur DESC
) as cSortie2
FROM dbo.T_HistoriqueCompteur AS H1
ORDER BY H1.DateCompteur DESC
)
select DateCompteur,
CASE WHEN (cSortie1 - cSortie2 > 0)
THEN cSortie1 - cSortie2
ELSE 0 END
AS Compte,
IdMachine
FROM T
You could also try CTE's (common table expressions) with windowing functions (ROW_NUMBER):
;WITH CTE AS
(
SELECT ID,IdMachine,cSortie,ROW_NUMBER() OVER(PARTITION BY h.IdMachine ORDER BY ID ASC) AS [ROW]
FROM T_HistoriqueCompteur h
)
SELECT
TOP (10) PERCENT
H1.DateCompteur,
CASE WHEN (h1.cSortie - h2.cSortie > 0) THEN h1.cSortie - h2.cSortie
ELSE 0
END AS Compte,
H1.IdMachine
FROM dbo.T_HistoriqueCompteur AS H1
INNER JOIN CTE cte on cte.idmachine = h1.idmachine and cte.id = h1.id
INNER JOIN CTE h2 on h2.idmachine = cte.idmachine and h2.row + 1 = cte.row
ORDER BY H1.DateCompteur DESC