What's the unit of buffers_checkpoint in pg_stat_bgwriter table? - postgresql

I'm using postgreSQL-9.1.6 and trying to build monitoring application for postgreSQL server.
I'm planning to select PHYSICAL and LOGICAL I/O stat from pg_stat_* information tables.
According to MANUAL unit of fields in PG_STAT_DATABASE is BLOCK which means size of 8KB.
postgres=# select * from pg_stat_database where datname='postgres';
-[ RECORD 3 ]-+------------------------------
datid | 12780
datname | postgres
numbackends | 2
xact_commit | 974
xact_rollback | 57
blks_read | 210769
blks_hit | 18664177
tup_returned | 16074339
tup_fetched | 35121
tup_inserted | 18182015
tup_updated | 572
tup_deleted | 3075
conflicts | 0
I could figure out size of PHYSICAL READ usging blks_read * 8KB.
However, there is no comments on the unit of stats in PG_STAT_BGWRITER.
postgres=# select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed | 276
checkpoints_req | 8
buffers_checkpoint | 94956
buffers_clean | 0
maxwritten_clean | 0
buffers_backend | 82618
buffers_backend_fsync | 0
buffers_alloc | 174760
stats_reset | 2013-07-15 22:27:05.503125+09
How can I calculate the size of PHYSICAL WRITE through the buffers_checkpoint?
Any advice wold be very appreciated.

Taken from the de facto performance handbook "Postgresql 9.0 High Performance" by Greg Smith, in the chapter on Database Activity and Statistics:
What percentage of the time are checkpoints being requested based on activity instead of time passing?
How much data does the average checkpoint write?
What percentage of the data being written out happens from checkpoints and backends, respectively?
SELECT
(100 * checkpoints_req) /
(checkpoints_timed + checkpoints_req) AS checkpoints_req_pct,
pg_size_pretty(buffers_checkpoint * block_size /
(checkpoints_timed + checkpoints_req)) AS avg_checkpoint_write,
pg_size_pretty(block_size *
(buffers_checkpoint + buffers_clean + buffers_backend)) AS total_written,
100 * buffers_checkpoint /
(buffers_checkpoint + buffers_clean + buffers_backend) AS checkpoint_write_pct,
100 * buffers_backend /
(buffers_checkpoint + buffers_clean + buffers_backend) AS backend_write_pct,
*
FROM pg_stat_bgwriter,
(SELECT cast(current_setting('block_size') AS integer) AS block_size) bs;

Related

PostgreSQL Checkpoint Discrepancy

There is problem about that I don' t understand in a database. Our configuration is like following:
archive_mode = on
archive_timeout = 900
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
max_wal_size = 4GB
We do not hit max_wal_size limit. Our average is 60 + 0.9 = 54 minutes which makes sense.
postgres=# SELECT
total_checkpoints,
seconds_since_start / total_checkpoints / 60 AS minutes_between_checkpoints
FROM
(SELECT
EXTRACT(EPOCH FROM (now() - pg_postmaster_start_time())) AS seconds_since_start,
(checkpoints_timed+checkpoints_req) AS total_checkpoints
FROM pg_stat_bgwriter
) AS sub;
-[ RECORD 1 ]---------------+------------
total_checkpoints | 240
minutes_between_checkpoints | 54.63359986
Yet, yesterday I checked the latest checkpoit at 13.19:
postgres=# SELECT * FROM pg_control_checkpoint();
-[ RECORD 1 ]--------+-------------------------
checkpoint_lsn | 862/D67582F0
prior_lsn | 862/7EBA9A80
redo_lsn | 862/87008050
redo_wal_file | 000000030000086200000087
timeline_id | 3
prev_timeline_id | 3
full_page_writes | t
next_xid | 0:1484144344
next_oid | 8611735
next_multixact_id | 151786
next_multi_offset | 305073
oldest_xid | 1284151498
oldest_xid_dbid | 1905285
oldest_active_xid | 1484144342
oldest_multi_xid | 1
oldest_multi_dbid | 1905305
oldest_commit_ts_xid | 0
newest_commit_ts_xid | 0
checkpoint_time | 2022-09-21 12:19:17+02
So, after latest checkpoint it passed more than 60 minutes, it should have taken another checkpoint. Archive mode is enabled and 15 minutes but it does not take checkpoint. Only possbile explanation is not generating any WAL according to the official document, but we generated lots of WAL, this is very active database(not as much as to fullfil 4 GB WAL). What do I miss?
Thanks!
That seems perfectly fine.
With your settings, PostgreSQL will run a checkpoint every hour and time it to take around 54 minutes. So 90% of the time you have some checkpoint activity, and 10% nothing. Of course this timing is not 100% accurate, so don't worry about a minute up or down.
If you want to observe this behavior in more detail, set log_checkpoints = on. Then you will get a log message whenever a checkpoint starts and whenever it completes. Leave this setting on, this is useful information for debugging database problems.

Postgres percent_rank() across rows not column

Data:
I have a postgres table where each row contain a question_id, and count values of when a user pressed a button.
+----------+-----------+------+-----+----------+
| Question | Very Good | Good | Bad | Very Bad |
+----------+-----------+------+-----+----------+
| 1 | 23 | 12 | 23 | 67 |
+----------+-----------+------+-----+----------+
| 2 | 56 | 90 | 23 | 18 |
+----------+-----------+------+-----+----------+
Requirement:
I want to be able to convert each row value in a percentage of the total row.
+----------+-----------+-------+-------+----------+
| Question | Very Good | Good | Bad | Very Bad |
+----------+-----------+-------+-------+----------+
| 1 | 18.4 | 9.6 | 18.4 | 53.8 |
+----------+-----------+-------+-------+----------+
| 2 | 29.94 | 48.12 | 12.29 | 9.6 |
+----------+-----------+-------+-------+----------+
Attempt:
I found that percent_rank() will show me the percentage based off columns, I'm wondering is there a similar function which works row wise?
SELECT
question_id,
PERCENT_RANK() OVER (
ORDER BY Very_good
),
PERCENT_RANK() OVER (
ORDER BY Good
)
PERCENT_RANK() OVER (
ORDER BY Bad
)
PERCENT_RANK() OVER (
ORDER BY Very Bad
)
FROM Question_feedback
I'm afraid the only thing that will work is to do this manually:
SELECT
question_id,
Very_good::double precision / (Very_good + Good + Bad + Very_bad),
Good::double precision / (Very_good + Good + Bad + Very_bad),
Bad::double precision / (Very_good + Good + Bad + Very_bad),
Very_bad::double precision / (Very_good + Good + Bad + Very_bad)
FROM Question_feedback
The good news is it will be faster than PERCENT_RANK because it only needs to consider that row, which is much cheaper.
Working Solution
WITH QUESTION_FEEDBACK
AS (SELECT 1 AS QUESTION,
23 VERYGOOD,
12 GOOD,
23 BAD,
67 VERYBAD
UNION ALL
SELECT 2 AS QUESTION,
56 VERYGOOD,
90 GOOD,
23 BAD,
18 VERYBAD
)
SELECT QUESTION,
VERYGOOD,
GOOD,
BAD,
VERYBAD,
(CAST(VERYGOOD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION))*100 VERYGOODPER,
(CAST(GOOD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION))*100 GOODPER,
(CAST(BAD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION) )*100 BADPER,
(CAST(VERYBAD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION))*100 VERYBADPER
FROM QUESTION_FEEDBACK

PostgreSQL calculate with calculated value from previous rows

The problem I need to solve:
In order to calculate the number of hours per day that are used for (public) holidays or days of illness, the average working hours are used from the previous 3 months (with a starting value of 8 hours per day).
The tricky part is that the calculated value of the previous month will need to be factored in, meaning if there was a public holiday last month, which had been assigned a calculated value of 8.5 hours, these calculated hours will influence the average working hours per day for that last month, which then is being used to assigned working hours to current months' holidays.
So far I only have come up with the following, which doesn't factor in the row-by-row calculation, yet:
WITH
const (h_target, h_extra) AS (VALUES (8.0, 20)),
monthly_sums (c_month, d_work, d_off, h_work) AS (VALUES
('2018-12', 16, 5, 150.25),
('2019-01', 20, 3, 171.25),
('2019-02', 15, 5, 120.5)
),
calc AS (
SELECT
ms.*,
(ms.d_work + ms.d_off) AS d_total,
(ms.h_work + ms.d_off * const.h_target) AS h_total,
(avg((ms.h_work + ms.d_off * const.h_target) / (ms.d_work + ms.d_off))
OVER (ORDER BY ms.c_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW))::numeric(10,2)
AS h_off
FROM monthly_sums AS ms
CROSS JOIN const
)
SELECT
calc.c_month,
calc.d_work,
calc.d_off,
calc.d_total,
calc.h_work,
calc.h_off,
(d_off * lag(h_off, 1, const.h_target) OVER (ORDER BY c_month)) AS h_off_sum,
(h_work + d_off * lag(h_off, 1, const.h_target) OVER (ORDER BY c_month)) AS h_sum
FROM calc CROSS JOIN const;
...giving the following result:
c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum
---------+--------+-------+---------+--------+-------+-----------+--------
2018-12 | 16 | 5 | 21 | 150.25 | 9.06 | 40.0 | 190.25
2019-01 | 20 | 3 | 23 | 171.25 | 8.77 | 27.18 | 198.43
2019-02 | 15 | 5 | 20 | 120.5 | 8.52 | 43.85 | 164.35
(3 rows)
This calculates correctly for the first row and for the second row for columns that rely on previous row values (lag) but the average hours per day calculation is obviously wrong as I couldn't figure out how to feed the current row value (h_sum) back into the calculation for the new h_off.
The desired result should be as follows:
c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum
---------+--------+-------+---------+--------+-------+-----------+--------
2018-12 | 16 | 5 | 21 | 150.25 | 9.06 | 40.0 | 190.25
2019-01 | 20 | 3 | 23 | 171.25 | 8.84 | 27.18 | 198.43
2019-02 | 15 | 5 | 20 | 120.5 | 8.64 | 44.2 | 164.7
(3 rows)
...meaning h_off is used for next months' h_off_sum and resulting h_sum and h_sum's of available months (at most three) in turn result into the calculation of current months' h_off (essentially avg(h_sum / d_total) over up to three months).
So, actual calculation is:
c_month | calculation | h_off
---------+----------------------------------------------------+-------
| | 8.00 << initial
.---------------------- uses ---------------------^
2018-12 | ((190.25 / 21)) / 1 | 9.06
.------------ uses ---------------^
2019-01 | ((190.25 / 21) + (198.43 / 23)) / 2 | 8.84
.--- uses --------^
2019-02 | ((190.25 / 21) + (198.43 / 23) + (164.7 / 20)) / 3 | 8.64
P.S.: I am using PostgreSQL 11, so I have the latest features at hands if that makes any difference.
I wasn't able to solve that inter-column + inter-row calculation problem with the use of window functions at all and not without falling back to a special use of a recursive CTE as well as introducing special-purpose columns for the days (d_total_1) and hours (h_sum_1) of the 3rd historical month (as you cannot join in the recursive temporary table more than once).
In addition, I added a 4th row to the input data and used an additional index column which I can refer to when joining, which is usually made up with a sub-query like this:
SELECT ROW_NUMBER() OVER (ORDER BY c_month) AS row_num, * FROM monthly_sums
So, here's my take at it:
WITH RECURSIVE calc AS (
SELECT
monthly_sums.row_num,
monthly_sums.c_month,
monthly_sums.d_work,
monthly_sums.d_off,
monthly_sums.h_work,
(monthly_sums.d_off * 8)::numeric(10,2) AS h_off_sum,
monthly_sums.d_work + monthly_sums.d_off AS d_total,
0.0 AS d_total_1,
(monthly_sums.h_work + monthly_sums.d_off * 8)::numeric(10,2) AS h_sum,
0.0 AS h_sum_1,
(
(monthly_sums.h_work + monthly_sums.d_off * 8)
/
(monthly_sums.d_work + monthly_sums.d_off)
)::numeric(10,2) AS h_off
FROM
(
SELECT * FROM (VALUES
(1, '2018-12', 16, 5, 150.25),
(2, '2019-01', 20, 3, 171.25),
(3, '2019-02', 15, 5, 120.5),
(4, '2019-03', 19, 2, 131.75)
) AS tmp (row_num, c_month, d_work, d_off, h_work)
) AS monthly_sums
WHERE
monthly_sums.row_num = 1
UNION ALL
SELECT
monthly_sums.row_num,
monthly_sums.c_month,
monthly_sums.d_work,
monthly_sums.d_off,
monthly_sums.h_work,
lat_off.h_off_sum::numeric(10,2),
lat_days.d_total,
calc.d_total AS d_total_1,
lat_sum.h_sum::numeric(10,2),
calc.h_sum AS h_sum_1,
lat_calc.h_off::numeric(10,2)
FROM
(
SELECT * FROM (VALUES
(1, '2018-12', 16, 5, 150.25),
(2, '2019-01', 20, 3, 171.25),
(3, '2019-02', 15, 5, 120.5),
(4, '2019-03', 19, 2, 131.75)
) AS tmp (row_num, c_month, d_work, d_off, h_work)
) AS monthly_sums
INNER JOIN calc ON (calc.row_num = monthly_sums.row_num - 1),
LATERAL (SELECT monthly_sums.d_work + monthly_sums.d_off AS d_total) AS lat_days,
LATERAL (SELECT monthly_sums.d_off * calc.h_off AS h_off_sum) AS lat_off,
LATERAL (SELECT monthly_sums.h_work + lat_off.h_off_sum AS h_sum) AS lat_sum,
LATERAL (SELECT
(calc.h_sum_1 + calc.h_sum + lat_sum.h_sum)
/
(calc.d_total_1 + calc.d_total + lat_days.d_total)
AS h_off
) AS lat_calc
WHERE
monthly_sums.row_num > 1
)
SELECT c_month, d_work, d_off, d_total, h_work, h_off, h_off_sum, h_sum FROM calc
;
...which gives:
c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum
---------+--------+-------+---------+--------+-------+-----------+--------
2018-12 | 16 | 5 | 21 | 150.25 | 9.06 | 40.00 | 190.25
2019-01 | 20 | 3 | 23 | 171.25 | 8.83 | 27.18 | 198.43
2019-02 | 15 | 5 | 20 | 120.5 | 8.65 | 44.15 | 164.65
2019-03 | 19 | 2 | 21 | 131.75 | 8.00 | 17.30 | 149.05
(4 rows)
(PostgreSQL's default type conversion behavior is to round numeric values and so the result is slightly different than initially expected but actually correct)
Please note that PostgreSQL is generally pretty picky about data types and refuses to process queries like this whenever there is a discrepancy that could potentially lead to loss of precision (e.g. numeric vs. integer), which is why I have used explicit types for the columns in both places.
One of the final pieces of the puzzle was solved by using LATERAL subqueries, which enables me to have one calculation reference the result of a previous one and even shift around columns in the final output independent of the calculation hierarchy.
If anyone can come up with a simpler variant I'd be happy to learn about it.

Update intermediate result

EDIT
As requested a little background of what I want to achieve. I have a table that I want to query but I don't want to change the table itself. Next the result of the SELECT query (what I called the 'intermediate table') needs to be cleaned a bit. For example certain cells of certain rows need to be swapped and some strings need to be trimmed. Of course this could all be done as postprocessing in, e.g., Python, but I was hoping to do all of this with one query statement.
Being new to Postgresql I want to update the intermediate table that results from a SELECT statement. So I basically want to edit the resulting table from a SELECT statement in one query. I'd like to prevent having to store the intermediate result.
I've tried the following 'with clause':
with result as (
select
a
from
b
)
update result as r
set
a = 'd'
...but that results in ERROR: relation "result" does not exist, while the following does work:
with result as (
select
a
from
b
)
select
*
from
result
As I said, I'm new to Postgresql so it is entirely possible that I'm using the wrong approach.
Depending on the complexity of the transformations you want to perform, you might be able to munge it into the SELECT, which would let you get away with a single query:
WITH foo AS (SELECT lower(name), freq, cumfreq, rank, vec FROM names WHERE name LIKE 'G%')
SELECT ... FROM foo WHERE ...
Or, for more or less unlimited manipulation options, you could create a temp table that will disappear at the end of the current transaction. That doesn't get the job done in a single query, but it does get it all done on the SQL server, which might still be worthwhile.
db=# BEGIN;
BEGIN
db=# CREATE TEMP TABLE foo ON COMMIT DROP AS SELECT * FROM names WHERE name LIKE 'G%';
SELECT 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-----------------------
GREEN | 0.183 | 11.403 | 35 | 'KRN':1 'green':1
GONZALEZ | 0.166 | 11.915 | 38 | 'KNSL':1 'gonzalez':1
GRAY | 0.106 | 15.921 | 69 | 'KR':1 'gray':1
GONZALES | 0.087 | 18.318 | 94 | 'KNSL':1 'gonzales':1
GRIFFIN | 0.084 | 18.659 | 98 | 'KRFN':1 'griffin':1
(5 rows)
db=# UPDATE foo SET name = lower(name);
UPDATE 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
--------+-------+---------+-------+---------------------
grube | 0.002 | 67.691 | 7333 | 'KRP':1 'grube':1
gasper | 0.001 | 69.999 | 9027 | 'KSPR':1 'gasper':1
gori | 0.000 | 81.360 | 28946 | 'KR':1 'gori':1
goeltz | 0.000 | 85.471 | 47269 | 'KLTS':1 'goeltz':1
gani | 0.000 | 86.202 | 51743 | 'KN':1 'gani':1
(5 rows)
db=# COMMIT;
COMMIT
db=# SELECT * FROM foo;
ERROR: relation "foo" does not exist

PostgreSQL - Pull earliest timestamp per user

I have a table which records each time the user performs a certain behavior, with timestamps for each iteration. I need to pull one row from each user with the earliest timestamp as part of a nested query.
As an example, the table looks like this:
+ row | user_id | timestamp | description
+ 1 | 100 | 02-02-2010| android
+ 2 | 100 | 02-03-2010| ios
+ 3 | 100 | 02-05-2010| windows
+ 4 | 111 | 02-01-2010| ios
+ 5 | 112 | 02-03-2010| android
+ 6 | 112 | 02-04-2010| android
And my query should pull just rows 1, 4 and 5.
Thanks!
This should be help. Don't understand your nested query part.
SELECT user_id, MIN(timestamp) AS min_timestamp
FROM table1
GROUP BY user_id
ORDER BY user_id;