PostgreSQL Checkpoint Discrepancy - postgresql

There is problem about that I don' t understand in a database. Our configuration is like following:
archive_mode = on
archive_timeout = 900
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
max_wal_size = 4GB
We do not hit max_wal_size limit. Our average is 60 + 0.9 = 54 minutes which makes sense.
postgres=# SELECT
total_checkpoints,
seconds_since_start / total_checkpoints / 60 AS minutes_between_checkpoints
FROM
(SELECT
EXTRACT(EPOCH FROM (now() - pg_postmaster_start_time())) AS seconds_since_start,
(checkpoints_timed+checkpoints_req) AS total_checkpoints
FROM pg_stat_bgwriter
) AS sub;
-[ RECORD 1 ]---------------+------------
total_checkpoints | 240
minutes_between_checkpoints | 54.63359986
Yet, yesterday I checked the latest checkpoit at 13.19:
postgres=# SELECT * FROM pg_control_checkpoint();
-[ RECORD 1 ]--------+-------------------------
checkpoint_lsn | 862/D67582F0
prior_lsn | 862/7EBA9A80
redo_lsn | 862/87008050
redo_wal_file | 000000030000086200000087
timeline_id | 3
prev_timeline_id | 3
full_page_writes | t
next_xid | 0:1484144344
next_oid | 8611735
next_multixact_id | 151786
next_multi_offset | 305073
oldest_xid | 1284151498
oldest_xid_dbid | 1905285
oldest_active_xid | 1484144342
oldest_multi_xid | 1
oldest_multi_dbid | 1905305
oldest_commit_ts_xid | 0
newest_commit_ts_xid | 0
checkpoint_time | 2022-09-21 12:19:17+02
So, after latest checkpoint it passed more than 60 minutes, it should have taken another checkpoint. Archive mode is enabled and 15 minutes but it does not take checkpoint. Only possbile explanation is not generating any WAL according to the official document, but we generated lots of WAL, this is very active database(not as much as to fullfil 4 GB WAL). What do I miss?
Thanks!

That seems perfectly fine.
With your settings, PostgreSQL will run a checkpoint every hour and time it to take around 54 minutes. So 90% of the time you have some checkpoint activity, and 10% nothing. Of course this timing is not 100% accurate, so don't worry about a minute up or down.
If you want to observe this behavior in more detail, set log_checkpoints = on. Then you will get a log message whenever a checkpoint starts and whenever it completes. Leave this setting on, this is useful information for debugging database problems.

Related

Scala Spark get sum by time bucket across team spans and key

I have a question that is very similar to How to group by time interval in Spark SQL
However, my metric is time spent (duration), so my data looks like
KEY |Event_Type | duration | Time
001 |event1 | 10 | 2016-05-01 10:49:51
002 |event2 | 100 | 2016-05-01 10:50:53
001 |event3 | 20 | 2016-05-01 10:50:55
001 |event1 | 15 | 2016-05-01 10:51:50
003 |event1 | 13 | 2016-05-01 10:55:30
001 |event2 | 12 | 2016-05-01 10:57:00
001 |event3 | 11 | 2016-05-01 11:00:01
Is there a way to sum the time spent into five minute buckets, grouped by key, and know when the duration goes outside of the bound of the bucket?
For example, the first row starts at 10:49:51 and ends at 10:50:01
Thus, the bucket for key 001 in window [2016-05-01 10:45:00.0,2016-05-01 10:50:00.0] would get 8 seconds of duration (51 seconds to 60 seconds) and the and the 10:50 to 10:55 would get 2 seconds of duration, plus the relevant seconds from other log lines (20 seconds from the third row, 15 from the 4th row).
I want to sum the time in a specific bucket, but the solution on the other thread of
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
would overcount in the buckets timestamps that overlap buckets start in, and undercount the subsequent buckets
Note: My Time column is also in Epoch timestamps like 1636503077, but I can easily cast it to the above format if that makes this calculation easier.
for my opinion, maybe you need preprocess you data by spilt you duration to every minutes (or every five minutes).
as you wish, the first row
001 |event1 | 10 | 2016-05-01 10:49:51
should be convert to
001 |event1 | 9 | 2016-05-01 10:49:51
001 |event1 | 1 | 2016-05-01 10:50:00
then you can use spark window function to sum it properly.
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
that will not change the result if you only want to know the duration of time bucket, but will increasing the record counts.

slow running postgres11 logical replication from EC2 to RDS

I'm trying to move a Postgres (11.6) database on EC2 to RDS (postgres 11.6). I started replication last a couple nights ago and have now noticed the replication has slowed down considerably when I see how fast the database size in increasing on the subscriber SELECT pg_database_size('db_name')/1024/1024. Here are some stats of the environment:
Publisher Node:
Instance type: r5.24xlarge
Disk: 5Tb GP2 with 16,000 PIOPs
Database Size w/ pg_database_size/1024/1024: 2,295,955 mb
Subscriber Node:
Instance type: RDS r5.24xlarge
Disk: 3Tb GP2
Here is the current DB size for the subscriber and publisher:
Publisher:
SELECT pg_database_size('db_name')/1024/1024 db_size_publisher;
db_size_publisher
-------------------
2295971
(1 row)
Subscriber:
SELECT pg_database_size('db_name')/1024/1024 as db_size_subscriber;
db_size_subscriber
--------------------
1506348
(1 row)
The difference is still about 789GB left to replicate it seems like and I've noticed that the subsriber db is increasing at a rate of about 250kb/sec
db_name=> SELECT pg_database_size('db_name')/1024/1024, current_timestamp;
?column? | current_timestamp
----------+-------------------------------
1506394 | 2020-05-21 06:27:46.028805-07
(1 row)
db_name=> SELECT pg_database_size('db_name')/1024/1024, current_timestamp;
?column? | current_timestamp
----------+-------------------------------
1506396 | 2020-05-21 06:27:53.542946-07
(1 row)
At this rate, it would take another 30 days to finish replication, which makes me think I've set something up wrong.
Here are also some other stats from the publisher and subscriber:
Subscriber pg_stat_subscription:
db_name=> select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
-------+----------------+-------+-------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
21562 | rds_subscriber | 2373 | 18411 | | 2020-05-20 18:41:54.132407-07 | 2020-05-20 18:41:54.132407-07 | | 2020-05-20 18:41:54.132407-07
21562 | rds_subscriber | 43275 | | 4811/530587E0 | 2020-05-21 06:15:55.160231-07 | 2020-05-21 06:15:55.16003-07 | 4811/5304BD10 | 2020-05-21 06:15:54.931053-07
(2 rows)
At this rate...it would take weeks to complete....what am I doing wrong here?

Select top 10 from subquery of median CPU usage and display time series data with Influx

I'm wanting to create a graph panel in Grafana which shows the top 10 highest consumers of CPU and show their respective history over whatever time interval has been selected. I think that last part is the tricky bit.
I have this so far:
SELECT TOP("median_Percent_Processor_Time", 10) as "usage", host FROM (
SELECT median("Percent_Processor_Time") AS "median_Percent_Processor_Time" FROM "telegraf_monitoring"."autogen"."win_cpu" WHERE time > now() - 5s GROUP BY time(:interval:), "host" FILL(none)
)
This produces the following table:
time | usage | host
12/17/18 02:38:36PM | 88.4503173828125 | CNVDWSO202
12/17/18 02:38:36PM | 60.55384826660156 | CNVDSerr01
12/17/18 02:38:36PM | 46.807456970214844 | NVsABAr01
12/17/18 02:38:36PM | 27.402353286743164 | NVDARCH02
12/17/18 02:38:36PM | 21.320478439331055 | NVDABAr05
12/17/18 02:38:36PM | 5.546620845794678 | NVDALMBOE
12/17/18 02:38:36PM | 3.654918909072876 | NVDLeNCXE01
12/17/18 02:38:36PM | 47.08285903930664 | NVDOKTARAD01
The table is useful but thats just a single point in time. I need to subsequently query and pull time series data from that win_cpu measurement for those 10 hosts. The hosts values are dynamic, I have no way of predicting what will show up and because of that I cant string together OR statements and Influx doesnt support IN as far as I can see.
You can use OR regexp instead of IN. =~ /HOST1|HOST2|HOST3/ + GROUP BY host and one InfluxDB query will return all data. The tricky part is Grafana variable, which will have those top 10 hosts. When you have it, then just use advance variable formatting in the regexp query - for example =~ /${tophosts:pipe}/.

Cognos Calculate Variance Crosstab (Dimensional)

This is very similar to Cognos Calculate Variance Crosstab (Relational), but my data source is dimensional.
I have a simple crosstab such as this:
| 04-13-2013 | 04-13-2014
---------------------------------------
Sold | 75 | 50
Purchased | 10 | 15
Repaired | 33 | 44
Filter: The user selects 1 date and then we include that date plus 1 year ago.
Dimension: The date is the day level in a YQMD Hierarchy.
Measures: We are showing various measures from a Measure Dimension.
Sold
Purchased
Repaired
Here is what is looks like in report studio:
| <#Day#> | <#Day#>
---------------------------------------
<#Sold#> | <#1234#> | <#1234#>
<#Purchased#> | <#1234#> | <#1234#>
<#Repaired#> | <#1234#> | <#1234#>
I want to be able to calculate the variance as a percentage between the two time periods for each measure like this.
| 04-13-2013 | 04-13-2014 | Var. %
-----------------------------------------------
Sold | 75 | 50 | -33%
Purchased | 10 | 15 | 50%
Repaired | 33 | 44 | 33%
I added a Query Expression to the right of the <#Day#> as shown below, but I cannot get the variance calculation to work.
| <#Day#> | <#Variance#>
---------------------------------------
<#Sold#> | <#1234#> | <#1234#>
<#Purchased#> | <#1234#> | <#1234#>
<#Repaired#> | <#1234#> | <#1234#>
These are the expressions I've tried and the results that I get:
An expression that is hard coded works, but only for that 1 measure:
total(case when [date] = 2014-04-13 then [Sold] end)
/
total(case when [date] = 2013-04-13 then [Sold] end)
-1
I thought CurrentMember and PrevMember might work, but it produces blank cells:
CurrentMember( [YQMD Hierarchy] )
/
prevMember(CurrentMember([YQMD Hierarchy]))
-1
I think it is because prevMember produces blank.
prevMember(CurrentMember([YQMD Hierarchy]))
Using only CurrentMember gives a total of both columns:
CurrentMember([YQMD Hierarchy])
What expression can I use to take advantage of my dimensional model and add a column with % variance?
These are the pages I used for research:
Variance reporting in Report Studio on Cognos 8.4?
Calculations that span dimensions - PDF
IBM Cognos 10 Report Studio: Creating Consumer-Friendly Reports
I hope there is a better way to do this. I finally found a resource that describes one approach to this problem. Using the tail and head functions, we can get to the first and last periods, and thereby calculate the % variance.
item(tail(members([Day])),0)
/
item(head(members([Day])),0)
-1
This idea came from IBM Cognos BI – Using Dimensional Functions to Determine Current Period.
Example 2 – Find Current Period by Filtering on Measure Data
If the OLAP or DMR data source has been populated with time periods into the future (e.g. end of year or future years), then the calculation of current period is more complicated. However, it can still be determined by finding the latest period that has data for a given measure.
item(tail(filter(members([sales_and_marketing].[Time].[Time].[Month]),
tuple([Revenue], currentMember([sales_and_marketing].[Time].[Time]))
is not null), 1), 0)

What's the unit of buffers_checkpoint in pg_stat_bgwriter table?

I'm using postgreSQL-9.1.6 and trying to build monitoring application for postgreSQL server.
I'm planning to select PHYSICAL and LOGICAL I/O stat from pg_stat_* information tables.
According to MANUAL unit of fields in PG_STAT_DATABASE is BLOCK which means size of 8KB.
postgres=# select * from pg_stat_database where datname='postgres';
-[ RECORD 3 ]-+------------------------------
datid | 12780
datname | postgres
numbackends | 2
xact_commit | 974
xact_rollback | 57
blks_read | 210769
blks_hit | 18664177
tup_returned | 16074339
tup_fetched | 35121
tup_inserted | 18182015
tup_updated | 572
tup_deleted | 3075
conflicts | 0
I could figure out size of PHYSICAL READ usging blks_read * 8KB.
However, there is no comments on the unit of stats in PG_STAT_BGWRITER.
postgres=# select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed | 276
checkpoints_req | 8
buffers_checkpoint | 94956
buffers_clean | 0
maxwritten_clean | 0
buffers_backend | 82618
buffers_backend_fsync | 0
buffers_alloc | 174760
stats_reset | 2013-07-15 22:27:05.503125+09
How can I calculate the size of PHYSICAL WRITE through the buffers_checkpoint?
Any advice wold be very appreciated.
Taken from the de facto performance handbook "Postgresql 9.0 High Performance" by Greg Smith, in the chapter on Database Activity and Statistics:
What percentage of the time are checkpoints being requested based on activity instead of time passing?
How much data does the average checkpoint write?
What percentage of the data being written out happens from checkpoints and backends, respectively?
SELECT
(100 * checkpoints_req) /
(checkpoints_timed + checkpoints_req) AS checkpoints_req_pct,
pg_size_pretty(buffers_checkpoint * block_size /
(checkpoints_timed + checkpoints_req)) AS avg_checkpoint_write,
pg_size_pretty(block_size *
(buffers_checkpoint + buffers_clean + buffers_backend)) AS total_written,
100 * buffers_checkpoint /
(buffers_checkpoint + buffers_clean + buffers_backend) AS checkpoint_write_pct,
100 * buffers_backend /
(buffers_checkpoint + buffers_clean + buffers_backend) AS backend_write_pct,
*
FROM pg_stat_bgwriter,
(SELECT cast(current_setting('block_size') AS integer) AS block_size) bs;