PostgreSQL 9.5.2 RDS in AWS
select name,setting from pg_settings
where name like '%vacuum%'
order by name;
name | setting
-------------------------------------+-----------
autovacuum | on
autovacuum_analyze_scale_factor | 0.05
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 450000000
autovacuum_max_workers | 3
autovacuum_multixact_freeze_max_age | 400000000
autovacuum_naptime | 30
autovacuum_vacuum_cost_delay | 20
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.1
autovacuum_vacuum_threshold | 50
autovacuum_work_mem | -1
log_autovacuum_min_duration | 0
rds.force_autovacuum_logging_level | log
vacuum_cost_delay | 0
vacuum_cost_limit | 300
vacuum_cost_page_dirty | 20
vacuum_cost_page_hit | 1
vacuum_cost_page_miss | 10
vacuum_defer_cleanup_age | 0
vacuum_freeze_min_age | 50000000
vacuum_freeze_table_age | 250000000
vacuum_multixact_freeze_min_age | 5000000
vacuum_multixact_freeze_table_age | 150000000
I've been trying to figure out how auto vacuuming is working in two Postgres databases. The databases are identical in size, parameters and structure. (These are two data warehouses for the same application - different locations and different patterns of data).
We are using partitions for some of our very large tables. I've noticed that the older (static) partitions are regularly getting auto vacuumed. I understand that XIDs are frozen but the relation does need periodic vacuuming to look for and new XIDs.
I've been using this query to look for relations that will require vacuuming to avoid XID wrap around:
SELECT 'Relation Name',age(c.relfrozenxid) c_age, age(t.relfrozenxid) t_age,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm')
order by age desc limit 5;
?column? | c_age | t_age | age
---------------+-----------+-----------+-----------
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | 310800517 | 461544753
All of the relations listed are old stable partitions. The column relfrozenxid is defined as: "All transaction IDs before this one have been replaced with a permanent ("frozen") transaction ID in this table. This is used to track whether the table needs to be vacuumed in order to prevent transaction ID wraparound or to allow pg_clog to be shrunk."
Out of curiosity I looked at relfrozenxid for all of the partitions of a particular table:
SELECT c.oid::regclass as table_name,age(c.relfrozenxid) as age , c.reltuples::int, n_live_tup, n_dead_tup,
date_trunc('day',last_autovacuum)
FROM pg_class c
JOIN pg_stat_user_tables u on c.relname = u.relname
WHERE c.relkind IN ('r', 'm')
and c.relname like 'tablename%'
table_name | age | reltuples | n_live_tup | n_dead_tup | date_trunc
-------------------------------------+-----------+-----------+------------+------------+------------------------
schema_partition.tablename_201202 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201306 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201204 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201110 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201111 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201112 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201201 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201203 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201109 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201801 | 435086084 | 37970232 | 37970230 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201307 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201107 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201312 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201311 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201401 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201310 | 423675180 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201704 | 423222113 | 43842668 | 43842669 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201612 | 423222113 | 65700844 | 65700845 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201705 | 423221655 | 46847336 | 46847338 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201702 | 423171142 | 50701032 | 50701031 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_overflow | 423171142 | 754 | 769 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201106 | 421207271 | 1 | 1 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201309 | 421207271 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201108 | 421207271 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201308 | 421207271 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201806 | 374122782 | 44626756 | 44626757 | 0 | 2018-09-26 00:00:00+00
schema.tablename | 360135561 | 0 | 0 | 0 | 2018-09-27 00:00:00+00
I'm pretty sure I don't really understand how the relfrozenxid works but it does appear that the partition tables are affected by the parent table (which would affect the relfrozenxid value for the partitioned table). I can't find any documentation regarding this. I would think that for static tables the relfrozenxid would remain static until a vacuum occurred.
Additionally I have a handful of relations that have static data that apparently have never been auto vacuumed (last_autovacuum is null). Could this be a result of a VACUUM FREEZE operation?
I am new to Postgres and I readily admit to not fully understanding the auto vacuum processes.
I'm not seeing and performance problems that I can identify.
Edit:
I set up a query to run every 4 hours against one partitioned table:
SELECT c.oid::regclass as table_name,age(c.relfrozenxid) as age , c.reltuples::int, n_live_tup, n_dead_tup,
date_trunc('day',last_autovacuum)
FROM pg_class c
JOIN pg_stat_user_tables u on c.relname = u.relname
WHERE c.relkind IN ('r', 'm')
and c.relname like 'sometable%'
order by age desc;
Looking at two different partitions here is the output for the last 20 hours:
schemaname.sometable_201812 | 206286536 | 0 | 0 | 0 |
schemaname.sometable_201812 | 206286537 | 0 | 0 | 0 |
schemaname.sometable_201812 | 225465100 | 0 | 0 | 0 |
schemaname.sometable_201812 | 225465162 | 0 | 0 | 0 |
schemaname.sometable_201812 | 225465342 | 0 | 0 | 0 |
schemaname.sometable_201812 | 236408374 | 0 | 0 | 0 |
-bash-4.2$ grep 201610 test1.out
schemaname.sometable_201610 | 449974426 | 31348368 | 31348369 | 0 | 2018-09-22 00:00:00+00
schemaname.sometable_201610 | 449974427 | 31348368 | 31348369 | 0 | 2018-09-22 00:00:00+00
schemaname.sometable_201610 | 469152990 | 31348368 | 31348369 | 0 | 2018-09-22 00:00:00+00
schemaname.sometable_201610 | 50000051 | 31348368 | 31348369 | 0 | 2018-10-10 00:00:00+00
schemaname.sometable_201610 | 50000231 | 31348368 | 31348369 | 0 | 2018-10-10 00:00:00+00
schemaname.sometable_201610 | 60943263 | 31348368 | 31348369 | 0 | 2018-10-10 00:00:00+00
The relfrozenxid of partitions is being modified even though there is no direct DML to the partition. I would assume that inserts to the base table are somehow modifying the relfrozenxid of the partitions.
The partition sometable_201610 has 31 million rows but is static. When I look at the log files the autvacuum of this type of partition is taking 20-30 minutes. I don't know if that is a performance problem or not but it does seem expensive. Looking at the autovacuum in the log files shows that typically there are several of these large partitions autovacuumed every night. (There are also lots of the partitions with zero tuples that are autovacuumed but these take very little time).
Related
We have had an incredibly long running autovacuum process running on one of our smaller database machines that we believe has been using a lot of Aurora:StorageIOUsage:
We determined this by running SELECT * FROM pg_stat_activity WHERE wait_event_type = 'IO';
and seeing the below results repeatedly.
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
--------+----------------------------+-------+----------+-----------+------------------+----------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+--------------+--------+-------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------
398954 | postgres | 17582 | | | | | | | 2022-09-29 18:45:55.364654+00 | 2022-09-29 18:46:20.253107+00 | 2022-09-29 18:46:20.253107+00 | 2022-09-29 18:46:20.253108+00 | IO | DataFileRead | active | | 66020718 | autovacuum: VACUUM pg_catalog.pg_depend | autovacuum worker
398954 | postgres | 17846 | | | | | | | 2022-09-29 18:46:04.092536+00 | 2022-09-29 18:46:29.196309+00 | 2022-09-29 18:46:29.196309+00 | 2022-09-29 18:46:29.19631+00 | IO | DataFileRead | active | | 66020732 | autovacuum: VACUUM pg_toast.pg_toast_2618 | autovacuum worker
As you can see from the screenshot it has been going for well over a month, and is mainly for the pg_depend, pg_attribute, and pg_toast_2618 tables which are not all that large. I haven't been able to find any reason why these tables would need so much vacuuming other than maybe a database restore from our production environment (this is one of our lower environments). Here are the pg_stat_sys_tables entries for those tables and the pg_rewrite which is the table that pg_toast_2618 is associated with:
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
-------+------------+---------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------+-------------------------------+--------------+-------------------------------+--------------+------------------+---------------+-------------------
1249 | pg_catalog | pg_attribute | 185251 | 12594432 | 31892996 | 119366792 | 1102817 | 3792 | 1065737 | 1281 | 543392 | 1069529 | 23584 | | 2022-09-29 18:53:25.227334+00 | | 2022-09-28 01:12:47.628499+00 | 0 | 1266763 | 0 | 36
2608 | pg_catalog | pg_depend | 2429 | 369003445 | 14152628 | 23494712 | 7226948 | 0 | 7176855 | 0 | 476267 | 7176855 | 0 | | 2022-09-29 18:52:34.523257+00 | | 2022-09-28 02:02:52.232822+00 | 0 | 950137 | 0 | 71
2618 | pg_catalog | pg_rewrite | 25 | 155083 | 1785288 | 1569100 | 64127 | 314543 | 62472 | 59970 | 7086 | 377015 | 13869 | | 2022-09-29 18:53:11.288732+00 | | 2022-09-23 18:54:50.771969+00 | 0 | 1280018 | 0 | 81
2838 | pg_toast | pg_toast_2618 | 0 | 0 | 1413436 | 3954640 | 828571 | 0 | 825143 | 0 | 15528 | 825143 | 1653714 | | 2022-09-29 18:52:47.242386+00 | | | 0 | 608881 | 0 | 0
I'm pretty new to Postgres and I'm wondering what could possibly cause this level of records to need to be cleaned up, and why it would take well over a month to accomplish considering we always have autovacuum set to TRUE. We are running Postgres version 10.17 on a single db.t3.medium, and the only thing I can think of at this point is to try increasing the instance size. Do we simply need to increase our database instance size on our aurora cluster so that this can be done more in memory? I'm at a bit of a loss for how to reduce this huge sustained spike in Storage IO costs.
Additional information for our autovaccum settings:
=> SELECT * FROM pg_catalog.pg_settings WHERE name LIKE '%autovacuum%';
name | setting | unit | category | short_desc | extra_desc | context | vartype | source | min_val | max_val | enumvals | boot_val | reset_val | sourcefile | sourceline | pending_restart
-------------------------------------+-----------+------+-------------------------------------+-------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------+---------+--------------------+---------+------------+-----------------------------------------------------------------------------------------+-----------+-----------+-----------------------------------+------------+-----------------
autovacuum | on | | Autovacuum | Starts the autovacuum subprocess. | | sighup | bool | configuration file | | | | on | on | /rdsdbdata/config/postgresql.conf | 78 | f
autovacuum_analyze_scale_factor | 0.05 | | Autovacuum | Number of tuple inserts, updates, or deletes prior to analyze as a fraction of reltuples. | | sighup | real | configuration file | 0 | 100 | | 0.1 | 0.05 | /rdsdbdata/config/postgresql.conf | 55 | f
autovacuum_analyze_threshold | 50 | | Autovacuum | Minimum number of tuple inserts, updates, or deletes prior to analyze. | | sighup | integer | default | 0 | 2147483647 | | 50 | 50 | | | f
autovacuum_freeze_max_age | 200000000 | | Autovacuum | Age at which to autovacuum a table to prevent transaction ID wraparound. | | postmaster | integer | default | 100000 | 2000000000 | | 200000000 | 200000000 | | | f
autovacuum_max_workers | 3 | | Autovacuum | Sets the maximum number of simultaneously running autovacuum worker processes. | | postmaster | integer | configuration file | 1 | 262143 | | 3 | 3 | /rdsdbdata/config/postgresql.conf | 45 | f
autovacuum_multixact_freeze_max_age | 400000000 | | Autovacuum | Multixact age at which to autovacuum a table to prevent multixact wraparound. | | postmaster | integer | default | 10000 | 2000000000 | | 400000000 | 400000000 | | | f
autovacuum_naptime | 5 | s | Autovacuum | Time to sleep between autovacuum runs. | | sighup | integer | configuration file | 1 | 2147483 | | 60 | 5 | /rdsdbdata/config/postgresql.conf | 9 | f
autovacuum_vacuum_cost_delay | 5 | ms | Autovacuum | Vacuum cost delay in milliseconds, for autovacuum. | | sighup | integer | configuration file | -1 | 100 | | 20 | 5 | /rdsdbdata/config/postgresql.conf | 73 | f
autovacuum_vacuum_cost_limit | -1 | | Autovacuum | Vacuum cost amount available before napping, for autovacuum. | | sighup | integer | default | -1 | 10000 | | -1 | -1 | | | f
autovacuum_vacuum_scale_factor | 0.1 | | Autovacuum | Number of tuple updates or deletes prior to vacuum as a fraction of reltuples. | | sighup | real | configuration file | 0 | 100 | | 0.2 | 0.1 | /rdsdbdata/config/postgresql.conf | 22 | f
autovacuum_vacuum_threshold | 50 | | Autovacuum | Minimum number of tuple updates or deletes prior to vacuum. | | sighup | integer | default | 0 | 2147483647 | | 50 | 50 | | | f
autovacuum_work_mem | -1 | kB | Resource Usage / Memory | Sets the maximum memory to be used by each autovacuum worker process. | | sighup | integer | default | -1 | 2147483647 | | -1 | -1 | | | f
log_autovacuum_min_duration | -1 | ms | Reporting and Logging / What to Log | Sets the minimum execution time above which autovacuum actions will be logged. | Zero prints all actions. -1 turns autovacuum logging off. | sighup | integer | default | -1 | 2147483647 | | -1 | -1 | | | f
rds.force_autovacuum_logging_level | disabled | | Customized Options | Emit autovacuum log messages irrespective of other logging configuration. | Each level includes all the levels that follow it.Set to disabled to disable this feature and fall back to using log_min_messages. | sighup | enum | default | | | {debug5,debug4,debug3,debug2,debug1,info,notice,warning,error,log,fatal,panic,disabled} | disabled | disabled | | | f
I would say you have some very long-lived snapshot being held. These tables need to be vacuumed, but the vacuum doesn't accomplish anything because the dead tuples can't be removed as some old snapshot still can see them. So immediately after being vacuumed, they are still eligible to be vacuumed again. So it tries again every 5 seconds (autovacuum_naptime), because autovacuum doesn't have a way to say "Don't bother until this snapshot which blocked me from accomplishing anything last time goes away"
Check pg_stat_activity for very old 'idle in transaction' and for any prepared transactions.
I have two, similar queries. In one of case have additional dummy where conditions (1=1, 0=0, true):
SELECT t1.*
FROM table1 t1
JOIN table2 t2 ON t2.fk_t1 = t1.id
JOIN table3 t3 ON t3.id = t1.fk_t3
WHERE
0 = 0 AND /* with this in 1st case, without this line in 2nd case */
t3.field = 6
AND EXISTS (SELECT 1 FROM table2 x WHERE x.fk2_t2 = t2.id)
All necessary fields are indexed.
For each case, Firebird (both versions 2.1 and 3.0) works differently, and statistics of reads see like this:
1st case (with 0=0):
Query Time
------------------------------------------------
Prepare : 32,00 ms
Execute : 1 046,00 ms
Avg fetch time: 61,53 ms
Operations
------------------------------------------------
Read : 8 342
Writes : 1
Fetches: 1 316 042
Marks : 0
Enhanced Info:
+-------------------------------+-----------+-----------+-------------+---------+---------+---------+----------+----------+----------+
| Table Name | Records | Indexed | Non-Indexed | Updates | Deletes | Inserts | Backouts | Purges | Expunges |
| | Total | reads | reads | | | | | | |
+-------------------------------+-----------+-----------+-------------+---------+---------+---------+----------+----------+----------+
|TABLE2 | 0 | 4804 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|TABLE1 | 0 | 0 | 96884 | 0 | 0 | 0 | 0 | 0 | 0 |
|TABLE3 | 0 | 387553 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-------------------------------+-----------+-----------+-------------+---------+---------+---------+----------+----------+----------+
And in 2nd case (without dummy condition):
Query Time
------------------------------------------------
Prepare : 16,00 ms
Execute : 515,00 ms
Avg fetch time: 30,29 ms
Operations
------------------------------------------------
Read : 7 570
Writes : 1
Fetches: 648 103
Marks : 0
Enhanced Info:
+-------------------------------+-----------+-----------+-------------+---------+---------+---------+----------+----------+----------+
| Table Name | Records | Indexed | Non-Indexed | Updates | Deletes | Inserts | Backouts | Purges | Expunges |
| | Total | reads | reads | | | | | | |
+-------------------------------+-----------+-----------+-------------+---------+---------+---------+----------+----------+----------+
|TABLE2 | 0 | 506 | 152655 | 0 | 0 | 0 | 0 | 0 | 0 |
|TABLE1 | 0 | 467 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|TABLE3 | 0 | 1885 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-------------------------------+-----------+-----------+-------------+---------+---------+---------+----------+----------+----------+
Queries have different execution plans.
PLAN JOIN (T2 NATURAL, T1 INDEX (T1_ID_IDX), T3 INDEX (T3_ID_IDX))
PLAN JOIN (T1 NATURAL, T3 INDEX (T3_ID_IDX1), T2 INDEX (T2_FK_T1_IDX))
It's strange for me. Why query with same sense of conditions works so different? How work FB optimizer and how write quick and optimal queries? How understand this?
P.S. https://github.com/FirebirdSQL/firebird/issues/6941
We usea time series extension prometheus storage adaptor along with postgres to store prometheus data to Postgres.
We have a Prometheus exporter to export Postgres metrics.
And it tries to get number of reads/writes to DB. Since values not showing up, I did following.
postgres=# select * from pg_stat_database;
datid | datname | numbackends | xact_commit | xact_rollback | blks_read | blks_hit | tup_returned | tup_fetched | tup_inserted | tup_updated | tup_deleted | conflicts | temp_files | temp_bytes | deadlocks | blk_read_time | blk_write_time | stats_reset
-------+-----------+-------------+-------------+---------------+-----------+----------+--------------+-------------+--------------+-------------+-------------+-----------+------------+------------+-----------+---------------+----------------+-------------
1 | template1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12291 | template0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12292 | postgres | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
(3 rows)
Specifically things like tup_fetched, tup_inserted etc shows zero values on 'postgres' database.
But I used following query to find current running queries on the DB.
postgres=# select pg_stat_activity.datname, pg_stat_activity.usename, pg_stat_activity.query from pg_stat_activity;
datname | usename | query
----------+-------------------+-----------------------------------------------------------------------------------------------------------
| postgres |
postgres | psqladmin | COMMIT
postgres | psqladmin | COMMIT
postgres | psqladmin | COMMIT
postgres | psqladmin | COMMIT
postgres | psqladmin | COMMIT
postgres | psqladmin | select * from pg_stat_database;
postgres | postgres_exporter | SELECT * FROM pg_stat_database;
postgres | psqladmin | COMMIT
postgres | psqladmin | COMMIT
postgres | postgres | select pg_stat_activity.datname, pg_stat_activity.usename, pg_stat_activity.query from pg_stat_activity;
postgres | psqladmin | COMMIT
postgres | psqladmin | COMMIT
| |
| |
| |
(16 rows)
Unfortunately I could not see any WRITE per say, but COMMIT as you can see. I assume those are inserts ? But anyway we can see some select queries. So why not tup_fetched or tup_returned non zero ?
Has anyone came across this situation. Any ideas to tackle this issue?
I have a Spark dataframe containing data similar to the following:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 |
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 |
+----+---------------------+-------+----------+-------------+
I'm looking to turn this into something like the following:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 | ? |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 | ? |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 | ? |
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 | ? |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 | ? |
+----+---------------------+-------+----------+-------------+------------+
More specifically I want to turn this:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
+----+---------------------+-------+----------+-------------+
Into this:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
+----+---------------------+-------+----------+-------------+------------+
I want to take the rows with more than 1 interval out of the original table, interpolate Values for missing intervals and reinsert the newly created rows into the initial table place of the original rows. I have ideas of how to achieve this (in PostgreSQL for example I would simply use the generate_series() function to create the required Timestamps and calculate new Values), but implementing these in Spark/Scala is proving troublesome.
Assuming I've created a new dataframe containing only rows with Interval > 1, how could I replicate those rows 'n' times with 'n' being the value of Interval? I believe that would give me enough to get going using a Counter function partitioned by some row reference I can create.
If there's a way to replicate the behavior of generate_series() that I've missed, even better.
I have a large single table of sent emails with dates and outcomes and I'd like to be able to match each row with the last time that email was sent and a specific outcome occurred (here that open=1). This needs to be done with PostgreSQL. For example:
Initial table:
id | sent_dt | bounced | open ` | clicked | unsubscribe
1 | 2015-01-01 | 1 | 0 | 0 | 0
1 | 2015-01-02 | 0 | 1 | 1 | 0
1 | 2015-01-03 | 0 | 1 | 1 | 0
2 | 2015-01-01 | 0 | 1 | 0 | 0
2 | 2015-01-02 | 1 | 0 | 0 | 0
2 | 2015-01-03 | 0 | 1 | 0 | 0
2 | 2015-01-04 | 0 | 1 | 0 | 1
Result table:
id | sent_dt | bounced| open | clicked | unsubscribe| previous_time
1 | 2015-01-01 | 1 | 0 | 0 | 0 | NULL
1 | 2015-01-02 | 0 | 1 | 1 | 0 | NULL
1 | 2015-01-03 | 0 | 1 | 1 | 0 | 2015-01-02
2 | 2015-01-01 | 0 | 1 | 0 | 0 | NULL
2 | 2015-01-02 | 1 | 0 | 0 | 0 | 2015-01-01
2 | 2015-01-03 | 0 | 1 | 0 | 0 | 2015-01-01
2 | 2015-01-04 | 0 | 1 | 0 | 1 | 2015-01-03
I have tried using Lag but I don't know how to go about that with the conditional that open needs to equal 1 while still returning all rows. I also tried doing a many to many Join on id then finding the minimum Datediff but that is going to essentially square the size of my table and takes entirely too long to compute (>7hrs). There are several answers which would work for SQL but none that I see work for PostgreSQL.
Thanks for any help guys!
You can use ROW_NUMBER() to achieve this desired result, connect each one to the one that occurred before if it has open = 1.
SELECT t.*,s.sent_dt
FROM
(SELECT p.*,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY sent_dt DESC) rnk
FROM YourTable p) t
LEFT OUTER JOIN
(SELECT p.*,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY sent_dt DESC) rnk
FROM YourTable p) s
ON(t.rnk = s.rnk-1 AND s.open = 1)
First I create a cte openFilter for the dates where the mail are open.
Then I join the table mail with those filter and get the dates previous to that email. Finally filter everyone execpt the latest open mail.
SQL Fiddle Demo
WITH openFilter as (
SELECT m."id", m."sent_dt"
FROM mail m
WHERE "open" = 1
)
SELECT m."id",
to_char(m."sent_dt", 'YYYY-MM-DD'),
"bounced", "open", "clicked", "unsubscribe",
to_char(o."sent_dt", 'YYYY-MM-DD') previous_time
FROM mail m
LEFT JOIN openFilter o
ON m."id" = o."id"
AND m."sent_dt" > o."sent_dt"
WHERE o."sent_dt" = (SELECT MAX(t."sent_dt")
FROM openFilter t
WHERE t."id" = m."id"
AND t."sent_dt" < m."sent_dt")
OR o."sent_dt" IS NULL
Output
| id | to_char | bounced | open | clicked | unsubscribe | previous_time |
|----|------------|---------|------|---------|-------------|---------------|
| 1 | 2015-01-01 | 1 | 0 | 0 | 0 | (null) |
| 1 | 2015-01-02 | 0 | 1 | 1 | 0 | (null) |
| 1 | 2015-01-03 | 0 | 1 | 1 | 0 | 2015-01-02 |
| 2 | 2015-01-01 | 0 | 1 | 0 | 0 | (null) |
| 2 | 2015-01-02 | 1 | 0 | 0 | 0 | 2015-01-01 |
| 2 | 2015-01-03 | 0 | 1 | 0 | 0 | 2015-01-01 |
| 2 | 2015-01-04 | 0 | 1 | 0 | 1 | 2015-01-03 |