how create KDB range bars

how create KDB range bars - kdb

unfortunately I'm a beginner with kdb. I am trying to transform ticks futures market data (date, time, price, size) in range bars, which explanation is described in this webpage:
https://help.cqg.com/cqgic/20/default.htm#!Documents/rangebarrb.htm
"A Range Bar chart is a non-time-based chart constructed of bars that indicate price movement as a way to help expose trends and volatility. A bar is created each time a trade occurs outside of the previous bar’s stated price range"
Image of RangeBars
This is my starting ticks table t:
date time price size
------------------------------------
2021.12.13 09:55:22.520 4682.5 1
2021.12.13 09:55:22.520 4682.5 1
2021.12.13 09:55:22.592 4682.5 1
2021.12.13 09:55:22.592 4682.5 1
2021.12.13 09:55:22.592 4682.5 1
2021.12.13 09:55:22.592 4682.5 1
2021.12.13 09:55:22.592 4682.5 2
2021.12.13 09:55:22.696 4682.5 1
2021.12.13 09:55:22.708 4682.5 1
I'm trying to create 2 range bars table, so when price make a delta of 2 (from min price to max price) a bar is completed and start a new bar.
To complete each bar the time vary.
I use kdb formula:
select last time, open:first price, high:max price, low:min price, close:last price, volume:sum size by date, 2 xbar price from t
but the result (not good) is:
date price| time open high low close volume
----------------| ---------------------------------------------------
2021.12.23 4712 | 10:41:04.700 4713.75 4713.75 4712.5 4713.75 5839
2021.12.23 4714 | 16:27:59.508 4715.75 4715.75 4714 4715.75 87912
2021.12.23 4716 | 16:59:59.704 4716.5 4717.75 4716 4716.75 78900
2021.12.23 4718 | 16:56:00.940 4718 4719.75 4718 4718 78230
2021.12.23 4720 | 15:59:04.468 4720 4721.75 4720 4720 114064
2021.12.23 4722 | 15:57:43.356 4722 4723.75 4722 4722 87195
2021.12.23 4724 | 15:55:24.700 4724 4725.75 4724 4724 67896
2021.12.23 4726 | 15:55:10.136 4726 4727.75 4726 4726 23351
2021.12.23 4728 | 15:55:04.172 4728 4729.75 4728 4728 26191
2021.12.23 4730 | 15:54:40.096 4730 4731.25 4730 4730 18846
2021.12.26 4716 | 20:17:59.108 4717 4717.75 4716.75 4717.75 303
2021.12.26 4718 | 21:09:08.688 4718 4719.75 4718 4719.75 3529
2021.12.26 4720 | 23:59:58.476 4720 4721.75 4720 4720.5 12145
2021.12.26 4722 | 23:05:46.528 4722 4723.75 4722 4722 9456
2021.12.26 4724 | 19:39:53.516 4724 4725.75 4724 4724 3120
2021.12.26 4726 | 19:10:05.092 4726 4726.5 4726 4726 262
2021.12.27 4712 | 02:48:12.664 4713.75 4713.75 4713.25 4713.75 422
2021.12.27 4714 | 03:04:59.368 4715.75 4715.75 4714 4715.75 2997
2021.12.27 4716 | 04:33:28.224 4717.75 4717.75 4716 4717.75 4544
2021.12.27 4718 | 04:36:56.816 4719.75 4719.75 4718 4719.75 7983
2021.12.27 4720 | 04:48:57.840 4720.25 4721.75 4720 4721.75 8017
2021.12.27 4722 | 07:05:54.468 4722 4723.75 4722 4723.75 6283
2021.12.27 4724 | 07:18:25.944 4724 4725.75 4724 4725.75 6577
2021.12.27 4726 | 07:29:00.936 4726 4727.75 4726 4727.75 1079
2021.12.27 4728 | 09:30:12.684 4728 4729.75 4728 4729.75 4587
2021.12.27 4730 | 09:30:20.096 4730 4731.75 4730 4731.75 18311
2021.12.27 4732 | 09:30:33.416 4732 4733.75 4732 4733.75 15286
2021.12.27 4734 | 09:31:20.188 4734 4735.75 4734 4735.75 8068
2021.12.27 4736 | 09:35:20.584 4736 4737.75 4736 4737.75 5642
2021.12.27 4738 | 09:55:42.292 4738 4739.75 4738 4739.75 30781
2021.12.27 4740 | 10:00:45.252 4740 4741.75 4740 4741.75 44855
2021.12.27 4742 | 10:02:42.868 4742 4743.75 4742 4743.75 15155
2021.12.27 4744 | 10:07:01.228 4744 4745.75 4744 4745.75 13155
2021.12.27 4746 | 10:12:59.244 4746 4747.75 4746 4747.75 20020
2021.12.27 4748 | 10:20:53.264 4748 4749.75 4748 4749.75 25253
2021.12.27 4750 | 10:27:04.184 4750 4751.75 4750 4751.75 8133
2021.12.27 4752 | 10:28:31.980 4752 4753.75 4752 4753.75 10472
2021.12.27 4754 | 10:48:52.712 4754 4755.75 4754 4755.75 18458
2021.12.27 4756 | 11:36:26.204 4756 4757.75 4756 4757.75 44302
2021.12.27 4758 | 11:51:05.524 4758 4759.75 4758 4759.75 39598
2021.12.27 4760 | 11:59:20.924 4760 4761.75 4760 4761.75 44517
2021.12.27 4762 | 12:11:28.400 4762 4763.75 4762 4763.75 11789
2021.12.27 4764 | 12:48:30.932 4764 4765.75 4764 4765.75 30577
2021.12.27 4766 | 15:22:42.212 4766 4767.75 4766 4767.75 34908
2021.12.27 4768 | 15:25:30.632 4768 4769.75 4768 4769.75 52600
2021.12.27 4770 | 15:41:42.400 4770 4771.75 4770 4771.75 61220
2021.12.27 4772 | 21:27:14.048 4772 4773.75 4772 4773.75 42183
2021.12.27 4774 | 22:38:58.564 4774 4775.75 4774 4775.75 43111
2021.12.27 4776 | 23:59:24.392 4776 4777.75 4776 4777.5 28879
2021.12.27 4778 | 23:42:43.300 4778 4779.75 4778 4778 22715
2021.12.27 4780 | 20:01:27.168 4780 4781.75 4780 4780 68495
2021.12.27 4782 | 18:06:48.512 4782 4783.75 4782 4782 52289
2021.12.27 4784 | 16:02:12.176 4784 4784.25 4784 4784 1880
2021.12.28 4774 | 02:54:08.386 4775.75 4775.75 4775.25 4775.75 178
2021.12.28 4776 | 03:07:23.086 4777.25 4777.75 4776 4777.75 3124
2021.12.28 4778 | 03:16:56.649 4778 4779.75 4778 4779.75 4677
2021.12.28 4780 | 03:27:35.693 4780 4781.75 4780 4781.75 5385
2021.12.28 4782 | 09:39:53.615 4782 4783.75 4782 4783.75 6951
2021.12.28 4784 | 09:49:00.299 4784 4785.75 4784 4785.75 23809
2021.12.28 4786 | 10:09:00.008 4786 4787.75 4786 4787.25 55220
2021.12.28 4788 | 10:08:54.026 4788 4789.75 4788 4788 35137
2021.12.28 4790 | 10:07:39.542 4790 4791.75 4790 4790 26044
2021.12.28 4792 | 10:07:30.735 4792 4793.75 4792 4792 22558
2021.12.28 4794 | 10:07:01.984 4794 4795.75 4794 4794 40433
2021.12.28 4796 | 10:06:17.482 4796 4797.75 4796 4796 22644
2021.12.28 4798 | 09:59:29.502 4798 4798 4798 4798 109
The result is not good because kdb, in the example table, for the day 2021.12.27 take price and divide it in blocks of 2 for the entire day.
The trick is transform original table time dependent in a table time independent.
I have also try to add to table "t" a column "deltap: (0,1_deltas price)" and then aggregate it by step of 2, but with no success.
For clarity, the futures instrument in tables above is the E-Mini S&P 500 Future, which have a tick (min delta price) of 0.25.
Any solutions?
thanks a lot

Edit: scrapped previous answer
So what you need to do is work out the time intervals for when the price has changed by +/- 2 to create the "bars". This is done with the times variable in part 1.
Part 1 is basically a sums deltas but when the total delta is greater than the bar you want, the value is set to null. where null then grabs the times for the bar/time buckets. This uses \ or scan (https://code.kx.com/q/ref/over/) to iterate through the prices keeping a running total of the change in price or resetting it to null/0 to end/start a bar.
Then you use bin with xbar (part 2 below). This is grouping at irregular intervals. (https://code.kx.com/q/ref/xbar/)
deltas0 is so that the first delta is 0 rather than 4710. (https://code.kx.com/q/ref/deltas/)
Edit: updated to >= in part 1
Edit2: added timestamp column to correct across dates
Edit3: alternative approach
// create mock price variance
t:update price:"f"${ y+last 1?x }[raze flip 1 -1 *\: 0.1 * til 10]\[9999;4710] from
`date`time xasc ([]date:10000?2022.01.01 + til 3;time:10000?.z.t;size:10000?1000)
deltas0:{first[x]-':x};
f:{[bar;tbl]
// add timestamp column for 'times' across dates
tbl:update ts:"P"$"D" sv/: flip string (date;time) from tbl;
// part 1
times:exec ts from tbl where null
{[x;y;bar] y:(0^x)+y;
if[(abs y%bar)>=1;y:0Nf];y}[;;bar]\[0;deltas0 price];
// part 2
select start:first time, end:last time, open:first price,
high:max price, low:min price, close:last price, volume:sum size
by times times bin ts from tbl
};
q)f[4;t]
ts | start end open high low close volume
-----------------------------| ------------------------------------------------------------
| 00:00:59.063 00:07:14.804 4710 4713.9 4708.4 4713.9 9186
2022.01.01D00:08:23.650000000| 00:08:23.650 00:10:06.468 4714.3 4717.8 4714.3 4717.8 5433
2022.01.01D00:10:42.424000000| 00:10:42.424 00:31:50.024 4718.5 4719.9 4714.8 4714.8 33448
2022.01.01D00:32:08.135000000| 00:32:08.135 00:56:20.906 4714 4714.5 4710.4 4710.7 40659
2022.01.01D00:56:26.240000000| 00:56:26.240 01:02:38.680 4709.8 4713.2 4709.7 4713.2 11804
q)f[2;t]
ts | start end open high low close volume
-----------------------------| ------------------------------------------------------------
| 00:00:59.063 00:04:27.040 4710 4711.8 4708.4 4711.8 7133
2022.01.01D00:05:11.841000000| 00:05:11.841 00:08:23.650 4712.5 4714.3 4711.6 4714.3 2712
2022.01.01D00:08:39.812000000| 00:08:39.812 00:09:28.143 4714.8 4716.2 4714.8 4716.2 1881
2022.01.01D00:09:31.499000000| 00:09:31.499 00:15:59.067 4716.9 4718.5 4716.6 4718.5 13352
2022.01.01D00:16:13.128000000| 00:16:13.128 00:18:14.294 4719.3 4719.5 4718 4718 4362
The current approach with f is creating time buckets/bars and won't include the final value that caused the bucket/bar to change. This is perhaps what you want shown in alternative approach g. Below shows how f handles the rows in t. This also highlights why the bars shouldn't be expected to be exactly 2 as market data is not perfect.
bar1 (price = 1)
+1 (price = 2)
high = 2 low = 1
bar2 +2 (price = 3) (delta of +2 so new bar starts)
+1 (price = 4)
-0.5 (price = 3.5)
-1 (price = 2.5)
-0.5 (price = 2)
-0.5 (price = 1.5)
high = 4 low = 1.5 // price went up then down before >= 2 delta achieved
bar3 -0.5 (price = 1) (delta of -2 so new bar starts)
An alternative approach would be to use the virtual row number column i instead of times.
The deltas calculation is the same but instead of using time, the i/row number is used to generate a list of rows like this:
// note the overlap to include the end row as the first row in the next bar
bar1 = 0 1 2 3 4
bar2 = 4 5 6 7 8
bar3 = 8 9 10 11
g:{[bar;tbl]
// split table by date
dates:value exec ([]date;time;size;price) by date from tbl;
// then for each date table do:
raze {[bar;tbl]
// get the rows with the deltas calculation and expand out to be a list
rows:{{x+til 1 + (y-x)} ./: flip (0^prev x;x)} exec i from tbl where
null {[x;y;bar] y:(0^x)+y;if[(abs y%bar)>=1;y:0Nf];y}[;;2]\[0;deltas0 price];
// apply the bar row numbers to the table then do the select query
raze { select first date, start:first time, end:last time,
open:first price, high:max price, low:min price, close:last price,
volume:sum size from x
} each tbl[rows]
}[bar] each dates
}
Example with perfectMkt
perfectMkt:([]date:.z.d;time:"t"$(til 20)*1000*60*60;price:0.5 * 1 + til 20;size:100*1 + til 20)
q)g[2;perfectMkt]
date start end open high low close volume
---------------------------------------------------------------
2022.01.17 00:00:00.000 04:00:00.000 0.5 2.5 0.5 2.5 1500
2022.01.17 04:00:00.000 08:00:00.000 2.5 4.5 2.5 4.5 3500
2022.01.17 08:00:00.000 12:00:00.000 4.5 6.5 4.5 6.5 5500
2022.01.17 12:00:00.000 16:00:00.000 6.5 8.5 6.5 8.5 7500

Related

How to Limit and Partition data in PySpqrk Dataframe

I have below data
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|restaurant_id| restaurant_name| city|state|postal_code| stars|review_count|cuisine_name|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
| 62112| Neptune Oyster| Boston| MA| 02113|4.500000000000000000| 5115| American|
| 62112| Neptune Oyster| Boston| MA| 02113|4.500000000000000000| 5115| Thai|
| 60154|Giacomo's Ristora...| Boston| MA| 02113|4.000000000000000000| 3520| Italian|
| 61455|Atlantic Fish Com...| Boston| MA| 02116|4.000000000000000000| 2575| American|
| 57757| Top of the Hub| Boston| MA| 02199|3.500000000000000000| 2273| American|
| 58631| Carmelina's| Boston| MA| 02113|4.500000000000000000| 2250| Italian|
| 58895| The Beehive| Boston| MA| 02116|3.500000000000000000| 2184| American|
| 56517|Lolita Cocina & T...| Boston| MA| 02116|4.000000000000000000| 2179| American|
| 56517|Lolita Cocina & T...| Boston| MA| 02116|4.000000000000000000| 2179| Mexican|
| 58440| Toro| Boston| MA| 02118|4.000000000000000000| 2175| Spanish|
| 58615| Regina Pizzeria| Boston| MA| 02113|4.000000000000000000| 2071| Italian|
| 58723| Gaslight| Boston| MA| 02118|4.000000000000000000| 2056| American|
| 58723| Gaslight| Boston| MA| 02118|4.000000000000000000| 2056| French|
| 60920| Modern Pastry Shop| Boston| MA| 02113|4.000000000000000000| 2042| Italian|
| 59453|Gourmet Dumpling ...| Boston| MA| 02111|3.500000000000000000| 1990| Taiwanese|
| 59453|Gourmet Dumpling ...| Boston| MA| 02111|3.500000000000000000| 1990| Chinese|
| 59204|Russell House Tavern|Cambridge| MA| 02138|4.000000000000000000| 1965| American|
| 60732|Eastern Standard ...| Boston| MA| 02215|4.000000000000000000| 1890| American|
| 60732|Eastern Standard ...| Boston| MA| 02215|4.000000000000000000| 1890| French|
| 56970| Border Café|Cambridge| MA| 02138|4.000000000000000000| 1880| Mexican|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
I want to partition data based of City,State and Cuisine and order by stars and review count and finally limit the records per partition.
Can this be done with pyspark.

You can add row_number to the partitions after windowing and filter based on this to limit records per window. You can control the maximum number of rows per window using max_number_of_rows_per_partition variable in the code below.
Since your question did not include the way you want stars and review_count ordered, I have assumed them to be descending.
import pyspark.sql.functions as F
from pyspark.sql import Window
window_spec = Window.partitionBy("city", "state", "cuisine_name")\
.orderBy(F.col("stars").desc(), F.col("review_count").desc())
max_number_of_rows_per_partition = 3
df.withColumn("row_number", F.row_number().over(window_spec))\
.filter(F.col("row_number") <= max_number_of_rows_per_partition)\
.drop("row_number")\
.show(200, False)

Clickhouse group by with difference less then

date_start | date_end | value
2020-12-05 11:00:00 | 2020-12-05 11:15:00 | 1
2020-12-05 11:15:00 | 2020-12-05 11:30:00 | 2
2020-12-05 11:30:00 | 2020-12-05 11:45:00 | 3
2020-12-05 13:00:00 | 2020-12-05 13:15:00 | 4
If the difference of date_start are less than 15 minutes, then group the rows and calculate the sum of values.
Expected result
date_start | date_end | sum
2020-12-05 11:00:00 | 2020-12-05 11:45:00 | 6
2020-12-05 13:00:00 | 2020-12-05 13:15:00 | 4

Postgresql autovacuum partitioned table

PostgreSQL 9.5.2 RDS in AWS
select name,setting from pg_settings
where name like '%vacuum%'
order by name;
name | setting
-------------------------------------+-----------
autovacuum | on
autovacuum_analyze_scale_factor | 0.05
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 450000000
autovacuum_max_workers | 3
autovacuum_multixact_freeze_max_age | 400000000
autovacuum_naptime | 30
autovacuum_vacuum_cost_delay | 20
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.1
autovacuum_vacuum_threshold | 50
autovacuum_work_mem | -1
log_autovacuum_min_duration | 0
rds.force_autovacuum_logging_level | log
vacuum_cost_delay | 0
vacuum_cost_limit | 300
vacuum_cost_page_dirty | 20
vacuum_cost_page_hit | 1
vacuum_cost_page_miss | 10
vacuum_defer_cleanup_age | 0
vacuum_freeze_min_age | 50000000
vacuum_freeze_table_age | 250000000
vacuum_multixact_freeze_min_age | 5000000
vacuum_multixact_freeze_table_age | 150000000
I've been trying to figure out how auto vacuuming is working in two Postgres databases. The databases are identical in size, parameters and structure. (These are two data warehouses for the same application - different locations and different patterns of data).
We are using partitions for some of our very large tables. I've noticed that the older (static) partitions are regularly getting auto vacuumed. I understand that XIDs are frozen but the relation does need periodic vacuuming to look for and new XIDs.
I've been using this query to look for relations that will require vacuuming to avoid XID wrap around:
SELECT 'Relation Name',age(c.relfrozenxid) c_age, age(t.relfrozenxid) t_age,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm')
order by age desc limit 5;
?column? | c_age | t_age | age
---------------+-----------+-----------+-----------
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | | 461544753
Relation Name | 461544753 | 310800517 | 461544753
All of the relations listed are old stable partitions. The column relfrozenxid is defined as: "All transaction IDs before this one have been replaced with a permanent ("frozen") transaction ID in this table. This is used to track whether the table needs to be vacuumed in order to prevent transaction ID wraparound or to allow pg_clog to be shrunk."
Out of curiosity I looked at relfrozenxid for all of the partitions of a particular table:
SELECT c.oid::regclass as table_name,age(c.relfrozenxid) as age , c.reltuples::int, n_live_tup, n_dead_tup,
date_trunc('day',last_autovacuum)
FROM pg_class c
JOIN pg_stat_user_tables u on c.relname = u.relname
WHERE c.relkind IN ('r', 'm')
and c.relname like 'tablename%'
table_name | age | reltuples | n_live_tup | n_dead_tup | date_trunc
-------------------------------------+-----------+-----------+------------+------------+------------------------
schema_partition.tablename_201202 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201306 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201204 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201110 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201111 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201112 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201201 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201203 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201109 | 460250527 | 0 | 0 | 0 | 2018-09-23 00:00:00+00
schema_partition.tablename_201801 | 435086084 | 37970232 | 37970230 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201307 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201107 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201312 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201311 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201401 | 433975635 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201310 | 423675180 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201704 | 423222113 | 43842668 | 43842669 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201612 | 423222113 | 65700844 | 65700845 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201705 | 423221655 | 46847336 | 46847338 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201702 | 423171142 | 50701032 | 50701031 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_overflow | 423171142 | 754 | 769 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201106 | 421207271 | 1 | 1 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201309 | 421207271 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201108 | 421207271 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201308 | 421207271 | 0 | 0 | 0 | 2018-09-25 00:00:00+00
schema_partition.tablename_201806 | 374122782 | 44626756 | 44626757 | 0 | 2018-09-26 00:00:00+00
schema.tablename | 360135561 | 0 | 0 | 0 | 2018-09-27 00:00:00+00
I'm pretty sure I don't really understand how the relfrozenxid works but it does appear that the partition tables are affected by the parent table (which would affect the relfrozenxid value for the partitioned table). I can't find any documentation regarding this. I would think that for static tables the relfrozenxid would remain static until a vacuum occurred.
Additionally I have a handful of relations that have static data that apparently have never been auto vacuumed (last_autovacuum is null). Could this be a result of a VACUUM FREEZE operation?
I am new to Postgres and I readily admit to not fully understanding the auto vacuum processes.
I'm not seeing and performance problems that I can identify.
Edit:
I set up a query to run every 4 hours against one partitioned table:
SELECT c.oid::regclass as table_name,age(c.relfrozenxid) as age , c.reltuples::int, n_live_tup, n_dead_tup,
date_trunc('day',last_autovacuum)
FROM pg_class c
JOIN pg_stat_user_tables u on c.relname = u.relname
WHERE c.relkind IN ('r', 'm')
and c.relname like 'sometable%'
order by age desc;
Looking at two different partitions here is the output for the last 20 hours:
schemaname.sometable_201812 | 206286536 | 0 | 0 | 0 |
schemaname.sometable_201812 | 206286537 | 0 | 0 | 0 |
schemaname.sometable_201812 | 225465100 | 0 | 0 | 0 |
schemaname.sometable_201812 | 225465162 | 0 | 0 | 0 |
schemaname.sometable_201812 | 225465342 | 0 | 0 | 0 |
schemaname.sometable_201812 | 236408374 | 0 | 0 | 0 |
-bash-4.2$ grep 201610 test1.out
schemaname.sometable_201610 | 449974426 | 31348368 | 31348369 | 0 | 2018-09-22 00:00:00+00
schemaname.sometable_201610 | 449974427 | 31348368 | 31348369 | 0 | 2018-09-22 00:00:00+00
schemaname.sometable_201610 | 469152990 | 31348368 | 31348369 | 0 | 2018-09-22 00:00:00+00
schemaname.sometable_201610 | 50000051 | 31348368 | 31348369 | 0 | 2018-10-10 00:00:00+00
schemaname.sometable_201610 | 50000231 | 31348368 | 31348369 | 0 | 2018-10-10 00:00:00+00
schemaname.sometable_201610 | 60943263 | 31348368 | 31348369 | 0 | 2018-10-10 00:00:00+00
The relfrozenxid of partitions is being modified even though there is no direct DML to the partition. I would assume that inserts to the base table are somehow modifying the relfrozenxid of the partitions.
The partition sometable_201610 has 31 million rows but is static. When I look at the log files the autvacuum of this type of partition is taking 20-30 minutes. I don't know if that is a performance problem or not but it does seem expensive. Looking at the autovacuum in the log files shows that typically there are several of these large partitions autovacuumed every night. (There are also lots of the partitions with zero tuples that are autovacuumed but these take very little time).

Postgres weighted average given two time intervals columns and separate table

I want to calculate a weighted average for parent orders comprised of several child orders. The first table defines the parent orders where the product id, parent order, start_time and end time are defined. The second table is where I need to aggregate the data.
Groups Definition Table:
id | order_Parent | start_time | end_time
1 | 1 | 2018-01-26 15:53:00 | 2018-01-26 15:54:00
2 | 2 | 2018-01-26 15:51:00 | 2018-01-26 16:01:00
2 | 3 | 2018-01-26 15:27:00 | 2018-01-26 15:35:00
Data Table To Calculate Weighted Average :
id | order_child | time_stamp | weight | target_value
1 | 1 | 2018-01-26 15:53:00 | 100 | 99.99
1 | 1 | 2018-01-26 15:53:00 | 200 | 89.99
1 | 1 | 2018-01-26 15:53:30 | 50 | 114.99
2 | 2 | 2018-01-26 15:49:00 | 100 | 49.99
2 | 2 | 2018-01-26 15:55:00 | 100 | 59.99
2 | 2 | 2018-01-26 15:57:30 | 250 | 54.99
2 | 3 | 2018-01-26 15:27:30 | 100 | 54.99
2 | 3 | 2018-01-26 15:31:30 | 75 | 49.99
2 | 3 | 2018-01-26 15:34:30 | 100 | 54.99
Ideal Output:
id | order_Parent | start_time | end_time | WgtAvg
1 | 1 | 2018-01-26 15:53:00 | 2018-01-26 15:54:00 | 96.41
2 | 2 | 2018-01-26 15:51:00 | 2018-01-26 16:01:00 | 54.99
2 | 3 | 2018-01-26 15:27:00 | 2018-01-26 15:35:00 | 53.62
The problem seems clear, but I am stumped on how to use the first table for my group definitions in order to calculate the weighted average per parent order.
Any thoughts are greatly appreciated.

Postgres lead window with a series index

So I'm working with a table of instruments and their historic closing prices and I'm trying to match a day with an arbitrary number of days in the future's closing price. To do this I'm trying to use the lead function with a generated series's index.
select subq.*, lead(subq.close, x) over w, lead(subq.date, x) over w, x from
select instrument_id, close, timestamp::date as date from
instrument_data_daily order by instrument_id, date) as subq,
generate_series(2, 5) as x
window w as (PARTITION BY instrument_id ORDER BY date);
However this provides what I consider to be incorrect results:
instrument_id | close | date | lead | lead | x
---------------|-------------|------------|-------------|------------|---
801 | 34499.96 | 2014-12-04 | 34499.96 | 2014-12-04 | 2
801 | 34499.96 | 2014-12-04 | 31599.99 | 2014-12-05 | 3
801 | 34499.96 | 2014-12-04 | 31599.99 | 2014-12-05 | 4
801 | 34499.96 | 2014-12-04 | 28599.99 | 2014-12-08 | 5
801 | 31599.99 | 2014-12-05 | 31599.99 | 2014-12-05 | 2
801 | 31599.99 | 2014-12-05 | 28599.99 | 2014-12-08 | 3
801 | 31599.99 | 2014-12-05 | 28599.99 | 2014-12-08 | 4
801 | 31599.99 | 2014-12-05 | 25800.04 | 2014-12-09 | 5
801 | 28599.99 | 2014-12-08 | 28599.99 | 2014-12-08 | 2
801 | 28599.99 | 2014-12-08 | 25800.04 | 2014-12-09 | 3
801 | 28599.99 | 2014-12-08 | 25800.04 | 2014-12-09 | 4
801 | 28599.99 | 2014-12-08 | 23399.95 | 2014-12-10 | 5
801 | 25800.04 | 2014-12-09 | 25800.04 | 2014-12-09 | 2
801 | 25800.04 | 2014-12-09 | 23399.95 | 2014-12-10 | 3
801 | 25800.04 | 2014-12-09 | 23399.95 | 2014-12-10 | 4
801 | 25800.04 | 2014-12-09 | 21499.98 | 2014-12-11 | 5
Note that the lead dates for indexes 3 and 4 are the same.
The underlying query generates a table with no duplicates:
select instrument_id, close, timestamp::date as date from
instrument_data_daily order by instrument_id, date;
Provides the following results:
instrument_id | close | date
---------------+-------------+------------
801 | 34499.96 | 2014-12-04
801 | 31599.99 | 2014-12-05
801 | 28599.99 | 2014-12-08
801 | 25800.04 | 2014-12-09
801 | 23399.95 | 2014-12-10
801 | 21499.98 | 2014-12-11
801 | 23100.00 | 2014-12-12
801 | 23300.04 | 2014-12-15
So we can see that the underlying data doesn't contain any problems with duplicates, and that the generated series index, x, is where I'm expecting it to be. Any ideas as to why the window would pull the wrong index?
(Data set is truncated to fit the example, but it's about a quarter of a million rows deep, making joins expensive).
edit: (Adding in expected results for clarity)
However the expected results are that each date is paired with the correct amount of lead indexed based on x, as follows: (Note this is manually created, query has not yet been updated)
instrument_id | close | date | lead | lead | x
---------------|-------------|------------|-------------|------------|---
801 | 34499.96 | 2014-12-04 | 31599.99 | 2014-12-05 | 2
801 | 34499.96 | 2014-12-04 | 28599.99 | 2014-12-08 | 3
801 | 34499.96 | 2014-12-04 | 25800.04 | 2014-12-09 | 4
801 | 34499.96 | 2014-12-04 | 23399.95 | 2014-12-10 | 5
801 | 31599.99 | 2014-12-05 | 31599.99 | 2014-12-08 | 2
801 | 31599.99 | 2014-12-05 | 25800.04 | 2014-12-09 | 3
801 | 31599.99 | 2014-12-05 | 23399.95 | 2014-12-10 | 4
801 | 31599.99 | 2014-12-05 | 21499.98 | 2014-12-11 | 5
801 | 28599.99 | 2014-12-08 | 25800.04 | 2014-12-09 | 2
801 | 28599.99 | 2014-12-08 | 23399.95 | 2014-12-10 | 3
801 | 28599.99 | 2014-12-08 | 21499.98 | 2014-12-11 | 4
801 | 28599.99 | 2014-12-08 | 23100.00 | 2014-12-12 | 5
801 | 25800.04 | 2014-12-09 | 23399.95 | 2014-12-10 | 2
801 | 25800.04 | 2014-12-09 | 21499.98 | 2014-12-11 | 3
801 | 25800.04 | 2014-12-09 | 23100.00 | 2014-12-12 | 4
801 | 25800.04 | 2014-12-09 | 23300.04 | 2014-12-15 | 5
We should actually update the generated series to be from 1 to x, where x is maximal lookahead desired. However the same overlapping lead results occur for any nontrivial series results.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how create KDB range bars - kdb

Related

How to Limit and Partition data in PySpqrk Dataframe

Clickhouse group by with difference less then

Postgresql autovacuum partitioned table

Postgres weighted average given two time intervals columns and separate table

Postgres lead window with a series index

Categories

Resources