Sum values from the previous N number of days in KDB? - kdb

I have a table with following two columns:
Initial Table
Date Value
-------------------
2019.01.01 | 150
2019.01.02 | 100
2019.01.04 | 200
2019.01.07 | 300
2019.01.08 | 100
2019.01.10 | 150
2019.01.14 | 200
2019.01.15 | 100
For each row, I would like to sum values from the previous N number of days. In this case, N = 5.
Resultant Table
Date Value Sum
------------------------
2019.01.01 | 150 | 150 (01 -> ..)
2019.01.02 | 100 | 250 (02 -> 01)
2019.01.04 | 200 | 450 (04 -> 01)
2019.01.07 | 300 | 600 (07 -> 02)
2019.01.08 | 100 | 600 (08 -> 04)
2019.01.10 | 150 | 550 (10 -> 07)
2019.01.14 | 200 | 350 (14 -> 10)
2019.01.15 | 100 | 450 (15 -> 10)
Query
t:([] Date: 2019.01.01 2019.01.02 2019.01.04 2019.01.07 2019.01.08 2019.01.10 2019.01.14 2019.01.15; Value: 150 100 200 300 100 150 200 100)
How can I go about doing that?

One way you could go about this is to use an update statement like below:
q)N:5
q)update Sum:sum each Value where each Date within/:flip(Date-N;Date)from t
Date Value Sum
--------------------
2019.01.01 150 150
2019.01.02 100 250
2019.01.04 200 450
2019.01.07 300 600
2019.01.08 100 600
2019.01.10 150 550
2019.01.14 200 350
2019.01.15 100 450
The within keyword checks each date in the Date column is within the window of the current date and the current date-N, which is possible with an each right.
q)flip(-5+t`Date;t`Date)
2018.12.27 2019.01.01
2018.12.28 2019.01.02
2018.12.30 2019.01.04
2019.01.02 2019.01.07
2019.01.03 2019.01.08
2019.01.05 2019.01.10
2019.01.09 2019.01.14
2019.01.10 2019.01.15
q)t[`Date]within/:flip(-5+t`Date;t`Date)
10000000b
11000000b
11100000b
01110000b
00111000b
00011100b
00000110b
00000111b
This will return a list of boolean lists, which can be turned to indexes using where each (each since its a list of list), then indexed back into Value.
q)where each t[`Date]within/:flip(-5+t`Date;t`Date)
,0
0 1
0 1 2
1 2 3
2 3 4
3 4 5
5 6
5 6 7
q)t[`Value]where each t[`Date]within/:flip(-5+t`Date;t`Date)
,150
150 100
150 100 200
100 200 300
200 300 100
300 100 150
150 200
150 200 100
Then using sum each you can sum each of the list of numbers to get your desired result.
q)sum each t[`Value]where each t[`Date]within/:flip(-5+t`Date;t`Date)
150 250 450 600 600 550 350 450

You could also achieve this using an update statement like the one below. It doesn't require the flip and so should execute faster.
q)N:5
q)delete s from update runningSum:s-0^s[Date bin neg[1]+Date-N] from update s:sums Value from t
Date Value runningSum
---------------------------
2019.01.01 150 150
2019.01.02 100 250
2019.01.04 200 450
2019.01.07 300 600
2019.01.08 100 600
2019.01.10 150 550
2019.01.14 200 350
2019.01.15 100 450
This works using sums on the Value column, and then bin to find the running count from N days prior.
The delete keyword then removes the summed Value column to obtain your required result
q)\t:1000 delete s from update runningSum:s-0^s[Date bin neg[1]+Date-N] from update s:sums Value from t
7
While the time difference between this answer and Elliot's is negligible for small values of N, for larger values e.g. 1000, this is faster
q)\t:1000 update Sum:sum each Value where each Date within/:flip(Date-1000;Date)from t
11
q)\t:1000 delete s from update runningSum:s-0^s[Date bin neg[1]+Date-1000] from update s:sums Value from t
7
It should be noted that this answer requires the date field to be sorted, where Elliot's does not.
Another slightly slower way would is to generate 0 values for all the dates that is in between the min and max Date.
Then can use moving sums, msums, to get the values for the past 5 days.
It first takes the min and max Date from the table and makes a list of the dates that span between them.
q)update t: 0^Value from ([]Date:{[x] x[0]+til 1+x[1]-x[0]} exec (min[Date], max Date) from t) lj `Date xkey t
Date Value t
--------------------
2019.01.01 150 150
2019.01.02 100 100
2019.01.03 0
2019.01.04 200 200
2019.01.05 0
2019.01.06 0
2019.01.07 300 300
2019.01.08 100 100
2019.01.09 0
2019.01.10 150 150
Then it adds them to the table and fills in the empty values. This will then work for only the previous N days, taking into account any missing data
q){[x] select from x where not null Value } update t: 5 msum 0^Value from ([]Date:{[x] x[0]+til 1+x[1]-x[0]} exec (min[Date], max Date) from t) lj `Date xkey t
Date Value t
--------------------
2019.01.01 150 150
2019.01.02 100 250
2019.01.04 200 450
2019.01.07 300 500
2019.01.08 100 600
2019.01.10 150 550
2019.01.14 200 350
2019.01.15 100 300
I would also be careful when using Value as a column name, as you can run into issues with the value keyword
I hope this answers your question

A window join is a pretty natural fit here. See: https://code.kx.com/v2/ref/wj/
q)wj1[-5 0+\:t`Date;`Date;t;(t;(sum;`Value))]
Date Value
----------------
2019.01.01 150
2019.01.02 250
2019.01.04 450
2019.01.07 600
2019.01.08 600
2019.01.10 550
2019.01.14 350
2019.01.15 450
To go back 5 observations rather than 5 calendar days you could do:
q)wj1[{(4 xprev x;x)}t`Date;`Date;t;(t;(sum;`Value))]
Date Value
----------------
2019.01.01 150
2019.01.02 250
2019.01.04 450
2019.01.07 750
2019.01.08 850
2019.01.10 850
2019.01.14 950
2019.01.15 850

You can use the moving window mwin function to achieve this:
mwin:{[f;w;l] f each {1_x,y}\[w#0n;`float$l]}
You can then set the function f to sum and get the desired results over the last w:5 days for the desired list of values l (here l:exec Value from t):
update Sum:(mwin[sum;5;] exec Value from t) from t
Date Value Sum
--------------------
2019.01.01 150 150
2019.01.02 100 250
2019.01.04 200 450
2019.01.07 300 750
2019.01.08 100 850
2019.01.10 150 850
2019.01.14 200 950
2019.01.15 100 850

Related

How to calculate the amount of SQL?

I have a table transaction_details:
transaction_id
customer_id
item_id
item_number
transaction_dttm
7765
1
23
1
2022-01-15
1254
2
12
4
2022-02-03
3332
3
56
2
2022-02-15
7658
1
43
1
2022-03-01
7231
4
56
1
2022-01-15
7231
2
23
2
2022-01-29
I need to calculate the amount spent by the client in the last month and find out the item (item_name) on which the client spent the most in the last month.
Example result:
|customer_id|amount_spent_lm|top_item_lm|
| - | ---------- | ----- |
| 1 | 700 | glasses |
| 2 | 20000 | notebook |
| 3 | 100 | cup |
When calculating, it is necessary to take into account the current price at the time of the transaction (dict_item_prices). Customers who have not made purchases in the last month are not included in the final table. he last month is defined as the last 30 days at the time of the report creation.
There is also a table dict_item_prices:
item_id
item_name
item_price
valid_from_dt
valid_to_dt
23
phone 1
1000
2022-01-01
2022-12-31
12
notebook
5000
2022-01-02
2022-12-31
56
cup
50
2022-01-02
2022-12-31
43
glasses
700
2022-01-01
2022-12-31

Address and smoothen noise in sensor data

I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.
Case 1 (sample data) :-
Sl.no
Timestamp
Data
New_data
1
1/1/2021 0:00:00
50
50
2
1/1/2021 0:15:00
50
50
3
1/1/2021 0:30:00
50
50
4
1/1/2021 0:45:00
50
50
5
1/1/2021 1:00:00
50
50
6
1/1/2021 1:15:00
50
50
7
1/1/2021 1:30:00
50
50
8
1/1/2021 1:45:00
50
50
9
1/1/2021 2:00:00
50
50
10
1/1/2021 2:15:00
50
50
11
1/1/2021 2:30:00
45
50
12
1/1/2021 2:45:00
45
50
13
1/1/2021 3:00:00
45
50
14
1/1/2021 3:15:00
45
50
15
1/1/2021 3:30:00
45
50
16
1/1/2021 3:45:00
45
50
17
1/1/2021 4:00:00
50
50
18
1/1/2021 4:15:00
50
50
19
1/1/2021 4:30:00
50
50
20
1/1/2021 4:45:00
50
50
21
1/1/2021 5:00:00
50
50
22
1/1/2021 5:15:00
50
50
23
1/1/2021 5:30:00
50
50
I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is.
Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value.
But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.
Case 2:-
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
24
1/1/2021 5:45:00
45
45
4
25
1/1/2021 6:00:00
45
45
4
26
1/1/2021 6:15:00
45
45
4
27
1/1/2021 6:30:00
45
45
4
28
1/1/2021 6:45:00
45
45
4
29
1/1/2021 7:00:00
45
45
4
30
1/1/2021 7:15:00
45
45
4
31
1/1/2021 7:30:00
45
45
4
Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.
The query below should answer the need.
The first query identifies the rows which correspond to a change of
data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an
iterative way according to the timestamp order.
The last query display the expected result.
WITH RECURSIVE list As
(
SELECT no
, timestamp
, lag(data) OVER w AS previous
, data
, lead(data) OVER w AS next
, data IS DISTINCT FROM lag(data) OVER w AS first
, data IS DISTINCT FROM lead(data) OVER w AS last
FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
, previous
, data
, lead(next) OVER w AS next
, first
FROM list
WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
, previous
, data
, next
, data
, array[range]
FROM range_list
WHERE previous IS NULL
UNION ALL
SELECT c.range
, p.data
, c.data
, c.next
, CASE
WHEN p.new_data IS NOT DISTINCT FROM c.next
THEN p.data
ELSE c.data
END
, p.arr || c.range
FROM rec_list AS p
INNER JOIN range_list AS c
ON lower(c.range) = upper(p.range) + interval '15 minutes'
WHERE NOT array[c.range] <# p.arr
AND first
)
SELECT s.*, r.new_data
FROM sensors AS s
INNER JOIN rec_list AS r
ON r.range #> s.timestamp
ORDER BY timestamp
see the test result in dbfiddle

Combine 2 data frames with different columns in spark

I have 2 dataframes:
df1 :
Id purchase_count purchase_sim
12 100 1500
13 1020 1300
14 1010 1100
20 1090 1400
21 1300 1600
df2:
Id click_count click_sim
12 1030 2500
13 1020 1300
24 1010 1100
30 1090 1400
31 1300 1600
I need to get the combined data frame with results as :
Id click_count click_sim purchase_count purchase_sim
12 1030 2500 100 1500
13 1020 1300 1020 1300
14 null null 1010 1100
24 1010 1100 null null
30 1090 1400 null null
31 1300 1600 null null
20 null null 1090 1400
21 null null 1300 1600
I can't use union because of different column names. Can some one suggest me a better way to do this ?
All you require a full outer join on ID column.
df1.join(df2, Seq("Id"), "full_outer")
// Since the Id column name is same in both the dataframes, if you use comparison like
df1($"Id") === df2($"Id"), you will get duplicate ID columns
Please refer the below documentation for future references.
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

How to match date and string from 2 lists (KDB)?

I have two lists:
data:
dt sym bid ask
2017.01.01D05:00:09.140745000 AAPL 101.20 101.30
2017.01.01D05:00:09.284281800 GOOG 801.00 802.00
2017.01.02D05:00:09.824847299 AAPL 101.30 101.40
info:
date sym shares divisor
2017.01.01 AAPL 500 2
2017.01.01 GOOG 100 1
2017.01.02 AAPL 200 2
I need to append from "info" the shares and divisor values for each ticker based on the date. How can I achieve this? Below is an example:
result:
dt sym bid ask shares divisor
2017.01.01D05:00:09.140745000 AAPL 101.20 101.30 500 2
2017.01.01D05:00:09.284281800 GOOG 801.00 802.00 100 1
2017.01.02D05:00:09.824847299 AAPL 101.30 101.40 200 2
If matching based on an exact date match then you can use lj. For this to work you will need to create a date column in the data table and key info by date and sym. Like so:
(update date:`date$dt from data)lj 2!info
dt sym price date shares divisor
---------------------------------------------------------------------
2018.02.04D17:25:06.658216000 AAPL 103.9275 2018.02.04 500 2
2018.02.04D17:25:06.658216000 GOOG 105.1709 2018.02.04 100 1
2018.02.05D17:25:06.658217000 AAPL 105.1598 2018.02.05 200 2
2018.02.05D17:25:06.658217000 GOOG 104.0666 2018.02.05
You can then delete the date column from this output.
It might be useful for you to use the stepped attribute [ http://code.kx.com/q/cookbook/temporal-data/#stepped-attribute ]
This will allow you to have e.g. missing dates from the info table and use the "most recent" date instead (so you don't have to have data for every sym every day). For example, without stepped attribute:
q)data:([] dt:(10?2017.01.01+til 2)+10?.z.t;sym:10?`AAPL`GOOG;bid:100+10?5;ask:105+10?5)
q)info:([] date:2017.01.01 2017.01.01 2017.01.02;sym:`AAPL`GOOG`AAPL;shares:500 100 200;divisor:2 1 2)
q)(update date:`date$dt from data) lj 2!info
dt sym bid ask date shares divisor
--------------------------------------------------------------------
2017.01.01D04:04:03.440000000 GOOG 104 105 2017.01.01 100 1
2017.01.01D14:00:02.748000000 GOOG 104 105 2017.01.01 100 1
2017.01.02D09:34:52.869000000 GOOG 102 106 2017.01.02
2017.01.02D16:44:16.648000000 AAPL 100 107 2017.01.02 200 2
2017.01.01D08:48:23.285000000 AAPL 102 108 2017.01.01 500 2
2017.01.02D02:31:11.038000000 AAPL 104 109 2017.01.02 200 2
2017.01.01D05:50:50.463000000 GOOG 104 109 2017.01.01 100 1
2017.01.02D02:13:45.275000000 AAPL 101 107 2017.01.02 200 2
2017.01.01D10:25:30.322000000 AAPL 104 109 2017.01.01 500 2
2017.01.01D14:51:12.687000000 AAPL 103 109 2017.01.01 500 2
Note the nulls for GOOG on 2017.01.02. With stepped attribute:
q)(update date:`date$dt from data) lj `s#2!`sym xasc `sym`date xcols info
dt sym bid ask date shares divisor
--------------------------------------------------------------------
2017.01.01D04:04:03.440000000 GOOG 104 105 2017.01.01 100 1
2017.01.01D14:00:02.748000000 GOOG 104 105 2017.01.01 100 1
2017.01.02D09:34:52.869000000 GOOG 102 106 2017.01.02 100 1
2017.01.02D16:44:16.648000000 AAPL 100 107 2017.01.02 200 2
2017.01.01D08:48:23.285000000 AAPL 102 108 2017.01.01 500 2
2017.01.02D02:31:11.038000000 AAPL 104 109 2017.01.02 200 2
2017.01.01D05:50:50.463000000 GOOG 104 109 2017.01.01 100 1
2017.01.02D02:13:45.275000000 AAPL 101 107 2017.01.02 200 2
2017.01.01D10:25:30.322000000 AAPL 104 109 2017.01.01 500 2
2017.01.01D14:51:12.687000000 AAPL 103 109 2017.01.01 500 2
Here, GOOG gets the values for 2017.01.01 as there is no new value on 2017.01.02
Could possibly use an aj as well.
q)aj[`date`sym;update date:`date$dt from data;info]
dt sym bid ask date shares divisor
--------------------------------------------------------------------
2017.01.02D07:57:14.764000000 GOOG 101 109 2017.01.02 200 2
2017.01.02D02:31:39.330000000 AAPL 100 105 2017.01.02 200 2
2017.01.02D04:25:17.604000000 AAPL 102 107 2017.01.02 200 2
2017.01.01D01:47:51.333000000 GOOG 104 106 2017.01.01 100 1
2017.01.02D15:50:12.140000000 AAPL 101 107 2017.01.02 200 2
2017.01.01D02:59:16.636000000 GOOG 102 106 2017.01.01 100 1
2017.01.01D14:35:31.860000000 AAPL 100 107 2017.01.01 500 2
2017.01.01D16:36:29.214000000 GOOG 101 108 2017.01.01 100 1
2017.01.01D14:01:18.498000000 GOOG 101 107 2017.01.01 100 1
2017.01.02D08:31:52.958000000 AAPL 102 109 2017.01.02 200 2

More then Print Repeated Values

Branch ID date tax total
banglore bang01 22-11-2013 450 5000
banglore bang01 22-11-2013 350 4300
banglore bang01 22-11-2013 450 5000
bangalore bang02 22-11-2013 350 4300
bangalore bang02 22-11-2013 250 2500
banglore bang03 24-11-2013 350 4500
when I remove Print Repeated Values
Branch ID date tax total
banglore bang01 22-11-2013 450 5000
350 4300
450 5000
bang02 350 4300
250 2500
bang03 24-11-2013 350 4500
but I want like this :
Branch ID date tax total
banglore bang01 22-11-2013 450 5000
350 4300
450 5000
bangalore bang02 22-11-2013 350 4300
250 2500
banglore bang03 24-11-2013 350 4500