How to match date and string from 2 lists (KDB)? - kdb

I have two lists:
data:
dt sym bid ask
2017.01.01D05:00:09.140745000 AAPL 101.20 101.30
2017.01.01D05:00:09.284281800 GOOG 801.00 802.00
2017.01.02D05:00:09.824847299 AAPL 101.30 101.40
info:
date sym shares divisor
2017.01.01 AAPL 500 2
2017.01.01 GOOG 100 1
2017.01.02 AAPL 200 2
I need to append from "info" the shares and divisor values for each ticker based on the date. How can I achieve this? Below is an example:
result:
dt sym bid ask shares divisor
2017.01.01D05:00:09.140745000 AAPL 101.20 101.30 500 2
2017.01.01D05:00:09.284281800 GOOG 801.00 802.00 100 1
2017.01.02D05:00:09.824847299 AAPL 101.30 101.40 200 2

If matching based on an exact date match then you can use lj. For this to work you will need to create a date column in the data table and key info by date and sym. Like so:
(update date:`date$dt from data)lj 2!info
dt sym price date shares divisor
---------------------------------------------------------------------
2018.02.04D17:25:06.658216000 AAPL 103.9275 2018.02.04 500 2
2018.02.04D17:25:06.658216000 GOOG 105.1709 2018.02.04 100 1
2018.02.05D17:25:06.658217000 AAPL 105.1598 2018.02.05 200 2
2018.02.05D17:25:06.658217000 GOOG 104.0666 2018.02.05
You can then delete the date column from this output.

It might be useful for you to use the stepped attribute [ http://code.kx.com/q/cookbook/temporal-data/#stepped-attribute ]
This will allow you to have e.g. missing dates from the info table and use the "most recent" date instead (so you don't have to have data for every sym every day). For example, without stepped attribute:
q)data:([] dt:(10?2017.01.01+til 2)+10?.z.t;sym:10?`AAPL`GOOG;bid:100+10?5;ask:105+10?5)
q)info:([] date:2017.01.01 2017.01.01 2017.01.02;sym:`AAPL`GOOG`AAPL;shares:500 100 200;divisor:2 1 2)
q)(update date:`date$dt from data) lj 2!info
dt sym bid ask date shares divisor
--------------------------------------------------------------------
2017.01.01D04:04:03.440000000 GOOG 104 105 2017.01.01 100 1
2017.01.01D14:00:02.748000000 GOOG 104 105 2017.01.01 100 1
2017.01.02D09:34:52.869000000 GOOG 102 106 2017.01.02
2017.01.02D16:44:16.648000000 AAPL 100 107 2017.01.02 200 2
2017.01.01D08:48:23.285000000 AAPL 102 108 2017.01.01 500 2
2017.01.02D02:31:11.038000000 AAPL 104 109 2017.01.02 200 2
2017.01.01D05:50:50.463000000 GOOG 104 109 2017.01.01 100 1
2017.01.02D02:13:45.275000000 AAPL 101 107 2017.01.02 200 2
2017.01.01D10:25:30.322000000 AAPL 104 109 2017.01.01 500 2
2017.01.01D14:51:12.687000000 AAPL 103 109 2017.01.01 500 2
Note the nulls for GOOG on 2017.01.02. With stepped attribute:
q)(update date:`date$dt from data) lj `s#2!`sym xasc `sym`date xcols info
dt sym bid ask date shares divisor
--------------------------------------------------------------------
2017.01.01D04:04:03.440000000 GOOG 104 105 2017.01.01 100 1
2017.01.01D14:00:02.748000000 GOOG 104 105 2017.01.01 100 1
2017.01.02D09:34:52.869000000 GOOG 102 106 2017.01.02 100 1
2017.01.02D16:44:16.648000000 AAPL 100 107 2017.01.02 200 2
2017.01.01D08:48:23.285000000 AAPL 102 108 2017.01.01 500 2
2017.01.02D02:31:11.038000000 AAPL 104 109 2017.01.02 200 2
2017.01.01D05:50:50.463000000 GOOG 104 109 2017.01.01 100 1
2017.01.02D02:13:45.275000000 AAPL 101 107 2017.01.02 200 2
2017.01.01D10:25:30.322000000 AAPL 104 109 2017.01.01 500 2
2017.01.01D14:51:12.687000000 AAPL 103 109 2017.01.01 500 2
Here, GOOG gets the values for 2017.01.01 as there is no new value on 2017.01.02

Could possibly use an aj as well.
q)aj[`date`sym;update date:`date$dt from data;info]
dt sym bid ask date shares divisor
--------------------------------------------------------------------
2017.01.02D07:57:14.764000000 GOOG 101 109 2017.01.02 200 2
2017.01.02D02:31:39.330000000 AAPL 100 105 2017.01.02 200 2
2017.01.02D04:25:17.604000000 AAPL 102 107 2017.01.02 200 2
2017.01.01D01:47:51.333000000 GOOG 104 106 2017.01.01 100 1
2017.01.02D15:50:12.140000000 AAPL 101 107 2017.01.02 200 2
2017.01.01D02:59:16.636000000 GOOG 102 106 2017.01.01 100 1
2017.01.01D14:35:31.860000000 AAPL 100 107 2017.01.01 500 2
2017.01.01D16:36:29.214000000 GOOG 101 108 2017.01.01 100 1
2017.01.01D14:01:18.498000000 GOOG 101 107 2017.01.01 100 1
2017.01.02D08:31:52.958000000 AAPL 102 109 2017.01.02 200 2

Related

Address and smoothen noise in sensor data

I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.
Case 1 (sample data) :-
Sl.no
Timestamp
Data
New_data
1
1/1/2021 0:00:00
50
50
2
1/1/2021 0:15:00
50
50
3
1/1/2021 0:30:00
50
50
4
1/1/2021 0:45:00
50
50
5
1/1/2021 1:00:00
50
50
6
1/1/2021 1:15:00
50
50
7
1/1/2021 1:30:00
50
50
8
1/1/2021 1:45:00
50
50
9
1/1/2021 2:00:00
50
50
10
1/1/2021 2:15:00
50
50
11
1/1/2021 2:30:00
45
50
12
1/1/2021 2:45:00
45
50
13
1/1/2021 3:00:00
45
50
14
1/1/2021 3:15:00
45
50
15
1/1/2021 3:30:00
45
50
16
1/1/2021 3:45:00
45
50
17
1/1/2021 4:00:00
50
50
18
1/1/2021 4:15:00
50
50
19
1/1/2021 4:30:00
50
50
20
1/1/2021 4:45:00
50
50
21
1/1/2021 5:00:00
50
50
22
1/1/2021 5:15:00
50
50
23
1/1/2021 5:30:00
50
50
I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is.
Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value.
But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.
Case 2:-
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
24
1/1/2021 5:45:00
45
45
4
25
1/1/2021 6:00:00
45
45
4
26
1/1/2021 6:15:00
45
45
4
27
1/1/2021 6:30:00
45
45
4
28
1/1/2021 6:45:00
45
45
4
29
1/1/2021 7:00:00
45
45
4
30
1/1/2021 7:15:00
45
45
4
31
1/1/2021 7:30:00
45
45
4
Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.
The query below should answer the need.
The first query identifies the rows which correspond to a change of
data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an
iterative way according to the timestamp order.
The last query display the expected result.
WITH RECURSIVE list As
(
SELECT no
, timestamp
, lag(data) OVER w AS previous
, data
, lead(data) OVER w AS next
, data IS DISTINCT FROM lag(data) OVER w AS first
, data IS DISTINCT FROM lead(data) OVER w AS last
FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
, previous
, data
, lead(next) OVER w AS next
, first
FROM list
WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
, previous
, data
, next
, data
, array[range]
FROM range_list
WHERE previous IS NULL
UNION ALL
SELECT c.range
, p.data
, c.data
, c.next
, CASE
WHEN p.new_data IS NOT DISTINCT FROM c.next
THEN p.data
ELSE c.data
END
, p.arr || c.range
FROM rec_list AS p
INNER JOIN range_list AS c
ON lower(c.range) = upper(p.range) + interval '15 minutes'
WHERE NOT array[c.range] <# p.arr
AND first
)
SELECT s.*, r.new_data
FROM sensors AS s
INNER JOIN rec_list AS r
ON r.range #> s.timestamp
ORDER BY timestamp
see the test result in dbfiddle

How do I window join across different dates in kdb

I am a beginner in kdb. As I was practicing window join on test data from NYSE, I came across issues with window join across different dates.
Basically, my table looks like:
t:([] sym:10#`AAPL;date:2021.03.21 2021.03.21 2021.03.21 2021.03.21 2021.03.21 2021.03.22 2021.03.22 2021.03.22 2021.03.22 2021.03.22;price:100 101 105 110 120 130 140 150 160 170;time:10:01 10:04 10:07 10:10 10:13 10:01 10:04 10:07 10:10 10:13)
I am trying to create a sliding window for every 3 minutes on each date and calculate the sum of price in that window. However, I am not sure how to do window join on different groups.
I tried:
w3:-3 0+\:t[`minute];
newdata: wj1[w3;`minute;t;(t;(sum;`price)
but this does not give me the correct result. Could someone please help with this. Thank you!
To do a wj across dates you need a timestamp column which you can create from date and time:
t:update timeStamp:"P"$"D" sv/: flip string (date;time) from t
t
sym date price time timeStamp
---------------------------------------------------------
AAPL 2021.03.21 100 10:01 2021.03.21D10:01:00.000000000
AAPL 2021.03.21 101 10:04 2021.03.21D10:04:00.000000000
AAPL 2021.03.21 105 10:07 2021.03.21D10:07:00.000000000
AAPL 2021.03.21 110 10:10 2021.03.21D10:10:00.000000000
AAPL 2021.03.21 120 10:13 2021.03.21D10:13:00.000000000
You can then use the timeStamp column like so:
w3:-00:03 00:00 +\:t[`timeStamp]
wj1[w3;`timeStamp;t;(t;(sum;`price))]
sym date price time timeStamp
---------------------------------------------------------
AAPL 2021.03.21 100 10:01 2021.03.21D10:01:00.000000000
AAPL 2021.03.21 201 10:04 2021.03.21D10:04:00.000000000
AAPL 2021.03.21 206 10:07 2021.03.21D10:07:00.000000000
AAPL 2021.03.21 215 10:10 2021.03.21D10:10:00.000000000
AAPL 2021.03.21 230 10:13 2021.03.21D10:13:00.000000000
AAPL 2021.03.22 130 10:01 2021.03.22D10:01:00.000000000
AAPL 2021.03.22 270 10:04 2021.03.22D10:04:00.000000000
AAPL 2021.03.22 290 10:07 2021.03.22D10:07:00.000000000
AAPL 2021.03.22 310 10:10 2021.03.22D10:10:00.000000000
AAPL 2021.03.22 330 10:13 2021.03.22D10:13:00.000000000
If you have more than 1 sym in the table you should apply the parted attribute:
t:update `p#sym from `sym`timeStamp xasc t
Then add sym before timeStamp in the 2nd argument of wj:
q)wj[w3;`sym`timeStamp;select sym, timeStamp from t;(t;(sum;`price))]
sym timeStamp price
----------------------------------------
AAPL 2021.03.21D10:01:00.000000000 100
AAPL 2021.03.21D10:04:00.000000000 201
AAPL 2021.03.21D10:07:00.000000000 206
AAPL 2021.03.21D10:10:00.000000000 215
AAPL 2021.03.21D10:13:00.000000000 230
AAPL 2021.03.22D10:01:00.000000000 250
AAPL 2021.03.22D10:04:00.000000000 270
AAPL 2021.03.22D10:07:00.000000000 290
AAPL 2021.03.22D10:10:00.000000000 310
AAPL 2021.03.22D10:13:00.000000000 330
MSFT 2021.03.21D10:01:00.000000000 468
MSFT 2021.03.21D10:04:00.000000000 915
MSFT 2021.03.21D10:07:00.000000000 668
MSFT 2021.03.21D10:10:00.000000000 403
MSFT 2021.03.21D10:13:00.000000000 604
MSFT 2021.03.22D10:01:00.000000000 775
MSFT 2021.03.22D10:04:00.000000000 697
MSFT 2021.03.22D10:07:00.000000000 829
MSFT 2021.03.22D10:10:00.000000000 799
MSFT 2021.03.22D10:13:00.000000000 382

Create PostgreSQL view to feed a chart generating tool having filter option

We need to create a postgres SQL view to generate a chart. Chart creating tool allow only a single SQL view as input. The chart has the filter option by studentname, cousecode and feecode. Other than the chart display, we need to show the sum of the total course fee and fee amount paid by all the students from the same view.
table1: student
id name address
1 John USA
2 Robert UK
3 Tinger NZ
table2: student_course
id std_id coursecode fee
1 1 CHEM 3000
2 1 PHY 4000
3. 1 BIO 2000
4. 2 CHEM 3000
5. 2 GEO 1500
6. 3 ENG 2000
table3: student_fees
id std_name coursecode feecode amount
1 1 CHEM BKFEE 100
2 1 CHEM SPFEE 140
3 1 CHEM MATFEE 250
4 1 PHY BKFEE 150
5 1 PHY SPFEE 200
6 1 BIO LBFEE 300
7 1 BIO MATFEE 350
9 1 BIO TECFEE 200
10 2 CHEM BKFEE 100
11 2 CHEM SPFEE 140
12 2 GEO BKFEE 150
13 3 ENG BKFEE 75
14 3 ENG SPFEE 140
15 3 ENG LBFEE 180
Am able to create a view like this. But this view is not enough for my operation. Because from this view I couldn't calculate the sum of the total course fee(course fee is repeating). In this case, the grouping will not work. Because of the need to filter the data by studentname,coursecode and feecode.
View:
id std_id coursecode course_fee feecode fee_amount
1 John CHEM 3000 BKFEE 100
2 John CHEM 3000 SPFEE 140
3 John CHEM 3000 MATFEE 250
4 John PHY 4000 BKFEE 150
5 John PHY 4000 SPFEE 200
6 John BIO 4000 LBFEE 300
7 John BIO 4000 MATFEE 350
8 John BIO 4000 TECFEE 200
9 Robert CHEM 3000 BKFEE 100
10 Robert CHEM 3000 SPFEE 140
11 Robert GEO 1500 BKFEE 150
12 Tinger ENG 2000 BKFEE 75
13 Tinger ENG 2000 SPFEE 140
14 Tinger ENG 2000 LBFEE 180
So in any way can we create a view like this ?
View:
id std_id coursecode course_fee feecode fee_amount
1 John CHEM 3000 BKFEE 100
2 John CHEM 0 SPFEE 140
3 John CHEM 0 MATFEE 250
4 John PHY 4000 BKFEE 150
5 John PHY 0 SPFEE 200
6 John BIO 4000 LBFEE 300
7 John BIO 0 MATFEE 350
8 John BIO 0 TECFEE 200
9 Robert CHEM 3000 BKFEE 100
10 Robert CHEM 0 SPFEE 140
11 Robert GEO 1500 BKFEE 150
12 Tinger ENG 2000 BKFEE 75
13 Tinger ENG 0 SPFEE 140
14 Tinger ENG 0 LBFEE 180
Any help appreciated...
I guess you are looking for rollup functionality in your view query i am shearing you 2 links fist one is for the basics how rollup works and the 2nd one is specific to Postgresql
first link , second link Hope this will help you
I have work out one demo for you please check rollup query
Not similar to the answer you are expecting but you can explore GROUPING SET
select name, sf.coursecode, amount, sum(fee)
from student s, student_course sc, student_fees sf
where s.id = sc.std_id
and sf.std_name = s.id
and sf.coursecode = sc.coursecode
group by
GROUPING SETS (
(name, sf.coursecode, amount, fee),
(name, sf.coursecode, fee),
()
)
order by name, sf.coursecode asc

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Combine 2 data frames with different columns in spark

I have 2 dataframes:
df1 :
Id purchase_count purchase_sim
12 100 1500
13 1020 1300
14 1010 1100
20 1090 1400
21 1300 1600
df2:
Id click_count click_sim
12 1030 2500
13 1020 1300
24 1010 1100
30 1090 1400
31 1300 1600
I need to get the combined data frame with results as :
Id click_count click_sim purchase_count purchase_sim
12 1030 2500 100 1500
13 1020 1300 1020 1300
14 null null 1010 1100
24 1010 1100 null null
30 1090 1400 null null
31 1300 1600 null null
20 null null 1090 1400
21 null null 1300 1600
I can't use union because of different column names. Can some one suggest me a better way to do this ?
All you require a full outer join on ID column.
df1.join(df2, Seq("Id"), "full_outer")
// Since the Id column name is same in both the dataframes, if you use comparison like
df1($"Id") === df2($"Id"), you will get duplicate ID columns
Please refer the below documentation for future references.
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html