KDB+ How to join data for particular dates - left-join

I have the following table containing some time-series data about some countries:
t1 : ([]dates:"d"$4+til 6) cross ([]country:`PT`AR`MR`LT; category1:1+til 4)
dates country category1
----------------------------
2000.01.05 PT 1
2000.01.05 AR 2
2000.01.05 MR 3
2000.01.05 LT 4
2000.01.06 PT 1
2000.01.06 AR 2
2000.01.06 MR 3
2000.01.06 LT 4
2000.01.07 PT 1
2000.01.07 AR 2
2000.01.07 MR 3
2000.01.07 LT 4
..
I have another table containing some complementary data for t1, but that are only valid from a certain point in time, as follows:
t2 : (([]validFrom:"d"$(0;6)) cross ([]country:`PT`AR`MR`LT)),'([]category2:1000*(1+til 8))
validFrom country category2
----------------------------
2000.01.01 PT 1000
2000.01.01 AR 2000
2000.01.01 MR 3000
2000.01.01 LT 4000
2000.01.07 PT 5000
2000.01.07 AR 6000
2000.01.07 MR 7000
2000.01.07 LT 8000
My question is: how do I join t1 and t2 to get the category2 column only for dates in t1 that are "compliant" with the validFrom dates in t2, such that the resulting table would look like this:
dates country category1 category2
--------------------------------------
2000.01.05 PT 1 1000
2000.01.05 AR 2 2000
2000.01.05 MR 3 3000
2000.01.05 LT 4 4000
2000.01.06 PT 1 1000
2000.01.06 AR 2 2000
2000.01.06 MR 3 3000
2000.01.06 LT 4 4000
2000.01.07 PT 1 5000
2000.01.07 AR 2 6000
2000.01.07 MR 3 7000
2000.01.07 LT 4 8000
..

You may use asof join to get the most recent category2 from t2 by date
aj[`country`dates;t1;`dates xasc `dates xcol t2]
Just don't forget to rename validFrom column to dates in table 2 and sort it by dates

Related

Making different group of sequential null data in pyspark

I have data that contain sequential null and I want to make those sequential null data to different group
I have data like below
group_num
days
time
useage
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
2
20200102
1
30
2
20200102
2
null
2
20200102
3
null
2
20200102
4
50
2
20200102
5
null
3
20200105
10
null
3
20200105
11
null
3
20200105
12
5
What I want to do in this data is that make null_group data in usage as the group.
I want to make the same null group if null data is sequential. And also I want to make different null group if null data is not sequential or have different group_num.
group_num
days
time
useage
null_group
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
group1
2
20200102
1
30
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
4
50
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
3
20200105
12
5
Or maybe make new data that only contain null data with different group.
group_num
days
time
useage
null_group
1
20200101
3
null
group1
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
null_group can be change to numeric like below
group_num
days
time
useage
null_group
1
20200101
3
null
1
2
20200102
2
null
2
2
20200102
3
null
2
2
20200102
5
null
3
3
20200105
10
null
4
3
20200105
11
null
4
Can anyone help with this problem? I thought I can do this with pyspark's window function, but it didn't work very well. I think I have to use pyspark because the original data is too large handling as python.
This looks a bit complicated, but the last 2 parts are just for displaying correctly. The main logic goes like this:
calculate "time" difference "dt" between rows (needs to be 1 for the same "null_group")
generate "key" from "usage" and "dt" columns
use the trick to label consecutive rows (originally for pandas https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/)
rename and manipulate labels to get desired result
Full solution:
w = Window.partitionBy('group_num').orderBy('time')
w_cumsum = w.rowsBetween(Window.unboundedPreceding, 0)
# main logic
df_tmp = (
df
.withColumn('dt', F.coalesce(F.col('time') - F.lag('time').over(w), F.lit(1)))
.withColumn('key', F.concat_ws('-', 'usage', 'dt'))
.withColumn('prev_key', F.lag('key').over(w))
.withColumn('diff', F.coalesce((F.col('key') != F.col('prev_key')).cast('int'), F.lit(1)))
.withColumn('cumsum', F.sum('diff').over(w_cumsum))
.withColumn('null_group_key',
F.when(F.isnull('usage'), F.concat_ws('-', 'group_num', 'cumsum')).otherwise(None))
)
# map to generate required group names
df_map = (
df_tmp
.select('null_group_key')
.distinct()
.dropna()
.sort('null_group_key')
.withColumn('null_group', F.concat(F.lit('group'), F.monotonically_increasing_id() + F.lit(1)))
)
# rename and display as needed
(
df_tmp
.join(df_map, 'null_group_key', 'left')
.fillna('', 'null_group')
.select('group_num', 'days', 'time', 'usage', 'null_group')
.sort('group_num', 'time')
.show()
)

Unpivot data in PostgreSQL

I have a table in PostgreSQL with the below values,
empid hyderabad bangalore mumbai chennai
1 20 30 40 50
2 10 20 30 40
And my output should be like below
empid city nos
1 hyderabad 20
1 bangalore 30
1 mumbai 40
1 chennai 50
2 hyderabad 10
2 bangalore 20
2 mumbai 30
2 chennai 40
How can I do this unpivot in PostgreSQL?
You can use a lateral join:
select t.empid, x.city, x.nos
from the_table t
cross join lateral (
values
('hyderabad', t.hyderabad),
('bangalore', t.bangalore),
('mumbai', t.mumbai),
('chennai', t.chennai)
) as x(city, nos)
order by t.empid, x.city;
Or this one: simpler to read- and real plain SQL ...
WITH
input(empid,hyderabad,bangalore,mumbai,chennai) AS (
SELECT 1,20,30,40,50
UNION ALL SELECT 2,10,20,30,40
)
,
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
SELECT
empid
, CASE i
WHEN 1 THEN 'hyderabad'
WHEN 2 THEN 'bangalore'
WHEN 3 THEN 'mumbai'
WHEN 4 THEN 'chennai'
ELSE 'unknown'
END AS city
, CASE i
WHEN 1 THEN hyderabad
WHEN 2 THEN bangalore
WHEN 3 THEN mumbai
WHEN 4 THEN chennai
ELSE NULL::INT
END AS city
FROM input CROSS JOIN i
ORDER BY empid,i;
-- out empid | city | city
-- out -------+-----------+------
-- out 1 | hyderabad | 20
-- out 1 | bangalore | 30
-- out 1 | mumbai | 40
-- out 1 | chennai | 50
-- out 2 | hyderabad | 10
-- out 2 | bangalore | 20
-- out 2 | mumbai | 30
-- out 2 | chennai | 40

How to consistently sum lists of values contained in a table?

I have the following two tables:
t1:([]sym:`AAPL`GOOG; histo_dates1:(2000.01.01+til 10;2000.01.01+til 10);histo_values1:(til 10;5+til 10));
t2:([]sym:`AAPL`GOOG; histo_dates2:(2000.01.05+til 5;2000.01.06+til 4);histo_values2:(til 5; 2+til 4));
What I want is to sum the histo_values of each symbol across the histo_dates, such that the resulting table would look like this:
t:([]sym:`AAPL`GOOG; histo_dates:(2000.01.01+til 10;2000.01.01+til 10);histo_values:(0 1 2 3 4 6 8 10 12 9;5 6 7 8 9 12 14 16 18 14))
So the resulting dates histo_dates should be the union of histo_dates1 and histo_dates2, and histo_values should be the sum of histo_values1 and histo_values2 across dates.
EDIT:
I insist on the union of the dates, as I want the resulting histo_dates to be the union of both histo_dates1 and histo_dates2.
There are a few ways. One would be to ungroup to remove nesting, join the tables, aggregate on sym/date and then regroup on sym:
q)0!select histo_dates:histo_dates1, histo_values:histo_values1 by sym from select sum histo_values1 by sym, histo_dates1 from ungroup[t1],cols[t1]xcol ungroup[t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
A possibly faster way would be to make each row a dictionary and then key the tables on sym and add them:
q)select sym:s, histo_dates:key each v, histo_values:value each v from (1!select s, d!'v from `s`d`v xcol t1)+(1!select s, d!'v from `s`d`v xcol t2)
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
Another option would be to use a plus join pj:
q)0!`sym xgroup 0!pj[ungroup `sym`histo_dates`histo_values xcol t1;2!ungroup `sym`histo_dates`histo_values xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
See here for more on plus joins: https://code.kx.com/v2/ref/pj/
EDIT:
To explicitly make sure the result has the union of the dates, you could use a union join:
q)0!`sym xgroup select sym,histo_dates,histo_values:hv1+hv2 from 0^uj[2!ungroup `sym`histo_dates`hv1 xcol t1;2!ungroup `sym`histo_dates`hv2 xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
another way:
// rename the columns to be common names, ungroup the tables, and place the key on `sym and `histo_dates
q){2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// add them together (or use pj in place of +), group on `sym
`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// and to test this matches t, remove the key from the resulting table
q)t~0!`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
1b
Another possible way using functional amend
//Column join the histo_dates* columns and get the distinct dates - drop idx
//Using a functional apply use the idx to determine which values to plus
//Join the two tables using sym as the key - Find the idx of common dates
(enlist `idx) _select sym,histo_dates:distinct each (histo_dates1,'histo_dates2),
histovalues:{#[x;z;+;y]}'[histo_values1;histo_values2;idx],idx from
update idx:(where each histo_dates1 in' histo_dates2) from ((1!t1) uj 1!t2)
One possible problem with this is that to get the idx, it depends on the date columns being sorted which is usually the case.

Spark - Grouping 2 Dataframe Rows in only 1 row [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01