Divide the dataframe to 3 buckets based on sum of one column - scala

I have a dataframe
c1 c2
user1 5
user2 3
user3 3
user4 1
I want to divide the dataframes into 3 equal groups based on total sum of c2
total sum of c2 = 12/3 = 4
In this case user1 have value 5 (>4) Ist group , user2 and user3 (total 6) >4(2nd group), and remaining all should be in 3rd group
so my expected dataframe
c1 c2 rank
user1 5 1
user2 3 2
user3 3 2
user4 1 3
I am trying with Window function and custom window function, but no luck so far.

Related

Join Keyed tables in a list

I've got a list of keyed tables:
ktbls:(([filter: `a`b] user1: 3 4f);([filter: `a`c] user2: 3 4f);([filter: `$()] user3: "f"$()))
I want to join the tables by column, so I want to run this: ktbls[0],'ktbls[1],'ktbls[2] which results in the keyed table below:
filter|user1 user2 user3
a | 3 3 0n
b | 4 0n 0n
c | 0n 4 0n
Now since the length of the keyed table list can vary I need to somehow functionalise: ktbls[0],'ktbls[1],'ktbls[2],'...
However, I can't seem to figure it out.
Using your syntax:
q){x,'y}/[ktbls] / alternate forms ,'/[ktbls] or (,')/[ktbls]
filter| user1 user2 user3
------| -----------------
a | 3 3
b | 4
c | 4
But perhaps union join (uj) could work too?
q)(uj/)ktbls
filter| user1 user2 user3
------| -----------------
a | 3 3
b | 4
c | 4
Alternative syntax uj/[ktbls].
See the documentation on this use case of over /.

How to perform a pivot() and write.parquet on each partition of pyspark dataframe?

I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!

How to add columns using Scala

I have df as below and want to add additional column using Scala
Id Name
1 ab
2 BC
1 Cd
2 mf
3 Hh
Expected output should be below
Id name repeatedcount
1 ab 2
2 BC 2
1 Cd 2
2 mf 2
3 Hh 3
I'm using DF.groupBy($"id").count.show() but I'm getting different output.
Can someone please help me on this.
val grouped = df.groupBy($"id").count
val res = df.join(grouped,Seq("id"))
.withColumnRenamed("count","repeatedcount")
Group By will give count of each id's. Join that with original dataframe to get count against each id.

KDB: select first n rows from each group

How can I extract the first n rows from each group? For example: for table
bb: ([]sym:(4#`a),(5#`b);val: til 9)
sym val
-------------
a 0
a 1
a 2
a 3
b 4
b 5
b 6
b 7
b 8
How can I select the first 2 rows of each group by sym?
Thanks
Can use fby:
q)select from bb where ({x in 2#x};i) fby sym
sym val
-------
a 0
a 1
b 4
b 5
You can try this:
q)select from t where i in raze exec 2#i by sym from t
sym val
-------
a 0
a 1
b 4
b 5

KDB: Aggregate across consecutive rows with common label

I would like to sum across consecutive rows that share the same label. Any very simple ways to do this?
Example: I start with this table...
qty flag
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
... and would like to generate...
qty flag
1 OFF
5 ON
11 OFF
4 ON
One method:
q)show t:flip`qty`flag!(1 3 2 2 9 4;`OFF`ON`ON`OFF`OFF`ON)
qty flag
--------
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
q)show result:select sum qty by d:sums differ flag,flag from t
d flag1| qty
----------| ---
1 OFF | 1
2 ON | 5
3 OFF | 11
4 ON | 4
Then to get it in the format you require:
q)`qty`flag#0!result
qty flag
--------
1 OFF
5 ON
11 OFF
4 ON