Pyspark: How to use salting technique for Skewed Aggregates - pyspark

How to use salting technique for Skewed Aggregation in Pyspark.
Say we have Skewed data like below how to create salting column and use it in aggregation.
city
state
count
Lachung
Sikkim
3,000
Rangpo
Sikkim
50,000
Gangtok
Sikkim
3,00,000
Bangalore
Karnataka
2,50,00,000
Mumbai
Maharashtra
2,90,00,000

To use the salting technique on skewed data, we need to create a column say "salt". Generate a random no with a range from 0 to (spark.sql.shuffle.partitions - 1).
Table should look like below, where "salt" column will have value from 0 to 199 (as in this case partitions size is 200). Now you can use groupBy on "city", "state", "salt".
city
state
salt
Lachung
Sikkim
151
Lachung
Sikkim
102
Lachung
Sikkim
16
Rangpo
Sikkim
5
Rangpo
Sikkim
19
Rangpo
Sikkim
16
Rangpo
Sikkim
102
Gangtok
Sikkim
55
Gangtok
Sikkim
119
Gangtok
Sikkim
16
Gangtok
Sikkim
10
Bangalore
Karnataka
19
Mumbai
Maharashtra
0
Bangalore
Karnataka
199
Mumbai
Maharashtra
190
code:
from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import (
StructType, StructField, IntegerType
)
salval = f.round(f.rand() * int(spark.conf.get("spark.sql.shuffle.partitions")) -1 )
record_df.withColumn("salt", f.lit(salval).cast(IntegerType()))\
.groupBy("city", "state", "salt")\
.agg(
f.count("city")
)\
.drop("salt")
output:
city
state
count
Lachung
Sikkim
3,000
Rangpo
Sikkim
50,000
Gangtok
Sikkim
3,00,000
Bangalore
Karnataka
2,50,00,000
Mumbai
Maharashtra
2,90,00,000

Related

Extracting all rows containing a specific datetime value (MATLAB)

I have a table which looks like this:
Entry number
Timestamp
Value1
Value2
Value3
Value4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5760
28-06-2018 17:30
33
53
35.2
58.5
5761
28-06-2018 18:00
33
63
35
57.9
5762
28-06-2018 18:30
33
61
34.6
58.9
5763
28-06-2018 19:00
33
59
34.1
59.4
5764
28-06-2018 19:30
28
89
33.5
64.2
5765
28-06-2018 20:00
28
89
33
66.1
5766
28-06-2018 20:30
28
83
32.5
67
5767
28-06-2018 21:00
29
89
32.2
68.4
Where '28-06-2018 16:30' is under one column. So I have 6 columns:
Entry number, Timestamp, Value1, Value2, Value3, Value4
I want to extract all rows that belong to '28-06-2018', i.e all data pertaining to that day. Since my table is too large I couldn't fit more data, however, the entries under the timestamp range for a couple of months.
t=table([5758;5759],["28-06-2018 16:30";"29-06-2018 16:30"],[34;33.5],'VariableNames',{'Entry number','Timestamp','Value1'})
t =
2×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34
5759 "29-06-2018 16:30" 33.5
t(contains(t.('Timestamp'),"28-06"),:)
ans =
1×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Update Spark dataframe to populate data from another dataframe

I have 2 dataframes. I want to take distinct values of 1 column and link it with all the rows of another dataframe. For e.g -
Dataframe 1 : df1 contains
scenarioId
---------------
101
102
103
Dataframe 2 : df2 contains columns
trades
-------------------------------------
isin price
ax11 111
re32 909
erre 445
Expected output
trades
----------------
isin price scenarioid
ax11 111 101
re32 909 101
erre 445 101
ax11 111 102
re32 909 102
erre 445 102
ax11 111 103
re32 909 103
erre 445 103
Note that i dont have a possibility to join the 2 dataframes on a common column. Please suggest.
What you need is cross join or cartessian product:
val result = df1.crossJoin(df2)
although I do not recommend it as the amount of data rises very fast. You'll get all possible pairs - elements of cartessian product (the number will be number of rows in df1 times number of rows in df2).

Combine 2 data frames with different columns in spark

I have 2 dataframes:
df1 :
Id purchase_count purchase_sim
12 100 1500
13 1020 1300
14 1010 1100
20 1090 1400
21 1300 1600
df2:
Id click_count click_sim
12 1030 2500
13 1020 1300
24 1010 1100
30 1090 1400
31 1300 1600
I need to get the combined data frame with results as :
Id click_count click_sim purchase_count purchase_sim
12 1030 2500 100 1500
13 1020 1300 1020 1300
14 null null 1010 1100
24 1010 1100 null null
30 1090 1400 null null
31 1300 1600 null null
20 null null 1090 1400
21 null null 1300 1600
I can't use union because of different column names. Can some one suggest me a better way to do this ?
All you require a full outer join on ID column.
df1.join(df2, Seq("Id"), "full_outer")
// Since the Id column name is same in both the dataframes, if you use comparison like
df1($"Id") === df2($"Id"), you will get duplicate ID columns
Please refer the below documentation for future references.
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

kdb getting float from integer division

I have a table
id, turnover, qty
and I want to query
select sum turnover, sum qty, (sum turnover) div (sum qty) by id from Table
However, the the resulting value from the division seems to be an int and shows 0 (as the unit price is a lot smaller than 1). I tried to cast the results into a float, but that doesnt help
select sum turnover, sum qty, `float$(`float$(sum turnover) div `float$(sum qty)) by id from Table.
How can I get a float in return?
Also, as a side question. How can I name the column (equivalently to sql select sum(x) as my_column_name ...)
That's the expected output from div, you should use % to divide numbers - which always returns a float.
q)200 div 8.5
22
q)200%8.5
23.52941
q)
Reference here;
Div: http://code.kx.com/q/ref/arith-integer/#div
%: http://code.kx.com/q/ref/arith-float/#divide
*edit
Apologies - forgot to reference the rest of your question. In your example, you are calculating the sum turnover and sum qty twice - you will want to avoid that, if you're dealing with a lot of records.
How is this;
q)show trade:([] id:(`$"A",'string[til 10]);turnover:10?til 10; qty:10?100+til 200)
id turnover qty
---------------
A0 4 152
A1 4 238
A2 2 298
A3 2 268
A4 7 246
A5 2 252
A6 0 279
A7 5 286
A8 7 245
A9 5 191
q)update toverq:sumT%sumQ from select sumT:sum turnover,sumQ:sum qty by id from trade
id| sumT sumQ toverq
--| ---------------------
A0| 4 152 0.02631579
A1| 4 238 0.01680672
A2| 2 298 0.006711409
A3| 2 268 0.007462687
A4| 7 246 0.02845528
A5| 2 252 0.007936508
A6| 0 279 0
A7| 5 286 0.01748252
A8| 7 245 0.02857143
A9| 5 191 0.02617801