Pyspark partition by most count

Pyspark partition by most count - pyspark

I have a question,
I have a dataset like this :
id / color
1 / red
2 / green
2 / green
2 / blue
3 / blue
4 / yellow
4 / pink
5 / red
and I would like to group by id and to keep the most frequent color
to have something like this :
(in case of draw take randomly it's ok, or better solution if you have)
id / most_color
1 / red
2 / green
3 / blue
4 / yellow
5 / red
I have tried things like :
display(dataset.select("id","color").
dropDuplicates().
withColumn("most_color",count("color").over(w)))
or like this :
dataset2= (dataset.select("id","color").
withColumn("most_color", dataset["color"]).
groupBy("id").
agg(count('color').
alias('count').
filter(column('count') == max(count))))
display(dataset2)
thank you everyone

You can use Window function row_number() to achieve this
from pyspark.sql import functions as F
from pyspark.sql import Window as W
_w = W.partitionBy('id').orderBy(F.col('id').desc())
_w = W.partitionBy('id').orderBy(F.col('id').desc())
df_final = df_final.withColumn('rn_no', F.row_number().over(_w))
df_final = df_final.filter(F.col('rn_no') == 1)
df_final.show()
Output
id / most_color
1 / red
2 / green
3 / blue
4 / yellow
5 / red
Modified Version : This will give you the most used/appeared value in a group --
Input
df_a = spark.createDataFrame([(1,'red'),(2,'green'),(2,'green'),(2,'blue'),(3,'blue'),(4,'yellow'),(4,'pink'),(5,'red')],[ "id","color"])
+---+------+
| id| color|
+---+------+
| 1| red|
| 2| green|
| 2| green|
| 2| blue|
| 3| blue|
| 4|yellow|
| 4| pink|
| 5| red|
+---+------+
# First Group the values to get the max appeared color in a group
df = df_a.groupBy('id','color').agg(F.count('color').alias('count')).orderBy(F.col('id'))
# Now, make a partition and sort of the decending order for each window of ID and take the first value
_w = W.partitionBy('id').orderBy(F.col('count').desc())
df_a = df.withColumn('rn_no', F.row_number().over(_w))
df_a = df_a.filter(F.col('rn_no') == F.lit('1'))
Output
df_a.show()
+---+-----+-----+-----+
| id|color|count|rn_no|
+---+-----+-----+-----+
| 1| red| 1| 1|
| 2|green| 2| 1|
| 3| blue| 1| 1|
| 4| pink| 1| 1|
| 5| red| 1| 1|
+---+-----+-----+-----+

Related

PYSPARK - how can I get the number of elements that cumulates the 80% participation in a column?

I have a column named "Sales", and another column with the Salesman, so I want to know how many salesmans concentrate the 80% of the sales by each type of sales (A, B, C).
For this example,
+---------+------+-----+
|salesman |sales |type |
+---------+-------+----+
| 5 | 9 | a|
| 8 | 12 | b|
| 6 | 3 | b|
| 6 | 1 | a|
| 1 | 3 | a|
| 5 | 1 | b|
| 2 | 11 | b|
| 4 | 3 | a|
| 1 | 1 | b|
| 2 | 3 | a|
| 3 | 4 | a|
+---------+------+-----+
The result should be:
+-----+--------- +
|type |Salesman80|
+-----+----------+
| a | 4 |
| b | 2 |
+-----+----------+

We'll find the total sales per type.
join this to the table,
groupby salesman, type to get totals per sales per type,
then use a math trick to get percentage.
You can then chop this table to any percentage you wish with a where clause.
#create some data
data = spark.range(1 , 1000)
sales = data.select( data.id, floor((rand()*13)).alias("salesman"), floor((rand()*26)+65).alias("type"), floor(rand()*26).alias("sale") )
totalSales = sales.groupby(sales.type)\
.agg(
sum(sales.sale).alias("total_sales")
)\
.select( col("*"), expr("chr( type)").alias("type_") ) #fix int to chr
sales.join( totalSales , ["type"] )\
.groupby("salesman","type_")\
.agg( (sum("sale")/avg("total_sales")).alias("percentage"),
avg("total_sales").alias("total_sales_by_type") #Math trick as the total sum is the same on all types. so average = total sales by type.
).show()
+--------+-----+--------------------+-------------------+
|salesman|type_| percentage|total_sales_by_type|
+--------+-----+--------------------+-------------------+
| 10| H| 0.04710144927536232| 552.0|
| 9| U| 0.21063394683026584| 489.0|
| 0| I| 0.09266409266409266| 518.0|
| 11| K| 0.09683426443202979| 537.0|
| 0| F|0.027070063694267517| 628.0|
| 11| F|0.054140127388535034| 628.0|
| 1| G| 0.08086253369272237| 371.0|
| 5| N| 0.1693548387096774| 496.0|
| 9| L| 0.05353728489483748| 523.0|
| 7| R|0.003058103975535...| 327.0|
| 0| C| 0.05398457583547558| 389.0|
| 6| G| 0.1105121293800539| 371.0|
| 12| A|0.057007125890736345| 421.0|
| 0| J| 0.09876543209876543| 567.0|
| 11| B| 0.11337209302325581| 344.0|
| 8| K| 0.08007448789571694| 537.0|
| 4| N| 0.06854838709677419| 496.0|
| 11| H| 0.1358695652173913| 552.0|
| 10| W| 0.11617312072892938| 439.0|
| 1| C| 0.06940874035989718| 389.0|
+--------+-----+--------------------+-------------------+

I am guessing you mean "how many salesmans contribute to the 80% of the total sales per type" and you want the lowest possible number of salesman.
If that is what you meant, you can do this in these steps
Calculate the total sales per group
Get cumulative sum of sales percentage (sales / total sales)
Assign the row number by ordering sales in descending order
Take minimum row number where the cumulative sum of sales percentage >= 80% per group
Note this is probably not efficient approach but it produces what you want.
from pyspark.sql import functions as F
part_window = Window.partitionBy('type')
order_window = part_window.orderBy(F.desc('sales'))
cumsum_window = order_window.rowsBetween(Window.unboundedPreceding, 0)
df = (df.withColumn('total_sales', F.sum('sales').over(part_window)) # Step 1
.select('*',
F.sum(F.col('sales') / F.col('total_sales')).over(cumsum_window).alias('cumsum_percent'), # Step 2
F.row_number().over(order_window).alias('rn')) # Step 3
.groupby('type') # Step 4
.agg(F.min(F.when(F.col('cumsum_percent') >= 0.8, F.col('rn'))).alias('Salesman80')))

Pyspark: How to build sum of a column(which contains negative and positive values) with stop at 0

I think a example says more then the describtion.
The right column "sum" is the one i am looking for.
enter image description here
to_count|sum
-------------
-1 |0
+1 |1
-1 |0
-1 |0
+1 |1
+1 |2
-1 |1
+1 |2
. |.
. |.
I tried to rebuild that with several groupings with comparing lead and lag but that only works for the first time the sum usually ends in a negativ value.
Summing only positive and negative values seperatly also ends in another final result.
Would be great if anyone has a good idea how to solve this in pyspark!

I would use pandas_udf here:
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'g':[1]*8, 'id':range(8), 'value': [-1,1,-1,-1,1,1,-1,1]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = 0
for v in pdf['value'].values:
prev = max(prev + v, 0)
cumsums.append(prev)
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('g').apply(_calc_cumsum)
df.show()
The results:
+---+---+-----+------+
| g| id|value|cumsum|
+---+---+-----+------+
| 1| 0| -1| 0.0|
| 1| 1| 1| 1.0|
| 1| 2| -1| 0.0|
| 1| 3| -1| 0.0|
| 1| 4| 1| 1.0|
| 1| 5| 1| 2.0|
| 1| 6| -1| 1.0|
| 1| 7| 1| 2.0|
+---+---+-----+------+

Please look at the pic first there is a testdataset(first 3 columns) and the calc steps.
The column "flag" is now in another format. We also checked our datasource and realized that we only have to handle 1 and -1 entries. We mapped 1 to 0 and -1 to 1. Now it's working like exspected as you see in the column result.
The code is this:
w1 = Window.partitionBy('group').orderBy('order')
df_0 = tst.withColumn('edge_det',F.when(((F.col('flag')==0)&((F.lag('flag',default=1).over(w1))==1)),1).otherwise(0))
df_0 = df_0.withColumn('edge_cyl',F.sum('edge_det').over(w1))
df1 = df_0.withColumn('condition', F.when(F.col('edge_cyl')==0,0).otherwise(F.when(F.col('flag')==1,-1).otherwise(1)))
df1 =df1.withColumn('cond_sum',F.sum('condition').over(w1))
cond = (F.col('cond_sum')>=0)|(F.col('condition')==1)
df2 = df1.withColumn('new_cond',F.when(cond,F.col('condition')).otherwise(0))
df3 = df2.withColumn("result",F.sum('new_cond').over(w1))

How can I add a column to a DataFrame which groups rows in chunks of N? Like NTILE, but with a fixed bucket size

Say I have a DataFrame like:
+------------+-----------+-----+
| feed|artist |count|
+------------+-----------+-----+
| y| Kanye West| 9|
| y| Radiohead| 6|
| y| Zero 7| 3|
| y| Puts Marie| 1|
| gwas| Drax| 7|
| gwas| Calibre| 4|
| gwas| Aphex Twin| 1|
| gwas| Jay Z| 1|
| x| DJ Krush| 2|
| x| Titeknots| 1|
+------------+-----------+-----+
I want to add a new column which chunks the rows into buckets of N rows for each partition (feed).
It seems like the inverse of NTILE to me. NTILE lets you choose the # of buckets but I want to choose the bucket-size instead.
Here's the desired result. Notice how each feed is chunked into groups of N = 2, including the x feed which has just one chunk of 2 rows. (Edit: each partition is ordered by count, so group 1 in each partition will be the rows with the highest value for count)
+------------+-----------+-----+-----+
| feed|artist |count|group|
+------------+-----------+-----+-----+
| y| Kanye West| 1| 9|
| y| Radiohead| 1| 6|
| y| Zero 7| 1| 3|
| y| Puts Marie| 1| 1|
| gwas| Drax| 7| 7|
| gwas| Calibre| 1| 4|
| gwas| Aphex Twin| 1| 1|
| gwas| Jay Z| 8| 1|
| x| DJ Krush| 2| 2|
| x| Titeknots| 1| 1|
+------------+-----------+-----+-----+
As a bonus, I would like each bucket to be a different size. E.g. List(2, 2, 4, 10, 10, -1) would mean that the first bucket has 2 rows, the second has 2 rows, the third has 4 rows, etc., and the final bucket (-1) contains the remainder.
EDIT
(Another useful variation)
While implementing the answers, I realized that there's another variation which I would prefer:
Add a column to a DataFrame which chunks its rows into groups of N, without knowing the size of the DataFrame.
Example:
If N = 100 and the DataFrame has 800 rows, it chunk it into 8 buckets of 100. If the DataFrame has 950 rows, it will chunk it into 9 buckets of 100, and 1 bucket of 50. It should not require a scan/call to .count().
The example DataFrames are analogous to the ones above.
(meta: should I make a new question for this variation? I feel like "NTILE with a fixed bucket size" is a more elegant problem and probably more common than my original use-case)

If I understand you correctly, this can be handled by using an SQL expression:
import org.apache.spark.sql.functions.{expr,row_number,desc}
import org.apache.spark.sql.expressions.Window
// set up WindowSpec
val w1 = Window.partitionBy("feed").orderBy(desc("count"))
val L = List(2, 2, 4, 10, 10, -1)
// dynamically create SQL expression from the List `L` to map row_number into group-id
var sql_expr = "CASE"
var running_total = 0
for(i <- 1 to L.size) {
running_total += L(i-1)
sql_expr += (if(L(i-1) > 0) s" WHEN rn <= $running_total THEN $i " else s" ELSE $i END")
}
println(sql_expr)
//CASE WHEN rn <= 2 THEN 1 WHEN rn <= 4 THEN 2 WHEN rn <= 8 THEN 3 WHEN rn <= 18 THEN 4 WHEN rn <= 28 THEN 5 ELSE 6 END
val df_new = df.withColumn("rn", row_number().over(w1)).withColumn("group", expr(sql_expr)).drop("rn")
df_new.show
+----+----------+-----+-----+
|feed| artist|count|group|
+----+----------+-----+-----+
|gwas| Drax| 7| 1|
|gwas| Calibre| 4| 1|
|gwas|Aphex Twin| 1| 2|
|gwas| Jay Z| 1| 2|
| x| DJ Krush| 2| 1|
| x| Titeknots| 1| 1|
| y|Kanye West| 9| 1|
| y| Radiohead| 6| 1|
| y| Zero 7| 3| 2|
| y|Puts Marie| 1| 2|
+----+----------+-----+-----+
For a fixed N, just cast (row_number-1)/N + 1 to int:
val N = 2
val df_new = df.withColumn("group", ((row_number().over(w1)-1)/N+1).cast("int"))

This could work :
val bucketDef = List(2, 2, 4, 10, 10)
val bucketRunsum = bucketDef.scanLeft(1)( _ + _) // calc running sum
// maps a row-number to a bucket
val indexBucketMapping = bucketRunsum.zip(bucketRunsum.tail)
.zipWithIndex
.map{case ((start,end),index) => ((start,end),index+1)} // make index start at 1
// gives List(((1,3),1), ((3,5),2), ((5,9),3), ((9,19),4), ((19,29),5))
// udf to assign a bucket to a given row-number
val calcBucket = udf((rnb:Long) => indexBucketMapping
.find{case ((start,end),_) => start<=rnb && rnb < end}
.map(_._2) // get index
.getOrElse(indexBucketMapping.last._2+1) // is in last bucket
)
df
.withColumn("group",calcBucket(row_number().over(Window.partitionBy($"feed").orderBy($"count"))))
alternatively (without UDF), construct a DataFrame which maps a row-number to a bucket and then join
val bucketSizeDef =List(2, 2, 4, 10, 10)
val bucketDef = (1 +: bucketSizeDef).zipWithIndex.map{case (bs,index) => (bs,index+1)}
.toDF("bucketSize","group")
.withColumn("i",sum($"bucketSize").over(Window.orderBy($"group")))
.withColumn("i_to",coalesce(lead($"i",1).over(Window.orderBy($"group")),lit(Long.MaxValue)))
.drop($"bucketSize")
bucketDef.show()
gives:
+-----+---+-------------------+
|group| i| i_to|
+-----+---+-------------------+
| 1| 1| 3|
| 2| 3| 5|
| 3| 5| 9|
| 4| 9| 19|
| 5| 19| 29|
| 6| 29|9223372036854775807|
+-----+---+-------------------+
then join to df:
df
.withColumn("rnb",row_number().over(Window.partitionBy($"feed").orderBy($"count")))
.join(broadcast(bucketDef),$"rnb">= $"i" and $"rnb"< $"i_to")
.drop("rnb","i","i_to")

join 2 DF with diferent dimension scala

Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you

I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).

I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+

I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))

Skip the current row COUNT and sum up the other COUNTS for current key with Spark Dataframe

My input:
val df = sc.parallelize(Seq(
("0","car1", "success"),
("0","car1", "success"),
("0","car3", "success"),
("0","car2", "success"),
("1","car1", "success"),
("1","car2", "success"),
("0","car3", "success")
)).toDF("id", "item", "status")
My intermediary group by output looks like this:
val df2 = df.groupBy("id", "item").agg(count("item").alias("occurences"))
+---+----+----------+
| id|item|occurences|
+---+----+----------+
| 0|car3| 2|
| 0|car2| 1|
| 0|car1| 2|
| 1|car2| 1|
| 1|car1| 1|
+---+----+----------+
The output I would like is:
Calculating the sum of occurrences of item skipping the occurrences value of the current id's item.
For example in the output table below, car3 appeared for id "0" 2 times, car 2 appeared 1 time and car 1 appeared 2 times.
So for id "0", the sum of other occurrences for its "car3" item would be value of car2(1) + car1(2) = 3.
For the same id "0", the sum of other occurrences for its "car2" item would be value of car3(2) + car1(2) = 4.
This continues for the rest. Sample output
+---+----+----------+----------------------+
| id|item|occurences| other_occurences_sum |
+---+----+----------+----------------------+
| 0|car3| 2| 3 |<- (car2+car1) for id 0
| 0|car2| 1| 4 |<- (car3+car1) for id 0
| 0|car1| 2| 3 |<- (car3+car2) for id 0
| 1|car2| 1| 1 |<- (car1) for id 1
| 1|car1| 1| 1 |<- (car2) for id 1
+---+----+----------+----------------------+

That's perfect target for a window function.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val w = Window.partitionBy("id")
df2.withColumn(
"other_occurences_sum", sum($"occurences").over(w) - $"occurences"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+
where sum($"occurences").over(w) is a sum of all occurrences for the current id. Of course join is also valid:
df2.join(
df2.groupBy("id").agg(sum($"occurences") as "total"), Seq("id")
).select(
$"*", ($"total" - $"occurences") as "other_occurences_sum"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark partition by most count - pyspark

Related

PYSPARK - how can I get the number of elements that cumulates the 80% participation in a column?

Pyspark: How to build sum of a column(which contains negative and positive values) with stop at 0

How can I add a column to a DataFrame which groups rows in chunks of N? Like NTILE, but with a fixed bucket size

join 2 DF with diferent dimension scala

Skip the current row COUNT and sum up the other COUNTS for current key with Spark Dataframe

Categories

Resources