Spark Dataframe Scala: add new columns by some conditions - scala

I revised my question so that it is easier to understand.
original df looks like this:
+---+----------+-------+----+------+
| id|tim |price | qty|qtyChg|
+---+----------+-------+----+------+
| 1| 31951.509| 0.370| 1| 1|
| 2| 31951.515|145.380| 100| 100|
| 3| 31951.519|149.370| 100| 100|
| 4| 31951.520|144.370| 100| 100|
| 5| 31951.520|119.370| 5| 5|
| 6| 31951.520|149.370| 300| 200|
| 7| 31951.521|149.370| 400| 100|
| 8| 31951.522|149.370| 410| 10|
| 9| 31951.522|149.870| 50| 50|
| 10| 31951.522|109.370| 50| 50|
| 11| 31951.522|144.370| 400| 300|
| 12| 31951.524|149.370| 610| 200|
| 13| 31951.526|135.130| 22| 22|
| 14| 31951.527|149.370| 750| 140|
| 15| 31951.528| 89.370| 100| 100|
| 16| 31951.528|145.870| 50| 50|
| 17| 31951.528|139.370| 100| 100|
| 18| 31951.531|144.370| 410| 10|
| 19| 31951.531|149.370| 769| 19|
| 20| 31951.538|149.370| 869| 100|
| 21| 31951.538|144.880| 200| 200|
| 22| 31951.541|139.370| 221| 121|
| 23| 31951.542|149.370|1199| 330|
| 24| 31951.542|139.370| 236| 15|
| 25| 31951.542|144.370| 510| 100|
| 26| 31951.543|146.250| 50| 50|
| 27| 31951.543|143.820| 100| 100|
| 28| 31951.543|139.370| 381| 145|
| 29| 31951.544|149.370|1266| 67|
| 30| 31951.544|150.000| 50| 50|
| 31| 31951.544|137.870| 300| 300|
| 32| 31951.544|140.470| 10| 10|
| 33| 31951.545|150.000| 53| 3|
| 34| 31951.545|140.000| 25| 25|
| 35| 31951.545|148.310| 8| 8|
| 36| 31951.547|149.000| 20| 20|
| 37| 31951.549|143.820| 102| 2|
| 38| 31951.549|150.110| 75| 75|
+---+----------+-------+----+------+
then I run the code
val ww = Window.partitionBy().orderBy($"tim")
val step1 = df.withColumn("sequence",sort_array(collect_set(col("price")).over(ww),asc=false))
.withColumn("top1price",col("sequence").getItem(0))
.withColumn("top2price",col("sequence").getItem(1))
.drop("sequence")
The new dataframe looks like this:
+---+---------+-------+----+------+---------+---------+
| id| tim| price| qty|qtyChg|top1price|top2price|
+---+---------+-------+----+------+---------+---------+
| 1|31951.509| 0.370| 1| 1| 0.370| null|
| 2|31951.515|145.380| 100| 100| 145.380| 0.370|
| 3|31951.519|149.370| 100| 100| 149.370| 145.380|
| 4|31951.520|149.370| 300| 200| 149.370| 145.380|
| 5|31951.520|144.370| 100| 100| 149.370| 145.380|
| 6|31951.520|119.370| 5| 5| 149.370| 145.380|
| 7|31951.521|149.370| 400| 100| 149.370| 145.380|
| 8|31951.522|109.370| 50| 50| 149.870| 149.370|
| 9|31951.522|144.370| 400| 300| 149.870| 149.370|
| 10|31951.522|149.870| 50| 50| 149.870| 149.370|
| 11|31951.522|149.370| 410| 10| 149.870| 149.370|
| 12|31951.524|149.370| 610| 200| 149.870| 149.370|
| 13|31951.526|135.130| 22| 22| 149.870| 149.370|
| 14|31951.527|149.370| 750| 140| 149.870| 149.370|
| 15|31951.528| 89.370| 100| 100| 149.870| 149.370|
| 16|31951.528|139.370| 100| 100| 149.870| 149.370|
| 17|31951.528|145.870| 50| 50| 149.870| 149.370|
| 18|31951.531|144.370| 410| 10| 149.870| 149.370|
| 19|31951.531|149.370| 769| 19| 149.870| 149.370|
| 20|31951.538|144.880| 200| 200| 149.870| 149.370|
| 21|31951.538|149.370| 869| 100| 149.870| 149.370|
| 22|31951.541|139.370| 221| 121| 149.870| 149.370|
| 23|31951.542|144.370| 510| 100| 149.870| 149.370|
| 24|31951.542|139.370| 236| 15| 149.870| 149.370|
| 25|31951.542|149.370|1199| 330| 149.870| 149.370|
| 26|31951.543|139.370| 381| 145| 149.870| 149.370|
| 27|31951.543|143.820| 100| 100| 149.870| 149.370|
| 28|31951.543|146.250| 50| 50| 149.870| 149.370|
| 29|31951.544|140.470| 10| 10| 150.000| 149.870|
| 30|31951.544|137.870| 300| 300| 150.000| 149.870|
| 31|31951.544|150.000| 50| 50| 150.000| 149.870|
| 32|31951.544|149.370|1266| 67| 150.000| 149.870|
| 33|31951.545|140.000| 25| 25| 150.000| 149.870|
| 34|31951.545|150.000| 53| 3| 150.000| 149.870|
| 35|31951.545|148.310| 8| 8| 150.000| 149.870|
| 36|31951.547|149.000| 20| 20| 150.000| 149.870|
| 37|31951.549|150.110| 75| 75| 150.110| 150.000|
| 38|31951.549|143.820| 102| 2| 150.110| 150.000|
+---+---------+-------+----+------+---------+---------+
I am hoping to get two new columns top1priceQty, top2priceQty which store the most updated corresponding qty of top1price and top2price.
For example, in row 6, top1price= 149.370, based on this value, I want to get its corresponding qty which is 400(not 100 or 300). in row 33, when top1price=150.00000000, I want to get its corresponding qty which is 53 that comes from row 32, not 50 from row 28. same rule apply to top2price
Thank you all in advance!

You were very close to the answer by yourself. Instead of collecting set of just one column, collect array of 'LMTPRICE' and it's corresponding 'qty'. Then use getItem(0).getItem(0) for top1price and getItem(0).getItem(1) for top1priceQty. To keep the order by INTEREST_TIME for getting correct qty, use INTEREST_TIME also after 'LMTPRICE' and before 'qty'.
df.withColumn("sequence",sort_array(collect_set(array("LMTPRICE","INTEREST_TIME","qty")).over(ww),asc=false)).withColumn("top1price",col("sequence").getItem(0).getItem(0)).withColumn("top1priceQty",col("sequence").getItem(0).getItem(2).cast("int")).drop("sequence").show(false)
+-----+-------------+--------+---+------+---------+------------+
|index|INTEREST_TIME|LMTPRICE|qty|qtyChg|top1price|top1priceQty|
+-----+-------------+--------+---+------+---------+------------+
|0 |31951.509 |0.37 |1 |1 |0.37 |1 |
|1 |31951.515 |145.38 |100|100 |145.38 |100 |
|2 |31951.519 |149.37 |100|100 |149.37 |100 |
|3 |31951.52 |119.37 |5 |5 |149.37 |300 |
|4 |31951.52 |144.37 |100|100 |149.37 |300 |
|5 |31951.52 |149.37 |300|200 |149.37 |300 |
|6 |31951.521 |149.37 |400|100 |149.37 |400 |
|7 |31951.522 |149.87 |50 |50 |149.87 |50 |
|8 |31951.522 |149.37 |410|10 |149.87 |50 |
|9 |31951.522 |109.37 |50 |50 |149.87 |50 |
|10 |31951.522 |144.37 |400|300 |149.87 |50 |
|11 |31951.524 |149.87 |610|200 |149.87 |610 |
|12 |31951.526 |135.13 |22 |22 |149.87 |610 |
|13 |31951.527 |149.37 |750|140 |149.87 |610 |
|14 |31951.528 |139.37 |100|100 |149.87 |610 |
|15 |31951.528 |145.87 |50 |50 |149.87 |610 |
|16 |31951.528 |89.37 |100|100 |149.87 |610 |
|17 |31951.531 |144.37 |410|10 |149.87 |610 |
|18 |31951.531 |149.37 |769|19 |149.87 |610 |
|19 |31951.538 |149.37 |869|100 |149.87 |610 |
+-----+-------------+--------+---+------+---------+------------+

Related

To generate the number in between range within Pyspark data frame

Input dataset:
Output dataset:
Basically i want to add one more column "new_month" where no will be in between "dvpt_month" and "lead_month" and all other column's values will be same for the new_month generated in between these months.
I want to do it with pyspark.
You can do it by creating an array of sequence between 2 columns and then exploding that array to get rows with all values
daf=spark.createDataFrame([(12,24),(24,36),(36,48)],"col1 int,col2 int")
daf.withColumn("arr",F.sequence(F.col("col1"),F.col("col2")-1)).select("col1","col2",F.explode("arr").alias("col3")).show()
#output
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 12| 24| 12|
| 12| 24| 13|
| 12| 24| 14|
| 12| 24| 15|
| 12| 24| 16|
| 12| 24| 17|
| 12| 24| 18|
| 12| 24| 19|
| 12| 24| 20|
| 12| 24| 21|
| 12| 24| 22|
| 12| 24| 23|
| 24| 36| 24|
| 24| 36| 25|
| 24| 36| 26|
| 24| 36| 27|
| 24| 36| 28|
| 24| 36| 29|
| 24| 36| 30|
| 24| 36| 31|
+----+----+----+
only showing top 20 rows
Edit - sequence is available in spark version >=2.4.0. in earlier version can try to use range or map to generate similar array

Unable to get the result from the window function

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things
Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

Taking sum ini spark-scala based on a condition

I have a data frame like this. How can i take the sum of the column sales where the rank is greater than 3 , per 'M'
+---+-----+----+
| M|Sales|Rank|
+---+-----+----+
| M1| 200| 1|
| M1| 175| 2|
| M1| 150| 3|
| M1| 125| 4|
| M1| 90| 5|
| M1| 85| 6|
| M2| 1001| 1|
| M2| 500| 2|
| M2| 456| 3|
| M2| 345| 4|
| M2| 231| 5|
| M2| 123| 6|
+---+-----+----+
Expected Output --
+---+-----+----+---------------+
| M|Sales|Rank|SumGreaterThan3|
+---+-----+----+---------------+
| M1| 200| 1| 300|
| M1| 175| 2| 300|
| M1| 150| 3| 300|
| M1| 125| 4| 300|
| M1| 90| 5| 300|
| M1| 85| 6| 300|
| M2| 1001| 1| 699|
| M2| 500| 2| 699|
| M2| 456| 3| 699|
| M2| 345| 4| 699|
| M2| 231| 5| 699|
| M2| 123| 6| 699|
+---+-----+----+---------------+
I have done sum over ROwnumber like this
df.withColumn("SumGreaterThan3",sum("Sales").over(Window.partitionBy(col("M"))))` //But this will provide total sum of sales.
To replicate the same DF-
val df = Seq(
("M1",200,1),
("M1",175,2),
("M1",150,3),
("M1",125,4),
("M1",90,5),
("M1",85,6),
("M2",1001,1),
("M2",500,2),
("M2",456,3),
("M2",345,4),
("M2",231,5),
("M2",123,6)
).toDF("M","Sales","Rank")
Well, the partition is enough to set the window function. Of course you also have to use the conditional summation by mixing sum and when.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("M")
df.withColumn("SumGreaterThan3", sum(when('Rank > 3, 'Sales).otherwise(0)).over(w).alias("sum")).show
This will givs you the expected results.

Spark Dataframe ORDER BY giving mixed combination(asc + desc)

I have a Dataframe that I want to sort column by descending if the count value is greater than 10.
But I'm getting a mixed combination like ascending for couple of records then again descending and then again ascending and son on.
I'm using orderBy() function which sort the record in ascending by default.
Since i'm new to Scala and Spark I'm not getting the reason for why I'm getting this.
df.groupBy("Value").count().filter("count>5.0").orderBy("Value").show(1000);
reading the csv
val df = sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/test.csv")
.toDF("Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value")````
the records I'm getting by executing the above syntax:
+-------+-----+
| Value|count|
+-------+-----+
| 0| 225|
| 0.01| 12|
| 0.02| 13|
| 0.03| 12|
| 0.04| 15|
| 0.05| 9|
| 0.06| 11|
| 0.07| 9|
| 0.08| 6|
| 0.09| 10|
| 0.1| 66|
| 0.11| 12|
| 0.12| 9|
| 0.13| 12|
| 0.14| 8|
| 0.15| 10|
| 0.16| 14|
| 0.17| 11|
| 0.18| 14|
| 0.19| 21|
| 0.2| 78|
| 0.21| 16|
| 0.22| 15|
| 0.23| 13|
| 0.24| 7|
| 0.3| 85|
| 0.31| 7|
| 0.34| 8|
| 0.4| 71|
| 0.5| 103|
| 0.6| 102|
| 0.61| 6|
| 0.62| 9|
| 0.69| 7|
| 0.7| 98|
| 0.72| 6|
| 0.74| 8|
| 0.78| 7|
| 0.8| 71|
| 0.81| 10|
| 0.82| 9|
| 0.83| 8|
| 0.84| 6|
| 0.86| 8|
| 0.87| 10|
| 0.88| 12|
| 0.9| 95|
| 0.91| 9|
| 0.93| 6|
| 0.94| 6|
| 0.95| 8|
| 0.98| 8|
| 0.99| 6|
| 1| 254|
| 1.08| 8|
| 1.1| 80|
| 1.11| 6|
| 1.15| 9|
| 1.17| 7|
| 1.18| 6|
| 1.19| 9|
| 1.2| 94|
| 1.25| 7|
| 1.3| 91|
| 1.32| 8|
| 1.4| 215|
| 1.45| 7|
| 1.5| 320|
| 1.56| 6|
| 1.6| 280|
| 1.64| 6|
| 1.66| 10|
| 1.7| 310|
| 1.72| 7|
| 1.74| 6|
| 1.8| 253|
| 1.9| 117|
| 10| 78|
| 10.1| 45|
| 10.2| 49|
| 10.3| 30|
| 10.4| 40|
| 10.5| 38|
| 10.6| 52|
| 10.7| 35|
| 10.8| 39|
| 10.9| 42|
| 10.96| 7|------------mark
| 100| 200|
| 101.3| 7|
| 101.8| 8|
| 102| 6|
| 102.2| 6|
| 102.7| 8|
| 103.2| 6|--------------here
| 11| 93|
| 11.1| 32|
| 11.2| 38|
| 11.21| 6|
| 11.3| 42|
| 11.4| 32|
| 11.5| 34|
| 11.6| 38|
| 11.69| 6|
| 11.7| 42|
| 11.8| 25|
| 11.86| 6|
| 11.9| 39|
| 11.96| 9|
| 12| 108|
| 12.07| 7|
| 12.1| 31|
| 12.11| 6|
| 12.2| 34|
| 12.3| 28|
| 12.39| 6|
| 12.4| 32|
| 12.5| 31|
| 12.54| 7|
| 12.57| 6|
| 12.6| 18|
| 12.7| 33|
| 12.8| 20|
| 12.9| 21|
| 13| 85|
| 13.1| 25|
| 13.2| 19|
| 13.3| 30|
| 13.34| 6|
| 13.4| 32|
| 13.5| 16|
| 13.6| 15|
| 13.7| 31|
| 13.8| 8|
| 13.83| 7|
| 13.89| 7|
| 14| 46|
| 14.1| 10|
| 14.3| 10|
| 14.4| 7|
| 14.5| 15|
| 14.7| 6|
| 14.9| 11|
| 15| 52|
| 15.2| 6|
| 15.3| 9|
| 15.4| 12|
| 15.5| 21|
| 15.6| 11|
| 15.7| 14|
| 15.8| 18|
| 15.9| 18|
| 16| 44|
| 16.1| 30|
| 16.2| 26|
| 16.3| 29|
| 16.4| 26|
| 16.5| 32|
| 16.6| 42|
| 16.7| 44|
| 16.72| 6|
| 16.8| 40|
| 16.9| 54|
| 17| 58|
| 17.1| 48|
| 17.2| 51|
| 17.3| 47|
| 17.4| 57|
| 17.5| 51|
| 17.6| 51|
| 17.7| 46|
| 17.8| 33|
| 17.9| 38|---------again
|1732.04| 6|
| 18| 49|
| 18.1| 21|
| 18.2| 23|
| 18.3| 29|
| 18.4| 22|
| 18.5| 22|
| 18.6| 17|
| 18.7| 13|
| 18.8| 13|
| 18.9| 19|
| 19| 36|
| 19.1| 15|
| 19.2| 13|
| 19.3| 12|
| 19.4| 15|
| 19.5| 15|
| 19.6| 15|
| 19.7| 15|
| 19.8| 14|
| 19.9| 9|
| 2| 198|------------see after 19 again 2 came
| 2.04| 7|
| 2.09| 8|
| 2.1| 47|
| 2.16| 6|
| 2.17| 8|
| 2.2| 55|
| 2.24| 6|
| 2.26| 7|
| 2.27| 6|
| 2.29| 8|
| 2.3| 53|
| 2.4| 33|
| 2.5| 36|
| 2.54| 6|
| 2.59| 6|
Can you tell me what is wrong i'm doing.
My dataframe has column
"Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value"
As we talked about in the comments, it seems your Value column is of type String. You can cast it to Double (for instance) to order it numerically.
This lines will cast the Value column to doubleType:
import org.apache.spark.sql.types._
df.withColumn("Value", $"Value".cast(DoubleType))
EXAMPLE INPUT
df.show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
With Value as Strings
df.orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
Casting Value as Double
df.withColumn("Value", $"Value".cast(DoubleType)).orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 2.0| a|
| 10.0| b|
+-----+-------+

how to calculate row mean before and after a given index for each row - pyspark?

I have a data frame of multiple columns and an index and I have to calculate mean of those columns before the index and after.
this is my pandas code:
for i in range(len(res.index)):
i=int(i)
m=int(res['index'].ix[i])
n = len(res.columns[1:m])
if n == 0:
res['mean'].ix[i]=0
else:
res['mean'].ix[i]=int(res.ix[i,1:m].sum()) / n
and i want to do it in pyspark?
any help please!!
You can calculate this using UDF in pyspark. Here is an example:-
from pyspark.sql import functions as F
from pyspark.sql import types as T
import numpy as np
sample_data = sqlContext.createDataFrame([
range(10)+[4],
range(50, 60)+[2],
range(9, 19)+[4],
range(19, 29)+[3],
], ["col_"+str(i) for i in range(10)]+["index"])
sample_data.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 4|
| 50| 51| 52| 53| 54| 55| 56| 57| 58| 59| 2|
| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 4|
| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 3|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
def def_mn(data, index, mean="pre"):
if mean == "pre":
return sum(data[:index])/float(len(data[:index]))
elif mean == "post":
return sum(data[index:])/float(len(data[index:]))
mn_udf = F.udf(def_mn)
sample_data.withColumn(
"index_pre_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index")
).withColumn(
"index_post_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index", F.lit("post"))
).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|index_pre_mean|index_post_mean|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |4 |1.5 |6.5 |
|50 |51 |52 |53 |54 |55 |56 |57 |58 |59 |2 |50.5 |55.5 |
|9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |4 |10.5 |15.5 |
|19 |20 |21 |22 |23 |24 |25 |26 |27 |28 |3 |20.0 |25.0 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+