Unable to get the result from the window function - scala

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things

Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

Related

How does Spark DataFrame find out some lines that only appear once?

I want to eliminate some rows that only appear once in the ‘county’ column, which is not conducive to my statistics.
I used groupBy+count to find:
fault_data.groupBy("county").count().show()
The data looks like this:
+----------+-----+
| county|count|
+----------+-----+
| A| 117|
| B| 31|
| C| 1|
| D| 272|
| E| 1|
| F| 1|
| G| 280|
| H| 1|
| I| 1|
| J| 1|
| K| 112|
| L| 63|
| M| 18|
| N| 71|
| O| 1|
| P| 1|
| Q| 82|
| R| 2|
| S| 31|
| T| 2|
+----------+-----+
Next, I want to eliminate the data whose count is 1.
But when I wrote it like this, it was wrong:
fault_data.filter("count(county)=1").show()
The result is:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(count(county) = CAST(1 AS BIGINT))]
Invalid expressions: [count(county)];
Filter (count(county#7) = cast(1 as bigint))
+- Relation [fault_id#0,fault_type#1,acs_way#2,fault_1#3,fault_2#4,province#5,city#6,county#7,town#8,detail#9,num#10,insert_time#11] JDBCRelation(fault_data) [numPartitions=1]
So I want to know the right way, thank you.
fault_data.groupBy("county").count().where(col("count")===1).show()

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

How to count change in row values in pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran
Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

pySpark windows partition sortby instead of order by (exclamation marks)

this is my current dataset
+----------+--------------------+---------+--------+
|session_id| timestamp| item_id|category|
+----------+--------------------+---------+--------+
| 1|2014-04-07 10:51:...|214536502| 0|
| 1|2014-04-07 10:54:...|214536500| 0|
| 1|2014-04-07 10:54:...|214536506| 0|
| 1|2014-04-07 10:57:...|214577561| 0|
| 2|2014-04-07 13:56:...|214662742| 0|
| 2|2014-04-07 13:57:...|214662742| 0|
| 2|2014-04-07 13:58:...|214825110| 0|
| 2|2014-04-07 13:59:...|214757390| 0|
| 2|2014-04-07 14:00:...|214757407| 0|
| 2|2014-04-07 14:02:...|214551617| 0|
| 3|2014-04-02 13:17:...|214716935| 0|
| 3|2014-04-02 13:26:...|214774687| 0|
| 3|2014-04-02 13:30:...|214832672| 0|
| 4|2014-04-07 12:09:...|214836765| 0|
| 4|2014-04-07 12:26:...|214706482| 0|
| 6|2014-04-06 16:58:...|214701242| 0|
| 6|2014-04-06 17:02:...|214826623| 0|
| 7|2014-04-02 06:38:...|214826835| 0|
| 7|2014-04-02 06:39:...|214826715| 0|
| 8|2014-04-06 08:49:...|214838855| 0|
+----------+--------------------+---------+--------+
I want to get the difference between the timestamp of the current row and the timestamp of the previous row.
so I converted the time stamp as follows
data = data.withColumn('time_seconds',data.timestamp.astype('Timestamp').cast("long"))
data.show()
next, I tried the following
my_window = Window.partitionBy().orderBy("session_id")
data = data.withColumn("prev_value", F.lag(data.time_seconds).over(my_window))
data = data.withColumn("diff", F.when(F.isnull(data.time_seconds - data.prev_value), 0)
.otherwise(data.time_seconds - data.prev_value))
data.show()
this is what I got
+----------+-----------+---------+--------+------------+----------+--------+
|session_id| timestamp| item_id|category|time_seconds|prev_value| diff|
+----------+--------------------+---------+--------+------------+----------+
| 1|2014-04-07 |214536502| 0| 1396831869| null| 0|
| 1|2014-04-07 |214536500| 0| 1396832049|1396831869| 180|
| 1|2014-04-07 |214536506| 0| 1396832086|1396832049| 37|
| 1|2014-04-07 |214577561| 0| 1396832220|1396832086| 134|
| 10000001|2014-09-08 |214854230| S| 1410136538|1396832220|13304318|
| 10000001|2014-09-08 |214556216| S| 1410136820|1410136538| 282|
| 10000001|2014-09-08 |214556212| S| 1410136836|1410136820| 16|
| 10000001|2014-09-08 |214854230| S| 1410136872|1410136836| 36|
| 10000001|2014-09-08 |214854125| S| 1410137314|1410136872| 442|
| 10000002|2014-09-08 |214849322| S| 1410167451|1410137314| 30137|
| 10000002|2014-09-08 |214838094| S| 1410167611|1410167451| 160|
| 10000002|2014-09-08 |214714721| S| 1410167694|1410167611| 83|
| 10000002|2014-09-08 |214853711| S| 1410168818|1410167694| 1124|
| 10000003|2014-09-05 |214853090| 3| 1409880735|1410168818| -288083|
| 10000003|2014-09-05 |214851326| 3| 1409880865|1409880735| 130|
| 10000003|2014-09-05 |214853094| 3| 1409881043|1409880865| 178|
| 10000004|2014-09-05 |214853090| 3| 1409886885|1409881043| 5842|
| 10000004|2014-09-05 |214851326| 3| 1409889318|1409886885| 2433|
| 10000004|2014-09-05 |214853090| 3| 1409889388|1409889318| 70|
| 10000004|2014-09-05 |214851326| 3| 1409889428|1409889388| 40|
+----------+--------------------+---------+--------+------------+----------+
only showing top 20 rows
I was hoping that the session Id came out in order of numerical sequence instead of what that gave me...
is there anyway to make the session id come out in numerical order (as in 1,2,3.....) instead of (1,100001......)
thank you so much
​

Spark Dataframe ORDER BY giving mixed combination(asc + desc)

I have a Dataframe that I want to sort column by descending if the count value is greater than 10.
But I'm getting a mixed combination like ascending for couple of records then again descending and then again ascending and son on.
I'm using orderBy() function which sort the record in ascending by default.
Since i'm new to Scala and Spark I'm not getting the reason for why I'm getting this.
df.groupBy("Value").count().filter("count>5.0").orderBy("Value").show(1000);
reading the csv
val df = sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/test.csv")
.toDF("Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value")````
the records I'm getting by executing the above syntax:
+-------+-----+
| Value|count|
+-------+-----+
| 0| 225|
| 0.01| 12|
| 0.02| 13|
| 0.03| 12|
| 0.04| 15|
| 0.05| 9|
| 0.06| 11|
| 0.07| 9|
| 0.08| 6|
| 0.09| 10|
| 0.1| 66|
| 0.11| 12|
| 0.12| 9|
| 0.13| 12|
| 0.14| 8|
| 0.15| 10|
| 0.16| 14|
| 0.17| 11|
| 0.18| 14|
| 0.19| 21|
| 0.2| 78|
| 0.21| 16|
| 0.22| 15|
| 0.23| 13|
| 0.24| 7|
| 0.3| 85|
| 0.31| 7|
| 0.34| 8|
| 0.4| 71|
| 0.5| 103|
| 0.6| 102|
| 0.61| 6|
| 0.62| 9|
| 0.69| 7|
| 0.7| 98|
| 0.72| 6|
| 0.74| 8|
| 0.78| 7|
| 0.8| 71|
| 0.81| 10|
| 0.82| 9|
| 0.83| 8|
| 0.84| 6|
| 0.86| 8|
| 0.87| 10|
| 0.88| 12|
| 0.9| 95|
| 0.91| 9|
| 0.93| 6|
| 0.94| 6|
| 0.95| 8|
| 0.98| 8|
| 0.99| 6|
| 1| 254|
| 1.08| 8|
| 1.1| 80|
| 1.11| 6|
| 1.15| 9|
| 1.17| 7|
| 1.18| 6|
| 1.19| 9|
| 1.2| 94|
| 1.25| 7|
| 1.3| 91|
| 1.32| 8|
| 1.4| 215|
| 1.45| 7|
| 1.5| 320|
| 1.56| 6|
| 1.6| 280|
| 1.64| 6|
| 1.66| 10|
| 1.7| 310|
| 1.72| 7|
| 1.74| 6|
| 1.8| 253|
| 1.9| 117|
| 10| 78|
| 10.1| 45|
| 10.2| 49|
| 10.3| 30|
| 10.4| 40|
| 10.5| 38|
| 10.6| 52|
| 10.7| 35|
| 10.8| 39|
| 10.9| 42|
| 10.96| 7|------------mark
| 100| 200|
| 101.3| 7|
| 101.8| 8|
| 102| 6|
| 102.2| 6|
| 102.7| 8|
| 103.2| 6|--------------here
| 11| 93|
| 11.1| 32|
| 11.2| 38|
| 11.21| 6|
| 11.3| 42|
| 11.4| 32|
| 11.5| 34|
| 11.6| 38|
| 11.69| 6|
| 11.7| 42|
| 11.8| 25|
| 11.86| 6|
| 11.9| 39|
| 11.96| 9|
| 12| 108|
| 12.07| 7|
| 12.1| 31|
| 12.11| 6|
| 12.2| 34|
| 12.3| 28|
| 12.39| 6|
| 12.4| 32|
| 12.5| 31|
| 12.54| 7|
| 12.57| 6|
| 12.6| 18|
| 12.7| 33|
| 12.8| 20|
| 12.9| 21|
| 13| 85|
| 13.1| 25|
| 13.2| 19|
| 13.3| 30|
| 13.34| 6|
| 13.4| 32|
| 13.5| 16|
| 13.6| 15|
| 13.7| 31|
| 13.8| 8|
| 13.83| 7|
| 13.89| 7|
| 14| 46|
| 14.1| 10|
| 14.3| 10|
| 14.4| 7|
| 14.5| 15|
| 14.7| 6|
| 14.9| 11|
| 15| 52|
| 15.2| 6|
| 15.3| 9|
| 15.4| 12|
| 15.5| 21|
| 15.6| 11|
| 15.7| 14|
| 15.8| 18|
| 15.9| 18|
| 16| 44|
| 16.1| 30|
| 16.2| 26|
| 16.3| 29|
| 16.4| 26|
| 16.5| 32|
| 16.6| 42|
| 16.7| 44|
| 16.72| 6|
| 16.8| 40|
| 16.9| 54|
| 17| 58|
| 17.1| 48|
| 17.2| 51|
| 17.3| 47|
| 17.4| 57|
| 17.5| 51|
| 17.6| 51|
| 17.7| 46|
| 17.8| 33|
| 17.9| 38|---------again
|1732.04| 6|
| 18| 49|
| 18.1| 21|
| 18.2| 23|
| 18.3| 29|
| 18.4| 22|
| 18.5| 22|
| 18.6| 17|
| 18.7| 13|
| 18.8| 13|
| 18.9| 19|
| 19| 36|
| 19.1| 15|
| 19.2| 13|
| 19.3| 12|
| 19.4| 15|
| 19.5| 15|
| 19.6| 15|
| 19.7| 15|
| 19.8| 14|
| 19.9| 9|
| 2| 198|------------see after 19 again 2 came
| 2.04| 7|
| 2.09| 8|
| 2.1| 47|
| 2.16| 6|
| 2.17| 8|
| 2.2| 55|
| 2.24| 6|
| 2.26| 7|
| 2.27| 6|
| 2.29| 8|
| 2.3| 53|
| 2.4| 33|
| 2.5| 36|
| 2.54| 6|
| 2.59| 6|
Can you tell me what is wrong i'm doing.
My dataframe has column
"Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value"
As we talked about in the comments, it seems your Value column is of type String. You can cast it to Double (for instance) to order it numerically.
This lines will cast the Value column to doubleType:
import org.apache.spark.sql.types._
df.withColumn("Value", $"Value".cast(DoubleType))
EXAMPLE INPUT
df.show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
With Value as Strings
df.orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
Casting Value as Double
df.withColumn("Value", $"Value".cast(DoubleType)).orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 2.0| a|
| 10.0| b|
+-----+-------+