Need to add order on ids on the basis of timestamp - pyspark

Desired Outcome
Tried Everthing group by and condition but not working

You can achieve that with windowing function, like this:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy("user").orderBy("timestamp")
df.withColumn("order", row_number().over(window)).show()
+----+---------+-----+
|user|timestamp|order|
+----+---------+-----+
| 111| 12:00| 1|
| 111| 12:30| 2|
| 111| 12:45| 3|
| 112| 12:00| 1|
| 112| 12:30| 2|
| 112| 12:45| 3|
| 113| 12:00| 1|
| 113| 12:30| 2|
| 113| 12:45| 3|
+----+---------+-----+

Related

How does Spark DataFrame find out some lines that only appear once?

I want to eliminate some rows that only appear once in the ‘county’ column, which is not conducive to my statistics.
I used groupBy+count to find:
fault_data.groupBy("county").count().show()
The data looks like this:
+----------+-----+
| county|count|
+----------+-----+
| A| 117|
| B| 31|
| C| 1|
| D| 272|
| E| 1|
| F| 1|
| G| 280|
| H| 1|
| I| 1|
| J| 1|
| K| 112|
| L| 63|
| M| 18|
| N| 71|
| O| 1|
| P| 1|
| Q| 82|
| R| 2|
| S| 31|
| T| 2|
+----------+-----+
Next, I want to eliminate the data whose count is 1.
But when I wrote it like this, it was wrong:
fault_data.filter("count(county)=1").show()
The result is:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(count(county) = CAST(1 AS BIGINT))]
Invalid expressions: [count(county)];
Filter (count(county#7) = cast(1 as bigint))
+- Relation [fault_id#0,fault_type#1,acs_way#2,fault_1#3,fault_2#4,province#5,city#6,county#7,town#8,detail#9,num#10,insert_time#11] JDBCRelation(fault_data) [numPartitions=1]
So I want to know the right way, thank you.
fault_data.groupBy("county").count().where(col("count")===1).show()

Unable to get the result from the window function

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things
Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

Create a new column that marks customers

My goal is to aggregate over the customerID (count), create a new Column and mark the customer which return often an article. How can I do that? (using Databricks, pyspark)
train.select("itemID","customerID","returnShipment").show(10)
+------+----------+--------------+
|itemID|customerID|returnShipment|
+------+----------+--------------+
| 186| 794| 0|
| 71| 794| 1|
| 71| 794| 1|
| 32| 850| 1|
| 32| 850| 1|
| 57| 850| 1|
| 2| 850| 1|
| 259| 850| 1|
| 603| 850| 1|
| 259| 850| 1|
+------+----------+--------------+
You can define a threshold value and then compare this threshold value to the sum of returnShipments for each customerID:
from pyspark.sql import functions as F
threshold=5
df.groupBy("customerID")\
.sum("returnShipment") \
.withColumn("mark", F.col("sum(returnShipment)") > threshold) \
.show()

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

Window function count() does not work properly when there is orderBy in the window definition

In pyspark, when using count().over(window), if there is orderBy in the window definition, the results are not correct. Not sure if this is a bug, or there is a better way to do it.
Compare the same group with different window definition, one is with orderBy, another is not. They showed different results. The window definition without orderBy has expected results.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
cschema = StructType([StructField('customer',StringType()),StructField('sales', IntegerType())])
data = [
['Bob',20],
['Bob',30],
['Bob',22],
['John',33],
['John', 18],
['Bob', 30],
['John', 18]]
test_df = spark.createDataFrame(data, schema = cschema)
test_df.show()
+--------+-----+
|customer|sales|
+--------+-----+
| Bob| 20|
| Bob| 30|
| Bob| 22|
| John| 33|
| John| 18|
| Bob| 30|
| John| 18|
+--------+-----+
win_ordered = Window.partitionBy('customer').orderBy(col('sales'))
win_non_ordered = Window.partitionBy('customer')
test_df.withColumn('cnt1', count(col('sales')).over(win_ordered)).withColumn('cnt2', count(col('sales')).over(win_non_ordered)).show()
+--------+-----+----+----+
|customer|sales|cnt1|cnt2|
+--------+-----+----+----+
| Bob| 20| 1| 4|
| Bob| 22| 2| 4|
| Bob| 30| 4| 4|
| Bob| 30| 4| 4|
| John| 18| 2| 3|
| John| 18| 2| 3|
| John| 33| 3| 3|
+--------+-----+----+----+
I am expecting the 'cnt1' column has the same value across the group, just like 'cnt2' column.