Add column to Pyspark which assign number of groups to regaridng rows - group-by

i have a dataframe:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
df = spark.createDataFrame([("a", "65"), ("b", "23"),("c", "65"), ("d", "23"),
("a", "66"), ("b", "46"),("c", "23"), ("d", "66"),
("b", "5"), ("b", "3"),("c", "3")], ["column2", "value"])
df.show()
+-------+-----+
|column2|value|
+-------+-----+
| a| 65 |
| b| 23 |
| c| 65 |
| d| 23 |
| a| 66 |
| b| 46 |
| c| 23 |
| d| 66 |
| b| 5 |
| b| 3 |
| c| 3 |
+-------+-----+
And I wanted to make each 4 row as an one group. Then regarding to that group create new column where i can assign the number of groups to the corresponding rows. So the desired output is as following:
+-------+-----+------+
|column2|value|gr_val|
+-------+-----+ -----+
| a| 65 | 1 |
| b| 23 | 1 |
| c| 65 | 1 |
| d| 23 | 1 |
| a| 66 | 2 |
| b| 46 | 2 |
| c| 23 | 2 |
| d| 66 | 2 |
| b| 5 | 3 |
| b| 3 | 3 |
| c| 3 | 3 |
+-------+-----+------+
I would appreciate any helps!

Try this approach -
(1) Create a new column (dummy) that will hold sequentially increasing number to each row. lit('a') used to create static value to generate sequentially increasing row number.
(2) Devide the dummy column with number or records you want in each group (eg. 4) and take ceil. Ceil return the smallest integer not less than the value.
Here is detailed example -
from pyspark.sql.functions import *
from pyspark.sql.window import *
w = Window().partitionBy(lit('a')).orderBy(lit('a'))
df.withColumn("row_num", row_number().over(w))\
.selectExpr('column2 AS column2','value AS value','ceil(row_num/4) as gr_val')\
.show()
#+-------+-----+------+
#|column2|value|gr_val|
#+-------+-----+------+
#| a| 65| 1|
#| b| 23| 1|
#| c| 65| 1|
#| d| 23| 1|
#| a| 66| 2|
#| b| 46| 2|
#| c| 23| 2|
#| d| 66| 2|
#| b| 5| 3|
#| b| 3| 3|
#| c| 3| 3|
#+-------+-----+------+

Related

how to solve following issue with apache spark with optimal solution

i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+

Spark dataframe groupby and order group?

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

Update column value from another columns based on multiple conditions in spark structured streaming

I want to update the value in one column using another two columns based on multiple conditions. For eg - the stream is like :
+---+---+----+---+
| A | B | C | D |
+---+---+----+---+
| a | T | 10 | 0 |
| a | T | 100| 0 |
| a | L | 0 | 0 |
| a | L | 1 | 0 |
+---+---+----+---+
What I have is multiple conditions like -
(B = "T" && C > 20 ) OR (B = "L" && C = 0)
The values "T", 20, "L" and 0 are dynamic. AND/OR operators are also supplied at run-time. I want to make D = 1 whenever the condition holds true else it should remain D = 0. The number of conditions are also dynamic.
I tried using it with the UPDATE command in spark-sql i.e. UPDATE df SET D = '1' WHERE CONDITIONS. But it says that update is not yet supported. The resulting dataframe should be -
+---+---+----+---+
| A | B | C | D |
+---+---+----+---+
| a | T | 10 | 0 |
| a | T | 100| 1 |
| a | L | 0 | 1 |
| a | L | 1 | 0 |
+---+---+----+---+
Is there any way I can achieve this ?
I hope you are using Python. Will also post same for Scala as well! Use udf
PYTHON
>>> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
>>> def get_column(B, C):
... return int((B == "T" and C > 20) or (B == "L" and C == 0))
...
>>> fun = udf(get_column)
>>> res = df.withColumn("D", fun(df['B'], df['C']))>>> res.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+
SCALA
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
scala> def get_column(B : String, C : Int) : Int = {
| if((B == "T" && C > 20) || (B == "L" && C == 0))
| 1
| else
| 0
| }
get_column: (B: String, C: Int)Int
scala> val fun = udf(get_column _)
fun: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(StringType, IntegerType)
))
scala> val res = df.withColumn("D", fun(df("B"), df("C")))
res: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> res.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+
You can also use case when and otherwise like this:
PYTHON
>>> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
>>> new_column = when(
(col("B") == "T") & (col("C") > 20), 1
).when((col("B") == "L") & (col("C") == 0), 1).otherwise(0)
>>> res = df.withColumn("D", new_column)
>>> res.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+
SCALA
scala> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
scala> val new_column = when(
| col("B") === "T" && col("C") > 20, 1
| ).when(col("B") === "L" && col("C") === 0, 1 ).otherwise(0)
new_column: org.apache.spark.sql.Column = CASE WHEN ((B = T) AND (C > 20)) THEN 1 WHEN ((B = L) AND (C = 0)) THEN 1 ELSE 0 END
scala> df.withColumn("D", new_column).show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+