Select specific rows from Spark dataframe per grouping - scala

I have a dataframe like that:
+-----------------+------------------+-----------+--------+---+
| conversation_id | message_body | timestamp | sender | |
+-----------------+------------------+-----------+--------+---+
| A | hi | 9:00 | John | |
| A | how are you? | 10:00 | John | |
| A | can we meet? | 10:05 | John | * |
| A | not bad | 10:30 | Steven | * |
| A | great | 10:40 | John | |
| A | yeah, let's meet | 10:35 | Steven | |
| B | Hi | 12:00 | Anna | * |
| B | Hello | 12:05 | Ken | * |
+-----------------+------------------+-----------+--------+---+
For each conversation I would like to have the last message in the first block of the 1st sender and the first message of the 2nd sender. I marked them with an asterisk.
One idea that I had is to assign 0s for the first user and 1s for the second user.
Ideally I would like to have something like that:
+-----------------+---------+------------+--------------+---------+------------+----------+
| conversation_id | sender1 | timestamp1 | message1 | sender2 | timestamp2 | message2 |
+-----------------+---------+------------+--------------+---------+------------+----------+
| A | John | 10:05 | can we meet? | Steven | 10:30 | not bad |
| B | Anna | 12:00 | Hi | Ken | 12:05 | Hello |
+-----------------+---------+------------+--------------+---------+------------+----------+
How could I do that in Spark?

Interesting issues arose.
Amended 10:35 to 10:45
Used leading 0 format e.g. 09:00 instead of 9:00
You will need to use your own data types accordingly, this simply demonstrates the approach needed
Done in DataBricks Notebook
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
Seq(("A", "hi", "09:00", "John"), ("A", "how are you?", "10:00", "John"),
("A", "can we meet?", "10:05", "John"), ("A", "not bad", "10:30", "Steven"),
("A", "great", "10:40", "John"), ("A", "yeah, let's meet", "10:45", "Steven"),
("B", "Hi", "12:00", "Anna"), ("B", "Hello", "12:05", "Ken")
).toDF("conversation_id", "message_body", "timestampX", "sender")
// Get rank, 1 is who were initiates conversation, the other values can be used to infer relationships
// Note no #Transient required here with Window
val df2 = df.withColumn("rankC", row_number().over(Window.partitionBy($"conversation_id").orderBy($"timestampX".asc)))
// A value <> 1 is the first message of second Sender.
// The 1 value of this is the last message of first "block"
val df3 = df2.select('conversation_id as "df3_conversation_id", 'sender as "df3_sender", 'rankC as "df3_rank")
// To avoid pipelined renaming issues that occur
val df3a = df3.groupBy("df3_conversation_id", "df3_sender").agg(min("df3_rank") as "rankC2").filter("rankC2 != 1")
//JOIN the values with some smarts. Some odd errors in Spark thru pipe-lining gotten. Need to drop pipelined row(), ranking etc.
val df4 = df3a.join(df2, (df3a("df3_conversation_id") === df2("conversation_id")) && (df3a("rankC2") === df2("rankC") + 1)).drop("rankC").drop("rankC2")
val df4a = df3a.join(df2, (df3a("df3_conversation_id") === df2("conversation_id")) && (df3a("rankC2") === df2("rankC"))).drop("rankC").drop("rankC2")
// The get other missing data, could have all been combined but done in steps for simplicity. Just a simple final JOIN and you ahve the answer.
val df5 = df4.join(df4a, (df4("df3_conversation_id") === df4a("df3_conversation_id")))
df5.show(false)
returns:
Output will not completely format here, run it in REPL to see titles.
|B |Ken |B |Hi |12:00 |Anna |B |Ken |B |Hello |12:05 |Ken |
|A |Steven |A |can we meet?|10:05 |John |A |Steven |A |not bad |10:30 |Steven|
You can further manipulate the data, the heavy lifting is done now. The Catalyst Optimizer has some issues compiling etc. so this is why I worked around in this fashion.

Related

PySpark Return Exact Match from list of strings

I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1

How to get column with list of values from another column in Pyspark

Can someone help me with any idea how to create pyspark DataFrame with all Recepients of each person?
For example:
Input DataFrame:
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
Result:
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+
I tried df.groupBy("Sender").sum("Recepients") to get string and split it but had the error Aggregation function can only be applied on a numeric column.
All you need was to do was a groupBy Sender column and collect the Recepient.
Below is the full solution
# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")],
("sender","Recepient"))
df.show()
# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
# Import functions
import pyspark.sql.functions as f
# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+

how to find which date the consecutive column status "Complete" started with in a 7day period

I need to get a date from below input on which there is a consecutive 'complete' status for past 7 days from that given date.
Requirement:
1. go Back 8 days (this is easy)
2. So we are on 20190111 from below data frame, I need to check day by day from 20190111 to 20190104 (7 day period) and get a date on which status has 'complete' for consecutive 7 days. So we should get 20190108
I need this in spark-scala.
input
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
| 9|20190109| pending|
| 10|20190110|complete|
| 11|20190111|complete|
| 12|20190112| pending|
| 13|20190113|complete|
| 14|20190114|complete|
| 15|20190115| pending|
| 16|20190116| pending|
| 17|20190117| pending|
| 18|20190118| pending|
| 19|20190119| pending|
+---+--------+--------+
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
for >= spark 2.4
import org.apache.spark.sql.expressions.Window
val df= Seq((1,"20190101","complete"),(2,"20190102","complete"),
(3,"20190103","complete"),(4,"20190104","complete"), (5,"20190105","complete"),(6,"20190106","complete"),(7,"20190107","complete"),(8,"20190108","complete"),
(9,"20190109", "pending"),(10,"20190110","complete"),(11,"20190111","complete"),(12,"20190112", "pending"),(13,"20190113","complete"),(14,"20190114","complete"),(15,"20190115", "pending") , (16,"20190116", "pending"),(17,"20190117", "pending"),(18,"20190118", "pending"),(19,"20190119", "pending")).toDF("id","date","status")
val df1= df.select($"id", to_date($"date", "yyyyMMdd").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val df_new= df4.where($"previous_7_sum"===8).select($"date").select(explode(sequence(date_sub($"date",7), $"date")).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
+---+--------+--------+
for spark < 2.4
use udf instead of built in array function "sequence"
val df1= df.select($"id", $"date".cast("integer").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val ud1= udf((col1:Int) => {
((col1-7).to(col1 )).toArray})
val df_new= df4.where($"previous_7_sum"===8)
.withColumn("dt_arr", ud1($"date"))
.select(explode($"dt_arr" ).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ?

I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful

Spark SQL Map only one column of DataFrame

Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column
2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.
You can use withColumn to add new column or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+