Spark Structured Streaming operations on rows of a single dataframe - scala

In my problem, there is a data stream of information about package delivery coming in. The data consists of "NumberOfPackages", "Action" (which can be either "Loaded", "Delivered" or "In Transit"), and "Driver".
val streamingData = <filtered data frame based on "Loaded" and "Delivered" Action types only>
The goal is to look at number of packages at the moment of loading and at the moment of delivery, and if they are not the same - execute a function that would call a REST service with the parameter of "TrackingId".
The data looks like this:
+-----------------+-----------+-----------------------
|NumberOfPackages |Action |TrackingId |Driver |
+-----------------+-----------+-----------------------
|5 |Loaded |a |Alex
|5 |Devivered |a |Alex
|8 |Loaded |b |James
|8 |Delivered |b |James
|7 |Loaded |c |Mark
|3 |Delivered |c |Mark
<...more rows in this streaming data frame...>
In this case, we see that by the "TrackingId" equal to "c", the number of packages loaded and delivered isn't the same, so this is where we'd need to call the REST api with the "TrackingId".
I would like to combine rows based on "TrackingId", which will always be unique for each trip. If we get the rows combined based on this tracking id, we could have two columns for number of packages, something like "PackagesAtLoadTime" and "PackagesAtDeliveryTime". Then we could compare these two values for each row and filter the dataframe by those which are not equal.
So far I have tried the groupByKey method with the "TrackingId", but I couldn't find a similar example and my experimental attempts weren't successful.
After I figure out how to "merge" the two rows with the same tracking id together and have a column for each corresponding count of packages, I could define a UDF:
def notEqualPackages = udf((packagesLoaded: Int, packagesDelivered: Int) => packagesLoaded!=packagesDelivered)
And use it to filter the rows of the dataframe to contain only those with not matching numbers:
streamingData.where(notEqualPackages(streamingData("packagesLoaded", streamingData("packagesDelivered")))

Related

How can I apply QuantileDiscretizer on a groupBy object in PySpark?

I want to compute Quantiles on a dataframe after grouping it.
This is my sample dataframe:
|id |shop|amount|
|:--|:--:|---:|
|1 |A. |100|
|2 |B. |200|
|3. |A. |125|
|1. |A |25 |
|2 |B |220|
|3. |A. |110|
I want to bin the amount into low, medium and high, based on each shop.
So, I would group my dataframe like this.
shop_groups= df.groupBy('shop')
The mistake I did originally, was I applied the QuantileDiscretizer on the whole amount set as is, without grouping it by shop.
How can I do this on the shop_groups ?

Delta table update if column exists otherwise add as a new column

I have a delta table as shown below. There is a id column and columns for count of signals. There will be around 300 signals and so the number of columns will be approx 601 in the final delta table.
The requirement is to have a single record for each id_no in the final table.
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............+------------------+--------------------+----------------------------+
|id_no|sig01_total |sig01_valid_total|sig02_total |sig02_valid_total|sig03_total |sig03_valid_total|sig03_total_valid |..............|sig300_valid_total|sig300_total_valid |load_timestamp |
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............|------------------+--------------------+----------------------------+
|050 |25 |23 |45 |43 |66 |60 |55 |..............|60 |55 |2021-08-10T16:58:30.054+0000|
|051 |78 |70 |15 |14 |10 |10 |9 |..............|10 |9 |2021-08-10T16:58:30.054+0000|
|052 |88 |88 |75 |73 |16 |13 |13 |..............|13 |13 |2021-08-10T16:58:30.054+0000|
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............+------------------+--------------------+----------------------------+
I could perform the upsert based on the id_no using delta merge option as shown below.
targetDeltaTable.alias("t")
.merge(
sourceDataFrame.alias("s"),
"t.id_no = s.id_no")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
This code will execute weekly and the final tabe has to be updated weekly. Not all signals might be available for every week data.
But the issue with merging is, If the id_no is available, then i have to do an additional check for the columns (signal columns) availability also in the existing table.
If the signal column exists then i have to do an addition of the new count with existing count.
If the signal column does not exists for that particular id_no then i have to add that signal as a new column to the existing row.
I have tried to execute delta table upsert command as shown below, but since the number of columns are not static every data load and considering the huge number of columns it was not succeeded.
DeltaTable.forPath(spark, "/data/events/target/")
.as("t")
.merge(
updatesDF.as("s"),
"t.id_no = s.id_no"
.whenMatched
.updateExpr(//some condition to be applied to check whether column exists or not?
Map("sig01_total" -> "t.sig01_total"+"s.sig01_total"
->
->........))
.whenNotMatched
.insertAll()
.execute()
How can I acheive this requirement? Any leads appreciated!

Executing multiple spark queries and storing as dataframe

I have 3 Spark queries saved in List - sqlQueries. The first 2 of them creates global temporary views and third one executes on those temporary views and fetches some output.
I am able to run a single query using this -
val resultDF = spark.sql(sql)
Then I add partition information on this dataframe object and save it.
In case of multiple queries, I tried executing
sqlQueries.foreach(query => spark.sql(query))
How do I save my output of third query keeping other 2 queries run.
I have 3 queries just for example, It can be any number.
You can write the last query as insert statement to save the results into table. You are executing queries through foreach which will execute sequentially.
I am taking reference from your other question for the query which needs some modification as explained in global-temporary-view in sql section.
After modification your query file should look like
CREATE GLOBAL TEMPORARY VIEW VIEW_1 AS select a,b from abc
CREATE GLOBAL TEMPORARY VIEW VIEW_2 AS select a,b from global_temp.VIEW_1
select * from global_temp.VIEW_2
Then answering this question: you can use foldLeft again for the multiple queries to be reflected.
Lets say you have a dataframe
+----+---+---+
|a |b |c |
+----+---+---+
|a |b |1 |
|adfs|df |2 |
+----+---+---+
And given above multiple line query file, you can do the following
df.createOrReplaceTempView("abc")
val sqlFile = "path to test.sql"
val queryList = scala.io.Source.fromFile(sqlFile).getLines().filterNot(_.isEmpty).toList
val finalresult = queryList.foldLeft(df)((tempdf, query) => sqlContext.sql(query))
finalresult.show(false)
which should give you
+----+---+
|a |b |
+----+---+
|a |b |
|adfs|df |
+----+---+

Flip each bit in Spark dataframe calling a custom function

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .
I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Calculate depart flights from sorted data using Spark

I've a dataset of flights in the form of
+----------------+----------+-------------+
|flightID |depart_ts |arrival_ts |
+----------------+----------+-------------+
|1 |1451603468| 1451603468|
|2 |1451603468| 1451603468|
|3 |1451603468| 1451603468|
|4 |1451603468| 1451603468|
|5 |1451603468| 1451603468|
+----------------+----------+-------------+
and my job is to use Apache Spark to find the return flight for each flight given some conditions (departure time of return flight B should be within 2 hours from arrival time of flight A). Doing a cross join of 1M Record to check these conditions is not efficient and will cost much time. I've thought about using window function with 1 partition and custom UDAF to do the calculation. Something like this
1. val flightsWindow = Window.orderBy("depart_ts").rangeBetween(0, 7200)
2. flights.withColumn("returnFlightID", calcReturn( $"arrival_ts", $"depart_ts").over(flightsWindow)).show()
Considering that this approach will lead to a solution, I'm facing some challenges:
In line 1, I want to let frame range span from CURRENT ROW to arrival_ts + 7200, but apparently I cannot do dynamic ranges in spark, no?
In line 1 and assuming that 2 flights have the same arrival time, this will make it impossible to retrieve the values of the second flight when CURRENT_ROW pointer moves there, since difference between first flight and second flight is 0. Is it possible to explicitly tell range to start framing from CURRENT_ROW?
In line 2, I want to retrieve the depart_ts value for the very first row of the frame to compare against other flights in the frame. Is it possible to do that. I tried first() function but it doesn't fit in my case.