Calculate depart flights from sorted data using Spark - scala

I've a dataset of flights in the form of
+----------------+----------+-------------+
|flightID |depart_ts |arrival_ts |
+----------------+----------+-------------+
|1 |1451603468| 1451603468|
|2 |1451603468| 1451603468|
|3 |1451603468| 1451603468|
|4 |1451603468| 1451603468|
|5 |1451603468| 1451603468|
+----------------+----------+-------------+
and my job is to use Apache Spark to find the return flight for each flight given some conditions (departure time of return flight B should be within 2 hours from arrival time of flight A). Doing a cross join of 1M Record to check these conditions is not efficient and will cost much time. I've thought about using window function with 1 partition and custom UDAF to do the calculation. Something like this
1. val flightsWindow = Window.orderBy("depart_ts").rangeBetween(0, 7200)
2. flights.withColumn("returnFlightID", calcReturn( $"arrival_ts", $"depart_ts").over(flightsWindow)).show()
Considering that this approach will lead to a solution, I'm facing some challenges:
In line 1, I want to let frame range span from CURRENT ROW to arrival_ts + 7200, but apparently I cannot do dynamic ranges in spark, no?
In line 1 and assuming that 2 flights have the same arrival time, this will make it impossible to retrieve the values of the second flight when CURRENT_ROW pointer moves there, since difference between first flight and second flight is 0. Is it possible to explicitly tell range to start framing from CURRENT_ROW?
In line 2, I want to retrieve the depart_ts value for the very first row of the frame to compare against other flights in the frame. Is it possible to do that. I tried first() function but it doesn't fit in my case.

Related

Delta table update if column exists otherwise add as a new column

I have a delta table as shown below. There is a id column and columns for count of signals. There will be around 300 signals and so the number of columns will be approx 601 in the final delta table.
The requirement is to have a single record for each id_no in the final table.
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............+------------------+--------------------+----------------------------+
|id_no|sig01_total |sig01_valid_total|sig02_total |sig02_valid_total|sig03_total |sig03_valid_total|sig03_total_valid |..............|sig300_valid_total|sig300_total_valid |load_timestamp |
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............|------------------+--------------------+----------------------------+
|050 |25 |23 |45 |43 |66 |60 |55 |..............|60 |55 |2021-08-10T16:58:30.054+0000|
|051 |78 |70 |15 |14 |10 |10 |9 |..............|10 |9 |2021-08-10T16:58:30.054+0000|
|052 |88 |88 |75 |73 |16 |13 |13 |..............|13 |13 |2021-08-10T16:58:30.054+0000|
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............+------------------+--------------------+----------------------------+
I could perform the upsert based on the id_no using delta merge option as shown below.
targetDeltaTable.alias("t")
.merge(
sourceDataFrame.alias("s"),
"t.id_no = s.id_no")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
This code will execute weekly and the final tabe has to be updated weekly. Not all signals might be available for every week data.
But the issue with merging is, If the id_no is available, then i have to do an additional check for the columns (signal columns) availability also in the existing table.
If the signal column exists then i have to do an addition of the new count with existing count.
If the signal column does not exists for that particular id_no then i have to add that signal as a new column to the existing row.
I have tried to execute delta table upsert command as shown below, but since the number of columns are not static every data load and considering the huge number of columns it was not succeeded.
DeltaTable.forPath(spark, "/data/events/target/")
.as("t")
.merge(
updatesDF.as("s"),
"t.id_no = s.id_no"
.whenMatched
.updateExpr(//some condition to be applied to check whether column exists or not?
Map("sig01_total" -> "t.sig01_total"+"s.sig01_total"
->
->........))
.whenNotMatched
.insertAll()
.execute()
How can I acheive this requirement? Any leads appreciated!

Spark UDF not giving rolling counts properly

I have a Spark UDF to calculate rolling counts of a column, precisely wrt time. If I need to calculate a rolling count for 24 hours, for example for entry with time 2020-10-02 09:04:00, I need to look back until 2020-10-01 09:04:00 (very precise).
The Rolling count UDF works fine and gives correct counts, if I run locally, but when I run on a cluster, its giving incorrect results. Here is the sample input and output
Input
+---------+-----------------------+
|OrderName|Time |
+---------+-----------------------+
|a |2020-07-11 23:58:45.538|
|a |2020-07-12 00:00:07.307|
|a |2020-07-12 00:01:08.817|
|a |2020-07-12 00:02:15.675|
|a |2020-07-12 00:05:48.277|
+---------+-----------------------+
Expected Output
+---------+-----------------------+-----+
|OrderName|Time |Count|
+---------+-----------------------+-----+
|a |2020-07-11 23:58:45.538|1 |
|a |2020-07-12 00:00:07.307|2 |
|a |2020-07-12 00:01:08.817|3 |
|a |2020-07-12 00:02:15.675|1 |
|a |2020-07-12 00:05:48.277|1 |
+---------+-----------------------+-----+
Last two entry values are 4 and 5 locally, but on cluster they are incorrect. My best guess is data is being distributed across executors and udf is also being called in parallel on each executor. As one of the parameter to UDF is column (Partition key - OrderName in this example), how could I control/correct the behavior for cluster if thats the case. So that it calculates proper counts for each partition in a right way. Any suggestion please
As per your comment , you want to count the total no of records of every record for the last 24 hours
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.LongType
//A sample data (Guessing from your question)
val df = Seq(("a","2020-07-10 23:58:45.438","1"),("a","2020-07-11 23:58:45.538","1"),("a","2020-07-11 23:58:45.638","1")).toDF("OrderName","Time","Count")
// Extract the UNIX TIMESTAMP for your time column
val df2 = df.withColumn("unix_time",concat(unix_timestamp($"Time"),split($"Time","\\.")(1)).cast(LongType))
val noOfMilisecondsDay : Long = 24*60*60*1000
//Create a window per `OrderName` and select rows from `current time - 24 hours` to `current time`
val winSpec = Window.partitionBy("OrderName").orderBy("unix_time").rangeBetween(Window.currentRow - noOfMilisecondsDay, Window.currentRow)
// Final you perform your COUNT or SUM(COUNT) as per your need
val finalDf = df2.withColumn("tot_count", count("OrderName").over(winSpec))
//or val finalDf = df2.withColumn("tot_count", sum("Count").over(winSpec))

Flip each bit in Spark dataframe calling a custom function

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .
I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Spark Structured Streaming operations on rows of a single dataframe

In my problem, there is a data stream of information about package delivery coming in. The data consists of "NumberOfPackages", "Action" (which can be either "Loaded", "Delivered" or "In Transit"), and "Driver".
val streamingData = <filtered data frame based on "Loaded" and "Delivered" Action types only>
The goal is to look at number of packages at the moment of loading and at the moment of delivery, and if they are not the same - execute a function that would call a REST service with the parameter of "TrackingId".
The data looks like this:
+-----------------+-----------+-----------------------
|NumberOfPackages |Action |TrackingId |Driver |
+-----------------+-----------+-----------------------
|5 |Loaded |a |Alex
|5 |Devivered |a |Alex
|8 |Loaded |b |James
|8 |Delivered |b |James
|7 |Loaded |c |Mark
|3 |Delivered |c |Mark
<...more rows in this streaming data frame...>
In this case, we see that by the "TrackingId" equal to "c", the number of packages loaded and delivered isn't the same, so this is where we'd need to call the REST api with the "TrackingId".
I would like to combine rows based on "TrackingId", which will always be unique for each trip. If we get the rows combined based on this tracking id, we could have two columns for number of packages, something like "PackagesAtLoadTime" and "PackagesAtDeliveryTime". Then we could compare these two values for each row and filter the dataframe by those which are not equal.
So far I have tried the groupByKey method with the "TrackingId", but I couldn't find a similar example and my experimental attempts weren't successful.
After I figure out how to "merge" the two rows with the same tracking id together and have a column for each corresponding count of packages, I could define a UDF:
def notEqualPackages = udf((packagesLoaded: Int, packagesDelivered: Int) => packagesLoaded!=packagesDelivered)
And use it to filter the rows of the dataframe to contain only those with not matching numbers:
streamingData.where(notEqualPackages(streamingData("packagesLoaded", streamingData("packagesDelivered")))

Spark time series data generation

I am trying to generate time series data in Spark and Scala. I have the following data in DataFrame which is hourly data
sid|date |count
200|2016-04-30T18:00:00:00+00:00 | 10
200 |2016-04-30T21:00:00:00+00:00 | 5
I want to generate time series data for the last 2 days hourly by taking the max time from the input. In my case the series data should start from 2016-04-30T21:00:00:00+00:00 and generate hourly data.Any hour without data then it should put the count as null. Sample output as follows
id|sid|date |count
1 |200|2016-04-28T22:00:00:00+00:00 |
2 |200|2016-04-28T23:00:00:00+00:00 |
3 |200|2016-04-29T00:00:00:00+00:00 |
--------------------------------------
45|200|2016-04-30T18:00:00:00+00:00 |10
--------------------------------------
--------------------------------------
48|200|2016-04-30T21:00:00:00+00:00 |5
Thanks,