Executing multiple spark queries and storing as dataframe - scala

I have 3 Spark queries saved in List - sqlQueries. The first 2 of them creates global temporary views and third one executes on those temporary views and fetches some output.
I am able to run a single query using this -
val resultDF = spark.sql(sql)
Then I add partition information on this dataframe object and save it.
In case of multiple queries, I tried executing
sqlQueries.foreach(query => spark.sql(query))
How do I save my output of third query keeping other 2 queries run.
I have 3 queries just for example, It can be any number.

You can write the last query as insert statement to save the results into table. You are executing queries through foreach which will execute sequentially.

I am taking reference from your other question for the query which needs some modification as explained in global-temporary-view in sql section.
After modification your query file should look like
CREATE GLOBAL TEMPORARY VIEW VIEW_1 AS select a,b from abc
CREATE GLOBAL TEMPORARY VIEW VIEW_2 AS select a,b from global_temp.VIEW_1
select * from global_temp.VIEW_2
Then answering this question: you can use foldLeft again for the multiple queries to be reflected.
Lets say you have a dataframe
+----+---+---+
|a |b |c |
+----+---+---+
|a |b |1 |
|adfs|df |2 |
+----+---+---+
And given above multiple line query file, you can do the following
df.createOrReplaceTempView("abc")
val sqlFile = "path to test.sql"
val queryList = scala.io.Source.fromFile(sqlFile).getLines().filterNot(_.isEmpty).toList
val finalresult = queryList.foldLeft(df)((tempdf, query) => sqlContext.sql(query))
finalresult.show(false)
which should give you
+----+---+
|a |b |
+----+---+
|a |b |
|adfs|df |
+----+---+

Related

Spark UDF not giving rolling counts properly

I have a Spark UDF to calculate rolling counts of a column, precisely wrt time. If I need to calculate a rolling count for 24 hours, for example for entry with time 2020-10-02 09:04:00, I need to look back until 2020-10-01 09:04:00 (very precise).
The Rolling count UDF works fine and gives correct counts, if I run locally, but when I run on a cluster, its giving incorrect results. Here is the sample input and output
Input
+---------+-----------------------+
|OrderName|Time |
+---------+-----------------------+
|a |2020-07-11 23:58:45.538|
|a |2020-07-12 00:00:07.307|
|a |2020-07-12 00:01:08.817|
|a |2020-07-12 00:02:15.675|
|a |2020-07-12 00:05:48.277|
+---------+-----------------------+
Expected Output
+---------+-----------------------+-----+
|OrderName|Time |Count|
+---------+-----------------------+-----+
|a |2020-07-11 23:58:45.538|1 |
|a |2020-07-12 00:00:07.307|2 |
|a |2020-07-12 00:01:08.817|3 |
|a |2020-07-12 00:02:15.675|1 |
|a |2020-07-12 00:05:48.277|1 |
+---------+-----------------------+-----+
Last two entry values are 4 and 5 locally, but on cluster they are incorrect. My best guess is data is being distributed across executors and udf is also being called in parallel on each executor. As one of the parameter to UDF is column (Partition key - OrderName in this example), how could I control/correct the behavior for cluster if thats the case. So that it calculates proper counts for each partition in a right way. Any suggestion please
As per your comment , you want to count the total no of records of every record for the last 24 hours
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.LongType
//A sample data (Guessing from your question)
val df = Seq(("a","2020-07-10 23:58:45.438","1"),("a","2020-07-11 23:58:45.538","1"),("a","2020-07-11 23:58:45.638","1")).toDF("OrderName","Time","Count")
// Extract the UNIX TIMESTAMP for your time column
val df2 = df.withColumn("unix_time",concat(unix_timestamp($"Time"),split($"Time","\\.")(1)).cast(LongType))
val noOfMilisecondsDay : Long = 24*60*60*1000
//Create a window per `OrderName` and select rows from `current time - 24 hours` to `current time`
val winSpec = Window.partitionBy("OrderName").orderBy("unix_time").rangeBetween(Window.currentRow - noOfMilisecondsDay, Window.currentRow)
// Final you perform your COUNT or SUM(COUNT) as per your need
val finalDf = df2.withColumn("tot_count", count("OrderName").over(winSpec))
//or val finalDf = df2.withColumn("tot_count", sum("Count").over(winSpec))

Logic to manipulate dataframe in spark scala [Spark]

Take for example the following dataFrame:
x.show(false)
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/done/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
Now I am trying to update the existing DF to create a new DF based based on the column hdfsPath
The new DF should look like the following:
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
So the path done changes to target and then from the compiled-20200218050518-1-0-0-1582020318751.snappy portion I get the date 20200218 and then colID 11 and then finally the snappy file. What would be the easiest and most efficient way to achieve this?
It's not a hard requirement to create a newDF, I can update the existing DF with a new column.
To summarize:
Current hdfsPath:
hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy
Expected hdfsPath:
hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy
Based on colID.
The simplest way i can imagine doing this is converting your dataframe to a dataset and apply a map operation and then back to dataframe,
// Define a case class
case class MyType(colId:Int,path:String,timestamp:Int) // they need to match the column names
dataframe.as[MyType].map(x=> <<Your Transformation code>>).toDf()
Here is what you can do with regex_replace and regex_extract, Extract the values you want and replace with it
df.withColumn("hdfsPath", regexp_replace(
$"hdfsPath",
lit("/done"),
concat(
lit("/target/"),
regexp_extract($"hdfsPath", "compiled-([0-9]{1,8})", 1),
lit("/"),
$"colId")
))
Output:
+-----+----------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-----+----------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-----+----------------------------------------------------------------------------------------------------+-------------+
Hope this helps!

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense

Flip each bit in Spark dataframe calling a custom function

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .
I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Spark Structured Streaming operations on rows of a single dataframe

In my problem, there is a data stream of information about package delivery coming in. The data consists of "NumberOfPackages", "Action" (which can be either "Loaded", "Delivered" or "In Transit"), and "Driver".
val streamingData = <filtered data frame based on "Loaded" and "Delivered" Action types only>
The goal is to look at number of packages at the moment of loading and at the moment of delivery, and if they are not the same - execute a function that would call a REST service with the parameter of "TrackingId".
The data looks like this:
+-----------------+-----------+-----------------------
|NumberOfPackages |Action |TrackingId |Driver |
+-----------------+-----------+-----------------------
|5 |Loaded |a |Alex
|5 |Devivered |a |Alex
|8 |Loaded |b |James
|8 |Delivered |b |James
|7 |Loaded |c |Mark
|3 |Delivered |c |Mark
<...more rows in this streaming data frame...>
In this case, we see that by the "TrackingId" equal to "c", the number of packages loaded and delivered isn't the same, so this is where we'd need to call the REST api with the "TrackingId".
I would like to combine rows based on "TrackingId", which will always be unique for each trip. If we get the rows combined based on this tracking id, we could have two columns for number of packages, something like "PackagesAtLoadTime" and "PackagesAtDeliveryTime". Then we could compare these two values for each row and filter the dataframe by those which are not equal.
So far I have tried the groupByKey method with the "TrackingId", but I couldn't find a similar example and my experimental attempts weren't successful.
After I figure out how to "merge" the two rows with the same tracking id together and have a column for each corresponding count of packages, I could define a UDF:
def notEqualPackages = udf((packagesLoaded: Int, packagesDelivered: Int) => packagesLoaded!=packagesDelivered)
And use it to filter the rows of the dataframe to contain only those with not matching numbers:
streamingData.where(notEqualPackages(streamingData("packagesLoaded", streamingData("packagesDelivered")))