Spark UDF not giving rolling counts properly - scala

I have a Spark UDF to calculate rolling counts of a column, precisely wrt time. If I need to calculate a rolling count for 24 hours, for example for entry with time 2020-10-02 09:04:00, I need to look back until 2020-10-01 09:04:00 (very precise).
The Rolling count UDF works fine and gives correct counts, if I run locally, but when I run on a cluster, its giving incorrect results. Here is the sample input and output
Input
+---------+-----------------------+
|OrderName|Time |
+---------+-----------------------+
|a |2020-07-11 23:58:45.538|
|a |2020-07-12 00:00:07.307|
|a |2020-07-12 00:01:08.817|
|a |2020-07-12 00:02:15.675|
|a |2020-07-12 00:05:48.277|
+---------+-----------------------+
Expected Output
+---------+-----------------------+-----+
|OrderName|Time |Count|
+---------+-----------------------+-----+
|a |2020-07-11 23:58:45.538|1 |
|a |2020-07-12 00:00:07.307|2 |
|a |2020-07-12 00:01:08.817|3 |
|a |2020-07-12 00:02:15.675|1 |
|a |2020-07-12 00:05:48.277|1 |
+---------+-----------------------+-----+
Last two entry values are 4 and 5 locally, but on cluster they are incorrect. My best guess is data is being distributed across executors and udf is also being called in parallel on each executor. As one of the parameter to UDF is column (Partition key - OrderName in this example), how could I control/correct the behavior for cluster if thats the case. So that it calculates proper counts for each partition in a right way. Any suggestion please

As per your comment , you want to count the total no of records of every record for the last 24 hours
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.LongType
//A sample data (Guessing from your question)
val df = Seq(("a","2020-07-10 23:58:45.438","1"),("a","2020-07-11 23:58:45.538","1"),("a","2020-07-11 23:58:45.638","1")).toDF("OrderName","Time","Count")
// Extract the UNIX TIMESTAMP for your time column
val df2 = df.withColumn("unix_time",concat(unix_timestamp($"Time"),split($"Time","\\.")(1)).cast(LongType))
val noOfMilisecondsDay : Long = 24*60*60*1000
//Create a window per `OrderName` and select rows from `current time - 24 hours` to `current time`
val winSpec = Window.partitionBy("OrderName").orderBy("unix_time").rangeBetween(Window.currentRow - noOfMilisecondsDay, Window.currentRow)
// Final you perform your COUNT or SUM(COUNT) as per your need
val finalDf = df2.withColumn("tot_count", count("OrderName").over(winSpec))
//or val finalDf = df2.withColumn("tot_count", sum("Count").over(winSpec))

Related

Filling blank field in a DataFrame with previous field value

I am working with Scala and Spark and I am relatively new to programming in Scala, so maybe my question has a simple solution.
I have one DataFrame that keeps information about the active and deactivate clients in some promotion. That DataFrame shows the Client Id, the action that he/she took (he can activate or deactivate from the promotion at any time) and the Date that he or she took this action. Here is an example of that format:
Example of how the DataFrame works
I want a daily monitoring of the clients that are active and wish to see how this number varies through the days, but I am not able to code anything that works like that.
My idea was to make a crossJoin of two Dataframes; one that has only the Client Ids and another with only the dates, so I would have all the Dates related to all the Client IDs and I only needed to see the Client Status in each of the Dates (if the Client is active or desactive). So after that I made a left join of these new Dataframe with the DataFrame that related the Client ID and the events, but the result is a lot of dates that have a "null" status and I don't know how to fill it with the correct status. Here's the example:
Example of the final DataFrame
I have already tried to use the lag function, but it did not solve my problem. Does anyone have any idea that could help me?
Thank You!
A slightly expensive operation due to Spark SQL having restrictions on correlated sub-queries with <, <= >, >=.
Starting from your second dataframe with NULLs and assuming that large enough system and volume of data manageable:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// My sample input
val df = Seq(
(1,"2018-03-12", "activate"),
(1,"2018-03-13", null),
(1,"2018-03-14", null),
(1,"2018-03-15", "deactivate"),
(1,"2018-03-16", null),
(1,"2018-03-17", null),
(1,"2018-03-18", "activate"),
(2,"2018-03-13", "activate"),
(2,"2018-03-14", "deactivate"),
(2,"2018-03-15", "activate")
).toDF("ID", "dt", "act")
//df.show(false)
val w = Window.partitionBy("ID").orderBy(col("dt").asc)
val df2 = df.withColumn("rank", dense_rank().over(w)).select("ID", "dt","act", "rank") //.where("rank == 1")
//df2.show(false)
val df3 = df2.filter($"act".isNull)
//df3.show(false)
val df4 = df2.filter(!($"act".isNull)).toDF("ID2", "dt2", "act2", "rank2")
//df4.show(false)
val df5 = df3.join(df4, (df3("ID") === df4("ID2")) && (df4("rank2") < df3("rank")),"inner")
//df5.show(false)
val w2 = Window.partitionBy("ID", "rank").orderBy(col("rank2").desc)
val df6 = df5.withColumn("rank_final", dense_rank().over(w2)).where("rank_final == 1").select("ID", "dt","act2").toDF("ID", "dt", "act")
//df6.show
val df7 = df.filter(!($"act".isNull))
val dfFinal = df6.union(df7)
dfFinal.show(false)
returns:
+---+----------+----------+
|ID |dt |act |
+---+----------+----------+
|1 |2018-03-13|activate |
|1 |2018-03-14|activate |
|1 |2018-03-16|deactivate|
|1 |2018-03-17|deactivate|
|1 |2018-03-12|activate |
|1 |2018-03-15|deactivate|
|1 |2018-03-18|activate |
|2 |2018-03-13|activate |
|2 |2018-03-14|deactivate|
|2 |2018-03-15|activate |
+---+----------+----------+
I solved this step-wise and in a rush, but no so apparent.

Spark window function with condition on current row

I am trying to count for a given order_id how many orders there were in the past 365 days which had a payment. And this is not the problem: I use the window function.
Where it gets tricky for me is: I don't want to count orders in this time window where the payment_date is after order_date of the current order_id.
Currently, I have something like this:
val window: WindowSpec = Window
.partitionBy("customer_id")
.orderBy("order_date")
.rangeBetween(-365*days, -1)
and
df.withColumn("paid_order_count", count("*") over window)
which would count all orders for the customer within the last 365 days before his current order.
How can I now incorporate a condition for the counting that takes the order_date of the current order into account?
Example:
+---------+-----------+-------------+------------+
|order_id |order_date |payment_date |customer_id |
+---------+-----------+-------------+------------+
|1 |2017-01-01 |2017-01-10 |A |
|2 |2017-02-01 |2017-02-10 |A |
|3 |2017-02-02 |2017-02-20 |A |
The resulting table should look like this:
+---------+-----------+-------------+------------+-----------------+
|order_id |order_date |payment_date |customer_id |paid_order_count |
+---------+-----------+-------------+------------+-----------------+
|1 |2017-01-01 |2017-01-10 |A |0 |
|2 |2017-02-01 |2017-02-10 |A |1 |
|3 |2017-02-02 |2017-02-20 |A |1 |
For order_id = 3 the paid_order_count should not be 2 but 1 as order_id = 2 is paid after order_id = 3 is placed.
I hope that I explained my problem well and look forward to your ideas!
Very good question!!!
A couple of remarks, using rangeBetween creates a fixed frame that is based on number of rows in it and not on values, so it will be problematic in 2 cases:
customer does not have orders every single day, so 365 rows window might contain rows with order_date well before one year ago
if customer has more than one order per day, it will mess with the one year coverage
combination of the 1 and 2
Also rangeBetween does not work with Date and Timestamp datatypes.
To solve it, it is possible to use window function with lists and an UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.sparkContext.parallelize(Seq(
(1, "2017-01-01", "2017-01-10", "A")
, (2, "2017-02-01", "2017-02-10", "A")
, (3, "2017-02-02", "2017-02-20", "A")
)
).toDF("order_id", "order_date", "payment_date", "customer_id")
.withColumn("order_date_ts", to_timestamp($"order_date", "yyyy-MM-dd").cast("long"))
.withColumn("payment_date_ts", to_timestamp($"payment_date", "yyyy-MM-dd").cast("long"))
// df.printSchema()
// df.show(false)
val window = Window.partitionBy("customer_id").orderBy("order_date_ts").rangeBetween(Window.unboundedPreceding, -1)
val count_filtered_dates = udf( (days: Int, top: Long, array: Seq[Long]) => {
val bottom = top - (days * 60 * 60 * 24).toLong // in spark timestamps are in secconds, calculating the date days ago
array.count(v => v >= bottom && v < top)
}
)
val res = df.withColumn("paid_orders", collect_list("payment_date_ts") over window)
.withColumn("paid_order_count", count_filtered_dates(lit(365), $"order_date_ts", $"paid_orders"))
res.show(false)
Output:
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|order_id|order_date|payment_date|customer_id|order_date_ts|payment_date_ts|paid_orders |paid_order_count|
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|1 |2017-01-01|2017-01-10 |A |1483228800 |1484006400 |[] |0 |
|2 |2017-02-01|2017-02-10 |A |1485907200 |1486684800 |[1484006400] |1 |
|3 |2017-02-02|2017-02-20 |A |1485993600 |1487548800 |[1484006400, 1486684800]|1 |
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
Converting dates to Spark timestamps in seconds makes the lists more memory efficient.
It is the easiest code to implement, but not the most optimal as the lists will take up some memory, custom UDAF would be best, but requires more coding, might do later. This will still work if you have thousands of orders per customer.

Executing multiple spark queries and storing as dataframe

I have 3 Spark queries saved in List - sqlQueries. The first 2 of them creates global temporary views and third one executes on those temporary views and fetches some output.
I am able to run a single query using this -
val resultDF = spark.sql(sql)
Then I add partition information on this dataframe object and save it.
In case of multiple queries, I tried executing
sqlQueries.foreach(query => spark.sql(query))
How do I save my output of third query keeping other 2 queries run.
I have 3 queries just for example, It can be any number.
You can write the last query as insert statement to save the results into table. You are executing queries through foreach which will execute sequentially.
I am taking reference from your other question for the query which needs some modification as explained in global-temporary-view in sql section.
After modification your query file should look like
CREATE GLOBAL TEMPORARY VIEW VIEW_1 AS select a,b from abc
CREATE GLOBAL TEMPORARY VIEW VIEW_2 AS select a,b from global_temp.VIEW_1
select * from global_temp.VIEW_2
Then answering this question: you can use foldLeft again for the multiple queries to be reflected.
Lets say you have a dataframe
+----+---+---+
|a |b |c |
+----+---+---+
|a |b |1 |
|adfs|df |2 |
+----+---+---+
And given above multiple line query file, you can do the following
df.createOrReplaceTempView("abc")
val sqlFile = "path to test.sql"
val queryList = scala.io.Source.fromFile(sqlFile).getLines().filterNot(_.isEmpty).toList
val finalresult = queryList.foldLeft(df)((tempdf, query) => sqlContext.sql(query))
finalresult.show(false)
which should give you
+----+---+
|a |b |
+----+---+
|a |b |
|adfs|df |
+----+---+

Flip each bit in Spark dataframe calling a custom function

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .
I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Spark 2.0 - How to obtain Cluster ID associated with Cluster Center

I want to know what is the ID associated with the Cluster Centers. model.transform(dataset) will assign a predicted cluster ID to my data points, and model.clusterCenters.foreach(println) will print these cluster centers, but I cannot figure out how to associate the cluster centers with their ID.
import org.apache.spark.ml.clustering.KMeans
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val prediction = model.transform(dataset)
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Ideally, I want an output such as:
|I.D |cluster center |
==========================
|0 |[0.0,...,0.3] |
|2 |[1.0,...,1.3] |
|1 |[2.0,...,1.3] |
|3 |[3.0,...,1.3] |
It does not seem to me that the println order is sorted by ID. I tried converting model.clusterCenters into a DF to transform() on it, but I couldn't figure out how to convert Array[org.apache.spark.ml.linalg.Vector] to org.apache.spark.sql.Dataset[_]
Once you saved the data it will write cluster_id and Cluster_center. You can read the file, can see the desired output
model.save(sc, "/user/hadoop/kmeanModel")
val parq = sqlContext.read.parquet("/user/hadoop/kmeanModel/data/*")
parq.collect.foreach(println)