Create new column in Spark DataFrame with diff of previous values from another column - scala

I have a data frame which has a column with epoch seconds.
In addition to this I would like to add a column which contains the difference between current and previous time value - in other words time diff since last row in the data frame based on the timestamp column.
How would I add such a column based on earlier values?
I am using the Scala API.

you can use the lag function of spark to achieve this
val df = sc.parallelize(Seq(
(1540000005),
(1540000004),
(1540000003),
(1540000002))).toDF("epoch")
// a lag function needs to have a window
val w = org.apache.spark.sql.expressions.Window.orderBy("epoch")
import org.apache.spark.sql.functions.lag
// create a column epoch_lag_1 which is the epoch column with an offset of 1 and default value 0
val dfWithLag = df.withColumn("epoch_lag_1", lag("epoch", 1, 0).over(w))
// calculate the diff between epoch and epoch_lag_1
val dfWithDiff = dfWithLag.withColumn("diff", dfWithLag("epoch") - dfWithLag("epoch_lag_1"))
this should result in
dfWithDiff.show
+----------+-----------+----------+
| epoch|epoch_lag_1| diff|
+----------+-----------+----------+
|1540000002| 0|1540000002|
|1540000003| 1540000002| 1|
|1540000004| 1540000003| 1|
|1540000005| 1540000004| 1|
+----------+-----------+----------+

This will do what you want, though as pointed out it could be a little slow.
df.printSchema
root
|-- ts: long (nullable = false)
df.join(
df.toDF("ts2"),
$"ts2" < $"ts",
"left_outer"
).groupBy($"ts").agg(max($"ts2") as "prev").select($"ts", $"ts" - $"prev" as "diff").show
We can even use my pimped out DataFrame-ified zipWithIndex to make it better. Assuming we used that to add an id column, you could do:
df.join(
df.toDF("prev_id", "prev_ts"),
$"id" === $"prev_id" + 1,
"left_outer"
).select($"ts", $"ts" - $"prev_ts" as "diff").show

I do not know Scala. But how about generating a lagged column with lag and then subtracting one column from the other?

Related

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

How would I create bins of date ranges in spark scala?

Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).

How to calculate the average of rows between a start index and end index of a column in a dataframe in Spark using Scala?

I have a spark dataframe with a column having float type values. I am trying to find the average of values between row 11 to row 20. Please note, I am not trying any sort of moving average. I tried using partition window like so -
var avgClose= avg(priceDF("Close")).over(partitionWindow.rowsBetween(11,20))
It returns an 'org.apache.spark.sql.Column' result. I don't know how to view avgClose.
I am new to Spark and Scala. Appreciate your help in getting this.
Assign an increasing id to your table. Then you can do an average between the ids.
val df = Seq(20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1).toDF("val1")
val dfWithId = df.withColumn("id", monotonically_increasing_id())
val avgClose= dfWithId.filter($"id" >= 11 && $"id" <= 20).agg(avg("val1"))
avgClose.show()
result:
+---------+
|avg(val1)|
+---------+
| 5.0|
+---------+

How can I split a column off of a DataFrame, but keep its association with the initial DataFrame?

I have a dataframe dataDF that is:
+-------+------+-----+-----+-----------+
|TEST_PK| COL_1|COL_2|COL_3|h_timestamp|
+-------+------+-----+-----+-----------+
| 1| apple| 10| 1.79| 1111|
| 1| apple| 11| 1.79| 1114|
| 2|banana| 15| 1.79| 1112|
| 2|banana| 16| 1.79| 1115|
| 3|orange| 7| 1.79| 1113|
+-------+------+-----+-----+-----------+
And I need to run this function:
operation(row, h_timestamp)
On each row, but row can not contain h_timestamp, so my first thought is to split the dataframe like:
val columns = dataDF.drop("h_timestamp")
val timestamp = dataDF.select("h_timestamp")
But that doesn't help when I want to perform the operation on every row like:
dataDF.map(row => {
...
val rowWithoutTimestamp = ???
val timestamp = ???
operation(rowWithoutTimestamp, timestamp)
...
})
But now those two dataframes are not linked and I don't know how to get the right timestamp for each row. The TEST_PK column is not necessarily unique.
Is there a way to use .drop() or .select() on just a row or some other way to do this?
Edit: Also, the table could have any number of columns, but will always have the timestamp column and at least one more that is not the timestamp
Since you have what appears to be a primary key column, just fork the timestamp with the id column into it's own dataframe to re-join it later.
val tsDF = dataDF.select("TEST_PK", "h_timestamp")
Then, drop the column from dataDF, do your operation, and re-join the h_timestamp back onto a new dataframe.
val finalDF = postopDF.join(tsDF, "TEST_PK")
Update
The sample code is helpful, you should be able to essentially dissasemble your row, and rebuild a new row with the desired values with something like this:
dataDF.map(row => {
val rowWithoutTimestamp = Row(
row.getAs[Long]("TEST_PK"),
row.getAs[String]("COL_1"),
row.getAs[Long]("COL_2"),
row.getAs[Double]("COL_3")
)
val timestamp = row.getAs[Long]("h_timestamp")
val result = operation(rowWithoutTimestamp, timestamp)
Row(result, timestamp)
})
Of course, I'm not certain what your operation() returns, so it may be necessary to disassemble result into individual values and compose a new row with those and the timestamp.
Update 2
OK, here is a more generic method. It wraps "all columns except" h_timestamp into a struct, and maps over the (struct, ts) tuple. Actually more elegant than the previous solution anyway.
val cols = df.drop("h_timestamp").columns.toSeq
dataDF
.select(struct(cols.map(c => col(c)):_*).as("row_no_ts"), $"h_timestamp")
.map(row => {
val rowWithoutTimestamp = row.getAs[Row]("row_no_ts")
val timestamp = row.getAs[Long]("h_timestamp")
operation(rowWithoutTimestamp, timestamp)
})
I'm not sure if you are mapping to just the output of operation() or some combination with the timestamp again, but both are available to modify to suit your needs.

Filtering a DataFrame on date columns comparison

I am trying to filter a DataFrame comparing two date columns using Scala and Spark. Based on the filtered DataFrame there are calculations running on top to calculate new columns.
Simplified my data frame has the following schema:
|-- received_day: date (nullable = true)
|-- finished: int (nullable = true)
On top of that I create two new column t_start and t_end that would be used for filtering the DataFrame. They have 10 and 20 days difference from the original column received_day:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10))
.withColumn("t_start",date_sub(col("received_day"),20))
I now want to have a new calculated column that indicates for each row of data how many rows of the dataframe are in the t_start to t_end period. I thought I can achieve this the following way:
val dfWithCount = dfWithDates
.withColumn("cnt", lit(
dfWithDates.filter(
$"received_day".lt(col("t_end"))
&& $"received_day".gt(col("t_start"))).count()))
However, this count only returns 0 and I believe that the problem is in the the argument that I am passing to lt and gt.
From following that issue here Filtering a spark dataframe based on date I realized that I need to pass a string value. If I try with hard coded values like lt(lit("2018-12-15")), then the filtering works. So I tried casting my columns to StringType:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10).cast(DataTypes.StringType))
.withColumn("t_start",date_sub(col("received_day"),20).cast(DataTypes.StringType))
But the filter still returns an empty dataFrame.
I would assume that I am not handling the data type right.
I am running on Scala 2.11.0 with Spark 2.0.2.
Yes you are right. For $"received_day".lt(col("t_end") each reveived_day value is compared with the current row's t_end value, not the whole dataframe. So each time you'll get zero as count.
You can solve this by writing a simple udf. Here is the way how you can solve the issue:
Creating sample input dataset:
import org.apache.spark.sql.{Row, SparkSession}
import java.sql.Date
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((Date.valueOf("2018-10-12"),1),
(Date.valueOf("2018-10-13"),1),
(Date.valueOf("2018-09-25"),1),
(Date.valueOf("2018-10-14"),1)).toDF("received_day", "finished")
val dfWithDates= df
.withColumn("t_start",date_sub(col("received_day"),20))
.withColumn("t_end",date_sub(col("received_day"),10))
dfWithDates.show()
+------------+--------+----------+----------+
|received_day|finished| t_start| t_end|
+------------+--------+----------+----------+
| 2018-10-12| 1|2018-09-22|2018-10-02|
| 2018-10-13| 1|2018-09-23|2018-10-03|
| 2018-09-25| 1|2018-09-05|2018-09-15|
| 2018-10-14| 1|2018-09-24|2018-10-04|
+------------+--------+----------+----------+
Here for 2018-09-25 we desire count 3
Generate output:
val count_udf = udf((received_day:Date) => {
(dfWithDates.filter((col("t_end").gt(s"$received_day")) && col("t_start").lt(s"$received_day")).count())
})
val dfWithCount = dfWithDates.withColumn("count",count_udf(col("received_day")))
dfWithCount.show()
+------------+--------+----------+----------+-----+
|received_day|finished| t_start| t_end|count|
+------------+--------+----------+----------+-----+
| 2018-10-12| 1|2018-09-22|2018-10-02| 0|
| 2018-10-13| 1|2018-09-23|2018-10-03| 0|
| 2018-09-25| 1|2018-09-05|2018-09-15| 3|
| 2018-10-14| 1|2018-09-24|2018-10-04| 0|
+------------+--------+----------+----------+-----+
To make computation faster i would suggest to cache dfWithDates as there are repetition of same operation for each row.
You can cast date value to string with any pattern using DateTimeFormatter
import java.time.format.DateTimeFormatter
date.format(DateTimeFormatter.ofPattern("yyyy-MM-dd"))