Spark dataframe change column value to timestamp [duplicate] - scala

This question already has answers here:
Apache Spark subtract days from timestamp column
(2 answers)
Closed 4 years ago.
I have a jsonl file I've read in, created a temporary table view and filtered down the records that I want to ammend.
val df = session.read.json("tiny.jsonl")
df.createOrReplaceTempView("tempTable")
val filter = df.select("*").where("field IS NOT NULL")
Now I am at the part where I have been trying various things. I want to change a column called "time" with the currentTimestamp before I write it back. Sometimes I will want to change the currentTimestamp to be timestampNow - 5 days for example.
val change = test.withColumn("server_time", date_add(current_timestamp(), -1))
The example above will throw me back a date that's 1 from today, rather than a timestamp.
Edit:
Sample Dataframe that mocks out my jsonl input:
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
Expected output:
+---+------+-------------------------+
| id|field |time |
+---+------+-------------------------+
| 1| fn | 2018-04-09T22:18:28.645Z|
| 2| fn | 2018-04-09T22:18:28.645Z|
+---+------+-------------------------+

If you want to replace current column time with current timestamp then, you can use current_timestamp function. To add the number of days you can use SQL INTERVAL
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
.na.drop()
val ddf = df
.withColumn("time", current_timestamp())
.withColumn("newTime", $"time" + expr("INTERVAL 5 DAYS"))
Output:
+---+-----+-----------------------+-----------------------+
|id |field|time |newTime |
+---+-----+-----------------------+-----------------------+
|1 |fn |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
|2 |fu |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
+---+-----+-----------------------+-----------------------+

Related

Spark dataframe filter a timestamp by just the date part

How can I filter a spark dataframe that has a column of type timestamp but filter out by just the date part. I tried below, but it only matches if time is 00:00:00.
Basically I want the filter to match all rows with date 2020-01-01 (3 rows)
import java.sql.Timestamp
val df = Seq(
(1, Timestamp.valueOf("2020-01-01 23:00:01")),
(2, Timestamp.valueOf("2020-01-01 00:00:00")),
(3, Timestamp.valueOf("2020-01-01 12:54:00")),
(4, Timestamp.valueOf("2019-12-15 09:54:00")),
(5, Timestamp.valueOf("2019-12-09 10:12:43"))
).toDF("someCol","someTimeStamp")
df.filter(df("someTimeStamp") === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 2|2020-01-01 00:00:00| // ONLY MATCHED with time 00:00
+-------+-------------------+
Use the to_date function to extract the date from the timestamp:
scala> df.filter(to_date(df("someTimeStamp")) === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 1|2020-01-01 23:00:01|
| 2|2020-01-01 00:00:00|
| 3|2020-01-01 12:54:00|
+-------+-------------------+

Spark dataframe last object in a group [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I need to select the last 'name' for the given 'id'. A possible solution could be the following:
val channels = sessions
.select($"start_time", $"id", $"name")
.orderBy($"start_time")
.select($"id", $"name")
.groupBy($"id")
.agg(last("name"))
I don't know if it's correct because I'm not sure that orderBy is kept after doing groupBy.
But it's certainly not a performant solution. Probably I should use reduceByKey. I tried the following in the spark shell and it works
val x = sc.parallelize(Array(("1", "T1"), ("2", "T2"), ("1", "T11"), ("1", "T111"), ("2", "T22"), ("1", "T100"), ("2", "T222"), ("2", "T200")), 3)
x.reduceByKey((acc,x) => x).collect
But it doesn't work with my dataframe.
case class ChannelRecord(id: Long, name: String)
val channels = sessions
.select($"start_time", $"id", $"name")
.orderBy($"start_time")
.select($"id", $"name")
.as[ChannelRecord]
.reduceByKey((acc, x) => x) // take the last object
I got a compilation error: value reduceByKey is not a member of org.apache.spark.sql.Dataset
I think I should add a map() call before doing reduceByKey but I cannot figure out what should I map.
You could do it with a window function for example. This will require a shuffle on a id column and a sort on start_time.
There are two stages:
Get last name for each id
Keep only rows with the last name (max start_time)
Example dataframe:
val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
Seq(
Row(1, "a", 1),
Row(1, "b", 2),
Row(1, "c", 3),
Row(2, "d", 4),
Row(2, "e", 5),
Row(2, "f", 6),
Row(3, "g", 7),
Row(3, "h", 8)
))
val schema: StructType = new StructType()
.add(StructField("id", IntegerType, false))
.add(StructField("name", StringType, false))
.add(StructField("start_time", IntegerType, false))
val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)
Define a window. Note that I am sorting here by start_time in decreasing order. This is to be able to choose first row in next step.
val w = Window.partitionBy("id").orderBy(col("start_time").desc)
Then
df0.withColumn("last_name", first("name").over(w)) // get first name for each id (first because of decreasing start_time)
.withColumn("row_number", row_number().over(w)) // get row number for each id sorted by start_time
.filter("row_number=1") // choose only first rows (first row = max start_time)
.drop("row_number") // get rid of row_number columns
.sort("id")
.show(10, false)
This returns
+---+----+----------+---------+
|id |name|start_time|last_name|
+---+----+----------+---------+
|1 |c |3 |c |
|2 |f |6 |f |
|3 |h |8 |h |
+---+----+----------+---------+

how to apply partition in spark scala dataframe with multiple columns? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following dataframe df in Spark Scala:
id project start_date Change_date designation
1 P1 08/10/2018 01/09/2017 2
1 P1 08/10/2018 02/11/2018 3
1 P1 08/10/2018 01/08/2016 1
then get designation closure to start_date and less than that
Expected output:
id project start_date designation
1 P1 08/10/2018 2
This is because change date 01/09/2017 is the closest date before start_date.
Can somebody advice how to achieve this?
This is not selecting first row but selecting the designation corresponding to change date closest to the start date
Parse dates:
import org.apache.spark.sql.functions._
val spark: SparkSession = ???
import spark.implicits._
val df = Seq(
(1, "P1", "08/10/2018", "01/09/2017", 2),
(1, "P1", "08/10/2018", "02/11/2018", 3),
(1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")
val parsed = df
.withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))
.withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))
Find difference
val diff = parsed
.withColumn("diff", datediff($"start_date", $"changed_date"))
.where($"diff" > 0)
Apply solution of your choice from How to select the first row of each group?, for example window functions. If you group by id:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"diff")
diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// | 1| P1|2018-10-08| 2017-09-01| 2| 402|
// +---+----------+----------+------------+-----------+----+
Reference:
How to select the first row of each group?

Timestamp comparison in spark-scala dataframe

I have a field in spark dataframe of type string, and it's value is in format 2019-07-08 00:00. I have to perform a condition on the field like
df.filter(myfield > 2019-07-08 00:00)
Standard comparison operators for String should work, given your date format is in British military form:
val df = Seq(
(1, "2019-07-06 16:00"),
(2, "2019-07-08 09:00"),
(3, "2019-07-11 06:30")
).toDF("id", "date")
df.filter(col("date") > "2019-07-08 00:00").show
// +---+----------------+
// | id| date|
// +---+----------------+
// | 2|2019-07-08 09:00|
// | 3|2019-07-11 06:30|
// +---+----------------+

scala spark - matching dataframes based on variable dates

I'm trying to match two dataframes based on a variable date window. I am not simply trying to get an exact match, which my code achieves but to get all likely candidates within a variable day window.
I was able to get exact matches on dates with my code.
But I want to find out if the records are still viable to match since they could be a few days off either side but would still be reasonable enough to join on.
I've tried looking for something similar to python's pd.to_timedelta('1 day') in spark to add to the filter but alas have struck no luck.
Here is my current code which matches the dataframe on the ID column and then runs a filter to ensure that the from_date in the second dataframe is between the start_date and the end_date of the first dataframe.
What I need is not the exact date match but be able to match records if they fall between a day or two (either side) of the actual dates.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
val df1 = spark.read.option("header","true")
.option("inferSchema","true").csv("../data/df1.csv")
val df2 = spark.read.option("header","true")
.option("inferSchema","true")
.csv("../data/df2.csv")
val df = df2.join(df1,
(df1("ID") === df2("ID")) &&
(df2("from_date") >= df1("start_date")) &&
(df2("from_date") <= df1("end_date")),"left")
.select(df1("ID"), df1("start_date"), df1("end_date"),
$"from_date", $"to_date")
df.coalesce(1).write.format("com.databricks.spark.csv")
.option("header", "true").save("../mydata.csv")
Essentially I want to be able to edit this date window to increase or decrease the data actually matching.
Would really appreciate your input. I'm new to spark/scala but gotta say I'm loving it so far ... soo much faster (and cleaner) than python!
cheers
You can apply date_add and date_sub to start_date/end_date in your join condition, as shown below:
import org.apache.spark.sql.functions._
import java.sql.Date
val df1 = Seq(
(1, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-05")),
(2, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-06")),
(3, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-07"))
).toDF("ID", "start_date", "end_date")
val df2 = Seq(
(1, Date.valueOf("2018-11-30")),
(2, Date.valueOf("2018-12-08")),
(3, Date.valueOf("2018-12-08"))
).toDF("ID", "from_date")
val deltaDays = 1
df2.join( df1,
df1("ID") === df2("ID") &&
df2("from_date") >= date_sub(df1("start_date"), deltaDays) &&
df2("from_date") <= date_add(df1("end_date"), deltaDays),
"left_outer"
).show
// +---+----------+----+----------+----------+
// | ID| from_date| ID|start_date| end_date|
// +---+----------+----+----------+----------+
// | 1|2018-11-30| 1|2018-12-01|2018-12-05|
// | 2|2018-12-08|null| null| null|
// | 3|2018-12-08| 3|2018-12-01|2018-12-07|
// +---+----------+----+----------+----------+
You can get the same results using datediff() function also. Check this out:
scala> val df1 = Seq((1, "2018-12-01", "2018-12-05"),(2, "2018-12-01", "2018-12-06"),(3, "2018-12-01", "2018-12-07")).toDF("ID", "start_date", "end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [ID: int, start_date: date ... 1 more field]
scala> val df2 = Seq((1, "2018-11-30"), (2, "2018-12-08"),(3, "2018-12-08")).toDF("ID", "from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [ID: int, from_date: date]
scala> val delta = 1;
delta: Int = 1
scala> df2.join(df1,df1("ID") === df2("ID") && datediff('from_date,'start_date) >= -delta && datediff('from_date,'end_date)<=delta, "leftOuter").show(false)
+---+----------+----+----------+----------+
|ID |from_date |ID |start_date|end_date |
+---+----------+----+----------+----------+
|1 |2018-11-30|1 |2018-12-01|2018-12-05|
|2 |2018-12-08|null|null |null |
|3 |2018-12-08|3 |2018-12-01|2018-12-07|
+---+----------+----+----------+----------+
scala>