How to get the lag of a column in a Spark streaming dataframe?

How to get the lag of a column in a Spark streaming dataframe? - scala

I have data streaming into my Spark Scala application in this format
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
I have it read into columns id, mark1, mark2, mark3 and time. The time is converted to datetime format as well.
I want to get this grouped by id and get the lag for mark1 which gives the previous row's mark1 value.
Something like this:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
Consider the dataframe to be markDF. I have tried:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
which says non time windows cannot be applied on streaming/appending datasets/frames.
I have also tried:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
To get a window for few rows which did not work either. The streaming window something like:
window("timestamp", "10 minutes")
cannot be used to send over the lag. I am super confused on how to do this. Any help would be awesome!!

I would advise you to change the time column into String as
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
After that doing the following should work
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
Which will give you output as
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+

Related

Returns the value in the next empty rows

Input
Here is an example of my input.
Number
Date
Motore
1
Fri Jan 01 00:00:00 CET 2021
Motore 1
2
Motore 2
3
Motore 3
4
Motore 4
5
Fri Feb 01 00:00:00 CET 2021
Motore 1
6
Motore 2
7
Motore 3
8
Motore 4
Expected Output
Number
Date
Motore
1
Fri Jan 01 00:00:00 CET 2021
Motore 1
2
Fri Jan 01 00:00:00 CET 2021
Motore 2
3
Fri Jan 01 00:00:00 CET 2021
Motore 3
4
Fri Jan 01 00:00:00 CET 2021
Motore 4
5
Fri Feb 01 00:00:00 CET 2021
Motore 1
6
Fri Feb 01 00:00:00 CET 2021
Motore 2
7
Fri Feb 01 00:00:00 CET 2021
Motore 3
8
Fri Feb 01 00:00:00 CET 2021
Motore 4
I tried to use the TmemorizeRows component but without any result, the second line is valorized but the others are not. Could you kindly help me.

You can solve this with a simple tMap with 2 inner variables (using the "var" array in the middle of the tMap)
Create two variables :
currentValue : you put in it the value of your input column date (in my example "row1.data").
updateValue : check whether currentValue is null or not : if null then you do not modify updateValue field . If not null then you update the field value. This way "updateValue" always contains not null data.
In output, just use "updateValue" variable.

Search data field by year in Janusgraph

I have a 'Date' property on my 'Patent' node class that is formatted like this:
==>Sun Jan 28 00:08:00 UTC 2007
==>Tue Jan 27 00:10:00 UTC 1987
==>Wed Jan 10 00:04:00 UTC 2001
==>Sun Jan 17 00:08:00 UTC 2010
==>Tue Jan 05 00:10:00 UTC 2010
==>Thu Jan 28 00:09:00 UTC 2010
==>Wed Jan 04 00:09:00 UTC 2012
==>Wed Jan 09 00:12:00 UTC 2008
==>Wed Jan 24 00:04:00 UTC 2018
And is stored as class java.util.Date in the database.
Is there a way to search this field to return all the 'Patents' for a particular year?
I tried variations of g.V().has("Patent", "date", 2000).values(). However, it doesn't return any results or an error message.
Is there a way to search this property field by year or do I need to create a separate property that just contains year?

You do not need to create a separate property for the year. JanusGraph recognizes the Date data type and can filter by date values.
gremlin> dateOfBirth1 = new GregorianCalendar(2000, 5, 6).getTime()
==>Tue Jun 06 00:00:00 MDT 2000
gremlin> g.addV("person").property("name", "Person 1").property("dateOfBirth", dateOfBirth1)
==>v[4144]
gremlin> dateOfBirth2 = new GregorianCalendar(2001, 5, 6).getTime()
==>Wed Jun 06 00:00:00 MDT 2001
gremlin> g.addV("person").property("name", "Person 2").property("dateOfBirth", dateOfBirth2)
==>v[4328]
gremlin> dateOfBirthFrom = new GregorianCalendar(2000, 0, 1).getTime()
==>Sat Jan 01 00:00:00 MST 2000
gremlin> dateOfBirthTo = new GregorianCalendar(2001, 0, 1).getTime()
==>Mon Jan 01 00:00:00 MST 2001
gremlin> g.V().hasLabel("person").
......1> has("dateOfBirth", gte(dateOfBirthFrom)).
......2> has("dateOfBirth", lt(dateOfBirthTo)).
......3> values("name")
==>Person 1

Converting CDT timestamp into UTC format in spark scala

My Dataframe, myDF is like bellow -
DATE_TIME
Wed Sep 6 15:24:27 CDT 2017
Wed Sep 6 15:30:05 CDT 2017
Expected output in format :
2017-09-06 15:24:27
2017-09-06 15:30:05
Need to convert DATE_TIME timestamp to UTC.
Tried the below code in databricks notebook but it's not working.
%scala
val df = Seq(("Wed Sep 6 15:24:27 CDT 2017")).toDF("times")
df.withColumn("times2",date_format(to_timestamp('times,"ddd MMM dd hh:mm:ss CDT yyyy"),"yyyy-MM-dd HH:mm:ss")).show(false)
times | times2
Wed Sep 6 15:24:27 CDT 2017 | null

I think we need to remove wed from your string then use to_timestamp() function.
Example:
df.show(false)
/*
+---------------------------+
|times |
+---------------------------+
|Wed Sep 6 15:24:27 CDT 2017|
+---------------------------+
*/
df.withColumn("times2",expr("""to_timestamp(substring(times,5,length(times)),"MMM d HH:mm:ss z yyyy")""")).
show(false)
/*
+---------------------------+-------------------+
|times |times2 |
+---------------------------+-------------------+
|Wed Sep 6 15:24:27 CDT 2017|2017-09-06 15:24:27|
+---------------------------+-------------------+
*/

Pivot table with columns as year/date in KDB+

I am trying to create a pivot table with columns as year out of a simple table
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth
stock year returns
------------------
apple 2015 9
apple 2016 18
apple 2017 17
goog 2015 8
goog 2016 13
goog 2017 17
nokia 2015 12
nokia 2016 12
nokia 2017 2
but I am not able to get the correct structure, it is still returning me a dictionary rather than multiple year columns.
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock|
-----| ----------------------
apple| 2015 2016 2017!9 18 17
goog | 2015 2016 2017!8 13 17
nokia| 2015 2016 2017!12 12 2
am I doing anything wrong?

You need to convert the years to symbols in order to use them as column headers. In this case I have updated the growth table first then performed the pivot:
q)exec distinct[year]#year!returns by stock:stock from update `$string year from growth
stock| 2015 2016 2017
-----| --------------
apple| 12 8 10
goog | 1 9 11
nokia| 5 6 1
Additionally you may see that I have changed to distinct[year] from (distinct growth`year) as this yields the same result with year being pulled from the updated table.

The column names of a table in KDB should be symbols rather than any other data type.
In your pivot table , the datatype of 'year' column is int\long this is the reason a proper table is not turning up.
If you type cast it as symbol, then it will work.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth:update `$string year from growth
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock| 2015 2016 2017
-----| --------------
apple| 9 18 17
goog | 8 13 17
nokia| 12 12 2
Alternatively, you can switch the pivot columns to 'stock' rather than 'year' and get a pivot table with the same original table.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)show exec (distinct growth`stock)#stock!returns by year:year from growth
year| apple goog nokia
----| ----------------
2015| 4 2 4
2016| 5 13 12
2017| 12 6 1

in RDD how do you get apply function like MIN/MAX for an Iterable class

I would like to find out the efficient way to apply function to an RDD:
Here is what I am trying to do :
I have defined the following class:
case class Type(Key: String, category: String, event: String, date: java.util.Date, value: Double)
case class Type2(Key: String, Mdate: java.util.Date, label: Double)
then a loaded an RDD:
val TypeRDD: RDD[Type] = types.map(s=>s.split(",")).map(s=>Type(s(0), s(1), s(2),dateFormat.parse(s(3).asInstanceOf[String]), s(4).toDouble))
val Type2RDD: RDD[Type2] = types2.map(s=>s.split(",")).map(s=>Type2(s(0), dateFormat.parse(s(1).asInstanceOf[String]), s(2).toDouble))
Then I try to create two new RDD - one that has Type.Key = Type2.Key and another one that has Type.Key not in Type2
val grpType = TypeRDD.groupBy(_.Key1)
vl grpType2 = Type2RDD.groupBy(_.Key1)
//get data where they Key1 does not exists in Type2 and return the values in grpType1
val tempresult = grpType fullOuterJoin grpType2
val result = tempresult.filter(_._2._2.isEmpty).map(_._2._1)
//get data where Type.Key == Type2.Key
val result2 = grpType.join.grpType2.map(_._2)
UPDATED:
typeRDD =
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(40,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
type2RDD =
(24,Wed Dec 22 00:00:00 EST 3080,1.0)
(40,Wed Jan 22 00:00:00 EST 3080,1.0)
SO FOR THE RESULT 1 : I would like to get the following
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
FOR RESULT 2 :
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(40,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
AND I THEN WANT TO COUNT THE NUMBER OF EVENTS PER KEY
RESULT :
19 EVENT1 2
19 EVENT2 1
19 EVENT3 2
21 EVENT3 2
RESULT2:
24 EVENT2 2
40 EVENT1 1
THEN I WANT TO GET THE MIN/MAX/AVG FOR THE EVENTS
1. RESULT1 MIN EVENT COUNT = 1
RESULT1 MAX EVENT COUNT = 5
RESULT1 AVG EVENT COUNT = 10/4 = 2.5

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to get the lag of a column in a Spark streaming dataframe? - scala

Related

Returns the value in the next empty rows

Search data field by year in Janusgraph

Converting CDT timestamp into UTC format in spark scala

Pivot table with columns as year/date in KDB+

in RDD how do you get apply function like MIN/MAX for an Iterable class

Categories

Resources