How to get the lag of a column in a Spark streaming dataframe? - scala

I have data streaming into my Spark Scala application in this format
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
I have it read into columns id, mark1, mark2, mark3 and time. The time is converted to datetime format as well.
I want to get this grouped by id and get the lag for mark1 which gives the previous row's mark1 value.
Something like this:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
Consider the dataframe to be markDF. I have tried:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
which says non time windows cannot be applied on streaming/appending datasets/frames.
I have also tried:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
To get a window for few rows which did not work either. The streaming window something like:
window("timestamp", "10 minutes")
cannot be used to send over the lag. I am super confused on how to do this. Any help would be awesome!!

I would advise you to change the time column into String as
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
After that doing the following should work
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
Which will give you output as
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+

Related

Returns the value in the next empty rows

Input
Here is an example of my input.
Number
Date
Motore
1
Fri Jan 01 00:00:00 CET 2021
Motore 1
2
Motore 2
3
Motore 3
4
Motore 4
5
Fri Feb 01 00:00:00 CET 2021
Motore 1
6
Motore 2
7
Motore 3
8
Motore 4
Expected Output
Number
Date
Motore
1
Fri Jan 01 00:00:00 CET 2021
Motore 1
2
Fri Jan 01 00:00:00 CET 2021
Motore 2
3
Fri Jan 01 00:00:00 CET 2021
Motore 3
4
Fri Jan 01 00:00:00 CET 2021
Motore 4
5
Fri Feb 01 00:00:00 CET 2021
Motore 1
6
Fri Feb 01 00:00:00 CET 2021
Motore 2
7
Fri Feb 01 00:00:00 CET 2021
Motore 3
8
Fri Feb 01 00:00:00 CET 2021
Motore 4
I tried to use the TmemorizeRows component but without any result, the second line is valorized but the others are not. Could you kindly help me.
You can solve this with a simple tMap with 2 inner variables (using the "var" array in the middle of the tMap)
Create two variables :
currentValue : you put in it the value of your input column date (in my example "row1.data").
updateValue : check whether currentValue is null or not : if null then you do not modify updateValue field . If not null then you update the field value. This way "updateValue" always contains not null data.
In output, just use "updateValue" variable.

Search data field by year in Janusgraph

I have a 'Date' property on my 'Patent' node class that is formatted like this:
==>Sun Jan 28 00:08:00 UTC 2007
==>Tue Jan 27 00:10:00 UTC 1987
==>Wed Jan 10 00:04:00 UTC 2001
==>Sun Jan 17 00:08:00 UTC 2010
==>Tue Jan 05 00:10:00 UTC 2010
==>Thu Jan 28 00:09:00 UTC 2010
==>Wed Jan 04 00:09:00 UTC 2012
==>Wed Jan 09 00:12:00 UTC 2008
==>Wed Jan 24 00:04:00 UTC 2018
And is stored as class java.util.Date in the database.
Is there a way to search this field to return all the 'Patents' for a particular year?
I tried variations of g.V().has("Patent", "date", 2000).values(). However, it doesn't return any results or an error message.
Is there a way to search this property field by year or do I need to create a separate property that just contains year?
You do not need to create a separate property for the year. JanusGraph recognizes the Date data type and can filter by date values.
gremlin> dateOfBirth1 = new GregorianCalendar(2000, 5, 6).getTime()
==>Tue Jun 06 00:00:00 MDT 2000
gremlin> g.addV("person").property("name", "Person 1").property("dateOfBirth", dateOfBirth1)
==>v[4144]
gremlin> dateOfBirth2 = new GregorianCalendar(2001, 5, 6).getTime()
==>Wed Jun 06 00:00:00 MDT 2001
gremlin> g.addV("person").property("name", "Person 2").property("dateOfBirth", dateOfBirth2)
==>v[4328]
gremlin> dateOfBirthFrom = new GregorianCalendar(2000, 0, 1).getTime()
==>Sat Jan 01 00:00:00 MST 2000
gremlin> dateOfBirthTo = new GregorianCalendar(2001, 0, 1).getTime()
==>Mon Jan 01 00:00:00 MST 2001
gremlin> g.V().hasLabel("person").
......1> has("dateOfBirth", gte(dateOfBirthFrom)).
......2> has("dateOfBirth", lt(dateOfBirthTo)).
......3> values("name")
==>Person 1

Converting CDT timestamp into UTC format in spark scala

My Dataframe, myDF is like bellow -
DATE_TIME
Wed Sep 6 15:24:27 CDT 2017
Wed Sep 6 15:30:05 CDT 2017
Expected output in format :
2017-09-06 15:24:27
2017-09-06 15:30:05
Need to convert DATE_TIME timestamp to UTC.
Tried the below code in databricks notebook but it's not working.
%scala
val df = Seq(("Wed Sep 6 15:24:27 CDT 2017")).toDF("times")
df.withColumn("times2",date_format(to_timestamp('times,"ddd MMM dd hh:mm:ss CDT yyyy"),"yyyy-MM-dd HH:mm:ss")).show(false)
times | times2
Wed Sep 6 15:24:27 CDT 2017 | null
I think we need to remove wed from your string then use to_timestamp() function.
Example:
df.show(false)
/*
+---------------------------+
|times |
+---------------------------+
|Wed Sep 6 15:24:27 CDT 2017|
+---------------------------+
*/
df.withColumn("times2",expr("""to_timestamp(substring(times,5,length(times)),"MMM d HH:mm:ss z yyyy")""")).
show(false)
/*
+---------------------------+-------------------+
|times |times2 |
+---------------------------+-------------------+
|Wed Sep 6 15:24:27 CDT 2017|2017-09-06 15:24:27|
+---------------------------+-------------------+
*/

Pivot table with columns as year/date in KDB+

I am trying to create a pivot table with columns as year out of a simple table
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth
stock year returns
------------------
apple 2015 9
apple 2016 18
apple 2017 17
goog 2015 8
goog 2016 13
goog 2017 17
nokia 2015 12
nokia 2016 12
nokia 2017 2
but I am not able to get the correct structure, it is still returning me a dictionary rather than multiple year columns.
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock|
-----| ----------------------
apple| 2015 2016 2017!9 18 17
goog | 2015 2016 2017!8 13 17
nokia| 2015 2016 2017!12 12 2
am I doing anything wrong?
You need to convert the years to symbols in order to use them as column headers. In this case I have updated the growth table first then performed the pivot:
q)exec distinct[year]#year!returns by stock:stock from update `$string year from growth
stock| 2015 2016 2017
-----| --------------
apple| 12 8 10
goog | 1 9 11
nokia| 5 6 1
Additionally you may see that I have changed to distinct[year] from (distinct growth`year) as this yields the same result with year being pulled from the updated table.
The column names of a table in KDB should be symbols rather than any other data type.
In your pivot table , the datatype of 'year' column is int\long this is the reason a proper table is not turning up.
If you type cast it as symbol, then it will work.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth:update `$string year from growth
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock| 2015 2016 2017
-----| --------------
apple| 9 18 17
goog | 8 13 17
nokia| 12 12 2
Alternatively, you can switch the pivot columns to 'stock' rather than 'year' and get a pivot table with the same original table.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)show exec (distinct growth`stock)#stock!returns by year:year from growth
year| apple goog nokia
----| ----------------
2015| 4 2 4
2016| 5 13 12
2017| 12 6 1

in RDD how do you get apply function like MIN/MAX for an Iterable class

I would like to find out the efficient way to apply function to an RDD:
Here is what I am trying to do :
I have defined the following class:
case class Type(Key: String, category: String, event: String, date: java.util.Date, value: Double)
case class Type2(Key: String, Mdate: java.util.Date, label: Double)
then a loaded an RDD:
val TypeRDD: RDD[Type] = types.map(s=>s.split(",")).map(s=>Type(s(0), s(1), s(2),dateFormat.parse(s(3).asInstanceOf[String]), s(4).toDouble))
val Type2RDD: RDD[Type2] = types2.map(s=>s.split(",")).map(s=>Type2(s(0), dateFormat.parse(s(1).asInstanceOf[String]), s(2).toDouble))
Then I try to create two new RDD - one that has Type.Key = Type2.Key and another one that has Type.Key not in Type2
val grpType = TypeRDD.groupBy(_.Key1)
vl grpType2 = Type2RDD.groupBy(_.Key1)
//get data where they Key1 does not exists in Type2 and return the values in grpType1
val tempresult = grpType fullOuterJoin grpType2
val result = tempresult.filter(_._2._2.isEmpty).map(_._2._1)
//get data where Type.Key == Type2.Key
val result2 = grpType.join.grpType2.map(_._2)
UPDATED:
typeRDD =
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(40,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
type2RDD =
(24,Wed Dec 22 00:00:00 EST 3080,1.0)
(40,Wed Jan 22 00:00:00 EST 3080,1.0)
SO FOR THE RESULT 1 : I would like to get the following
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
FOR RESULT 2 :
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(40,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
AND I THEN WANT TO COUNT THE NUMBER OF EVENTS PER KEY
RESULT :
19 EVENT1 2
19 EVENT2 1
19 EVENT3 2
21 EVENT3 2
RESULT2:
24 EVENT2 2
40 EVENT1 1
THEN I WANT TO GET THE MIN/MAX/AVG FOR THE EVENTS
1. RESULT1 MIN EVENT COUNT = 1
RESULT1 MAX EVENT COUNT = 5
RESULT1 AVG EVENT COUNT = 10/4 = 2.5