How to iterate through rows after group by in spark scala dataframe? - scala

I have a dataframe with the below columns , df1
Following the example there:
Project_end_date I_date Project_start_date id
Jan 30 2017 Jan 10 2017 Jan 1 2017 1
Jan 30 2017 Jan 15 2017 Jan 1 2017 1
Jan 30 2017 Jan 20 2017 Jan 1 2017 1
Here you would fist find the differences between i and start date, which would be 10, 15, and 20 days. Then you would express those as a percentage of the project's duration, so 100*10/30=33%, 100*15/30=50%, 100*20/20=67%. Then you would obtain the mean (33%), min(33%), max(67%), etc of these.
how to achieve this after doing group by on id
df.groupby("id"). ?

Easiest way would be to add the value you care about just before the groupBy:
import org.apache.spark.sql.{functions => F}
import spark.implicits._
df.withColumn("ival", (
$"I_date" - $"Project_start_date") /
($"Project_end_date" - $"Project_start_date"))
.groupBy('id').agg(
F.min($"ival").as("min"),
F.max($"ival").as("max"),
F.avg($"ival").as("avg")
)
If you want to avoid the withColumn you can just get the expression for ival inside F.min, F.max and F.avg, but that's more verbose.

Related

Returns the value in the next empty rows

Input
Here is an example of my input.
Number
Date
Motore
1
Fri Jan 01 00:00:00 CET 2021
Motore 1
2
Motore 2
3
Motore 3
4
Motore 4
5
Fri Feb 01 00:00:00 CET 2021
Motore 1
6
Motore 2
7
Motore 3
8
Motore 4
Expected Output
Number
Date
Motore
1
Fri Jan 01 00:00:00 CET 2021
Motore 1
2
Fri Jan 01 00:00:00 CET 2021
Motore 2
3
Fri Jan 01 00:00:00 CET 2021
Motore 3
4
Fri Jan 01 00:00:00 CET 2021
Motore 4
5
Fri Feb 01 00:00:00 CET 2021
Motore 1
6
Fri Feb 01 00:00:00 CET 2021
Motore 2
7
Fri Feb 01 00:00:00 CET 2021
Motore 3
8
Fri Feb 01 00:00:00 CET 2021
Motore 4
I tried to use the TmemorizeRows component but without any result, the second line is valorized but the others are not. Could you kindly help me.
You can solve this with a simple tMap with 2 inner variables (using the "var" array in the middle of the tMap)
Create two variables :
currentValue : you put in it the value of your input column date (in my example "row1.data").
updateValue : check whether currentValue is null or not : if null then you do not modify updateValue field . If not null then you update the field value. This way "updateValue" always contains not null data.
In output, just use "updateValue" variable.

Search data field by year in Janusgraph

I have a 'Date' property on my 'Patent' node class that is formatted like this:
==>Sun Jan 28 00:08:00 UTC 2007
==>Tue Jan 27 00:10:00 UTC 1987
==>Wed Jan 10 00:04:00 UTC 2001
==>Sun Jan 17 00:08:00 UTC 2010
==>Tue Jan 05 00:10:00 UTC 2010
==>Thu Jan 28 00:09:00 UTC 2010
==>Wed Jan 04 00:09:00 UTC 2012
==>Wed Jan 09 00:12:00 UTC 2008
==>Wed Jan 24 00:04:00 UTC 2018
And is stored as class java.util.Date in the database.
Is there a way to search this field to return all the 'Patents' for a particular year?
I tried variations of g.V().has("Patent", "date", 2000).values(). However, it doesn't return any results or an error message.
Is there a way to search this property field by year or do I need to create a separate property that just contains year?
You do not need to create a separate property for the year. JanusGraph recognizes the Date data type and can filter by date values.
gremlin> dateOfBirth1 = new GregorianCalendar(2000, 5, 6).getTime()
==>Tue Jun 06 00:00:00 MDT 2000
gremlin> g.addV("person").property("name", "Person 1").property("dateOfBirth", dateOfBirth1)
==>v[4144]
gremlin> dateOfBirth2 = new GregorianCalendar(2001, 5, 6).getTime()
==>Wed Jun 06 00:00:00 MDT 2001
gremlin> g.addV("person").property("name", "Person 2").property("dateOfBirth", dateOfBirth2)
==>v[4328]
gremlin> dateOfBirthFrom = new GregorianCalendar(2000, 0, 1).getTime()
==>Sat Jan 01 00:00:00 MST 2000
gremlin> dateOfBirthTo = new GregorianCalendar(2001, 0, 1).getTime()
==>Mon Jan 01 00:00:00 MST 2001
gremlin> g.V().hasLabel("person").
......1> has("dateOfBirth", gte(dateOfBirthFrom)).
......2> has("dateOfBirth", lt(dateOfBirthTo)).
......3> values("name")
==>Person 1

Select Columns in Spark Dataframe based on Column name pattern

I have a spark dataframe with the following column structure:
UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY
In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q etc
I want to select UT_LVL_17_CD,UT_LVL_20_CD and columns which has the pattern year<space>quarter like 2017 1Q.
Basically I want to avoid selecting columns like 2017 FY , 2018 FY , and this has to be dynamic as I will get new FY data each year
I am using spark 2.4.4
Like I stated in my comment, this can be done with plain scala using Regex since the DataFrame can return columns names as a Seq[String] :
scala> val columns = df.columns
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)
scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$
scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)
You can check that the selected columns does not contain the unwanted columns :
scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)
Now you can use the selection :
scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]
You could use desc sql command to get list of column names
val fyStringList=new util.ArrayList[String]()
spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
println(fyStringList)
Use above snippet to get list of column name which contains "fy"
You can update filter logic with regex and also update logic in forEach for storing string columns
you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.
var DF1 = DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)
This will remove those FY related columns. Hope this works for you.

Pivot table with columns as year/date in KDB+

I am trying to create a pivot table with columns as year out of a simple table
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth
stock year returns
------------------
apple 2015 9
apple 2016 18
apple 2017 17
goog 2015 8
goog 2016 13
goog 2017 17
nokia 2015 12
nokia 2016 12
nokia 2017 2
but I am not able to get the correct structure, it is still returning me a dictionary rather than multiple year columns.
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock|
-----| ----------------------
apple| 2015 2016 2017!9 18 17
goog | 2015 2016 2017!8 13 17
nokia| 2015 2016 2017!12 12 2
am I doing anything wrong?
You need to convert the years to symbols in order to use them as column headers. In this case I have updated the growth table first then performed the pivot:
q)exec distinct[year]#year!returns by stock:stock from update `$string year from growth
stock| 2015 2016 2017
-----| --------------
apple| 12 8 10
goog | 1 9 11
nokia| 5 6 1
Additionally you may see that I have changed to distinct[year] from (distinct growth`year) as this yields the same result with year being pulled from the updated table.
The column names of a table in KDB should be symbols rather than any other data type.
In your pivot table , the datatype of 'year' column is int\long this is the reason a proper table is not turning up.
If you type cast it as symbol, then it will work.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth:update `$string year from growth
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock| 2015 2016 2017
-----| --------------
apple| 9 18 17
goog | 8 13 17
nokia| 12 12 2
Alternatively, you can switch the pivot columns to 'stock' rather than 'year' and get a pivot table with the same original table.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)show exec (distinct growth`stock)#stock!returns by year:year from growth
year| apple goog nokia
----| ----------------
2015| 4 2 4
2016| 5 13 12
2017| 12 6 1

in RDD how do you get apply function like MIN/MAX for an Iterable class

I would like to find out the efficient way to apply function to an RDD:
Here is what I am trying to do :
I have defined the following class:
case class Type(Key: String, category: String, event: String, date: java.util.Date, value: Double)
case class Type2(Key: String, Mdate: java.util.Date, label: Double)
then a loaded an RDD:
val TypeRDD: RDD[Type] = types.map(s=>s.split(",")).map(s=>Type(s(0), s(1), s(2),dateFormat.parse(s(3).asInstanceOf[String]), s(4).toDouble))
val Type2RDD: RDD[Type2] = types2.map(s=>s.split(",")).map(s=>Type2(s(0), dateFormat.parse(s(1).asInstanceOf[String]), s(2).toDouble))
Then I try to create two new RDD - one that has Type.Key = Type2.Key and another one that has Type.Key not in Type2
val grpType = TypeRDD.groupBy(_.Key1)
vl grpType2 = Type2RDD.groupBy(_.Key1)
//get data where they Key1 does not exists in Type2 and return the values in grpType1
val tempresult = grpType fullOuterJoin grpType2
val result = tempresult.filter(_._2._2.isEmpty).map(_._2._1)
//get data where Type.Key == Type2.Key
val result2 = grpType.join.grpType2.map(_._2)
UPDATED:
typeRDD =
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(40,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
type2RDD =
(24,Wed Dec 22 00:00:00 EST 3080,1.0)
(40,Wed Jan 22 00:00:00 EST 3080,1.0)
SO FOR THE RESULT 1 : I would like to get the following
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(19,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
(21,EVENT3,TEST3,Sun Aug 21 00:00:00 EDT 3396,1.0)
FOR RESULT 2 :
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(24,EVENT2,TEST2,Sun Aug 21 00:00:00 EDT 3396,1.0)
(40,EVENT1,TEST1,Sun Aug 21 00:00:00 EDT 3396,1.0)
AND I THEN WANT TO COUNT THE NUMBER OF EVENTS PER KEY
RESULT :
19 EVENT1 2
19 EVENT2 1
19 EVENT3 2
21 EVENT3 2
RESULT2:
24 EVENT2 2
40 EVENT1 1
THEN I WANT TO GET THE MIN/MAX/AVG FOR THE EVENTS
1. RESULT1 MIN EVENT COUNT = 1
RESULT1 MAX EVENT COUNT = 5
RESULT1 AVG EVENT COUNT = 10/4 = 2.5