how to fetch the last row's 1st column value of spark scala dataframe inside the for and if loop - scala

s_n181n is a dataframe and here I go through the 3rd and 5th column of the dataframe row wise
and
where the column nd is <=1.0 it breaks the code
ts(timestamp) | nd (nearest distance)
is the output columns, shown above
But what i need is the timestamp of last row value i.e 1529157727000
I want to break the loop showing last value of the loop
here. How to store that last row's timestamp value in a variable , so that outside this loop I can use it .

Here's my understanding of your requirement based on your question description and comment:
Loop through the collect-ed RDD row-wise, and whenever nd in the
current row is less than or equal to the ndLimit, extract ts from
the previous row and reset ndLimit to value of nd from that same
row.
If that's correct, I would suggest using foldLeft to assemble the list of timestamps, as shown below:
import org.apache.spark.sql.Row
val s_n181n = Seq(
(1, "a1", 101L, "b1", 1.0), // nd 1.0 is the initial limit
(2, "a2", 102L, "b2", 1.6),
(3, "a3", 103L, "b3", 1.2),
(4, "a4", 104L, "b4", 0.8), // 0.8 <= 1.0, hence ts 103 is saved and nd 1.2 is the new limit
(5, "a5", 105L, "b5", 1.5),
(6, "a6", 106L, "b6", 1.3),
(7, "a7", 107L, "b7", 1.1), // 1.1 <= 1.2, hence ts 106 is saved and nd 1.3 is the new limit
(8, "a8", 108L, "b8", 1.2) // 1.2 <= 1.3, hence ts 107 is saved and nd 1.1 is the new limit
).toDF("c1", "c2", "ts", "c4", "nd")
val s_rows = s_n181n.rdd.collect
val s_list = s_rows.map(r => (r.getAs[Long](2), r.getAs[Double](4))).toList
// List[(Long, Double)] = List(
// (101,1.0), (102,1.6), (103,1.2), (104,0.8), (105,1.5), (106,1.3), (107,1.1), (108,1.2)
// )
val ndLimit = s_list.head._2 // 1.0
s_list.tail.foldLeft( (s_list.head._1, s_list.head._2, ndLimit, List.empty[Long]) ){
(acc, x) =>
if (x._2 <= acc._3)
(x._1, x._2, acc._2, acc._1 :: acc._4)
else
(x._1, x._2, acc._3, acc._4)
}._4.reverse
// res1: List[Long] = List(103, 106, 107)
Note that a tuple of ( previous ts, previous nd, current ndLimit, list of timestamps ) is used as the accumulator to carry over items from the previous row for the necessary comparison logic in the current row.

Related

Calculate date difference for a specific column ID Scala

I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

Scala code to label rows of data frame based on another data frame

I just started learning scala to do data analytics and I encountered a problem when I try to label my data rows based on another data frame.
Suppose I have a df1 with columns "date","id","value",and"label" which is set to be "F" for all rows in df1 in the beginning. Then I have this df2 which is a smaller set of data with columns "date","id","value".Then I want to change the row label in df1 from "F" to "T" if that row appears in df2, i.e.some row in df2 has the same combination of ("date","id","value")as that row in df1.
I tried with df.filter and df.join but seems that both cannot solve my problem.
I Think this is what you are looking for.
val spark =SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
//create Dataframe 1
val df1 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd", "F"),
("2016-01-01", 2, "efg", "F"),
("2016-01-01", 3, "hij", "F"),
("2016-01-01", 4, "klm", "F")
)).toDF("date","id","value", "label")
//Create Dataframe 2
val df2 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd"),
("2016-01-01", 3, "hij")
)).toDF("date1","id1","value1")
val condition = $"date" === $"date1" && $"id" === $"id1" && $"value" === $"value1"
//Join two dataframe with above condition
val result = df1.join(df2, condition, "left")
// check wather both fields contain same value and drop columns
val finalResult = result.withColumn("label", condition)
.drop("date1","id1","value1")
//Update column label from true false to T or F
finalResult.withColumn("label", when(col("label") === true, "T").otherwise("F")).show
The basic idea is to join the two and then calculate the result. Something like this:
df2Mod = df2.withColumn("tmp", lit(true))
joined = df1.join(df2Mod , df1("date") <=> df2Mod ("date") && df1("id") <=> df2Mod("id") && df1("value") <=> df2Mod("value"), "left_outer")
joined.withColumn("label", when(joined("tmp").isNull, "F").otherwise("T")
The idea is that we add the "tmp" column and then do a left_outer join. "tmp" would be null for everything not in df2 and therefore we can use that to calculate the label.

How to compare records in a dataframe in scala

For example I have a dataframe as below:
var tmp_df = sqlContext.createDataFrame(Seq(
("One", "Sagar", 1),
("Two", "Ramesh" , 2),
("Three", "Suresh", 3),
("One", "Sagar", 5)
)).toDF("ID", "Name", "Balance");
Now I want to write all records from above dataframe having same ID in one file likewise. Please advise.
//find records having same id and rename the id column to idstowrite
val idsMoreThanOne = tmp_df.groupBy('id).count.filter('count.gt(1)).withColumnRenamed("id" , "idstowrite")
idsMoreThanOne.show
//join back with original dataframe
val joinedDf = idsMoreThanOne.join(tmp_df ,tmp_df("id") === idsMoreThanOne("idstowrite") , "left")
joinedDf.show
//select only the columns we want
val dfToWrite = joinedDf.select("id" , "Name" , "Balance")
dfToWrite.show

read individual elements of a tuple from a map((tuple),(tuple)) in scala

The generated output of reducebykey is an ShuffledRDD with key-value both as array of multiple field. I need to extract all the fields and write to a hive table.
Below is the code which I was trying:
sqlContext.sql(s"select SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT,RMNG_NW_OP_KEY, ACCESS_TYPE FROM FACT.FCT_MEDIATED_USAGE_DATA")
val USAGE_DATA_Reduce = USAGE_DATA.map{ USAGE_DATA => ((USAGE_DATA.getShort(0), USAGE_DATA.getString(1),USAGE_DATA.getString(2)),
(USAGE_DATA.getInt(3), USAGE_DATA.getInt(4)))}.reduceByKey((x, y) => (math.min(x._1, y._1), math.max(x._2,y._2)))
The final output what I am expecting is all the five fields as:
SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT, MINVAL, MAXVAL
So that it can be directly inserted to hive table
If you mean:
Given a RDD[(TupleN, TupleM)], how do I map each record's elements of both key and value tuples into a single concatenated string?
Here's a simplified version, you should be able extrapolate this to solve your problem:
val keyValueRdd = sc.parallelize(Seq(
(1, "key1") -> (10, "value1", "A"),
(2, "key2") -> (20, "value2", "B"),
(3, "key3") -> (30, "value3", "C")
))
val asStrings: RDD[String] = keyValueRdd.map {
case ((k1, k2), (v1, v2, v3)) => List(k1, k2, v1, v2, v3).mkString(",")
}
asStrings.foreach(println)
// prints:
// 3,key3,30,value3,C
// 2,key2,20,value2,B
// 1,key1,10,value1,A