My code is supposed to extract a Map from a dataframe. The map will be used later for some calculations (mapping Credit to best matching original Billing). However the first step is failing already - the TransactionId is always retrieved as 0.
Simplified version of the code:
case class SalesTransaction(
CustomerId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition($"CustomerId")
//map generation:
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs("TransactionId")
, new SalesTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
)
).collect.toMap
m2.foreach(m => println("key: " + m._1 +" Value: "+ m._2))
The output has only the very last record, because all values captured by row.getAs("TransactionId") is null (i.e. translates as 0 in the m2 Map) thus tuple created in each iteration is (null, [current row SalesTransaction]).
Could you please advice me what might be wrong with my code? I'm quite new to Scala and must be missing some syntactical nuance here.
You can also use row.getAs[Int]("TransactionId") as shown below :
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs[Int]("TransactionId"),
new SalesTransaction(row.getAs("CustomerId"),
row.getAs("TransactionAttributesScore"),
row.getAs("Revenue"),
row.getAs("TransactionType"))
)
).collect.toMap
It is always better to use the casted version of getAs to avoid errors like this.
The issue is related to data type obtained from row.getAs("TransactionId"). Despite underlying $"TransactionId" being integer. Converting the input explicitly resolved the issue:
//… code above unchanged
val m2 : Map[Int, SlTransaction] =
df.map(row => {
val mKey : Int = row.getAs("TransactionId") //forcing into Int variable
val mValue : SlTransaction = new SlTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
(mKey, mValue)
}
).collect.toMap
Related
I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)
Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?
I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+
s_n181n is a dataframe and here I go through the 3rd and 5th column of the dataframe row wise
and
where the column nd is <=1.0 it breaks the code
ts(timestamp) | nd (nearest distance)
is the output columns, shown above
But what i need is the timestamp of last row value i.e 1529157727000
I want to break the loop showing last value of the loop
here. How to store that last row's timestamp value in a variable , so that outside this loop I can use it .
Here's my understanding of your requirement based on your question description and comment:
Loop through the collect-ed RDD row-wise, and whenever nd in the
current row is less than or equal to the ndLimit, extract ts from
the previous row and reset ndLimit to value of nd from that same
row.
If that's correct, I would suggest using foldLeft to assemble the list of timestamps, as shown below:
import org.apache.spark.sql.Row
val s_n181n = Seq(
(1, "a1", 101L, "b1", 1.0), // nd 1.0 is the initial limit
(2, "a2", 102L, "b2", 1.6),
(3, "a3", 103L, "b3", 1.2),
(4, "a4", 104L, "b4", 0.8), // 0.8 <= 1.0, hence ts 103 is saved and nd 1.2 is the new limit
(5, "a5", 105L, "b5", 1.5),
(6, "a6", 106L, "b6", 1.3),
(7, "a7", 107L, "b7", 1.1), // 1.1 <= 1.2, hence ts 106 is saved and nd 1.3 is the new limit
(8, "a8", 108L, "b8", 1.2) // 1.2 <= 1.3, hence ts 107 is saved and nd 1.1 is the new limit
).toDF("c1", "c2", "ts", "c4", "nd")
val s_rows = s_n181n.rdd.collect
val s_list = s_rows.map(r => (r.getAs[Long](2), r.getAs[Double](4))).toList
// List[(Long, Double)] = List(
// (101,1.0), (102,1.6), (103,1.2), (104,0.8), (105,1.5), (106,1.3), (107,1.1), (108,1.2)
// )
val ndLimit = s_list.head._2 // 1.0
s_list.tail.foldLeft( (s_list.head._1, s_list.head._2, ndLimit, List.empty[Long]) ){
(acc, x) =>
if (x._2 <= acc._3)
(x._1, x._2, acc._2, acc._1 :: acc._4)
else
(x._1, x._2, acc._3, acc._4)
}._4.reverse
// res1: List[Long] = List(103, 106, 107)
Note that a tuple of ( previous ts, previous nd, current ndLimit, list of timestamps ) is used as the accumulator to carry over items from the previous row for the necessary comparison logic in the current row.
Orgin data
ID, NAME, SEQ, NUMBER
A, John, 1, 3
A, Bob, 2, 5
A, Sam, 3, 1
B, Kim, 1, 4
B, John, 2, 3
B, Ria, 3, 5
To mak ID group list, I did below
val MapRDD = originDF.map { x => (x.getAs[String](colMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
My goal is making this RDD (purpose is to find SEQ-1's NAME and Number diff in each ID group)
ID, NAME, SEQ, NUMBER, PRE_NAME, DIFF
A, John, 1, 3, NULL, NULL
A, Bob, 2, 5, John, 2
A, Sam, 3, 1, Bob, -4
B, Kim, 1, 4, NULL, NULL
B, John, 2, 3, Kim, -1
B, Ria, 3, 5, John, 2
Currently ListRDD would be like
A, ([A,Jone,1,3], [A,Bob,2,5], ..)
B, ([B,Kim,1,4], [B,John,2,3], ..)
This is code I tried to make my goal RDD with ListRDD (not working as I want)
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: List[Row] = Nil
ListRDD.foreach( row => {
rows ::: make(row._2)
})
//rows has nothing and It's not RDD
}
def make( eachList: List[Row]): List[Row] = {
caseList.foreach { x => //... Make PRE_NAME and DIFF in new List
}
My final goal is to save this RDD in csv (RDD.saveAsFile...). How to make this RDD(not list) with this data.
Window functions look like a good fit here:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
val df = sc.parallelize(Seq(
("A", "John", 1, 3),
("A", "Bob", 2, 5),
("A", "Sam", 3, 1),
("B", "Kim", 1, 4),
("B", "John", 2, 3),
("B", "Ria", 3, 5))).toDF("ID", "NAME", "SEQ", "NUMBER")
val w = Window.partitionBy($"ID").orderBy($"SEQ")
df.select($"*",
lag($"NAME", 1).over(w).alias("PREV_NAME"),
($"NUMBER" - lag($"NUMBER", 1).over(w)).alias("DIFF"))