Values of a Dataframe Column into an Array in Scala Spark - scala

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

Related

Why is Key always 0 when creating map

My code is supposed to extract a Map from a dataframe. The map will be used later for some calculations (mapping Credit to best matching original Billing). However the first step is failing already - the TransactionId is always retrieved as 0.
Simplified version of the code:
case class SalesTransaction(
CustomerId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition($"CustomerId")
//map generation:
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs("TransactionId")
, new SalesTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
)
).collect.toMap
m2.foreach(m => println("key: " + m._1 +" Value: "+ m._2))
The output has only the very last record, because all values captured by row.getAs("TransactionId") is null (i.e. translates as 0 in the m2 Map) thus tuple created in each iteration is (null, [current row SalesTransaction]).
Could you please advice me what might be wrong with my code? I'm quite new to Scala and must be missing some syntactical nuance here.
You can also use row.getAs[Int]("TransactionId") as shown below :
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs[Int]("TransactionId"),
new SalesTransaction(row.getAs("CustomerId"),
row.getAs("TransactionAttributesScore"),
row.getAs("Revenue"),
row.getAs("TransactionType"))
)
).collect.toMap
It is always better to use the casted version of getAs to avoid errors like this.
The issue is related to data type obtained from row.getAs("TransactionId"). Despite underlying $"TransactionId" being integer. Converting the input explicitly resolved the issue:
//… code above unchanged
val m2 : Map[Int, SlTransaction] =
df.map(row => {
val mKey : Int = row.getAs("TransactionId") //forcing into Int variable
val mValue : SlTransaction = new SlTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
(mKey, mValue)
}
).collect.toMap

how to fetch the last row's 1st column value of spark scala dataframe inside the for and if loop

s_n181n is a dataframe and here I go through the 3rd and 5th column of the dataframe row wise
and
where the column nd is <=1.0 it breaks the code
ts(timestamp) | nd (nearest distance)
is the output columns, shown above
But what i need is the timestamp of last row value i.e 1529157727000
I want to break the loop showing last value of the loop
here. How to store that last row's timestamp value in a variable , so that outside this loop I can use it .
Here's my understanding of your requirement based on your question description and comment:
Loop through the collect-ed RDD row-wise, and whenever nd in the
current row is less than or equal to the ndLimit, extract ts from
the previous row and reset ndLimit to value of nd from that same
row.
If that's correct, I would suggest using foldLeft to assemble the list of timestamps, as shown below:
import org.apache.spark.sql.Row
val s_n181n = Seq(
(1, "a1", 101L, "b1", 1.0), // nd 1.0 is the initial limit
(2, "a2", 102L, "b2", 1.6),
(3, "a3", 103L, "b3", 1.2),
(4, "a4", 104L, "b4", 0.8), // 0.8 <= 1.0, hence ts 103 is saved and nd 1.2 is the new limit
(5, "a5", 105L, "b5", 1.5),
(6, "a6", 106L, "b6", 1.3),
(7, "a7", 107L, "b7", 1.1), // 1.1 <= 1.2, hence ts 106 is saved and nd 1.3 is the new limit
(8, "a8", 108L, "b8", 1.2) // 1.2 <= 1.3, hence ts 107 is saved and nd 1.1 is the new limit
).toDF("c1", "c2", "ts", "c4", "nd")
val s_rows = s_n181n.rdd.collect
val s_list = s_rows.map(r => (r.getAs[Long](2), r.getAs[Double](4))).toList
// List[(Long, Double)] = List(
// (101,1.0), (102,1.6), (103,1.2), (104,0.8), (105,1.5), (106,1.3), (107,1.1), (108,1.2)
// )
val ndLimit = s_list.head._2 // 1.0
s_list.tail.foldLeft( (s_list.head._1, s_list.head._2, ndLimit, List.empty[Long]) ){
(acc, x) =>
if (x._2 <= acc._3)
(x._1, x._2, acc._2, acc._1 :: acc._4)
else
(x._1, x._2, acc._3, acc._4)
}._4.reverse
// res1: List[Long] = List(103, 106, 107)
Note that a tuple of ( previous ts, previous nd, current ndLimit, list of timestamps ) is used as the accumulator to carry over items from the previous row for the necessary comparison logic in the current row.

Fetch columns based on list in Spark

I have a list List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13) and I have a dataframe which read input from text file with no headers. I want to fetch the columns mentioned in my List from that dataframe(inputFile). My input files has more 20 column but I want to fetch only columns mentioned in my list
val inputFile = spark.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", "|")
.load("C:\\demo.txt")
You can get the required columns using the following :
val fetchIndex = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
val fetchCols = inputFile.columns.zipWithIndex
.filter { case (colName, idx) => fetchIndex.contains(idx) }
.map(x => col(x._1) )
inputFile.select( fetchCols : _* )
Basically what it does is, zipWithIndex adds a continuous index to each element of the collection. So you get something like this :
df.columns.zipWithIndex.filter { case (data, idx) => a.contains(idx) }.map(x => col(x._1))
res8: Array[org.apache.spark.sql.Column] = Array(companyid, event, date_time)
And then you can just use the splat operator to pass the generated array as varargs to the select function.
You can use the following steps to get the columns that you have defined in a list as indexes.
You can get the column names by doing the following
val names = df.schema.fieldNames
And you have a list of column indexes as
val list = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
Now you can select the column names that the indexes that the list has by doing the following
val selectCols = list.map(x => names(x))
Last step is to select only the columns that has been selected by doing the following
import org.apache.spark.sql.functions.col
val selectedDataFrame = df.select(selectCols.map(col): _*)
You should have the dataframe with column indexes mentioned in the list.
Note: indexes in the list should not be greater than the column indexes present in the dataframe

How to select values from one field rdd only if it is present in second field of rdd

I have an rdd with 3 fields as mentioned below.
1,2,6
2,4,6
1,4,9
3,4,7
2,3,8
Now, from the above rdd, I want to get following rdd.
2,4,6
3,4,7
2,3,8
The resultant rdd does not have rows starting with 1, because 1 is nowhere in the second field in input rdd.
Ok, if I understood correctly what you want to do, there are two ways:
Split your RDD into two, where first RDD contains unique values of "second field" and second RDD is has "first value" as a key. Then join rdds together. The drawback of this approach is that distinct and join are slow operations.
val r: RDD[(String, String, Int)] = sc.parallelize(Seq(
("1", "2", 6),
("2", "4", 6),
("1", "4", 9),
("3", "4", 7),
("2", "3", 8)
))
val uniqueValues: RDD[(String, Unit)] = r.map(x => x._2 -> ()).distinct
val r1: RDD[(String, (String, String, Int))] = r.map(x => x._1 -> x)
val result: RDD[(String, String, Int)] = r1.join(uniqueValues).map {case (_, (x, _)) => x}
result.collect.foreach(println)
If your RDD is relatively small and Set of second values can fit completely in memory in all the nodes, then you can create that in-memory set as a first step, broadcast it to all nodes and then just filter your RDD:
val r: RDD[(String, String, Int)] = sc.parallelize(Seq(
("1", "2", 6),
("2", "4", 6),
("1", "4", 9),
("3", "4", 7),
("2", "3", 8)
))
val uniqueValues = sc.broadcast(r.map(x => x._2).distinct.collect.toSet)
val result: RDD[(String, String, Int)] = r.filter(x => uniqueValues.value.contains(x._1))
result.collect.foreach(println)
Both examples output:
(2,4,6)
(2,3,8)
(3,4,7)

Make RDD from List in scala&spark

Orgin data
ID, NAME, SEQ, NUMBER
A, John, 1, 3
A, Bob, 2, 5
A, Sam, 3, 1
B, Kim, 1, 4
B, John, 2, 3
B, Ria, 3, 5
To mak ID group list, I did below
val MapRDD = originDF.map { x => (x.getAs[String](colMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
My goal is making this RDD (purpose is to find SEQ-1's NAME and Number diff in each ID group)
ID, NAME, SEQ, NUMBER, PRE_NAME, DIFF
A, John, 1, 3, NULL, NULL
A, Bob, 2, 5, John, 2
A, Sam, 3, 1, Bob, -4
B, Kim, 1, 4, NULL, NULL
B, John, 2, 3, Kim, -1
B, Ria, 3, 5, John, 2
Currently ListRDD would be like
A, ([A,Jone,1,3], [A,Bob,2,5], ..)
B, ([B,Kim,1,4], [B,John,2,3], ..)
This is code I tried to make my goal RDD with ListRDD (not working as I want)
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: List[Row] = Nil
ListRDD.foreach( row => {
rows ::: make(row._2)
})
//rows has nothing and It's not RDD
}
def make( eachList: List[Row]): List[Row] = {
caseList.foreach { x => //... Make PRE_NAME and DIFF in new List
}
My final goal is to save this RDD in csv (RDD.saveAsFile...). How to make this RDD(not list) with this data.
Window functions look like a good fit here:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
val df = sc.parallelize(Seq(
("A", "John", 1, 3),
("A", "Bob", 2, 5),
("A", "Sam", 3, 1),
("B", "Kim", 1, 4),
("B", "John", 2, 3),
("B", "Ria", 3, 5))).toDF("ID", "NAME", "SEQ", "NUMBER")
val w = Window.partitionBy($"ID").orderBy($"SEQ")
df.select($"*",
lag($"NAME", 1).over(w).alias("PREV_NAME"),
($"NUMBER" - lag($"NUMBER", 1).over(w)).alias("DIFF"))