Spark 2.3: Reading dataframe inside rdd.map() - scala

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.

Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

Related

Collecting unique elements during Spark aggregation

Problem
I need to update this line in my code. How do I do that?
"case StringType => concat_ws(",",collect_list(col(c)))"
To only append strings that are not already in the existing field. In this example, the letter "b" would not appear twice.
Code
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 2.0, false, "b")
(3, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf =>
(sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c)))
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
You probably want to use collect_set() instead of collect_list(). This will automatically remove duplicates during the collection.
I am not sure why you want to turn the array of unique strings into a comma-delimited list. Spark can easily handle array columns and they are displayed such that each element can be seen. Still, if you absolutely must have the array converted into a comma-delimited string, use array_join in Spark 2.4+ or a UDF in earlier versions of Spark.

Scala - How to convert a pair RDD to an RDD?

I have an RDD[Sale] and wanted to leave only the latest sales. So what I did is created a pair RDD and then performed grouping and filtering:
val sales: RDD[(String, Sale)] = rawSales.map(sale => sale.id -> sale)
.groupByKey()
.mapValues(_.maxBy(_.timestamp))
But how do I return back to RDD[Sale] instead of the pair RDD in this case?
The only way I figured out is the following:
val value: RDD[Sale] = sales.map(salePaired => salePaired._2)
Is it the most proper solution?
You can access the keys or values from pair RDD directly, like you access any Map
val keys: RDD[String] = sales.keys
val values: RDD[Sale] = sales.values

How I can deal with Tuple in Spark Streaming?

I have a problem with Spark Scala which I want to multiply Tuple elements in Spark streaming,I get data from kafka to dstream ,my RDD data is like this,
(2,[2,3,4,6,5])
(4,[2,3,4,6,5])
(7,[2,3,4,6,5])
(9,[2,3,4,6,5])
I want to do this operate using multiplication like this,
(2,[2*2,3*2,4*2,6*2,5*2])
(4,[2*4,3*4,4*4,6*4,5*4])
(7,[2*7,3*7,4*7,6*7,5*7])
(9,[2*9,3*9,4*9,6*9,5*9])
Then,I get the rdd like this,
(2,[4,6,8,12,10])
(4,[8,12,16,24,20])
(7,[14,21,28,42,35])
(9,[18,27,36,54,45])
Finally,I get Tuple the second element by smallest like this,
(2,4)
(4,8)
(7,14)
(9,18)
How can I do this with scala from dstream? I use spark version 1.6
Give you a demo with scala
// val conf = new SparkConf().setAppName("ttt").setMaster("local")
//val sc = new SparkContext(conf)
// val data =Array("2,2,3,4,6,5","4,2,3,4,6,5","7,2,3,4,6,5","9,2,3,4,6,5")
//val lines = sc.parallelize(data)
//change to your data (each RDD in streaming)
lines.map(x => (x.split(",")(0).toInt,List(x.split(",")(1).toInt,x.split(",")(2).toInt,x.split(",")(3).toInt,x.split(",")(4).toInt,x.split(",")(5).toInt) ))
.map(x =>(x._1 ,x._2.min)).map(x => (x._1,x._2* x._1)).foreach(x => println(x))
here is the result
(2,4)
(4,8)
(7,14)
(9,18)
Each RDD in DStream contains data at a specific time interval, and you can manipulate each RDD as you want
Let's say, you are getting tuple rdd in variable input:
import scala.collection.mutable.ListBuffer
val result = input
.map(x => { // for each element
var l = new ListBuffer[Int]() // create a new list for storing the multiplication result
for(i <- x._1){ // for each element in the array
l += x._0 * i // append the multiplied result to the new list
}
(x._0, l.toList) // return the new tuple
})
.map(x => {
(x._0, x._1.min) // return the new tuple with the minimum element in it from the list
})
result.foreach(println) should result in:
(2,4)
(4,8)
(7,14)
(9,18)

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.
Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)

Spark Cassandra List Data type Mapping

I started testing Spark with Cassandra.
I get data from Cassandra which has two columns (primary key, set).
val sc = new SparkContext("spark://172.31.32.224:7077","test", conf)
val rdd = sc.cassandraTable("test", "table").select("pk", "lists")
.map(l => (l.get[String]("pk"), l.getList[String]("lists")))
But this code is mapping (String, Seq[String])
I'd like to break the Seq[String] and make pairs with "pk", such as
((pk1, list(1)), (pk1, list(2), (pk1, list(3)))
Is there way to do this?
replace map with flatmap and create a collection of pairs:
.flatMap{l =>
val pk = l.get[String]("pk")
l.getList[String]("lists").map(item => (pk,List(item)))
}