Spark-Scala RDD - scala

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.

Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)

Related

Converting a dataframe into a hashmap where Key is int and Value is a list in Scala

I have a dataframe which looks like this:
key
words
1
['a','test']
2
['hi', 'there]
And I would like to create the following hashmap:
Map(1 -> ['a', 'test'], 2 -> ['hi', 'there'])
But I cannot figure out how to do this, can anyone help me?
Thanks!
There must be dozens of ways of doing this. One would be:
df.collect().map { case row => (row.getAs[Int](0) -> row.getAs[mutable.WrappedArray[String]](1))}.toMap
This is very similar to the solution in this question. The following should give you the output you want. It gathers all the maps as a collection, and then uses the UDF to create a single map. This comes with the usual caveats regarding the potential poor performance of UDF functions.
import org.apache.spark.sql.functions.{col, map, collect_list, lit}
import org.apache.spark.sql.functions.udf
val joinMap = udf { values: Seq[Map[Int, Seq[String]]] =>
values.flatten.toMap
}
val df = Seq((1, Seq("a", "test")), (2, Seq("hi", "there"))).toDF("key", "words")
val rDf = df
.select(lit(1) as "id", map(col("key"), col("words")) as "kwMap")
.groupBy("id")
.agg(collect_list(col("kwMap")) as "kwMaps")
.select(joinMap(col("kwMaps")) as "map")
rDf.show

Spark 2.3: Reading dataframe inside rdd.map()

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.
Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

Scala - How to convert a pair RDD to an RDD?

I have an RDD[Sale] and wanted to leave only the latest sales. So what I did is created a pair RDD and then performed grouping and filtering:
val sales: RDD[(String, Sale)] = rawSales.map(sale => sale.id -> sale)
.groupByKey()
.mapValues(_.maxBy(_.timestamp))
But how do I return back to RDD[Sale] instead of the pair RDD in this case?
The only way I figured out is the following:
val value: RDD[Sale] = sales.map(salePaired => salePaired._2)
Is it the most proper solution?
You can access the keys or values from pair RDD directly, like you access any Map
val keys: RDD[String] = sales.keys
val values: RDD[Sale] = sales.values

Converting String RDD to Int RDD

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance
As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

Spark Cassandra List Data type Mapping

I started testing Spark with Cassandra.
I get data from Cassandra which has two columns (primary key, set).
val sc = new SparkContext("spark://172.31.32.224:7077","test", conf)
val rdd = sc.cassandraTable("test", "table").select("pk", "lists")
.map(l => (l.get[String]("pk"), l.getList[String]("lists")))
But this code is mapping (String, Seq[String])
I'd like to break the Seq[String] and make pairs with "pk", such as
((pk1, list(1)), (pk1, list(2), (pk1, list(3)))
Is there way to do this?
replace map with flatmap and create a collection of pairs:
.flatMap{l =>
val pk = l.get[String]("pk")
l.getList[String]("lists").map(item => (pk,List(item)))
}