Scala - How to convert a pair RDD to an RDD? - scala

I have an RDD[Sale] and wanted to leave only the latest sales. So what I did is created a pair RDD and then performed grouping and filtering:
val sales: RDD[(String, Sale)] = rawSales.map(sale => sale.id -> sale)
.groupByKey()
.mapValues(_.maxBy(_.timestamp))
But how do I return back to RDD[Sale] instead of the pair RDD in this case?
The only way I figured out is the following:
val value: RDD[Sale] = sales.map(salePaired => salePaired._2)
Is it the most proper solution?

You can access the keys or values from pair RDD directly, like you access any Map
val keys: RDD[String] = sales.keys
val values: RDD[Sale] = sales.values

Related

Spark 2.3: Reading dataframe inside rdd.map()

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.
Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

Add new columns

ErrorHi I am trying to a new column to a Spark. I am trying in a data set where I want to add the percentage made by in all games.
The data set looks like this:
Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales
val vgdataLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/t1/vgsales-small.csv")
val vgdata = vgdataLines.map(_.split(";"))
def toPercentage(x: Double): Double = {x * 100} val countPubl = vgdata.map(r => (r(4),1)).reduceByKey(_+_)
val addpercen = countPubl.withColumn("count", toPercentage($"count"/countPubl.count(_._2)))
I used withColumn() to add new column 'count' and expected output to be like:
(Ubisoft,3,15.0)
Can anyone tell whats wrong here?
You cannot use "withColumn" with an RDD.
You could do as follow
val addpercen = countPubl.map({case(key, value) => (key, value, toPercentage(value))})
use map to add a calculated value as new column and convert to a DataFrame if you want
import spark.implicits._
val myDf = addpercen.toDF("key","value","myNewColumn")
myDf.show()
Hope it helps.
You can not use withColumn with an RDD hence convert it to DataFrame as below and then use it
val countPubl : DataFrame = vgdata.map(r => (r(4),1)).reduceByKey(_+_).toDF()
If you still looking to use RDD then just converto it back to RDD once you add the with column as
val javaRdd : JavaRDD[Row] = countPubl.withColumn("...",col("...")).toJavaRDD

Scala - Converting RDD to map

I am a beginner in scala.
I have a class User containing a userId as one of the attributes.
I would like to convert RDD of users to a map with the userId as key and user as value.
Thanks!
let suppose you have the RDD myUsers:RDD[Users]. Each record of the RDD contains the attributes userId. You can transform it into a newRdd this way:
val newRdd = myUsers.map(x => (x.userId, x))
If You want to convert newRdd to a Map:
val myMap = newRdd.toMap
You can do these two computations in one line, I splitted them just for explanation

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.
Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)

Spark Cassandra List Data type Mapping

I started testing Spark with Cassandra.
I get data from Cassandra which has two columns (primary key, set).
val sc = new SparkContext("spark://172.31.32.224:7077","test", conf)
val rdd = sc.cassandraTable("test", "table").select("pk", "lists")
.map(l => (l.get[String]("pk"), l.getList[String]("lists")))
But this code is mapping (String, Seq[String])
I'd like to break the Seq[String] and make pairs with "pk", such as
((pk1, list(1)), (pk1, list(2), (pk1, list(3)))
Is there way to do this?
replace map with flatmap and create a collection of pairs:
.flatMap{l =>
val pk = l.get[String]("pk")
l.getList[String]("lists").map(item => (pk,List(item)))
}