Scala - Converting RDD to map - scala

I am a beginner in scala.
I have a class User containing a userId as one of the attributes.
I would like to convert RDD of users to a map with the userId as key and user as value.
Thanks!

let suppose you have the RDD myUsers:RDD[Users]. Each record of the RDD contains the attributes userId. You can transform it into a newRdd this way:
val newRdd = myUsers.map(x => (x.userId, x))
If You want to convert newRdd to a Map:
val myMap = newRdd.toMap
You can do these two computations in one line, I splitted them just for explanation

Related

Scala - How to convert a pair RDD to an RDD?

I have an RDD[Sale] and wanted to leave only the latest sales. So what I did is created a pair RDD and then performed grouping and filtering:
val sales: RDD[(String, Sale)] = rawSales.map(sale => sale.id -> sale)
.groupByKey()
.mapValues(_.maxBy(_.timestamp))
But how do I return back to RDD[Sale] instead of the pair RDD in this case?
The only way I figured out is the following:
val value: RDD[Sale] = sales.map(salePaired => salePaired._2)
Is it the most proper solution?
You can access the keys or values from pair RDD directly, like you access any Map
val keys: RDD[String] = sales.keys
val values: RDD[Sale] = sales.values

Get Json Key value from List[Row] with Scala

Let's say that I have a List[Row] such as {"name":"abc,"salary","somenumber","id":"1"},{"name":"xyz","salary":"some_number_2","id":"2"}
How do I get the JSON key value pair with scala. Let's assume that I want to get the value of the key "salary". IS the below one right ?
val rows = List[Row] //Assuming that rows has the list of rows
for(row <- rows){
row.get(0).+("salary")
}
If you have a List[Row] I assume that you've had a DataFrame and you did collectAsList. If you collect/collectAsList that means that you
Can no longer use that Spark SQL operations
Can not run your calculations in parallel on the nodes in your cluster. At this point everything is executed in your driver.
I would recommend keeping it as a DataFrame and then doing:
val salaries = df.select("salary")
Then you can do further calculations on the salaries, show them or collect or persist them somewhere.
If you choose to use DataSet (which is like a typed DataFrame) then you could do
val salaries = dataSet.map(_.salary)
Using Spray Json:
import spray.json._
import DefaultJsonProtocol._
object sprayApp extends App {
val list = List("""{"name":"abc","salary":"somenumber","id":"1"}""", """{"name":"xyz","salary":"some_number_2","id":"2"}""")
val jsonAst = list.map(_.parseJson)
for(l <- jsonAst) {
println(l.asJsObject.getFields("salary")(0))
}
}

Spark Scala: Generating list of DataFrame based on values in RDD

I have a rdd containing values, each of those values will be passed to a function generate_df(num:Int) to create a dataframe. So essentially in the end we will have a list of dataframes stored in a list buffer like this var df_list_example = new ListBuffer[org.apache.spark.sql.DataFrame]().
First I will show the code and result of doing it using a list instead of RDD:
var df_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
for (i <- list_values) //list_values contains values
{
df_list += generate_df(i)
}
Result:
df_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer([value: int], [value: int], [value: int])
However, when I am using RDD which is very essential for my use case I am having issue:
var df_rdd_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
//rdd_values contains values
rdd_values.map( i => df_rdd_list += generate_df(i) )
Result:
df_rdd_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer()
Basically the list buffer remains empty and cannot store dataframes unlike when I am using list of values instead of rdd of values. Mapping using rdd is essential for my original use case.

Converting String RDD to Int RDD

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance
As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

Why doesn't keys() and values() work on (String,String) one-pair RDD, while sortByKey() works

I create an RDD using the README.md file in Spark directory. The type of the newRDD is (String,String)
val lines = sc.textFile("README.md")
val newRDD = lines.map(x => (x.split(" ")(0),x))
So, when I try to runnewRDD.values() or newRDD.keys(), I get the error:
error: org.apache.spark.rdd.RDD[String] does not take parameters newRDD.values()or.keys() resp.
What I can understand from the error is maybe that String data type cannot be a key (And I think I am wrong). But if that's the case, why does
newRDD.sortByKey() work ?
Note: I am trying values() and keys() transformations because they're listed as valid transformations for one-pair RDDs
Edit: I am using Apache Spark version 1.5.2 in Scala
It doesn't work values (or keys) receives no parameters and because of that it has to be called without parentheses:
val rdd = sc.parallelize(Seq(("foo", "bar")))
rdd.keys.first
// String = foo
rdd.values.first
// String = bar