Classify every instance of a RDD | Apache Spark Scala - scala

I'm starting to work with RDD's and I have some doubts. In my case, I have a RDD and I want to classify his data. My RDD contains the following:
Array[(String, String)] = Array((data: BD=bd_users,BD_classified,contains_people, rbd: BD=bd_users,BD_classified,contains_people),
(data: BD=bd_users,BD_classified,contains_people,contains_users, user: id=8282bd, BD_USERS,bdd),
(data: BD=bd_experts,BD_exp,contains_exp,contains_adm, rbd: BD=bd_experts,BD_ea,contains_exp,contains_adm),
(data: BD=bd_test,BD_test,contains_acc,contains_tst, rbd: BD=bd_test,BD_test,contains_tst,contains_t))
As you can see the RDD contains two strings, the first one start with data and the second one starts with rbd. What I want to do is classify every instance of this RDD as you can see here:
If the instance contains bd_users & BD_classified -> users
bd_experts & BD_exp -> experts
BD_test -> tests
The output would be something like this for this RDD:
1. Users
2. Users
3. Experts
4. Test
To do this I would like to use a map that calls a function for every instance in this RDD but I don't know how can orientate this:
val rdd_groups = rdd_1.map(x=>x(0).toString).map(x => getGroups(x))
def getGroups(input: String): (String) = {
//here i should use for example case to classify this strings?
}
If you need something more or examples, just tell me it. Thanks in advance!

Well assuming you have a RDD of strings and a classifier already defined:
val rdd: RDD[String] =
???
def classify(input: String): String =
???
rdd.map(input => classify(input))

Related

create scala dataframe out of custom object

I am a newbie in scala. I will try to be as clear as possible.I have the following code:
case class Session (bf: Array[File])
case class File(s: s, a: Option[a], b: Option[b], c: Option[c])
case class s(s1:Int, s2:String)
case class a(a1:Int, a2:String)
case class b(b1:Int, b2:String)
case class c(c1:Int, c2:String)
val x = Session(...) // some values here, many session objects grouped in a dataset collection i.e. Dataset[Sessions]
I want to know how to create dataframes from a Dataset[Sessions]. I do not
know how to manipulate such a complex structure.
how to create a dataframe from Dataset[sessions] only containing the custom
object "a".
Thank you
A Spark DataSet works much like a regular Scala collection. It has a toDF() operation to create a DataFrame out of it. Now you just need to extract the right data out of it using different transformations.
flatMap it into a DataSet of File
filter every File for a non-empty a
map every remaining File to a
call toDF() to create a DataFrame
In code this would be:
val ds: DataSet[Session] = ...
ds.flatMap(_.bf)
.filter(_.a.isDefined)
.map(_.a.get)
.toDF()
In Scala you can also combine the filter and map to a collect, which would lead to the following code:
ds.flatMap(_.bf).collect({ case File(_, Some(a), _, _) => a }).toDF()

Convert Dataframe back to RDD of case class in Spark

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

What is the best practice for reading/parsing Spark generated files which saved using SaveAsTextFile?

As you know, if you use saveAsTextFile on an RDD[String, Int], the output looks like this:
(T0000036162,1747)
(T0000066859,1704)
(T0000043861,1650)
(T0000075501,1641)
(T0000071951,1638)
(T0000075623,1638)
(T0000070102,1635)
(T0000043868,1627)
(T0000094043,1626)
You may want to use this file in Spark again and what should be best practice for reading and parsing it? Should it be something like that or is there any elegant way for it?
val lines = sc.textFile("result/hebe")
case class Foo(id: String, count: Long)
val parsed = lines
.map(l => l.stripPrefix("(").stripSuffix(")").split(","))
.map(l => new Foo(id=l(0),count = l(1).toLong))
It depends what you're looking for.
If you want something pretty I'd consider possibly adding an alternative constructor to Foo which takes a single string so you could have something like
lines.map(new Foo)
And Foo would look like
case class Foo(id: String, count: Long) {
def apply(l: String): Foo = {
val split = l.stripPrefix("(").stripSuffix(")").split(",")
new Foo(l(0), l(1))
}
}
If you have no requirement to output the data like that then I'd consider saving it as a sequence file.
If performance isn't an issue then its fine. I'd just say the most important thing is to just isolate the text parsing so that later you could unit test it and come back to it later and easily edit it.
you should either save it as a Dataframe which will use the case class as a schema (that allows you to easily parse it back into Spark) or you should map out the individual components of your RDD (so you remove the brackets before saving) since it only makes the file larger:
yourRDD.toDF("id","count").saveAsParquetFile(path)
when you load in the DF, you can pass it through a schema definition to get it back into a RDD if you want
RDDInput = input.map(x=>(x.getAs[Long]("id"),x.getAs[Int]("count")))
If you prefer to store as a text file, you could consider mapping the elements without the brackets:
yourRDD.map(x => s"${x._1}, ${x._2}")
The best way will be, you write dataframes instead of RDD directly as file.
Code that writing files -
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd.toDF()
df.write.parquet("dir”)
Code that reading files -
val rdd = sqlContext.read.parquet(“dir”).rdd.map(row => (row.getString(0),row.getLong(1)))
Before making saveAsTextFile use map(x=>x.mkString(",").
rdd.map(x=>x.mkString(",").saveAsTextFile(path). Output will not have bracket.
Output of this will be:-
T0000036162,1747
T0000066859,1704

Spark: Cannot add RDD elements into a mutable HashMap inside a closure

I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?
Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.
What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.

flatMapping in scala/spark

Looking for some assistance with a problem with how to to something in scala using spark.
I have:
type DistanceMap = HashMap[(VertexId,String), Int]
this forms part of my data in the form of an RDD of:
org.apache.spark.rdd.RDD[(DistanceMap, String)]
in short my dataset looks like this:
({(101,S)=3},piece_of_data_1)
({(101,S)=3},piece_of_data_2)
({(101,S)=1, (100,9)=2},piece_of_data_3)
What I want to do us flat map my distance map (which I can do) but at the same time for each flatmapped DistanceMap want to retain the associated string with that. So my resulting data would look like this:
({(101,S)=3},piece_of_data_1))<br>
({(101,S)=3},piece_of_data_2))<br>
({(101,S)=1},piece_of_data_3))<br>
({(109,S)=2},piece_of_data_3))<br>
As mentioned I can flatMap the first part using:
x.flatMap(x=>x._1).collect.foreach(println))
but am stuck on how I can retain the string from the second part of my original data.
This might work for you:
x.flatMap(x => x._1.map(y => (y,x._2)))
The idea is to convert from (Seq(a,b,c),Value) to Seq( (a,Value), (b, Value), (c, Value)).
This is the same in Scala, so here is a standalone simplified Scala example you can paste in Scala REPL:
Seq((Seq("a","b","c"), 34), (Seq("r","t"), 2)).flatMap( x => x._1.map(y => (y,x._2)))
This results in:
res0: Seq[(String, Int)] = List((a,34), (b,34), (c,34), (r,2), (t,2))
update
I have an alternative solution - flip key with value and use flatMapValues transformation, and then flip key with value again: see pseudo code:
x.map(x=>x._2, x._1).flatMapValues(x=>x).map(x=>x._2, x._1)
previous version
I propose to add one preprocessing step (sorry I have no computer with scala interpreter in front of me till tomorrow to come up with working code).
transform the pair rdd from (DistanceMap, String) into the rdd with list of Tuple4: List((VertexId,String, Int, String), ... ())
apply flatMap on on result
Pseudocode:
rdd.map( (DistanceMap, String) => List((VertexId,String, Int, String), ... ()))
.flatMap(x=>x)