Using key-value pair RDD to build a kdtree in Spark - scala

I am trying to build kd-trees from points in a pair RDD called "RDDofPoints" with type RDD[BoundingBox[Double],(Double,Double)]. All the points are assigned to a particular bounding box and my goal is to build a kd-tree for each of the Bounding Boxes.
I am trying to use reduceByKey for this purpose. However, I am stuck at how to call the buildtree function in this case.
The function declaration of buildtree is:
def buildtree(points: RDD[(Double, Double)], depth: Int = 0): Option[KdNodeforRDD]
And, I am trying to call it as:
val treefromPairRDD = RDDofPoints.reduceByKey((k,v) => buildtree(v))
This does not work obviously. I am fairly new to Scala and Spark, please suggest what could be the appropriate way to go about in this situation. I am not sure about using reduceByKey, if some other pair RDD function can be applied here, which one would it be?
Thank you.

Related

Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:
rdd: RDD[WikipediaArticle])
val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)
meinVal.collect.foreach(println) gives:
(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))
I have two questions:
Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.
I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.
I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.
Anyone? Please?
Irene
Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey.
To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.
val meinVal = rdd.flatMap( article=> langs.map(lang=> { if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)

Cannot create graph in GraphX (Scala Spark)

I have huge problems creating a simple graph in Spark GraphX. I really don't understand anything so I try everything that I find but nothing works.
For example I try to reproduce the steps from here.
The following two were OK:
val flightsFromTo = df_1.select($"Origin",$"Dest")
val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString))
But after this I obtain an error:
val airportVertices: RDD[(VertexId, String)] = airportCodes.distinct().map(x => (MurmurHash.stringHash(x), x))
Error: missing Parameter type
Could You please tell me what is wrong?
And by the way, why MurmurHash? What is a purpose of it?
My guess is that you are working from a 3 year old tutorial with a recent Spark version.
The sqlContext read returns a Dataset instead of RDD.
If you want it like the tutorial use .rdd. instead
val airportVertices: RDD[(VertexId, String)] = airportCodes.rdd.distinct().map(x => (MurmurHash3.stringHash(x), x))
or change type of variable
val airportVertices: Dataset[(Int, String)] = airportCodes.distinct().map(x => (MurmurHash3.stringHash(x), x))
You could also checkout https://graphframes.github.io/ if you are interested in Graphs and Spark
Updated
To create a Graph you need vertices and edges
To make computation easier all vertices have to be identified by a VertexId (in essence a Long)
The MurmerHash is used to create very good unique hashes. More info here: MurmurHash - what is it?
Hashing is a best practise to distribute the data without skewing, but there is no technical reason why you couldn't use an incremental counter for each vertex
I've looked at the tutorial, but the only thing you have to change to make it work, is to add .rdd:
val flightsFromTo = df_1.select($"Origin",$"Dest").rdd
val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString)).rdd

javanullpointerexception after df.na.fill("Missing") in scala?

I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!

Most common string in list with counter

I am trying to write a function that
takes a List read from a file as input
outputs the most frequently used string as well an integer that shows the number of times that it was used.
example output:
("Cat",5)
function signature:
def mostFreq(info: List[List[String]]): (String, Int) =
First,I thought about creating a
Map variable and a counter variable
iterating over my list to fill the map
then iterate over the map
However, there must be a simpler way to do this scala but I'm not used to scala's library just yet.
I have seen this as one way to do it that uses only integers.
Finding the most frequent/common element in a collection?
But I was wondering how it could be done using string and integers.
The solution from the linked post has just about everything you need for this.
def mostFreq(info: List[List[String]]): (String, Int) =
info.flatten.groupBy(identity).mapValues(_.size).maxBy(_._2)
It doesn't handle ties terribly well, but you haven't stated how ties should be handled.

Subtract an RDD from another RDD doesn't work correctly

I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract can do that. Actually, when I tested subtract, the final RDD remains the same and the values are not removed!
Is there any other function to do that? Or am I using subtract incorrectly?
Here is the code that I used:
val vertexRDD: org.apache.spark.rdd.RDD[(VertexId, Array[Int])]
val clusters = vertexRDD.takeSample(false, 3)
val clustersRDD: RDD[(VertexId, Array[Int])] = sc.parallelize(clusters)
val final = vertexRDD.subtract(clustersRDD)
final.collect().foreach(println(_))
Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended.
Try using a immutable type instead.
I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure.
If your rdd is composed of mutables object it wont work... problem is it wont show an error either so this kind of problems are hard to identify, i had a similar one yesterday and i used a workaround.
rdd.keyBy( someImmutableValue ) -> do this using the same key value to
both your rdds
val resultRDD = rdd.subtractByKey(otherRDD).values
Recently I tried the subtract operation of 2 RDDs (of array List) and it is working. The important note is - the RDD val after .subtract method should be the list from where you're subtracting, not the other way around.
Correct: val result = theElementYouWantToSubtract.subtract(fromList)
Incorrrect: val reuslt = fromList.subtract(theElementYouWantToSubtract) (will not give any compile/runtime error message)