I want to convert Pair RDD "myRDD" values from Iterable[(Double,Double)] to Seq(Seq(Double)), however I am not sure how to do it. I tried the following but it does not work.
val groupedrdd: RDD[BB,Iterable[(Double,Double)]] = RDDofPoints.groupByKey()
val RDDofSeq = groupedrdd.mapValues{case (x,y) => Seq(x,y)}
The myRDD is formed using a groupByKey operation on a RddofPoints with their respective bounding boxes as keys. The BB is a case class and it is the key for a set of points with type (Double,Double). I want the RDDofSeq to have the type RDD[BB,Seq(Seq(Double))], however after groupByKey, myRDD has the type RDD[BB,Iterable[(Double,Double)]].
Here, it gives an error as:
Error:(107, 58) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: Iterable[(Double, Double)]
I am new to Scala, any help in this regard is appreciated. Thanks.
ANSWER : The following is used to accomplish the above goal:
val RDDofSeq = groupedrdd.mapValues{iterable => iterable.toSeq.map{case (x,y) => Seq(x,y)}}
I tried this on Scalafiddle
val myRDD: Iterable[(Double,Double)] = Seq((1.1, 1.2), (2.1, 2.2))
val RDDofSeq = myRDD.map{case (x,y) => Seq(x,y)}
println(RDDofSeq) // returns List(List(1.1, 1.2), List(2.1, 2.2))
The only difference is that I used myRDD.map(.. instead of myRDD.mapValues(..
Make sure that myRDD is really of the type Iterable[(Double,Double)]!
Update after comment:
If I understand you correctly you want a Seq[Double] and not a Seq[Seq[Double]]
That would be this:
val RDDofSeq = myRDD.map{case (k,v) => v} // returns List(1.2, 2.2)
Update after the Type is now clear:
The values are of type Iterable[(Double,Double)] so you cannot match on a pair.
Try this:
val RDDofSeq = groupedrdd.mapValues{iterable =>
Seq(iterable.head._1, iterable.head._2)}
You just need map, not mapValues.
val RDDofSeq = myRDD.map{case (x,y) => Seq(x,y)}
Related
in my main program i receive inputs like -
key1=value1 key2=value2
Now what I want is to create a map out of it. I know the imperative way of doing this where I would get Array[String] that can be foreach and then split by "=" and then key and value can be used to form a Map.
is there a good functional and readable way to achieve this?
Also It will be great if I can avoid mutable Map and I want to avoid initial Dummy value initialization.
def initialize(strings: Array[String]): Unit = {
val m = collection.mutable.Map("dummy" -> "dummyval")
strings.foreach(
s => {
val keyVal:Array[String] = s.split("=")
m += keyVal(0) -> keyVal(1)
})
println(m)
}
you can just use toMap().
However, converting from array to tuple is not quite trivial:
How to convert an Array to a Tuple?
scala> val ar = Array("key1=value1","key2=value2")
ar: Array[String] = Array(key1=value1, key2=value2)
scala> ar.collect(_.split("=") match { case Array(x,y) => (x,y)}).toMap
res10: scala.collection.immutable.Map[String,String] = Map(key1 -> value1, key2 -> value2)
Maybe you have to call Function.unlift for intellij
val r = ar.collect(Function.unlift(_.split("=") match { case Array(x, y) => Some(x, y)})).toMap
similar to above but using only 'map'
ar.map(_.split("=")).map(a=>(a(0), a(1))).toMap
You can use Scopt to do the command line argument parsing in a neat way.
I am encountering this error:
java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to [Ljava.lang.Object;
whenever I try to use "contains" to find if a string is inside an array. Is there a more appropriate way of doing this? Or, am I doing something wrong? (I am fairly new to Scala)
Here is the code:
val matches = Set[JSONObject]()
val config = new SparkConf()
val sc = new SparkContext("local", "SparkExample", config)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val ebay = sqlContext.read.json("/Users/thomassquires/Downloads/products.json")
val catalogue = sqlContext.read.json("/Users/thomassquires/Documents/catalogue2.json")
val eins = ebay.map(item => (item.getAs[String]("ID"), Option(item.getAs[Set[Row]]("itemSpecifics"))))
.filter(item => item._2.isDefined)
.map(item => (item._1 , item._2.get.find(x => x.getAs[String]("k") == "EAN")))
.filter(x => x._2.isDefined)
.map(x => (x._1, x._2.get.getAs[String]("v")))
.collect()
def catEins = catalogue.map(r => (r.getAs[String]("_id"), Option(r.getAs[Array[String]]("item_model_number")))).filter(r => r._2.isDefined).map(r => (r._1, r._2.get)).collect()
def matched = for(ein <- eins) yield (ein._1, catEins.filter(z => z._2.contains(ein._2)))
The exception occurs on the last line. I have tried a few different variants.
My data structure is one List[Tuple2[String, String]] and one List[Tuple2[String, Array[String]]] . I need to find the zero or more matches from the second list that contain the string.
Thanks
Long story short (there is still part that eludes me here*) you're using wrong types. getAs is implemented as fieldIndex (String => Int) followed by get (Int => Any) followed by asInstanceOf.
Since Spark doesn't use Arrays nor Sets but WrappedArray to store array column data, calls like getAs[Array[String]] or getAs[Set[Row]] are not valid. If you want specific types you should use either getAs[Seq[T]] or getAsSeq[T] and convert your data to desired type with toSet / toArray.
* See Why wrapping a generic method call with Option defers ClassCastException?
This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD
I have following list:
val list = List(("name1",20),("name2",20),("name1",30),("name2",30),
("name3",40),("name3",30),("name3",20))
I want following output:
List(("name3",40))
I tried following:
val distElements = list.map(_._2).distinct
list.groupBy(_._1).map{ case(k,v) =>
val h = v.map(_._2)
if(distElements.equals(h)) List.empty else distElements.diff(h)
}.flatten
But this is not I am looking for.
Can anybody give answer/hint me to get expected output.
I understand the question as looking for the element of the list whose _2 (number) occurs only once.
val list = List(("name1",20),("name2",20),("name1",30),("name2",30),
("name3",40),("name3",30),("name3",20))
First you group by the _2 element, which gives you a map whose keys are lists of all elements with the same _2:
val g = list.groupBy(_._2) // Map[Int, List[(String, Int)]]
Now you can filter those entries that consists only of one element:
val opt = g.collectFirst { // Option[(String, Int)]
case (_, single :: Nil) => single
}
Or (if you are expecting possibly more than one distinct value)
val col = g.collect { // Map[String, Int]
case (_, single :: Nil) => single
}
Seems to me that you're looking to match against both the value of the left hand and the right hand at the same time while also preserving the type of collection you're looking at, a List. I would use collect:
val out = myList.collect{
case item # ("name3", 40) => item
}
which combines a PartialFunction with filter and map like qualities. In this case, it filters out any value for which the PartialFunction is not defined while mapping the values which match. Here, I've only allowed for a singular match.
What is the best way to convert a List[String, Int] A to List[Int, String] B. I wanted to use the map function which would iterate through all items in my list A and then return a new list B however whenever I apply the map function on the list A it complains about wrong number of arguments
val listA:List[(String, Int)] = List(("graduates", 20), ("teachers", 10), ("students", 300))
val listB:List[(Int, String)] = listA.map((x:String, y:Int) => y, x)
Any suggestions? Thanks
How about this:
val listB = listA.map(_.swap)
You need to use pattern matching to get the elements of a pair. I swear a question like this was asked just a few days ago....
listA.map{case (a,b) => (b,a)}