Say I have a Iterable[(Int, String)]. How do I get an array of just the "values"? That is, how do I convert from Iterable[(Int, String)] => Array[String]? The "keys" or "values" do not have to be unique, and that's why I put them in quotation marks.
iterable.map(_._2).toArray
_._2 : take out the second element of the tuple represented by input variable( _ ) whose name I don't care.
Simply:
val iterable: Iterable[(Int, String)] = Iterable((1, "a"), (2, "b"))
val values = iterable.toArray.map(_._2)
Simply map the iterable and extract the second element(tuple._2),
scala> val iterable: Iterable[(Int, String)] = Iterable((100, "Bring me the horizon"), (200, "Porcupine Tree"))
iterable: Iterable[(Int, String)] = List((100,Bring me the horizon), (200,Porcupine Tree))
scala> iterable.map(tuple => tuple._2).toArray
res3: Array[String] = Array(Bring me the horizon, Porcupine Tree)
In addition to the already suggested map you might want to build the array as you map from tuple to string instead of converting at some point as it might save an iteration.
import scala.collection
val values: Array[String] = iterable.map(_._2)(collection.breakOut)
Related
I created this RDD :
scala> val data=sc.textFile("sparkdata.txt")
Then I am trying to return the content of the file :
scala> data.collect
I am dividing existing data in individual word using :
scala> val splitdata = data.flatMap(line => line.split(" "));
scala> splitdata.persist()
scala> splitdata.collect;
Now, I am doing the map reduce operation :
scala> val mapdata = splitdata.map(word => (word,1));
scala> mapdata.collect;
scala> val reducedata = mapdata.reduceByKey(_+_);
To get the result :
scala> reducedata.collect;
When I want to display the first 10th rows :
splitdata.groupByKey(identity).count().show(10)
I get the following error :
<console>:38: error: value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
splitdata.groupByKey(identity).count().show(10)
^
<console>:38: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
splitdata.groupByKey(identity).count().show(10)
^
Similar to reduceByKey(), groupByKey() is a method for PairRDDs of type RDD[K, V], rather than for general RDDs. While reduceByKey() uses a provided binary function to reduce a RDD[K, V] to another RDD[K, V], groupByKey() transforms a RDD[K, V] into a RDD[(K, Iterable[V])]. To further transform the Iterable[V] by key, one would typically apply mapValues() (or flatMapValues) with a provided function.
For example:
val rdd = sc.parallelize(Seq(
"apple", "apple", "orange", "banana", "banana", "orange", "apple", "apple", "orange"
))
rdd.map((_, 1)).reduceByKey(_ + _).collect
// res1: Array[(String, Int)] = Array((apple,4), (banana,2), (orange,3))
rdd.map((_, 1)).groupByKey().mapValues(_.sum).take(2)
// res2: Array[(String, Int)] = Array((apple,4), (banana,2))
In case you're interested only in getting the count of groups after applying groupByKey():
rdd.map((_, 1)).groupByKey().count()
// res3: Long = 3
I have a list of tuples, that I want to sort ascending according to the second element in the tuple. I do it with this code:
freqs.sortWith( _._2 < _._2 )
But I dont like the ._2 naming, as I would prefer to have a nice name to the second parameter like freqs.sortWith( _.weight < _.weight ).
Any ideas how to do this?
You don't need to repeat everything twice in sortWith, you can use sortBy(_._2) instead.
If you want to have a nice name, create a custom case class that has member variables of this name:
case class Foo(whatever: String, weight: Double)
val list: List[Foo] = ???
list.sortBy(_.weight)
It takes just a single line.
Alternatively, you can "pimp" the tuples locally:
class WeightOps(val whatever: String, val weight: Double)
implicit def tupleToWeightOps(t: (String, Double)): WeightOps =
new WeightOps(t._1, t._2)
then you can use .weight on tuples directly:
val list: List[(String, Double)] = ???
list.sortBy(_.weight)
Don't forget to keep the implicit scope as small as possible.
scala> val list: List[(String, Int)] = List (("foo", 7), ("bar", 3), ("foobar", 5))
list: List[(String, Int)] = List((foo,7), (bar,3), (foobar,5))
scala> list.sortBy {case (ignore, price) => price }
res70: List[(String, Int)] = List((bar,3), (foobar,5), (foo,7))
A case extractor can be used to put meaningful names on variables.
My requirement is that :
arr1 : Array[(String, String)] = Array((bangalore,Kanata), (kannur,Kerala))
arr2 : Array[(String, String)] = Array((001,anup), (002,sithu))
should give me
Array((001,anup,bangalore,Krnata), (002,sithu,kannur,Kerala))
I tried this :
val arr3 = arr2.map(field=>(field,arr1))
but it didn't work
#nicodp's answer addressed your question very nicely. zip and then map will give you the resultant array.
Recall that if one list is larger than the other, its remaining elements are ignored.
My attempt tries to address this:
Consider:
val arr1 = Array(("bangalore","Kanata"), ("kannur","Kerala"))
val arr2 = Array(("001","anup", "ramakrishan"), ("002","sithu", "bhattacharya"))
zip and mapping on tuples will give the result as:
arr1.zip(arr2).map(field => (field._1._1, field._1._2, field._2._1, field._2._2))
Array[(String, String, String, String)] = Array((bangalore,Kanata,001,anup), (kannur,Kerala,002,sithu))
// This ignores the last field of arr2
While mapping, you can convert the tuple in iterator and get a list from it. This will enable you to not keep a track of Tuple2 or Tuple3
arr1.zip(arr2).map{ case(k,v) => List(k.productIterator.toList, v.productIterator.toList).flatten }
// Array[List[Any]] = Array(List(bangalore, Kanata, 001, anup, ramakrishan), List(kannur, Kerala, 002, sithu, bhattacharya))
You can do a zip followed by a map:
scala> val arr1 = Array((1,2),(3,4))
arr1: Array[(Int, Int)] = Array((1,2), (3,4))
scala> val arr2 = Array((5,6),(7,8))
arr2: Array[(Int, Int)] = Array((5,6), (7,8))
scala> arr1.zip(arr2).map(field => (field._1._1, field._1._2, field._2._1, field._2._2))
res1: Array[(Int, Int, Int, Int)] = Array((1,2,5,6), (3,4,7,8))
The map acts as a flatten for tuples, that is, takes things of type ((A, B), (C, D)) and maps them to (A, B, C, D).
What zip does is... meh, let's see its type:
def zip[B](that: GenIterable[B]): List[(A, B)]
So, from there, we can argue that it takes an iterable collection (which can be another list) and returns a list which is the combination of the corresponding elements of both this: List[A] and that: List[B] lists. Recall that if one list is larger than the other, its remaining elements are ignored. You can dig more about list functions in the documentation.
I agree that the cleanes solution is using the zip method from collections
val arr1 = Array(("bangalore","Kanata"), ("kannur","Kerala"))
val arr2 = Array(("001","anup"), ("002","sithu"))
arr1.zip(arr2).foldLeft(List.empty[Any]) {
case (acc, (a, b)) => acc ::: List(a.productIterator.toList ++ b.productIterator.toList)
}
Imagine you have a Map[Option[Int], String] and you want to have a Map[Int, String] discarding the entry which contain None as the key.
Another example, that should be somehow similar is List[(Option[Int], String)] and transform it to List[(Int, String)], again discarding the tuple which contain None as the first element.
What's the best approach?
collect is your friend here:
example data definition
val data = Map(Some(1) -> "data", None -> "")
solution for Map
scala> data collect { case ( Some(i), s) => (i,s) }
res4: scala.collection.immutable.Map[Int,String] = Map(1 -> data)
the same approach works for a list of tuples
scala> data.toList collect { case ( Some(i), s) => (i,s) }
res5: List[(Int, String)] = List((1,data))
I have a list of Tuples of type : (user id, name, count).
For example,
val x = sc.parallelize(List(
("a", "b", 1),
("a", "b", 1),
("c", "b", 1),
("a", "d", 1))
)
I'm attempting to reduce this collection to a type where each
element name is counted.
So in above val x is converted to :
(a,ArrayBuffer((d,1), (b,2)))
(c,ArrayBuffer((b,1)))
Here is the code I am currently using :
val byKey = x.map({case (id,uri,count) => (id,uri)->count})
val grouped = byKey.groupByKey
val count = grouped.map{case ((id,uri),count) => ((id),(uri,count.sum))}
val grouped2: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey
grouped2.foreach(println)
I'm attempting to use reduceByKey as it performs faster than groupByKey.
How can reduceByKey be implemented instead of above code to provide
the same mapping ?
Following your code:
val byKey = x.map({case (id,uri,count) => (id,uri)->count})
You could do:
val reducedByKey = byKey.reduceByKey(_ + _)
scala> reducedByKey.collect.foreach(println)
((a,d),1)
((a,b),2)
((c,b),1)
PairRDDFunctions[K,V].reduceByKey takes an associative reduce function that can be applied to the to type V of the RDD[(K,V)]. In other words, you need a function f[V](e1:V, e2:V) : V . In this particular case with sum on Ints: (x:Int, y:Int) => x+y or _ + _ in short underscore notation.
For the record: reduceByKey performs better than groupByKey because it attemps to apply the reduce function locally before the shuffle/reduce phase. groupByKey will force a shuffle of all elements before grouping.
Your origin data structure is: RDD[(String, String, Int)], and reduceByKey can only be used if data structure is RDD[(K, V)].
val kv = x.map(e => e._1 -> e._2 -> e._3) // kv is RDD[((String, String), Int)]
val reduced = kv.reduceByKey(_ + _) // reduced is RDD[((String, String), Int)]
val kv2 = reduced.map(e => e._1._1 -> (e._1._2 -> e._2)) // kv2 is RDD[(String, (String, Int))]
val grouped = kv2.groupByKey() // grouped is RDD[(String, Iterable[(String, Int)])]
grouped.foreach(println)
The syntax is below:
reduceByKey(func: Function2[V, V, V]): JavaPairRDD[K, V],
which says for the same key in an RDD it takes the values (which will be definitely of same type) performs the operation provided as part of function and returns the value of same type as of parent RDD.