On Spark's rdd.map(_.swap) - scala

I'm new to both Scala and Spark. Could anyone explain what's the meaning of
rdd.map(_.swap)
? If I look in Scala/Spark API I cannot find swap as a method in RDD class.

swap is a method on Scala Tuples. It swaps the first and second elements of a Tuple2 (or pair) with each other. For example:
scala> val pair = ("a","b")
pair: (String, String) = (a,b)
scala> val swapped = pair.swap
swapped: (String, String) = (b,a)
RDD's map function applies a given function to each element of the RDD. In this case, the function to be applied to each element is simply
_.swap
The underscore in this case is shorthand in Scala when writing anonymous functions, and it pertains to the parameter passed in to your function without naming it. So the above snippet can be rewritten into something like:
rdd.map{ pair => pair.swap }
So the code snippet you posted swaps the first and second elements of the tuple/pair in each row of the RDD.

This would only be available if rdd is of type RDD[Tuple2[T1,T2]], so swap is on the Tuple2

In Python it works like as follows:
rdd.map(lambda x: (x[1], x[0]))
This will swap (a, b) to (b, a) in the key, value pair.

For tuples which have been created using spark, use this lambda:
RDD map1 : ("a", 1), ("b", 2), ("c", 3)...
val map 2 = map1.map(a=> (a._2, a._1))
This will return the RDD
RDD map2 : (1, "a"), (2, "b"), (3, "c")...

Related

Tuple keys of a map are converted to another map when using map function to retrieve the keys

When I ran the following piece of code, I got some unexpected result
val a = Map(
("1", "2") -> 1,
("1", "4") -> 2,
("2", "2") -> 3,
("2", "4") -> 4
)
println(a.size)
val b = a.map(_._1)
println(b.size)
val c = a.keySet
println(c.size)
The result is:
res0: Int = 4
b: scala.collection.immutable.Map[String,String] = Map(1 -> 4, 2 -> 4)
res1: Int = 2
c: scala.collection.immutable.Set[(String, String)] = Set((1,2), (1,4), (2,2), (2,4))
res2: Int = 4
What I expected is the content of b is the same as that of c. Is it expected in Scala? Or some kind of side effect?
Yes, this is expected behaviour. As a general rule, Scala's collection map methods try to make the output collection the same type as the collection on which the method is called.
So, for Map.map, the output collection can be a Map whenever the result type of the function you pass to map is a Tuple. This is precisely the case when you call val b = a.map(_._1). At this point, another rule comes into play: a Map's keys must be unique. So, as Scala traverses a during the call to a.map(_._1), it inserts the result of _._1 into the new map that it's building (to become b). The second entry replaces the first, and the fourth entry replaces the third, because they have the same key.
If this is not the behaviour you want, you should be able to get round it by decalring the type of b to be something other than a Map.
e.g.
val b: Seq[(String, String)] = a.map(_._1)
I think you want to use .keys instead, because you are mapping a Map[(String,String),Int] into a Map[String,String] and since the keys need to be unique, the rest of the Map is discarded.
This happens because you are mapping into this Map[(String,String)] so Scala convert into a Map[String,String]

creating pair RDD in spark using scala

Im new to spark so I need to create a RDD with just two element.
Array1 = ((1,1)(1,2)(1,3),(2,1),(2,2),(2,3)
when I execute groupby key the output is ((1,(1,2,3)),(2,(1,2,3))
But I need the output to just have 2 value pair with the key. I'm not sure how to get it.
Expected Output = ((1,(1,2)),(1,(1,3)),(1(2,3),(2(1,2)),(2,(1,3)),(2,(2,3)))
The values should only be printed once. There should only be (1,2) and not (2,1)
or like (2,3) not (3,4)
Thanks
You can get the result you require as follows:
// Prior to doing the `groupBy`, you have an RDD[(Int, Int)], x, containing:
// (1,1),(1,2),(1,3),(2,1),(2,2),(2,3)
//
// Can simply map values as below. Result is a RDD[(Int, (Int, Int))].
val x: RDD[(Int, Int)] = sc.parallelize(Seq((1,1),(1,2),(1,3),(2,1),(2,2),(2,3))
val y: RDD[(Int, (Int, Int))] = x.map(t => (t._1, t)) // Map first value in pair tuple to the tuple
y.collect // Get result as an array
// res0: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)))
That is, the result is a pair RDD that relates the key (the first value of each pair) to the pair (as a tuple). Do not use groupBy, since—in this case—it will not give you what you want.
If I understand your requirement correctly, you can use groupByKey and flatMapValues to flatten the 2-combinations of the grouped values, as shown below:
val rdd = sc.parallelize(Seq(
(1, 1), (1, 2), (1 ,3), (2, 1), (2, 2), (2, 3)
))
rdd.groupByKey.flatMapValues(_.toList.combinations(2)).
map{ case (k, v) => (k, (v(0), v(1))) }.
collect
// res1: Array[(Int, (Int, Int))] =
// Array((1,(1,2)), (1,(1,3)), (1,(2,3)), (2,(1,2)), (2,(1,3)), (2,(2,3)))

How to separate array or vector column into multiple columns?

Suppose I have a Spark Dataframe generated as:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
It's possible to extract elements at the first index in "Col1" with something like:
val extractFirstInt = udf { (x: Seq[Int], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirstInt($"Col1", lit(1)))
And similarly for the second column "Col2" with e.g.
val extractFirstString = udf { (x: Seq[String], i: Int) => x(i) }
df.withColumn("Col2_1", extractFirstString($"Col2", lit(1)))
But the code duplication is a little ugly -- I need a separate UDF for each underlying element type.
Is there a way to write a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset? E.g. I'd like to be able to write something like (pseudocode; with generic T)
val extractFirst = udf { (x: Seq[T], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirst($"Col1", lit(1)))
Where somehow the type T would just be automagically inferred by Spark / the Scala compiler (perhaps using reflection if appropriate).
Bonus points if you're aware of a solution that works both with array-columns and Spark's own DenseVector / SparseVector types. The main thing I'd like to avoid (if at all possible) is the requirement of defining a separate UDF for each underlying array-element type I want to handle.
Perhaps frameless could be a solution?
Since manipulating datasets requires an Encoder for a given type, you have to define the type upfront so Spark SQL can create one for you. I think a Scala macro to generate all sorts of Encoder-supported types would make sense here.
As of now, I'd define a generic method and a UDF per type (which is against your wish to find a way to have "a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset").
def myExtract[T](x: Seq[T], i: Int) = x(i)
// define UDF for extracting strings
val extractString = udf(myExtract[String] _)
Use as follows:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
scala> df.withColumn("Col1_1", extractString($"Col2", lit(1))).show
+---------+---------+------+
| Col1| Col2|Col1_1|
+---------+---------+------+
|[1, 2, 3]|[a, b, c]| b|
|[1, 2, 3]|[a, b, c]| b|
+---------+---------+------+
You could explore Dataset (not DataFrame, i.e. Dataset[Row]) instead. That would give you all the type machinery (and perhaps you could avoid any macro development).
As per advice from #zero323, I centered on an implementation of the following form:
def extractFirst(df: DataFrame, column: String, into: String) = {
// extract column of interest
val col = df.apply(column)
// figure out the type name for this column
val schema = df.schema
val typeName = schema.apply(schema.fieldIndex(column)).dataType.typeName
// delegate based on column type
typeName match {
case "array" => df.withColumn(into, col.getItem(0))
case "vector" => {
// construct a udf to extract first element
// (could almost certainly do better here,
// but this demonstrates the strategy regardless)
val extractor = udf {
(x: Any) => {
val el = x.getClass.getDeclaredMethod("toArray").invoke(x)
val array = el.asInstanceOf[Array[Double]]
array(0)
}
}
df.withColumn(into, extractor(col))
}
case _ => throw new IllegalArgumentException("unexpected type '" + typeName + "'")
}
}

Extract elements of lists in an RDD

What I want to achieve
I'm working with Spark and Scala. I have two Pair RDDs.
rdd1 : RDD[(String, List[String])]
rdd2 : RDD[(String, List[String])]
Both RDDs are joined on their first value.
val joinedRdd = rdd1.join(rdd2)
So the resulting RDD is of type RDD[(String, (List[String], List[String]))]. I want to map this RDD and extract the elements of both lists, so that the resulting RDD contains just these elements of the two lists.
Example
rdd1 (id, List(a, b))
rdd2 (id, List(d, e, f))
wantedResult (a, b, d, e, f)
Naive approach
My naive approach would be to adress each element directly with (i), like below:
val rdd = rdd1.join(rdd2)
.map({ case (id, lists) =>
(lists._1(0), lists._1(1), lists._2(0), lists._2(2), lists._2(3)) })
/* results in RDD[(String, String, String, String, String)] */
Is there a way to get the elements of each list, without adressing each individually? Something like "lists._1.extractAll". Is there a way to use flatMap to achieve what I'm trying to achieve?
You can simply concatenate the two lists with the ++ operator:
val res: RDD[List[String]] = rdd1.join(rdd2)
.map { case (_, (list1, list2)) => list1 ++ list2 }
Probably a better approach that would avoid to carry List[String] around that may be very big would be to explode the RDD into smaller (key value) pairs, concatenate them and then do a groupByKey:
val flatten1: RDD[(String, String)] = rdd1.flatMapValues(identity)
val flatten2: RDD[(String, String)] = rdd2.flatMapValues(identity)
val res: RDD[Iterable[String]] = (flatten1 ++ flatten2).groupByKey.values

TreeMap Keys and Iteration in Scala

I am using TreeMap and it behaves strangely in the following code.
Here is the code :
import scala.collection.immutable.TreeMap
object TreeMapTest extends App{
val mp = TreeMap((0,1) -> "a", (0,2) -> "b", (1,3) -> "c", (3,4) -> "f")
mp.keys.foreach(println) //A
println("****")
mp.map(x => x._1).foreach(println) //B
}
As you can see the two print lines (A and B) should have printed the same thing but the result is as follows:
(0,1)
(0,2)
(1,3)
(3,4)
****
(0,2)
(1,3)
(3,4)
Why is this happening here? The interesting thing is even the IDE believes that one can use these two interchangeably and suggests the replacement.
Scala collection library generally tries to return the same kind of collection it starts with, so that e.g. val seq: Seq[Int] = ...; seq.map(...) will return a Seq, val seq: List[Int] = ...; seq.map(...) will return a List, etc. This isn't always possible: e.g. a String is considered to be a collection of Char, but "ab".map(x => x.toInt) obviously can't return a String. Similarly for Map: if you map each pair to a non-pair, you can't get a Map back; but you map each pair to a pair (Int, Int), and so Scala returns Map[Int, Int]. So you can't get both (0, 1) and (0, 2): they would be duplicate keys.
To avoid this problem, convert your map to Seq[((Int, Int), String)] first: mp.toSeq.map(x => x._1) (or mp.keySet.toSeq).