Why Spark doesn't allow map-side combining with array keys? - scala

I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining.
Piece of combineByKey function:
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
}

Basically for the same reason why default partitioner cannot partition array keys.
Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:
scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)
scala> val h = x.hashCode
h: Int = 630226932
scala> x(0) = -1
scala> x.hashCode() == h1
res3: Boolean = true
It means that two arrays with exact the same content are not equal
scala> x
res4: Array[Int] = Array(-1, 2, 3)
scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)
scala> y == x
res5: Boolean = false
As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:
scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
If you want to use a collection as key you should use an immutable data structure like a Vector or a List.
scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
See also:
SI-1607
How does HashPartitioner work?
A list as a key for PySpark's reduceByKey

Related

Compare two list and get the index of same elements

val a = List(1,1,1,0,0,2)
val b = List(1,0,3,2)
I want to get the List of indices of elements of "List b" which are existing in "List a".
Here output to be List(0,1,3)
I tried this
for(x <- a.filter(b.contains(_))) yield a.indexOf(x))
Sorry. I missed this. The list size may vary. Edited the Lists
Is there a better way to do this?
If you want a result of indices, it's often useful to start with indices.
b.indices.filter(a contains b(_))
REPL tested.
scala> val a = List(1,1,1,0,0,2)
a: List[Int] = List(1, 1, 1, 0, 0, 2)
scala> val b = List(1,0,3,2)
b: List[Int] = List(1, 0, 3, 2)
scala> b.indices.filter(a contains b(_))
res0: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 1, 3)
val result = (a zip b).zipWithIndex.flatMap {
case ((aItem, bItem), index) => if(aItem == bItem) Option(index) else None
}
a zip b will return all elements from a that have a matching pair in b.
For example, if a is longer, like in your example, the result would be List((1,1),(1,0),(1,3),(0,2)) (the list will be b.length long).
Then you need the index also, that's zipWithIndex.
Since you only want the indexes, you return an Option[Int] and flatten it.
You can use indexed for for this:
for{ i <- 0 to b.length-1
if (a contains b(i))
} yield i
scala> for(x <- b.indices.filter(a contains b(_))) yield x;
res27: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 1, 3)
Here is another option:
scala> val a = List(1,1,1,0,0,2)
a: List[Int] = List(1, 1, 1, 0, 0, 2)
scala> val b = List(1,0,3,2)
b: List[Int] = List(1, 0, 3, 2)
scala> b.zipWithIndex.filter(x => a.contains(x._1)).map(x => x._2)
res7: List[Int] = List(0, 1, 3)
I also want to point out that your original idea of: Finding elements in b that are in a and then getting indices of those elements would not work, unless all elements in b contained in a are unique, indexOf returns index of the first element. Just heads up.

Scala hash of stacks has only one stack for all the keys

I have the following hashmap, where each element should be mapped to a stack:
var pos = new HashMap[Int, Stack[Int]] withDefaultValue Stack.empty[Int]
for(i <- a.length - 1 to 0 by -1) {
pos(a(i)).push(i)
}
If a will have elements {4, 6, 6, 4, 6, 6},
and if I add the following lines after the code above:
println("pos(0) is " + pos(0))
println("pos(4) is " + pos(4))
The output will be:
pos(0) is Stack(0, 1, 2, 3, 4, 5)
pos(4) is Stack(0, 1, 2, 3, 4, 5)
Why is this happening?
I don't want to add elements to pos(0), but only to pos(4) and pos(6) (the lements of a).
It looks like there is only one stack mapped to all the keys. I want a stack for each key.
Check the docs:
http://www.scala-lang.org/api/current/index.html#scala.collection.mutable.HashMap
method withDefaultValue takes this value as a regular parameter, it won't be recalculated so all your entries share the same mutable stack.
def withDefaultValue(d: B): Map[A, B]
You should use withDefault method instead.
val pos = new HashMap[Int, Stack[Int]] withDefault (_ => Stack.empty[Int])
Edit
Above solution doesn't seem to work, I get empty stacks. Checking with sources shows that the default value is returned but never put into map
override def apply(key: A): B = {
val result = findEntry(key)
if (result eq null) default(key)
else result.value
}
I guess one solution might be to override apply or default method to add the entry to map before returning it. Example for default method:
val pos = new mutable.HashMap[Int, mutable.Stack[Int]]() {
override def default(key: Int) = {
val newValue = mutable.Stack.empty[Int]
this += key -> newValue
newValue
}
}
Btw. that is punishment for being mutable, I encourage you to use immutable data structures.
If you are looking for a more idiomatic Scala, functional-style solution without those mutable collections, consider this:
scala> val a = List(4, 6, 6, 4, 6, 6)
a: List[Int] = List(4, 6, 6, 4, 6, 6)
scala> val pos = a.zipWithIndex groupBy {_._1} mapValues { _ map (_._2) }
pos: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(0, 3), 6 -> List(1, 2, 4, 5))
It may look confusing at first, but if you break it down, zipWithIndex gets pairs of values and their positions, groupBy makes a map from each value to a list of entries, and mapValues is then used to turn the lists of (value, position) pairs into just lists of positions.
scala> val pairs = a.zipWithIndex
pairs: List[(Int, Int)] = List((4,0), (6,1), (6,2), (4,3), (6,4), (6,5))
scala> val pairsByValue = pairs groupBy (_._1)
pairsByValue: scala.collection.immutable.Map[Int,List[(Int, Int)]] = Map(4 -> List((4,0), (4,3)), 6 -> List((6,1), (6,2), (6,4), (6,5)))
scala> val pos = pairsByValue mapValues (_ map (_._2))
pos: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(0, 3), 6 -> List(1, 2, 4, 5))

Selecting multiple arbitrary columns from Scala array using map()

I'm new to Scala (and Spark). I'm trying to read in a csv file and extract multiple arbitrary columns from the data. The following function does this, but with hard-coded column indices:
def readCSV(filename: String, sc: SparkContext): RDD[String] = {
val input = sc.textFile(filename).map(line => line.split(","))
val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
return out
}
Is there a way to use map with an arbitrary number of column indices passed to the function in an array?
If you have a sequence of indices, you could map over it and return the values :
scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))
scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)
// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))
// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6) // that's actually a List containing 2 String

How to select elements of collection based on another of different type?

I know I can do this:
scala> val a = List(1,2,3)
a: List[Int] = List(1, 2, 3)
scala> val b = List(2,4)
b: List[Int] = List(2, 4)
scala> a.filterNot(b.toSet)
res0: List[Int] = List(1, 3)
But I'd like to select elements of a collection based on their integer key, as in the following:
case class N (p: Int , q: Int)
val x = List(N(1,100), N(2,200), N(3,300))
val y = List(2,4)
val z = .... ?
Z // want Z to be ((N1,100), (N3,300)) after removing the items of type N with 'p'
// matching any item in list y.
I know one way to do it is is something like the following which makes the above broken code work:
val z = x.filterNot(e => y.contains(e.p))
but this seems very inefficient. Is there a better way?
Just do
val z = y.toSet
x.filterNot {z.contains(_.p)}
That's linear.
The problem with contains is that the search will be a linear search and you are looking at O(N^2) solution(which is still OK, if the dataset is not large)
Anyways, a simple solution can be to use Binary search to get O(NlnN) solution. You can easily convert val y to Array from list and then use java's binary search method.
scala> case class N(p: Int, q: Int)
defined class N
scala> val x = List(N(1, 100), N(2, 200), N(3, 300))
x: List[N] = List(N(1,100), N(2,200), N(3,300))
scala> val y = Array(2, 4) // Using Array directly.
y: Array[Int] = Array(2, 4)
scala> val z = x.filterNot(e => java.util.Arrays.binarySearch(y, e.p) >= 0)
z: List[N] = List(N(1,100), N(3,300))

Mutating a mutable collection using map?

If you have a mutable data structure like an Array, is it possible to use map operations or something similar to change its values?
Say I have val a = Array(5, 1, 3), what the best way of say, subtracting 1 from each value? The best I've come up with is
for(i <- 0 until a.size) { a(i) = a(i) - 1 }
I suppose way would be make the array a var rather than a val so I can say
a = a map (_-1)
edit: fairly easy to roll my own if there's nothing built-in, although I don't know how to generalise to other mutable collections
scala> implicit def ArrayToMutator[T](a: Array[T]) = new {
| def mutate(f: T => T) = a.indices.foreach {i => a(i) = f(a(i))}
| }
ArrayToMutator: [T](a: Array[T])java.lang.Object{def mutate(f: (T) => T): Unit}
scala> val a = Array(5, 1, 3)
a: Array[Int] = Array(5, 1, 3)
scala> a mutate (_-1)
scala> a
res16: Array[Int] = Array(4, 0, 2)
If you don't care about indices, you want the transform method.
scala> val a = Array(1,2,3)
a: Array[Int] = Array(1, 2, 3)
scala> a.transform(_+1)
res1: scala.collection.mutable.WrappedArray[Int] = WrappedArray(2, 3, 4)
scala> a
res2: Array[Int] = Array(2, 3, 4)
(It returns a copy of the wrapped array so that chaining transforms is more efficient, but the original array is modified as you can see.)
How about using foreach since you don't really want to return a modified collection, you want to perform a task for each element. For example:
a.indices.foreach{i => a(i) = a(i) - 1}
or
a.zipWithIndex.foreach{case (x,i) => a(i) = x - 1}