Collections code involving mutable.IndexedSeq, view, take, and grouped throws ClassCastException - scala

The following scala code compiles fine.
object Main extends App {
import scala.collection.mutable.IndexedSeq
def doIt() {
val nums: IndexedSeq[Int] = Array(3,5,9,11)
val view: IndexedSeq[Int] = nums.view
val half: IndexedSeq[Int] = view.take(2)
val grouped: Iterator[IndexedSeq[Int]] = half.grouped(2)
val firstPair: IndexedSeq[Int] = grouped.next() //throws exception here
}
doIt()
}
However, at runtime it thows java.lang.ClassCastException: scala.collection.SeqViewLike$$anon$1 cannot be cast to scala.collection.mutable.IndexedSeq
on the call grouped.next()
I would expect the call to grouped.next() to return something equal to IndexedSeq[Int](3,5)
I am wondering why is this code failing, and if there a proper way to fix it?
If I repeat the same steps in the REPL, the type information confirms why the code compiles, but does not give me any insight to why the exception was thrown:
scala> val nums = Array(3,5,9,11)
nums: Array[Int] = Array(3, 5, 9, 11)
scala> val view = nums.view
view: scala.collection.mutable.IndexedSeqView[Int,Array[Int]] = SeqView(...)
scala> val half = view.take(2)
half: scala.collection.mutable.IndexedSeqView[Int,Array[Int]] = SeqViewS(...)
scala> val grouped = half.grouped(2)
grouped: Iterator[scala.collection.mutable.IndexedSeqView[Int,Array[Int]]] = non-empty iterator
scala> val firstPair = grouped.next()
java.lang.ClassCastException: scala.collection.SeqViewLike$$anon$1 cannot be cast to scala.collection.mutable.IndexedSeqView
Scala version 2.10.0-20121205-235900-18481cef9b -- Copyright 2002-2012, LAMP/EPFL

Looks like you ran into bug SI-6709

Related

Scala type mismatch in map operation

I am trying a map operation on a Spark DStream in the code below:
val hashesInRecords: DStream[(RecordKey, Array[Int])] = records.map(record => {
val hashes: List[Int] = calculateIndexing(record.fields())
val ints: Array[Int] = hashes.toArray(Array.ofDim[Int](hashes.length))
(new RecordKey(record.key, hashes.length), ints)
})
The code looks fine in IntelliJ however when I try to build I get an error which I don't really understand:
Error:(53, 61) type mismatch;
found : Array[Int]
required: scala.reflect.ClassTag[Int]
val ints: Array[Int] = hashes.toArray(Array.ofDim[Int](hashes.length))
This error remains even after I add the type in the map operation like so :
records.map[(RecordKey, Array[Int])](record => {...
This should fix your problem, also it avoids the call of List.length which is O(N), and uses Array.length instead which is O(1).
val hashesInRecords: DStream[(RecordKey, Array[Int])] = records.map { record =>
val ints = calculateIndexing(record.fields()).toArray
(new RecordKey(record.key, ints.length), ints)
}

Scala Guava type mismatch issue

I am trying to implement a simple usecase using Guava caching but facing some issues as shown below:
case class Person(x:Int, y:String)
val db = Map(1 -> Person(1,"A"), 2 -> Person(2,"B"), 3 -> Person(3,"C"))
val loader:CacheLoader[Int,Person] = new CacheLoader[Int,Person](){
def load(key: Int): Person = {
db(key)
}
}
lazy val someData = CacheBuilder.newBuilder().expireAfterWrite(60, MINUTES).maximumSize(10).build(loader)
someData.get(3)
The error I am getting is related to types which I am not able figure out
scala> someData.get(3)
<console>:24: error: type mismatch;
found : Int(3)
required: Int
someData.get(3)
Can someone advice on what can be the issue.
That's a common issue with Java's use-site covariance annotations.
This here works with scala 2.12.4 and guava 24.1:
import com.google.common.cache._
import java.util.concurrent.TimeUnit._
object GuavaCacheBuilderTypeProblem {
case class Person(x:Int, y:String)
val db = Map(1 -> Person(1,"A"), 2 -> Person(2,"B"), 3 -> Person(3,"C"))
val loader: CacheLoader[java.lang.Integer, Person] =
new CacheLoader[java.lang.Integer, Person](){
def load(key: java.lang.Integer): Person = {
db(key)
}
}
lazy val someData = CacheBuilder
.newBuilder()
.expireAfterWrite(60, MINUTES)
.maximumSize(10)
.build[java.lang.Integer, Person](loader)
someData.get(3)
}
Answers with similar errors:
compiler error when using Google guava from scala code

Need workaround for scala breeze matrix slicing and vector indexing

Because of the odd behaviour in method foo I cannot write methods like bar,
which I need:
import breeze.linalg.DenseMatrix
import breeze.linalg.DenseVector
class Test {
val dim = 3
val X:DenseMatrix[Double] = DenseMatrix.rand(dim,dim)
val u:DenseVector[Double] = DenseVector.fill(dim){1.0}
def foo:Unit = {
val i=0;
val row_i:DenseVector[Double] = X(i,::).t // OK
val s = u(i)+u(i) // OK
val j:Integer = 0
val row_j:DenseVector[Double] = X(j,::).t // does not compile (A)
val a = u(j)+u(j) // does not compile (B)
}
def bar(i:Integer):Double = u(i)+u(i) // does not compile (C)
}
Is there a workaround?
Thanks in advance for all replies.
Compilation errors:
(A) could not find implicit value for parameter canSlice:
breeze.linalg.support.CanSlice2[breeze.linalg.DenseMatrix[Double],Integer,collection.immutable.::.type,Result]
not enough arguments for method apply: (implicit canSlice:
breeze.linalg.support.CanSlice2[breeze.linalg.DenseMatrix[Double],Integer,collection.immutable.::.type,Result])
Result in trait TensorLike. Unspecified value parameter canSlice.
(B), (C)
could not find implicit value for parameter canSlice:
breeze.linalg.support.CanSlice[breeze.linalg.DenseVector[Double],Integer,Result]
not enough arguments for method apply: (implicit canSlice: breeze.linalg.support.CanSlice[breeze.linalg.DenseVector[Double],Integer,Result])Result
in trait TensorLike. Unspecified value parameter canSlice.
First off: convert your Integer to Int . That would take care of at least the first compilation error. Following update to your code does compile
import breeze.linalg.DenseMatrix
import breeze.linalg.DenseVector
import breeze.linalg.DenseMatrix
import breeze.linalg.DenseVector
class Test {
val dim = 3
val X:DenseMatrix[Double] = DenseMatrix.rand(dim,dim)
val u:DenseVector[Double] = DenseVector.fill(dim){1.0}
def foo:Unit = {
val i=0;
val row_i:DenseVector[Double] = X(i,::).t // OK
val s = u(i)+u(i) // OK
val j:Int = 0
val row_j:DenseVector[Double] = X(j,::).t // does not compile (A)
val a = u(j)+u(j) // does not compile (B)
}
def bar(i:Int):Double = u(i)+u(i) // does not compile (C)
}
From the repl:
// Exiting paste mode, now interpreting.
import breeze.linalg.DenseMatrix
import breeze.linalg.DenseVector
defined class Test
So your other errors also disappear for me (scala 2.11.8). Let me know if you have further issues.

Intersection in Spark gives odd results [duplicate]

Why does pattern matching in Spark not work the same as in Scala? See example below... function f() tries to pattern match on class, which works in the Scala REPL but fails in Spark and results in all "???". f2() is a workaround that gets the desired result in Spark using .isInstanceOf(), but I understand that to be bad form in Scala.
Any help on pattern matching the correct way in this scenario in Spark would be greatly appreciated.
abstract class a extends Serializable {val a: Int}
case class b(a: Int) extends a
case class bNull(a: Int=0) extends a
val x: List[a] = List(b(0), b(1), bNull())
val xRdd = sc.parallelize(x)
attempt at pattern matching which works in Scala REPL but fails in Spark
def f(x: a) = x match {
case b(n) => "b"
case bNull(n) => "bnull"
case _ => "???"
}
workaround that functions in Spark, but is bad form (I think)
def f2(x: a) = {
if (x.isInstanceOf[b]) {
"b"
} else if (x.isInstanceOf[bNull]) {
"bnull"
} else {
"???"
}
}
View results
xRdd.map(f).collect //does not work in Spark
// result: Array("???", "???", "???")
xRdd.map(f2).collect // works in Spark
// resut: Array("b", "b", "bnull")
x.map(f(_)) // works in Scala REPL
// result: List("b", "b", "bnull")
Versions used...
Spark results run in spark-shell (Spark 1.6 on AWS EMR-4.3)
Scala REPL in SBT 0.13.9 (Scala 2.10.5)
This is a known issue with Spark REPL. You can find more details in SPARK-2620. It affects multiple operations in Spark REPL including most of transformations on the PairwiseRDDs. For example:
case class Foo(x: Int)
val foos = Seq(Foo(1), Foo(1), Foo(2), Foo(2))
foos.distinct.size
// Int = 2
val foosRdd = sc.parallelize(foos, 4)
foosRdd.distinct.count
// Long = 4
foosRdd.map((_, 1)).reduceByKey(_ + _).collect
// Array[(Foo, Int)] = Array((Foo(1),1), (Foo(1),1), (Foo(2),1), (Foo(2),1))
foosRdd.first == foos.head
// Boolean = false
Foo.unapply(foosRdd.first) == Foo.unapply(foos.head)
// Boolean = true
What makes it even worse is that the results depend on the data distribution:
sc.parallelize(foos, 1).distinct.count
// Long = 2
sc.parallelize(foos, 1).map((_, 1)).reduceByKey(_ + _).collect
// Array[(Foo, Int)] = Array((Foo(2),2), (Foo(1),2))
The simplest thing you can do is to define and package required case classes outside REPL. Any code submitted directly using spark-submit should work as well.
In Scala 2.11+ you can create a package directly in the REPL with paste -raw.
scala> :paste -raw
// Entering paste mode (ctrl-D to finish)
package bar
case class Bar(x: Int)
// Exiting paste mode, now interpreting.
scala> import bar.Bar
import bar.Bar
scala> sc.parallelize(Seq(Bar(1), Bar(1), Bar(2), Bar(2))).distinct.collect
res1: Array[bar.Bar] = Array(Bar(1), Bar(2))

Spark fails while calling scala class method to comma split strings

I have the following clss in scala shell in spark.
class StringSplit(val query:String)
{
def getStrSplit(rdd:RDD[String]):RDD[String]={
rdd.map(x=>x.split(query))
}
}
I am trying to call the method in this class like
val inputRDD=sc.parallelize(List("one","two","three"))
val strSplit=new StringSplit(",")
strSplit.getStrSplit(inputRDD)
-> This steps fails with error:getStrSplit is not a member of StringSplit error.
Can you please let me know what is wrong with this?
It seems like a reasonable thing to do, but...
the result type for getStrSplit is wrong because .split returns Array[String] not String
parallelizing List("one","two","three") results in "one", "two" and "three" being stored, and there are no strings needing a comma split.
Another way:
val input = sc.parallelize(List("1,2,3,4","5,6,7,8"))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>
The test input here is a list of two strings that each require some comma splitting to get to the data.
To parse input by splitting can be as easy as:
val parsedInput = input.map(_.split(","))
parsedInput: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[19] at map at <console>:25
Here _.split(",") is an anonymous function with one parameter _, where Scala infers the types from the other calls rather than the types being explicitly defined.
Notice the type is RDD[Array[String]] not RDD[String]
We could extract the 3rd element of each line with
parsedInput.map(_(2)).collect()
res27: Array[String] = Array(3, 7)
So how about the original question, doing the same operation
in a class. I tried:
class StringSplit(query:String){
def get(rdd:RDD[String]) = rdd.map(_.split(query));
}
val ss = StringSplit(",");
ss.get(input);
---> org.apache.spark.SparkException: Task not serializable
I'm guessing that occurs because the class is not serialized to each worker, rather Spark tries to send split function but it has a parameter that is not also sent.
scala> class commaSplitter {
def get(rdd:RDD[String])=rdd.map(_.split(","));
}
defined class commaSplitter
scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter#262f1580
scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10
scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
This parameter-free class works.
EDIT
You can tell scala you want your class to be serializable by extends Serializable like so:
scala> class stringSplitter(s:String) extends Serializable {
def get(rdd:RDD[String]) = rdd.map(_.split(s));
}
defined class stringSplitter
scala> val ss = new stringSplitter(",");
ss: stringSplitter = $iwC$$iwC$stringSplitter#2a33abcd
scala> ss.get(input)
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[25] at map at <console>:10
scala> ss.get(input).collect()
res34: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
and this works.