Spark fails while calling scala class method to comma split strings - scala

I have the following clss in scala shell in spark.
class StringSplit(val query:String)
{
def getStrSplit(rdd:RDD[String]):RDD[String]={
rdd.map(x=>x.split(query))
}
}
I am trying to call the method in this class like
val inputRDD=sc.parallelize(List("one","two","three"))
val strSplit=new StringSplit(",")
strSplit.getStrSplit(inputRDD)
-> This steps fails with error:getStrSplit is not a member of StringSplit error.
Can you please let me know what is wrong with this?

It seems like a reasonable thing to do, but...
the result type for getStrSplit is wrong because .split returns Array[String] not String
parallelizing List("one","two","three") results in "one", "two" and "three" being stored, and there are no strings needing a comma split.
Another way:
val input = sc.parallelize(List("1,2,3,4","5,6,7,8"))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>
The test input here is a list of two strings that each require some comma splitting to get to the data.
To parse input by splitting can be as easy as:
val parsedInput = input.map(_.split(","))
parsedInput: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[19] at map at <console>:25
Here _.split(",") is an anonymous function with one parameter _, where Scala infers the types from the other calls rather than the types being explicitly defined.
Notice the type is RDD[Array[String]] not RDD[String]
We could extract the 3rd element of each line with
parsedInput.map(_(2)).collect()
res27: Array[String] = Array(3, 7)
So how about the original question, doing the same operation
in a class. I tried:
class StringSplit(query:String){
def get(rdd:RDD[String]) = rdd.map(_.split(query));
}
val ss = StringSplit(",");
ss.get(input);
---> org.apache.spark.SparkException: Task not serializable
I'm guessing that occurs because the class is not serialized to each worker, rather Spark tries to send split function but it has a parameter that is not also sent.
scala> class commaSplitter {
def get(rdd:RDD[String])=rdd.map(_.split(","));
}
defined class commaSplitter
scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter#262f1580
scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10
scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
This parameter-free class works.
EDIT
You can tell scala you want your class to be serializable by extends Serializable like so:
scala> class stringSplitter(s:String) extends Serializable {
def get(rdd:RDD[String]) = rdd.map(_.split(s));
}
defined class stringSplitter
scala> val ss = new stringSplitter(",");
ss: stringSplitter = $iwC$$iwC$stringSplitter#2a33abcd
scala> ss.get(input)
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[25] at map at <console>:10
scala> ss.get(input).collect()
res34: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
and this works.

Related

How to convert an Array of Array containing Integer to a Scala Spark List/Seq?

I am new to Scala spark. I have a string of array-like "[[1,2,100], [1, 2, 111]]" I don't how to convert that in Scala List or Sequence. I could not found a solution to it.
I tried to use circe parse method but it did not help me out.
val e = parse(json_string).getOrElse(Json.Null)
e.asArray.foreach(l => {
println(l)
})
val r = "\\[(:?\\d+,? *)+\\]".r
r.findAllMatchIn(s).map { m =>
val s = m.toString
s.substring(1, s.length - 1).split(", *").map(_.toInt)
}.toArray
For your example, it produces:
res26: Array[Array[Int]] = Array(Array(1, 2, 100), Array(1, 2, 111))
Not sure what you want to do with the result after extracting it

Scala type mismatch in map operation

I am trying a map operation on a Spark DStream in the code below:
val hashesInRecords: DStream[(RecordKey, Array[Int])] = records.map(record => {
val hashes: List[Int] = calculateIndexing(record.fields())
val ints: Array[Int] = hashes.toArray(Array.ofDim[Int](hashes.length))
(new RecordKey(record.key, hashes.length), ints)
})
The code looks fine in IntelliJ however when I try to build I get an error which I don't really understand:
Error:(53, 61) type mismatch;
found : Array[Int]
required: scala.reflect.ClassTag[Int]
val ints: Array[Int] = hashes.toArray(Array.ofDim[Int](hashes.length))
This error remains even after I add the type in the map operation like so :
records.map[(RecordKey, Array[Int])](record => {...
This should fix your problem, also it avoids the call of List.length which is O(N), and uses Array.length instead which is O(1).
val hashesInRecords: DStream[(RecordKey, Array[Int])] = records.map { record =>
val ints = calculateIndexing(record.fields()).toArray
(new RecordKey(record.key, ints.length), ints)
}

spark implicit encoder not found in scope

I have a problem with spark already outlined in spark custom kryo encoder not providing schema for UDF but created a minimal sample now:
https://gist.github.com/geoHeil/dc9cfb8eca5c06fca01fc9fc03431b2f
class SomeOtherClass(foo: Int)
case class FooWithSomeOtherClass(a: Int, b: String, bar: SomeOtherClass)
case class FooWithoutOtherClass(a: Int, b: String, bar: Int)
case class Foo(a: Int)
implicit val someOtherClassEncoder: Encoder[SomeOtherClass] = Encoders.kryo[SomeOtherClass]
val df2 = Seq(FooWithSomeOtherClass(1, "one", new SomeOtherClass(4))).toDS
val df3 = Seq(FooWithoutOtherClass(1, "one", 1), FooWithoutOtherClass(2, "two", 2)).toDS
val df4 = df3.map(d => FooWithSomeOtherClass(d.a, d.b, new SomeOtherClass(d.bar)))
here, even the createDataSet statement fails due to
java.lang.UnsupportedOperationException: No Encoder found for SomeOtherClass
- field (class: "SomeOtherClass", name: "bar")
- root class: "FooWithSomeOtherClass"
Why is the encoder not in scope or at least not in the right scope?
Also, trying to specify an explicit encoder like:
df3.map(d => {FooWithSomeOtherClass(d.a, d.b, new SomeOtherClass(d.bar))}, (Int, String, Encoders.kryo[SomeOtherClass]))
does not work.
This happens because you should use the Kryo encoder through the whole serialization stack, meaning that your top-level object should have a Kryo encoder. The following runs successfully on a local Spark shell (the change you are interested in is on the first line):
implicit val topLevelObjectEncoder: Encoder[FooWithSomeOtherClass] = Encoders.kryo[FooWithSomeOtherClass]
val df1 = Seq(Foo(1), Foo(2)).toDF
val df2 = Seq(FooWithSomeOtherClass(1, "one", new SomeOtherClass(4))).toDS
val df3 = Seq(FooWithoutOtherClass(1, "one", 1), FooWithoutOtherClass(2, "two", 2)).toDS
df3.printSchema
df3.show
val df4 = df3.map(d => FooWithSomeOtherClass(d.a, d.b, new SomeOtherClass(d.bar)))
df4.printSchema
df4.show
df4.collect

How to use ConvertToWritableTypes in scala spark?

I was looking at BasicSaveSequenceFile and tried to follow this in scala
So I had:
val input = Seq(("coffee", 1), ("coffee", 2), ("pandas", 3))
val inputRDD = sc.parallelize(input) // no parallelizePairs
but then when I try:
val outputRDD = inputRDD.map(new ConvertToWritableTypes()) // I have no mapToPair how to write that instead?
How to use ConvertToWritableTypes in scala spark?
right now I get:
Error:(29, 38) type mismatch;
found : SparkExampleWriteSeqLZO.ConvertToWritableTypes
required: ((String, Int)) => ?
val outputRDD = inputRDD.map(new ConvertToWritableTypes())
^
So you're looking at the Java version, you really should be looking at the Scala version as the API's are fairly different. From the example give, you don't need mapToPair, you can just use a normal map without the static class:
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
val input = Seq(("coffee", 1), ("coffee", 2), ("pandas", 3))
val inputRDD = sc.parallelize(input)
val outputRDD = inputRDD.map(record => (new Text(record._1), new IntWritable(record._2)))
You really don't need to do this though, as the Scala example I linked to shows you:
val data = sc.parallelize(List(("Holden", 3), ("Kay", 6), ("Snail", 2)))
data.saveAsSequenceFile(outputFile)

Collections code involving mutable.IndexedSeq, view, take, and grouped throws ClassCastException

The following scala code compiles fine.
object Main extends App {
import scala.collection.mutable.IndexedSeq
def doIt() {
val nums: IndexedSeq[Int] = Array(3,5,9,11)
val view: IndexedSeq[Int] = nums.view
val half: IndexedSeq[Int] = view.take(2)
val grouped: Iterator[IndexedSeq[Int]] = half.grouped(2)
val firstPair: IndexedSeq[Int] = grouped.next() //throws exception here
}
doIt()
}
However, at runtime it thows java.lang.ClassCastException: scala.collection.SeqViewLike$$anon$1 cannot be cast to scala.collection.mutable.IndexedSeq
on the call grouped.next()
I would expect the call to grouped.next() to return something equal to IndexedSeq[Int](3,5)
I am wondering why is this code failing, and if there a proper way to fix it?
If I repeat the same steps in the REPL, the type information confirms why the code compiles, but does not give me any insight to why the exception was thrown:
scala> val nums = Array(3,5,9,11)
nums: Array[Int] = Array(3, 5, 9, 11)
scala> val view = nums.view
view: scala.collection.mutable.IndexedSeqView[Int,Array[Int]] = SeqView(...)
scala> val half = view.take(2)
half: scala.collection.mutable.IndexedSeqView[Int,Array[Int]] = SeqViewS(...)
scala> val grouped = half.grouped(2)
grouped: Iterator[scala.collection.mutable.IndexedSeqView[Int,Array[Int]]] = non-empty iterator
scala> val firstPair = grouped.next()
java.lang.ClassCastException: scala.collection.SeqViewLike$$anon$1 cannot be cast to scala.collection.mutable.IndexedSeqView
Scala version 2.10.0-20121205-235900-18481cef9b -- Copyright 2002-2012, LAMP/EPFL
Looks like you ran into bug SI-6709