spark implicit encoder not found in scope - scala

I have a problem with spark already outlined in spark custom kryo encoder not providing schema for UDF but created a minimal sample now:
https://gist.github.com/geoHeil/dc9cfb8eca5c06fca01fc9fc03431b2f
class SomeOtherClass(foo: Int)
case class FooWithSomeOtherClass(a: Int, b: String, bar: SomeOtherClass)
case class FooWithoutOtherClass(a: Int, b: String, bar: Int)
case class Foo(a: Int)
implicit val someOtherClassEncoder: Encoder[SomeOtherClass] = Encoders.kryo[SomeOtherClass]
val df2 = Seq(FooWithSomeOtherClass(1, "one", new SomeOtherClass(4))).toDS
val df3 = Seq(FooWithoutOtherClass(1, "one", 1), FooWithoutOtherClass(2, "two", 2)).toDS
val df4 = df3.map(d => FooWithSomeOtherClass(d.a, d.b, new SomeOtherClass(d.bar)))
here, even the createDataSet statement fails due to
java.lang.UnsupportedOperationException: No Encoder found for SomeOtherClass
- field (class: "SomeOtherClass", name: "bar")
- root class: "FooWithSomeOtherClass"
Why is the encoder not in scope or at least not in the right scope?
Also, trying to specify an explicit encoder like:
df3.map(d => {FooWithSomeOtherClass(d.a, d.b, new SomeOtherClass(d.bar))}, (Int, String, Encoders.kryo[SomeOtherClass]))
does not work.

This happens because you should use the Kryo encoder through the whole serialization stack, meaning that your top-level object should have a Kryo encoder. The following runs successfully on a local Spark shell (the change you are interested in is on the first line):
implicit val topLevelObjectEncoder: Encoder[FooWithSomeOtherClass] = Encoders.kryo[FooWithSomeOtherClass]
val df1 = Seq(Foo(1), Foo(2)).toDF
val df2 = Seq(FooWithSomeOtherClass(1, "one", new SomeOtherClass(4))).toDS
val df3 = Seq(FooWithoutOtherClass(1, "one", 1), FooWithoutOtherClass(2, "two", 2)).toDS
df3.printSchema
df3.show
val df4 = df3.map(d => FooWithSomeOtherClass(d.a, d.b, new SomeOtherClass(d.bar)))
df4.printSchema
df4.show
df4.collect

Related

How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark

Moving from Spark 1.6 to Spark 2.2* has brought the error “error: Unable to find encoder for type stored in a 'Dataset'. Primitive types (Int, String, etc)” when trying to apply a method to a dataset returned from querying a parquet table.
I have oversimplified my code to demonstrate the same error. The code queries a parquet file to return the following datatype:
'org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]'
I apply a function to extract a string and integer , returning a string. Returning the following
datatype: Array[String]
Next, I need to perform extensive manipulations requiring a separate function. In this test function, I try to append a string producing the same error as my detailed example.
I have tried some encoder examples and use of the ‘case’ but have not come up with a workable solution. Any suggestions/ examples would be appreciated
scala> var d1 = hive.executeQuery(st)
d1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cvdt35_message_id_d: string,
cvdt35_input_timestamp_s: decimal(16,5) ... 2 more fields]
val parseCVDP_parquet = (s:org.apache.spark.sql.Row) => s.getString(2).split("0x"
(1)+","+s.getDecimal(1);
scala> var d2 = d1.map(parseCVDP_parquet)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
20/03/25 19:01:08 WARN TaskSetManager: Stage 3 contains a task of very large size (131 KB). The
maximum recommended task size is 100 KB.
res10: Array[String] = Array(ab04006000504304,1522194407.95162)
scala> def dd(s:String){
| s + "some string"
| }
dd: (s: String)Unit
scala> var d3 = d2.map{s=> dd(s) }
<console>:47: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int,
String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
To distill the problem further, i believe this scenario (though I have not tried all possible solutions to) can be simplified further to the following code:
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String){
| s + "hi"
| }
f: (s: String)Unit
scala> var test2 = test.map{ s => f(s) }
<console>:42: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are
supported by importing spark.implicits._ Support for serializing other types
will be added in future releases.
var test2 = test.map{ s => f(s) }
I have a solution at least to my simplified Problem (below).
I will be testing more....
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var test2 = test.rdd.map{ s => f(s) }
test2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at map at <console>:43
scala> test2.take(1)
res9: Array[String] = Array(just some wordshi)
The first solution does not work on my initial (production) data set, rather producing the error "org.apache.spark.SparkException: Task not serializable" (interestingly though both stored as the same data type (org.apache.spark.sql.Dataset[String] = [value: string]) which I believe to be related. I included yet another solution to my test data set that eliminates the initial Encoder error and as shown actually works on my toy problem, does not ramp to a production data set. A bit confused as to exactly why my application is sidelined in the movement from 1.6 to 2.3 version spark as I didn't have to make any special accommodations to my application for years and have run it successfully for calculations that most likely count in the trillions. Other explorations have included wrapping my method as Serializable, explorations of the #transient keyword, leveraging the "org.apache.spark.serializer.KryoSerializer", writing my methods as functions and changing all vars to 'vals' (following related posts on 'stack').
scala> import spark.implicits._
import spark.implicits._
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var d2 = test.map{s => f(s)}(Encoders.STRING)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
res0: Array[String] = Array(just some wordshi)
scala>

how to convert a RDD to other RDD using case class property?

i have an RDD as below with name: other_nodes:
(4,(1,true))
(22,(1,true))
(14,(1,true))
(3,(1,true))
(8,(1,true))
(18,(1,true))
i wrote a case class as below and applyed it on a graph and it gave the result i wanted:
case class nodes_properties(label:Int, isVisited:Boolean=false)
when i apply case on a graph its result looks like this:
(1,nodes_properties(15,false))
(2,nodes_properties(11,false))
(3,nodes_properties(9,false))
Problem: how can i apply the case class i have defined, on the other_nodes RDD to get the result like as below:
(4,nodes_properties(1,true))
(22,nodes_properties(1,true))
(14,nodes_properties(1,true))
(3,nodes_properties(1,true))
(8,nodes_properties(1,true))
(18,nodes_properties(1,true))
This solution might work:
scala> val data = sc.parallelize(Seq((4,(1, true)),(22,(1,true))))
data: org.apache.spark.rdd.RDD[(Int, (Int, Boolean))] = ParallelCollectionRDD[72] at parallelize at <console>:39
scala> data.take(2)
res27: Array[(Int, (Int, Boolean))] = Array((4,(1,true)), (22,(1,true)))
scala> val data1 = data.map(elem => (elem._1, nodes_properties(elem._2._1, elem._2._2)))
data1: org.apache.spark.rdd.RDD[(Int, nodes_properties)] = MapPartitionsRDD[73] at map at <console>:42
scala> data1.take(2)
res28: Array[(Int, nodes_properties)] = Array((4,nodes_properties(1,true)), (22,nodes_properties(1,true)))
EDIT
The problem is each element in others_rdd is of Type (VertexId, Any). You need to convert to (VertexId, (Int, Boolean)) type in order for your case class to apply. The way to do is
val newRdd = others_rdd.map(elem => (elem._1, elem._2.asInstanceOf[(Int,Boolean)]))
After performing this, you can apply the solution as shown above by mapping to node_properties class.
Let me know if it helps!!

Scala deserialize JSON to Collection

My JSON File containes below details
{
"category":"age, gender,post_code"
}
My scala code is below one
val filename = args.head
println(s"Reading ${args.head} ...")
val json = Source.fromFile(filename)
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val parsedJson = mapper.readValue[Map[String, Any]](json.reader())
val data = parsedJson.get("category").toSeq
It's returning Seq(Any) = example List(age, gender,post_code) but I need Seq(String) output please if any has an idea about this please help me.
The idea in scala is to be typesafe whenever possible which you are giving away using Map[String, Any].
So, I recommend using a data class that represents your JSON data.
Example,
define a mapper,
scala> import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
scala> import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
scala> import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.DefaultScalaModule
scala> val mapper = new ObjectMapper() with ScalaObjectMapper
mapper: com.fasterxml.jackson.databind.ObjectMapper with com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper = $anon$1#d486a4d
scala> mapper.registerModule(DefaultScalaModule)
res0: com.fasterxml.jackson.databind.ObjectMapper = $anon$1#d486a4d
Now, when you deserialise to Map[K, V] you can not specify all the nested data-structures,
scala> val jsonString = """{"category": ["metal", "metalcore"], "age": 10, "gender": "M", "postCode": "98109"}"""
jsonString: String = {"category": ["metal", "metalcore"], "age": 10, "gender": "M", "postCode": "98109"}
scala> mapper.readValue[Map[String, Any]](jsonString)
res2: Map[String,Any] = Map(category -> List(metal, metalcore), age -> 10, gender -> M, postCode -> 98109)
Following is a solution casting some key to desired data-structure but I personally don not recommend.
scala> mapper.readValue[Map[String, Any]](jsonString).get("category").map(_.asInstanceOf[List[String]]).getOrElse(List.empty[String])
res3: List[String] = List(metal, metalcore)
Best solution is to define a data class which I'm calling SomeData in following example and deserialize to it. SomeData is defined based on your JSON data-structure.
scala> final case class SomeData(category: List[String], age: Int, gender: String, postCode: String)
defined class SomeData
scala> mapper.readValue[SomeData](jsonString)
res4: SomeData = SomeData(List(metal, metalcore),10,M,98109)
scala> mapper.readValue[SomeData](jsonString).category
res5: List[String] = List(metal, metalcore)
Just read the JSON as a JsonNode, and access the property directly:
val jsonNode = objectMapper.readTree(json.reader())
val parsedJson = jsonNode.get("category").asText
By using the scala generic function for converting the JSON String to Case Class/Object you can de-serialize to anything you want. Like
JSON to Collection,
JSON to Case Class, and
JSON to Case Class with Object as field.
Please find a working and detailed answer which I have provided using generics here.

Array[Byte] Spark RDD to String Spark RDD

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?
First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.

Spark fails while calling scala class method to comma split strings

I have the following clss in scala shell in spark.
class StringSplit(val query:String)
{
def getStrSplit(rdd:RDD[String]):RDD[String]={
rdd.map(x=>x.split(query))
}
}
I am trying to call the method in this class like
val inputRDD=sc.parallelize(List("one","two","three"))
val strSplit=new StringSplit(",")
strSplit.getStrSplit(inputRDD)
-> This steps fails with error:getStrSplit is not a member of StringSplit error.
Can you please let me know what is wrong with this?
It seems like a reasonable thing to do, but...
the result type for getStrSplit is wrong because .split returns Array[String] not String
parallelizing List("one","two","three") results in "one", "two" and "three" being stored, and there are no strings needing a comma split.
Another way:
val input = sc.parallelize(List("1,2,3,4","5,6,7,8"))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>
The test input here is a list of two strings that each require some comma splitting to get to the data.
To parse input by splitting can be as easy as:
val parsedInput = input.map(_.split(","))
parsedInput: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[19] at map at <console>:25
Here _.split(",") is an anonymous function with one parameter _, where Scala infers the types from the other calls rather than the types being explicitly defined.
Notice the type is RDD[Array[String]] not RDD[String]
We could extract the 3rd element of each line with
parsedInput.map(_(2)).collect()
res27: Array[String] = Array(3, 7)
So how about the original question, doing the same operation
in a class. I tried:
class StringSplit(query:String){
def get(rdd:RDD[String]) = rdd.map(_.split(query));
}
val ss = StringSplit(",");
ss.get(input);
---> org.apache.spark.SparkException: Task not serializable
I'm guessing that occurs because the class is not serialized to each worker, rather Spark tries to send split function but it has a parameter that is not also sent.
scala> class commaSplitter {
def get(rdd:RDD[String])=rdd.map(_.split(","));
}
defined class commaSplitter
scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter#262f1580
scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10
scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
This parameter-free class works.
EDIT
You can tell scala you want your class to be serializable by extends Serializable like so:
scala> class stringSplitter(s:String) extends Serializable {
def get(rdd:RDD[String]) = rdd.map(_.split(s));
}
defined class stringSplitter
scala> val ss = new stringSplitter(",");
ss: stringSplitter = $iwC$$iwC$stringSplitter#2a33abcd
scala> ss.get(input)
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[25] at map at <console>:10
scala> ss.get(input).collect()
res34: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
and this works.