I refactor my code to work with kryo serialization.
Everything works fine except deserialize a property of geomtry from certain class.
No exception is thrown (I set "spark.kryo.registrationRequired" to true).
On debug I try to collect the data and I see that the data in the geomtry is just empty. As a result I understand that the deserialize was fail.
Geomtry is from type of - Any(scala) because it is a complex property maybe.
My question is why the data is empty, And is there connection to the type 'Any' of the property.
Update :
class code: class Entity(val id:String) extends Serializable{
var index:Any = null
var geometry:Any = null
}
geometry contains centeroid, shape and coordinates(complex object)
You should not use Kryo with Scala since the behavior of many Scala classes differs from Java classes and Kryo was originally written to work with Java. You will probably encounter many weird issues like this one if you use Kryo with Scala. You should instead use chill-scala which is an extension of Kryo that handles all of Scala's special cases.
Related
As mentioned in the title, what is the main difference between using df.as[T] and df.asInstanceOf[Dataset[T]]?
First, asInstanceOf is just telling the compiler to shut up and believe you that df is an instance of the Dataset class (the T part is irrelevant due to type-erasure). In runtime, if the value is not an instance of that class you will get an exception; and in this case it will never be.
On the other hand, as is a method defined in the Dataset class, which asks for an implicit Encoder so it can safely cast the data; note that since the data is processed in runtime, the conversion may still fail.
So the difference is big and you should not use the former.
I have following case classes defined in my flink application (Flink 1.10.1)
case class FilterDefinition(filterDefId: String, filter: TileFilter)
case class TileFilter(tiles: Seq[Long], zoomLevel: Int)
During runtime, I noticed the log saying
FilterDefinition cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance.
If I interpreted Flink documentation correctly, the flink should be able to serialize the scala case classes and not need Kryo for it. However, it looks like for me, the above case class fallbacks on Kryo serializer.
Did I miss interpret how case classes are handled by flink?
Excerpting here from the documentation:
Java and Scala classes are treated by Flink as a special POJO data
type if they fulfill the following requirements:
The class must be public.
It must have a public constructor without arguments (default
constructor).
All fields are either public or must be accessible through getter and
setter functions. For a field called foo the getter and setter methods
must be named getFoo() and setFoo().
The type of a field must be supported by a registered serializer.
In this case Flink it appears that Flink doesn't know how to serialize TileFilter (or more specifically, Seq[Long]).
I was debugging a deserialization issue with Jackson where Scala object instances seemed to be replaced. I managed to drill the issue down to this code:
object WaitWhat extends App {
object XX
val x1 = XX
// Notice: no assignment!
XX.getClass.getConstructor().newInstance()
val x2 = XX
println(x1)
println(x2)
}
The output is:
WaitWhat$XX$#5315b42e
WaitWhat$XX$#2ef9b8bc
(Of course the actual hash codes change each run.)
IntelliJ's debugger also indicates that x1 and x2 really are different instances, despite the fact that the result of newInstance is completely ignored.
I would have expected a no-op, or an exception of some kind. How is it possible that the actual object instance gets replaced by this call?
Objects in Scala have a private constructor that can’t be called with new (since it’s private), but can still be called using reflection.
Under the hood, the object is accessed by static MODULE$ field. This field is the singleton instance created internally by calling the private constructor.
As long as you access the object in your Scala or in your Java code using MODULE$ you will be ok. However, you can't be sure, that some library won't create an additional instance of your object with a private constructor using reflection. In this case, whenever private constructor will be called, a new instance of the object will be created and reassigned to MODULE$.
This can happen especially if you use Java libraries, that are not aware of the existence of Scala objects.
Please check this article for more details.
Anyway, I would just create custom deserializer for Jackson (similarly to the solution described in the article).
I have always seen that, when we are using a map function, we can create a dataframe from rdd using case class like below:-
case class filematches(
row_num:Long,
matches:Long,
non_matches:Long,
non_match_column_desc:Array[String]
)
newrdd1.map(x=> filematches(x._1,x._2,x._3,x._4)).toDF()
This works great as we all know!!
I was wondering , why we specifically need case classes here?
We should be able to achieve same effect using normal classes with parameterized constructors (as they will be vals and not private):-
class filematches1(
val row_num:Long,
val matches:Long,
val non_matches:Long,
val non_match_column_desc:Array[String]
)
newrdd1.map(x=> new filematches1(x._1,x._2,x._3,x._4)).toDF
Here , I am using new keyword to instantiate the class.
Running above has given me the error:-
error: value toDF is not a member of org.apache.spark.rdd.RDD[filematches1]
I am sure I am missing some key concept on case classes vs regular classes here but not able to find it yet.
To resolve error of
value toDF is not a member of org.apache.spark.rdd.RDD[...]
You should move your case class definition out of function where you are using it. You can refer http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Scala-Error-value-toDF-is-not-a-member-of-org-apache/td-p/29878 for mode detail.
On your Other query - case classes are syntactic sugar and they provide following additional things
Case classes are different from general classes. They are specially used when creating immutable objects.
They have default apply function which is used as constructor to create object. (so Lesser code)
All the variables in case class are by default val type. Hence immutable. which is a good thing in spark world as all red are immutable
example for case class is
case class Book( name : string)
val book1 = Book("test")
you cannot change value of book1.name as it is immutable. and you do not need to say new Book() to create object here.
The class variables are public by default. so you don't need setter and getters.
Moreover while comparing two objects of case classes, their structure is compared instead of references.
Edit : Spark Uses Following class to Infer Schema
Code Link :
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
If you check. in schemaFor function (Line 719 to 791). It converts Scala types to catalyst types. I this the case to handle non case classes for schema inference is not added yet. so the every time you try to use non case class with infer schema. It goes to other option and hence gives error of Schema for type $other is not supported.
Hope this helps
What different between
scala.collection.immutable.List$SerializationProxy
and
scala.collection.immutable.List
in Scala 2.11 ?
List$SerializationProxy is a helper class that is used by List to implement the Serialization Proxy Pattern
You can see some discussion about this in the source code, List.scala:415
// Create a proxy for Java serialization that allows us to avoid mutation
// during deserialization. This is the Serialization Proxy Pattern.
protected final def writeReplace(): AnyRef = new List.SerializationProxy(this)
You do not need to use or interact with List$SerializationProxy as a normal user of Scala, it is an implementation detail.