Reference a java nested class in Spark Scala - scala

I'm trying to read some data from hadoop into an RDD in Spark using the interactive Scala shell but I'm having trouble accessing some of the classes I need to deserialise the data.
I start by importing the necessary class
import com.example.ClassA
Which works fine. ClassA is located in a jar in the 'jars' path and has ClassB as a public static nested class
I'm then trying to use ClassB like so:
val rawData = sc.newAPIHadoopFile(dataPath, classOf[com.exmple.mapreduce.input.Format[com.example.ClassA$ClassB]], classOf[org.apache.hadoop.io.LongWritable], classOf[com.example.ClassA$ClassB])
This is slightly complicated by one of the other classes taking ClassB as a type, but I think that should be fine.
When I execute this line, I get the following error:
<console>:17: error: type ClassA$ClassB is not a member of package com.example
I have also tried using the import statement
import com.example.ClassA$ClassB
and it also seems fine with that.
Any advice as to how I could proceed to debug this would be appreciated
Thanks for reading.
update:
Changing the '$' to a '.' to reference the nested class seems to get past this problem, although I then got the following syntax error:
'<console>:17: error: inferred type arguments [org.apache.hadoop.io.LongWritable,com.example.ClassA.ClassB,com.example.mapredu‌​ce.input.Format[com.example.ClassA.ClassB]] do not conform to method newAPIHadoopFile's type parameter bounds [K,V,F <: org.apache.hadoop.mapreduce.InputFormat[K,V]]

Notice the types that the newAPIHadoopFile expects:
K,V,F <: org.apache.hadoop.mapreduce.InputFormat[K,V]
the important part here is that the generic type InputFormat expects the types K and V, i.e. the exact types of the first two parameters to the method.
In your case, the third parameter should be of type
F <: org.apache.hadoop.mapreduce.InputFormat[LongWritable, ClassA.ClassB]
does your class extend FileInputFormat<LongWritable, V>?

Related

Creating Spark Dataframes from regular classes

I have always seen that, when we are using a map function, we can create a dataframe from rdd using case class like below:-
case class filematches(
row_num:Long,
matches:Long,
non_matches:Long,
non_match_column_desc:Array[String]
)
newrdd1.map(x=> filematches(x._1,x._2,x._3,x._4)).toDF()
This works great as we all know!!
I was wondering , why we specifically need case classes here?
We should be able to achieve same effect using normal classes with parameterized constructors (as they will be vals and not private):-
class filematches1(
val row_num:Long,
val matches:Long,
val non_matches:Long,
val non_match_column_desc:Array[String]
)
newrdd1.map(x=> new filematches1(x._1,x._2,x._3,x._4)).toDF
Here , I am using new keyword to instantiate the class.
Running above has given me the error:-
error: value toDF is not a member of org.apache.spark.rdd.RDD[filematches1]
I am sure I am missing some key concept on case classes vs regular classes here but not able to find it yet.
To resolve error of
value toDF is not a member of org.apache.spark.rdd.RDD[...]
You should move your case class definition out of function where you are using it. You can refer http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Scala-Error-value-toDF-is-not-a-member-of-org-apache/td-p/29878 for mode detail.
On your Other query - case classes are syntactic sugar and they provide following additional things
Case classes are different from general classes. They are specially used when creating immutable objects.
They have default apply function which is used as constructor to create object. (so Lesser code)
All the variables in case class are by default val type. Hence immutable. which is a good thing in spark world as all red are immutable
example for case class is
case class Book( name : string)
val book1 = Book("test")
you cannot change value of book1.name as it is immutable. and you do not need to say new Book() to create object here.
The class variables are public by default. so you don't need setter and getters.
Moreover while comparing two objects of case classes, their structure is compared instead of references.
Edit : Spark Uses Following class to Infer Schema
Code Link :
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
If you check. in schemaFor function (Line 719 to 791). It converts Scala types to catalyst types. I this the case to handle non case classes for schema inference is not added yet. so the every time you try to use non case class with infer schema. It goes to other option and hence gives error of Schema for type $other is not supported.
Hope this helps

Why does spray-json apply this hierarchy way in RootJsonFormat?

Recently, I am reading the source code of Spray-json. I noted that the following hierarchy relation in JsonFormat.scala, please see below code snippet
/**
* A special JsonFormat signaling that the format produces a legal JSON root
* object, i.e. either a JSON array
* or a JSON object.
*/
trait RootJsonFormat[T] extends JsonFormat[T] with RootJsonReader[T] with RootJsonWriter[T]
To express the confusion more convenient, I draw the following diagram of hierarchy:
According to my limited knowledge of Scala, I think the JsonFormat[T] with should be removed from the above code. Then I cloned the repository of Spary-json, and comment the code JsonFormat[T] with
trait RootJsonFormat[T] extends RootJsonReader[T] with RootJsonWriter[T]
Then I compile it in SBT(use package/compile command) and it passed to the compiling process and generates a spray-json_2.11-1.3.4.jar successfully.
However, when I run the test cases via test command of SBT, it failed.
So I would like to know why. Thanks in advance.
I suggest you to not think of it in terms of OOP. Think of it in terms of type classes. In case when some entity must be serialized and deserialized at the same time, there is a type class JsonFormat that includes both JsonWriter and JsonReader. This is convenient since you don't need to search for 2 type class instances when you need both capabilities. But in order for this approach to work, there has to be an instance of JsonFormat type class. This is why you can't just throw it away from hierarchy. For instance:
def myMethod[T](t: T)(implicit format: JsonFormat[T]): Unit = {
format.read(format.write(t))
}
If you want this method to work properly there has to be a direct descendant of JsonFormat and a concrete implicit instance of it for a specific type T.
UPD: By creating an instance of the JsonFormat type class, you get instances for JsonWriter and JsonReader type classes automatically (in case when you need both). So this is also a way to reduce boilerplate.

Scala convert Map$ to Map

I have an exception:
java.lang.ClassCastException: scala.collection.immutable.Map$ cannot
be cast to scala.collection.immutable.Map
which i'm getting in this part of code:
val iterator = new CsvMapper()
.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES)
.readerFor(Map.getClass).`with`(CsvSchema.emptySchema().withHeader()).readValues(reader)
while (iterator.hasNext) {
println(iterator.next.asInstanceOf[Map[String, String]])
}
So, are there any options to avoid this issue, because this:
val iterator = new CsvMapper()
.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES)
.readerFor(Map[String,String].getClass).`with`(CsvSchema.emptySchema().withHeader()).readValues(reader)
doesn't help, because I get
[error] Unapplied methods are only converted to functions when a function type is expected.
[error] You can make this conversion explicit by writing `apply _` or `apply(_)` instead of `apply`.
Thanks in advance
As has been pointed out in the earlier comments, in general you need classOf[X[_,_]] rather than X.getClass or X[A, B].getClass for a class that takes two generic types. (instance.getClass retrieves the class of the associated instance; classOf[X] does the same for some type X when an instance isn't available. Since Map is an object and objects are also instances, it retrieves the class type of the object Map - the Map trait's companion.)
However, a second problem here is that scala.collection.immutable.Map is abstract (it's actually a trait), and so it cannot be instantiated as-is. (If you look at the type of Scala Map instances created via the companion's apply method, you'll see that they're actually instances of classes such as Map.EmptyMap or Map.Map1, etc.) As a consequence, that's why your modified code still produced an error.
However, the ultimate problem here is that you required - as you mentioned - a Java java.util.Map and not a Scala scala.collections.immutable.Map (which is what you'll get by default it you just type Map in a Scala program). Just one more thing to watch out for when converting Java code examples to Scala. ;-)

Dealing with nested classes in Scala

I am unable to understand how to work with nested classes in Scala esp when I encountered the error below:
class Action {
val entityModelVar = new EntityModel
}
class EntityModel {
class EntityLabel {
....
}
}
The above code-snippet gives an idea about my class structure. Here's two code blocks that puzzle me on how they work.
val actionList=Array[Action](Action1,Action2)
..
val newLabels=actionList(i).test(doc)
actionList(i).retrain(newLabels) //error pointed here
**Error: type mismatch:
found : Seq[a.entityModelVar.EntityLabel]
required : Seq[_13.entityModelVar.EntityLabel] where _13:Action**
However, the following code compiles without any error:
//This works fine
val a=actionList(i)
val newLabels=a.test(doc2)
a.retrain(newLabels)
Also, here is the definition of the retrain function:
def retrain(labels:Seq[entityModelVar.EntityLabel])={
entityModelVar.retrain(labels)
}
and the signature of EntityModel.retrain function:
def retrain(testLabels:Seq[EntityLabel]):Unit
The problem is that the inner class has got to belong to the same instance of the outer class. But is actionList(i) guaranteed to be the same instance between two calls? The compiler doesn't know for certain (maybe another thread fiddles with it? who knows what apply does anyway?), so it complains. The _13 is its name for a temporary variable that it wishes were there to assure that it is the same instance.
Your next one works because the compiler can see that you call actionList(i) once, store that instance, get an inner class from it and then apply it.
So, moral of the story is: you need to make it abundantly obvious to the compiler that your inner class instances match up to their proper outer class, and the best way to do that is to store that outer class in a val where it can't change without you (or the compiler) noticing.
(You can also specify types of individual variables if you break up parameter blocks. So, for instance: def foo(m: EntityModel)(l: m.EntityLabel) would be a way to write a function that takes an outer class an an inner one corresponding to it.)

Why does Scala complain about illegal inheritance when there are raw types in the class hierarchy?

I'm writing a wrapper that takes a Scala ObservableBuffer and fires events compatible with the Eclipse/JFace Databinding framework.
In the Databinding framework, there is an abstract ObservableList that decorates a normal Java list. I wanted to reuse this base class, but even this simple code fails:
val list = new java.util.ArrayList[Int]
val obsList = new ObservableList(list, null) {}
with errors:
illegal inheritance; anonymous class $anon inherits different type instances of trait Collection: java.util.Collection[E] and java.util.Collection[E]
illegal inheritance; anonymous class $anon inherits different type instances of trait Iterable: java.lang.Iterable[E] and java.lang.Iterable[E]
Why? Does it have to do with raw types? ObservableList implements IObservableList, which extends the raw type java.util.List. Is this expected behavior, and how can I work around it?
Having a Java raw type in the inheritance hierarchy causes this kind of problem. One solution is to write a tiny bit of Java to fix up the raw type as in the answer for Scala class cant override compare method from Java Interface which extends java.util.comparator
For more about why raw types are problematic for scala see this bug http://lampsvn.epfl.ch/trac/scala/ticket/1737 . That bug has a workaround using existential types that probably won't work for this particular case, at least not without a lot of casting, because the java.util.List type parameter is in both co and contra variant positions.
From looking at the javadoc the argument of the constructor isn't parameterized.
I'd try this:
val list = new java.util.ArrayList[_]
val obsList = new ObservableList(list, null) {}