Creating Spark Dataframes from regular classes

Creating Spark Dataframes from regular classes - scala

I have always seen that, when we are using a map function, we can create a dataframe from rdd using case class like below:-
case class filematches(
row_num:Long,
matches:Long,
non_matches:Long,
non_match_column_desc:Array[String]
)
newrdd1.map(x=> filematches(x._1,x._2,x._3,x._4)).toDF()
This works great as we all know!!
I was wondering , why we specifically need case classes here?
We should be able to achieve same effect using normal classes with parameterized constructors (as they will be vals and not private):-
class filematches1(
val row_num:Long,
val matches:Long,
val non_matches:Long,
val non_match_column_desc:Array[String]
)
newrdd1.map(x=> new filematches1(x._1,x._2,x._3,x._4)).toDF
Here , I am using new keyword to instantiate the class.
Running above has given me the error:-
error: value toDF is not a member of org.apache.spark.rdd.RDD[filematches1]
I am sure I am missing some key concept on case classes vs regular classes here but not able to find it yet.

To resolve error of
value toDF is not a member of org.apache.spark.rdd.RDD[...]
You should move your case class definition out of function where you are using it. You can refer http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Scala-Error-value-toDF-is-not-a-member-of-org-apache/td-p/29878 for mode detail.
On your Other query - case classes are syntactic sugar and they provide following additional things
Case classes are different from general classes. They are specially used when creating immutable objects.
They have default apply function which is used as constructor to create object. (so Lesser code)
All the variables in case class are by default val type. Hence immutable. which is a good thing in spark world as all red are immutable
example for case class is
case class Book( name : string)
val book1 = Book("test")
you cannot change value of book1.name as it is immutable. and you do not need to say new Book() to create object here.
The class variables are public by default. so you don't need setter and getters.
Moreover while comparing two objects of case classes, their structure is compared instead of references.
Edit : Spark Uses Following class to Infer Schema
Code Link :
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
If you check. in schemaFor function (Line 719 to 791). It converts Scala types to catalyst types. I this the case to handle non case classes for schema inference is not added yet. so the every time you try to use non case class with infer schema. It goes to other option and hence gives error of Schema for type $other is not supported.
Hope this helps

Related

Using Enumerations in Scala Best Practices

I have been using sealed traits and case objects to define enumerated types in Scala and I recently came across another approach to extend the Enumeration class in Scala like this below:
object CertificateStatusEnum extends Enumeration {
val Accepted, SignatureError, CertificateExpired, CertificateRevoked, NoCertificateAvailable, CertChainError, ContractCancelled = Value
}
against doing something like this:
sealed trait CertificateStatus
object CertificateStatus extends {
case object Accepted extends CertificateStatus
case object SignatureError extends CertificateStatus
case object CertificateExpired extends CertificateStatus
case object CertificateRevoked extends CertificateStatus
case object NoCertificateAvailable extends CertificateStatus
case object CertChainError extends CertificateStatus
case object ContractCancelled extends CertificateStatus
}
What is considered a good approach?

They both get the job done for simple purposes, but in terms of best practice, the use of sealed traits + case objects is more flexible.
The story behind is that since Scala came with everything Java had, so Java had enumerations and Scala had to put them there for interoperability reasons. But Scala does not need them, because it supports ADTs (algebraic data types) so it can generate enumeration in a functional way like the one you just saw.
You'll encounter certain limitations with the normal Enumeration class:
the inability of the compiler to detect pattern matches exhaustively
it's actually harder to extend the elements to hold more data besides the String name and the Int id, because Value is final.
at runtime, all enums have the same type because of type erasure, so limited type level programming - for example, you can't have overloaded methods.
when you did object CertificateStatusEnum extends Enumeration your enumerations will not be defined as CertificateStatusEnum type, but as CertificateStatusEnum.Value - so you have to use some type aliases to fix that. The problem with this is the type of your companion will still be CertificateStatusEnum.Value.type so you'll end up doing multiple aliases to fix that, and have a rather confusing enumeration.
On the other hand, the algebraic data type comes as a type-safe alternative where you specify the shape of each element and to encode the enumeration you just need sum types which are expressed exactly using sealed traits (or abstract classes) and case objects.
These solve the limitations of the Enumeration class, but you'll encounter some other (minor) drawbacks, though these are not that limiting:
case objects won't have a default order - so if you need one, you'll have to add your id as an attribute in the sealed trait and provide an ordering method.
a somewhat problematic issue is that even though case objects are serializable, if you need to deserialize your enumeration, there is no easy way to deserialize a case object from its enumeration name. You will most probably need to write a custom deserializer.
you can't iterate over them by default as you could using Enumeration. But it's not a very common use case. Nevertheless, it can be easily achieved, e.g. :
object CertificateStatus extends {
val values: Seq[CertificateStatus] = Seq(
Accepted,
SignatureError,
CertificateExpired,
CertificateRevoked,
NoCertificateAvailable,
CertChainError,
ContractCancelled
)
// rest of the code
}
In practice, there's nothing that you can do with Enumeration that you can't do with sealed trait + case objects. So the former went out of people's preferences, in favor of the latter.
This comparison only concerns Scala 2.
In Scala 3, they unified ADTs and their generalized versions (GADTs) with enums under a new powerful syntax, effectively giving you everything you need. So you'll have every reason to use them. As Gael mentioned, they became first-class entities.

It depends on what you want from enum.
In the first case, you implicitly have an order on items (accessed by id property). Reordering has consequences.
I'd prefer 'case object', in some cases enum item could have extra info in the constructor (like, Color with RGB, not just name).
Also, I'd recommend https://index.scala-lang.org/mrvisser/sealerate or similar libraries. That allows iterating over all elements.

Quick Documentation For Scala Apply Constructor Pattern in IntelliJ IDE

I am wondering if there is a way to get the quick documentation in IntelliJ to work for the class construction pattern many scala developers use below.
SomeClass(Param1,Parma2)
instead of
new SomeClass(param1,Param2)
The direct constructor call made with new obviously works but many scala devs use apply to construct objects. When that pattern is used the Intelij documentation look up fails to find any information on the class.

I don't know if there are documents in IntelliJ per se. However, the pattern is fairly easy to explain.
There's a pattern in Java code for having static factory methods (this is a specialization of the Gang of Four Factory Method Pattern), often along the lines of (translated to Scala-ish):
object Foo {
def barInstance(args...): Bar = ???
}
The main benefit of doing this is that the factory controls object instantiation, in particular:
the particular runtime class to instantiate, possibly based on the arguments to the factory. For example, the generic immutable collections in Scala have factory methods which may create optimized small collections if they're created with a sufficiently small amount of contents. An example of this is a sequence of length 1 can be implemented with basically no overhead with a single field referring to the object and a lookup that checks if the offset is 0 and either throws or returns its sole field.
whether an instance is created. One can cache arguments to the factory and memoize or "hashcons" the created objects, or precreate the most common instances and hand them out repeatedly.
A further benefit is that the factory is a function, while new is an operator, which allows the factory to be passed around:
class Foo(x: Int)
object Foo {
def instance(x: Int) = new Foo(x)
}
Seq(1, 2, 3).map(x => Foo(x)) // results in Seq(Foo(1), Foo(2), Foo(3))
In Scala, this is combined with the fact that the language allows any object which defines an apply method to be used syntactically as a function (even if it doesn't extend Function, which would allow the object to be passed around as if it's a function) and with the "companion object" to a class (which incorporates the things that in Java would be static in the class) to get something like:
class Foo(constructor_args...)
object Foo {
def apply(args...): Foo = ???
}
Which can be used like:
Foo(...)
For a case class, the Scala compiler automatically generates a companion object with certain behaviors, one of which is an apply with the same arguments as the constructor (other behaviors include contract-obeying hashCode and equals as well as an unapply method to allow for pattern matching).

Scala convert Map$ to Map

I have an exception:
java.lang.ClassCastException: scala.collection.immutable.Map$ cannot
be cast to scala.collection.immutable.Map
which i'm getting in this part of code:
val iterator = new CsvMapper()
.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES)
.readerFor(Map.getClass).`with`(CsvSchema.emptySchema().withHeader()).readValues(reader)
while (iterator.hasNext) {
println(iterator.next.asInstanceOf[Map[String, String]])
}
So, are there any options to avoid this issue, because this:
val iterator = new CsvMapper()
.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES)
.readerFor(Map[String,String].getClass).`with`(CsvSchema.emptySchema().withHeader()).readValues(reader)
doesn't help, because I get
[error] Unapplied methods are only converted to functions when a function type is expected.
[error] You can make this conversion explicit by writing `apply _` or `apply(_)` instead of `apply`.
Thanks in advance

As has been pointed out in the earlier comments, in general you need classOf[X[_,_]] rather than X.getClass or X[A, B].getClass for a class that takes two generic types. (instance.getClass retrieves the class of the associated instance; classOf[X] does the same for some type X when an instance isn't available. Since Map is an object and objects are also instances, it retrieves the class type of the object Map - the Map trait's companion.)
However, a second problem here is that scala.collection.immutable.Map is abstract (it's actually a trait), and so it cannot be instantiated as-is. (If you look at the type of Scala Map instances created via the companion's apply method, you'll see that they're actually instances of classes such as Map.EmptyMap or Map.Map1, etc.) As a consequence, that's why your modified code still produced an error.
However, the ultimate problem here is that you required - as you mentioned - a Java java.util.Map and not a Scala scala.collections.immutable.Map (which is what you'll get by default it you just type Map in a Scala program). Just one more thing to watch out for when converting Java code examples to Scala. ;-)

Getting field name and types from Case Class (with Option)

Assuming we have a model of something, represented as a case class, as so
case class User(firstName:String,lastName:String,age:Int,planet:Option[Planet])
sealed abstract class Planet
case object Earth extends Planet
case object Mars extends Planet
case object Venus extends Planet
Essentially, either by use of reflection, or Macros, to be able to get the field names of the User case class, as well as the types represented by the fields. This also includes Option, i.e. in the example provided, need to be able to differentiate between an Option[Planet] and just a Planet
In scala'ish pseudocode, something like this
val someMap = createTypedMap[User] // Assume createTypedMap is some function which returns map of Strings to Types
someMap.foreach{case(fieldName,someType) {
val statement = someType match {
case String => s"$fieldName happened to be a string"
case Int => s"$fieldName happened to be an integer"
case Planet => s"$fieldName happened to be a planet"
case Option[Planet] => s"$fieldName happened to be an optional planet"
case _ => s"unknown type for $fieldName"
}
println(statement)
}
I am currently aware that you can't do stuff like case Option[Planet], since it gets erased by Scala's erasure, however even when using TypeTags, I am unable to wrote code that does what I am trying to do, and possibly deal with other types (like Either[SomeError,String]).
Currently we are using the latest version of Scala (2.11.2) so any solution that uses TypeTags or ClassTags or macros would be more than enough.

Option is a type-parametrized type (Option[T]). At runtime, unless you have structured your code to use type tags, you have no mean to distinguish between an Option[String] and an Option[Int], due to type erasure (this is true for all type-parametrized types).
Nonetheless, you can discriminate between an Option[*] and a Planet. Just keep in mind the first issue.
Through reflection, getting all the "things" inside a class is easy. For example, say you only want the getters (you can put other types of filters, there are A LOT of them, and not all behave as expected when inheritance is part of the process, so you'll need to experiment a little):
import reflect.runtime.{universe=>ru}
val fieldSymbols = ru.typeOf[User].members.collect{
case m: ru.MethodSymbol if m.isGetter => m
}
Another option you'd have, if you are calling the code on instances rather than on classes, is to go through every method, call the method and assign the result to a variable, and then test the type of the variable. This assumes that you are only calling methods that don't alter the state of the instance.
You have a lot of options, time for you to find the best one for your needs.

Why does Scala complain about illegal inheritance when there are raw types in the class hierarchy?

I'm writing a wrapper that takes a Scala ObservableBuffer and fires events compatible with the Eclipse/JFace Databinding framework.
In the Databinding framework, there is an abstract ObservableList that decorates a normal Java list. I wanted to reuse this base class, but even this simple code fails:
val list = new java.util.ArrayList[Int]
val obsList = new ObservableList(list, null) {}
with errors:
illegal inheritance; anonymous class $anon inherits different type instances of trait Collection: java.util.Collection[E] and java.util.Collection[E]
illegal inheritance; anonymous class $anon inherits different type instances of trait Iterable: java.lang.Iterable[E] and java.lang.Iterable[E]
Why? Does it have to do with raw types? ObservableList implements IObservableList, which extends the raw type java.util.List. Is this expected behavior, and how can I work around it?

Having a Java raw type in the inheritance hierarchy causes this kind of problem. One solution is to write a tiny bit of Java to fix up the raw type as in the answer for Scala class cant override compare method from Java Interface which extends java.util.comparator
For more about why raw types are problematic for scala see this bug http://lampsvn.epfl.ch/trac/scala/ticket/1737 . That bug has a workaround using existential types that probably won't work for this particular case, at least not without a lot of casting, because the java.util.List type parameter is in both co and contra variant positions.

From looking at the javadoc the argument of the constructor isn't parameterized.
I'd try this:
val list = new java.util.ArrayList[_]
val obsList = new ObservableList(list, null) {}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse