Passing Spark dataframe between scala methods - Performance - scala

Recently, I have developed a Spark Streaming application using Scala and Spark. In this application, I have extensively used Implicit Class (Pimp my Library pattern) to implement more general utilities like Writing a Dataframe to HBase by creating an implicit class that is extending Spark's Dataframe. For example,
implicit class DataFrameExtension(private val dataFrame: DataFrame) extends Serializable { ..... // Custom methods to perform some computations }
However, a senior architect from my team refactored the code (specifying some style mismatch and performance as a reason) and copied these methods to a new class. Now, these methods accept Dataframe as an argument.
Can anyone help me on,
Whether Scala's implicit classes creates any overhead during
run-time?
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
I have searched a bit, but couldn't find any style guide that gives guidelines on using implicit classes or methods over traditional methods.
Thanks in advance.

Whether Scala's implicit classes creates any overhead during run-time?
Not in your case. There is some overhead when the implicit type is AnyVal (thus needs to be boxed). Implicits are resolved during compile time, and except for maybe a few virtual method calls there should be no overhead.
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
No, no more then any other type. Obviously there will be no serialization.
... if I pass dataframes between methods in Spark code, it might create closure and as a result, will bring the parent class that holds the dataframe object.
Only if you use scoped variables inside your dataframe, for example filter($"col" === myVar) where myVar declared in the scope of the method. In this case, Spark might serialize the wrapping class, but it's easy to avoid that. Please remember that dataframes are passed quite often and quite deep inside Spark code, and probably in every other library that you might be using (datasources, for example).
It is very common (and handy) to use extension implicit classes like you did.

Related

Scala boilerplate: lack of common superclass of Iterable and ParIterable

Why is Scala designed with the following irritating form of boilerplate?
It would be convenient to write
def doStuffWithInts(ints: BaseIterable[Int]): Unit = ints foreach doStuffWithInt
for a common superclass BaseIterable of Iterable and ParIterable so that we can write both
val sequentialInts: Vector[Int] = getSomeHugeVector()
doStuffWithInts(sequentialInts)
and
val parInts: ParVector[Int] = getSomeHugeParVector()
doStuffWithInts(parInts)
Yet Scala forces us to copy and paste our doStuff method, once for Iterable and once for ParIterable. Why does Scala thrust such boilerplate on us by failing to have a common superclass BaseIterator of both Iterator and ParIterator?
You can use IterableOnce but that would force you to get an Iterator which is always sequential.
This is a conscious decision from the maintainers, you can read all the related discussions by starting here: https://github.com/scala/scala-parallel-collections/issues/101
The TL;DR; is that the maintainers agree that it is a bad idea to provide an abstraction between two; mainly because parallel collections should not be used as general collections but rather as localized optimizations. Also, the point out how easy it would be to introduce errors if you could abstract over the two (as was the case in 2.12).
Now, if you insist you want to abstract over the two, you may create your own typeclass.
Finally, I may suggest looking at using Future.traverse instead of parallel collections.

Special grammar in scala

I am very new at Scala and Spark area, and I found a strange grammar usage in the scala inside the Apache beam project and I can't understand.
Here is the strange place:
JavaDStream<Metadata> metadataDStream = mapWithStateDStream.map(new Tuple2MetadataFunction());
// register ReadReportDStream to report information related to this read.
new ReadReportDStream(metadataDStream.dstream(), id, getSourceName(source, id), stepName)
.register();
From the above code, you can see inside the constructor of ReadReportDstream, the first parameter is
metadataDStream.dstream()
If we go inside the dstream() method, you will see the following code:
class JavaDStream[T](val dstream: DStream[T])(implicit val classTag: ClassTag[T])
extends AbstractJavaDStreamLike[T, JavaDStream[T], JavaRDD[T]] {
I am wondering why it uses "metadataDStream.dstream()" in the constructor instead of "metadataDStream.dstream"? What does the "()" do?
It's mostly a question of convention. Methods with empty parameter lists are evaluated for their side-effects. Methods without parameters are assumed to be purely functional, and free of side-effects. You can read more about that here - https://docs.scala-lang.org/style/method-invocation.html (Arity-0 section)
So in that case, we're probably having some side-effects in metadataDStream.dstream(). However, syntactically writing it as metadataDStream.dstream won't be an error.

using Calcite's ReflectiveSchema from scala

I'm experimenting with calcite from scala, and trying to pass a simple scala class for creating a schema at runtime (using ReflectiveSchema), I'm having some headache.
For example, re-implementing the FoodMart JDBC Example (which works well in Java), I'm calling it as simple as new ReflectiveSchema(new Hr()), using a Hr class rewritten in scala as:
class HR {
val emps: Array[Employee] = Array(new Employee(100, "Bill"))
}
I'm experiencing an error: ...SqlValidatorException: Object 'emps' not found within 'hr'. This problem seems to be related to the fact that val fields are actually created private in bytecode from java, and the implementation in calcite seems to be able to use (by means of java reflection) only fields accessible through the .getFields() method of a class.
So I suppose this direction requires a lot more hacking than a simple my_field.setAccessible(true) or similar.
Are there any other way to construct a schema by API, avoiding reflection and the usage of JSON?
thanks in advance for any suggestion

Scala Case Class Map Expansion

In groovy one can do:
class Foo {
Integer a,b
}
Map map = [a:1,b:2]
def foo = new Foo(map) // map expanded, object created
I understand that Scala is not in any sense of the word, Groovy, but am wondering if map expansion in this context is supported
Simplistically, I tried and failed with:
case class Foo(a:Int, b:Int)
val map = Map("a"-> 1, "b"-> 2)
Foo(map: _*) // no dice, always applied to first property
A related thread that shows possible solutions to the problem.
Now, from what I've been able to dig up, as of Scala 2.9.1 at least, reflection in regard to case classes is basically a no-op. The net effect then appears to be that one is forced into some form of manual object creation, which, given the power of Scala, is somewhat ironic.
I should mention that the use case involves the servlet request parameters map. Specifically, using Lift, Play, Spray, Scalatra, etc., I would like to take the sanitized params map (filtered via routing layer) and bind it to a target case class instance without needing to manually create the object, nor specify its types. This would require "reliable" reflection and implicits like "str2Date" to handle type conversion errors.
Perhaps in 2.10 with the new reflection library, implementing the above will be cake. Only 2 months into Scala, so just scratching the surface; I do not see any straightforward way to pull this off right now (for seasoned Scala developers, maybe doable)
Well, the good news is that Scala's Product interface, implemented by all case classes, actually doesn't make this very hard to do. I'm the author of a Scala serialization library called Salat that supplies some utilities for using pickled Scala signatures to get typed field information
https://github.com/novus/salat - check out some of the utilities in the salat-util package.
Actually, I think this is something that Salat should do - what a good idea.
Re: D.C. Sobral's point about the impossibility of verifying params at compile time - point taken, but in practice this should work at runtime just like deserializing anything else with no guarantees about structure, like JSON or a Mongo DBObject. Also, Salat has utilities to leverage default args where supplied.
This is not possible, because it is impossible to verify at compile time that all parameters were passed in that map.

Why does the Scala API have two strategies for organizing types?

I've noticed that the Scala standard library uses two different strategies for organizing classes, traits, and singleton objects.
Using packages whose members are them imported. This is, for example, how you get access to scala.collection.mutable.ListBuffer. This technique is familiar coming from Java, Python, etc.
Using type members of traits. This is, for example, how you get access to the Parser type. You first need to mix in scala.util.parsing.combinator.Parsers. This technique is not familiar coming from Java, Python, etc, and isn't much used in third-party libraries.
I guess one advantage of (2) is that it organizes both methods and types, but in light of Scala 2.8's package objects the same can be done using (1). Why have both these strategies? When should each be used?
The nomenclature of note here is path-dependent types. That's the option number 2 you talk of, and I'll speak only of it. Unless you happen to have a problem solved by it, you should always take option number 1.
What you miss is that the Parser class makes reference to things defined in the Parsers class. In fact, the Parser class itself depends on what input has been defined on Parsers:
abstract class Parser[+T] extends (Input => ParseResult[T])
The type Input is defined like this:
type Input = Reader[Elem]
And Elem is abstract. Consider, for instance, RegexParsers and TokenParsers. The former defines Elem as Char, while the latter defines it as Token. That means the Parser for the each is different. More importantly, because Parser is a subclass of Parsers, the Scala compiler will make sure at compile time you aren't passing the RegexParsers's Parser to TokenParsers or vice versa. As a matter of fact, you won't even be able to pass the Parser of one instance of RegexParsers to another instance of it.
The second is also known as the Cake pattern.
It has the benefit that the code inside the class that has a trait mixed in becomes independent of the particular implementation of the methods and types in that trait. It allows to use the members of the trait without knowing what's their concrete implementation.
trait Logging {
def log(msg: String)
}
trait App extends Logging {
log("My app started.")
}
Above, the Logging trait is the requirement for the App (requirements can also be expressed with self-types). Then, at some point in your application you can decide what the implementation will be and mix the implementation trait into the concrete class.
trait ConsoleLogging extends Logging {
def log(msg: String) = println(msg)
}
object MyApp extends App with ConsoleLogging
This has an advantage over imports, in the sense that the requirements of your piece of code aren't bound to the implementation defined by the import statement. Furthermore, it allows you to build and distribute an API which can be used in a different build somewhere else provided that its requirements are met by mixing in a concrete implementation.
However, there are a few things to be careful with when using this pattern.
All of the classes defined inside the trait will have a reference to the outer class. This can be an issue where performance is concerned, or when you're using serialization (when the outer class is not serializable, or worse, if it is, but you don't want it to be serialized).
If your 'module' gets really large, you will either have a very big trait and a very big source file, or will have to distribute the module trait code across several files. This can lead to some boilerplate.
It can force you to have to write your entire application using this paradigm. Before you know it, every class will have to have its requirements mixed in.
The concrete implementation must be known at compile time, unless you use some sort of hand-written delegation. You cannot mix in an implementation trait dynamically based on a value available at runtime.
I guess the library designers didn't regard any of the above as an issue where Parsers are concerned.