SparkContext implicit functions definition - scala

I am reading the latest code of Spark,there are comments:
// The following implicit functions were in SparkContext before 1.3 and users had to
// `import SparkContext._` to enable them. Now we move them here to make the compiler find
// them automatically. However, we still keep the old functions in SparkContext for backward
// compatibility and forward to the following functions directly.
I don't understand Now we move them here to make the compiler find them automatically, I would ask how spark could automatically find these implicit definitions,and put the implicit definition into scope, because user only create s SparkContext instance in their spark code

I would ask how spark could automatically find these implicit
definition
It is not Spark, it is the Scala compiler. The compiler has several places it traces while compiling your code to try and find implicits. You can find them in Where does Scala look for implicits?
Since these methods are defined on WritableConverter, anytime that type is in scope in one of the ways Scala looks for implicits, you'll automatically have these conversions in scope and be able to apply them.

Related

Special grammar in scala

I am very new at Scala and Spark area, and I found a strange grammar usage in the scala inside the Apache beam project and I can't understand.
Here is the strange place:
JavaDStream<Metadata> metadataDStream = mapWithStateDStream.map(new Tuple2MetadataFunction());
// register ReadReportDStream to report information related to this read.
new ReadReportDStream(metadataDStream.dstream(), id, getSourceName(source, id), stepName)
.register();
From the above code, you can see inside the constructor of ReadReportDstream, the first parameter is
metadataDStream.dstream()
If we go inside the dstream() method, you will see the following code:
class JavaDStream[T](val dstream: DStream[T])(implicit val classTag: ClassTag[T])
extends AbstractJavaDStreamLike[T, JavaDStream[T], JavaRDD[T]] {
I am wondering why it uses "metadataDStream.dstream()" in the constructor instead of "metadataDStream.dstream"? What does the "()" do?
It's mostly a question of convention. Methods with empty parameter lists are evaluated for their side-effects. Methods without parameters are assumed to be purely functional, and free of side-effects. You can read more about that here - https://docs.scala-lang.org/style/method-invocation.html (Arity-0 section)
So in that case, we're probably having some side-effects in metadataDStream.dstream(). However, syntactically writing it as metadataDStream.dstream won't be an error.

No implicits found for parameter evidence

I have a line of code in a scala app that takes a dataframe with one column and two rows, and assigns them to variables start and end:
val Array(start, end) = datesInt.map(_.getInt(0)).collect()
This code works fine when run in a REPL, but when I try to put the same line in a scala object in Intellij, it inserts a grey (?: Encoder[Int]) before the .collect() statement, and show an inline error No implicits found for parameter evidence$6: Encoder[Int]
I'm pretty new to scala and I'm not sure how to resolve this.
Spark needs to know how to serialize JVM types to send them from workers to the master. In some cases they can be automatically generated and for some types there are explicit implementations written by Spark devs. In this case you can implicitly pass them. If your SparkSession is named spark then you miss following line:
import spark.implicits._
As you are new to Scala: implicits are parameters that you don't have to explicitly pass. In your example map function requires Encoder[Int]. By adding this import, it is going to be included in the scope and thus passed automatically to map function.
Check Scala documentation to learn more.

Passing Spark dataframe between scala methods - Performance

Recently, I have developed a Spark Streaming application using Scala and Spark. In this application, I have extensively used Implicit Class (Pimp my Library pattern) to implement more general utilities like Writing a Dataframe to HBase by creating an implicit class that is extending Spark's Dataframe. For example,
implicit class DataFrameExtension(private val dataFrame: DataFrame) extends Serializable { ..... // Custom methods to perform some computations }
However, a senior architect from my team refactored the code (specifying some style mismatch and performance as a reason) and copied these methods to a new class. Now, these methods accept Dataframe as an argument.
Can anyone help me on,
Whether Scala's implicit classes creates any overhead during
run-time?
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
I have searched a bit, but couldn't find any style guide that gives guidelines on using implicit classes or methods over traditional methods.
Thanks in advance.
Whether Scala's implicit classes creates any overhead during run-time?
Not in your case. There is some overhead when the implicit type is AnyVal (thus needs to be boxed). Implicits are resolved during compile time, and except for maybe a few virtual method calls there should be no overhead.
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
No, no more then any other type. Obviously there will be no serialization.
... if I pass dataframes between methods in Spark code, it might create closure and as a result, will bring the parent class that holds the dataframe object.
Only if you use scoped variables inside your dataframe, for example filter($"col" === myVar) where myVar declared in the scope of the method. In this case, Spark might serialize the wrapping class, but it's easy to avoid that. Please remember that dataframes are passed quite often and quite deep inside Spark code, and probably in every other library that you might be using (datasources, for example).
It is very common (and handy) to use extension implicit classes like you did.

How to Find Available Implicit Conversions for a Given Type in Scala

I am writing an autocompleter (i.e., code completion like in Eclipse or IntelliJ) for a domain specific language that is a subset of Scala. Users frequently use implicit conversions to hide the more advanced features of Scala like options or Scalaz disjunctions.
I am looking for a way, either at compile time or runtime, to acquire a list of implicit conversions available for a receiver (i.e., for the ‘x’ in ‘val y = x.foo’). So, I have two specific questions:
Is there some library that, given the type of a receiver, can find all the implicit conversions that the compiler could use to turn that receiver into another type?*
How is the identification of available implicit conversions actually done by the Scala compiler? I am not sure where in the source to look to find it; some documentation about how the compiler does this or the location in the source where it does it would also be very helpful.
*: As you might have guessed, I plan to use the resulting list to get all the available fields and methods of all the types the given variable could be implicitly converted to so that the autocompleter can suggest them all to users. If there’s an even more direct way to do that, that would be great too.

Scala Case Class Map Expansion

In groovy one can do:
class Foo {
Integer a,b
}
Map map = [a:1,b:2]
def foo = new Foo(map) // map expanded, object created
I understand that Scala is not in any sense of the word, Groovy, but am wondering if map expansion in this context is supported
Simplistically, I tried and failed with:
case class Foo(a:Int, b:Int)
val map = Map("a"-> 1, "b"-> 2)
Foo(map: _*) // no dice, always applied to first property
A related thread that shows possible solutions to the problem.
Now, from what I've been able to dig up, as of Scala 2.9.1 at least, reflection in regard to case classes is basically a no-op. The net effect then appears to be that one is forced into some form of manual object creation, which, given the power of Scala, is somewhat ironic.
I should mention that the use case involves the servlet request parameters map. Specifically, using Lift, Play, Spray, Scalatra, etc., I would like to take the sanitized params map (filtered via routing layer) and bind it to a target case class instance without needing to manually create the object, nor specify its types. This would require "reliable" reflection and implicits like "str2Date" to handle type conversion errors.
Perhaps in 2.10 with the new reflection library, implementing the above will be cake. Only 2 months into Scala, so just scratching the surface; I do not see any straightforward way to pull this off right now (for seasoned Scala developers, maybe doable)
Well, the good news is that Scala's Product interface, implemented by all case classes, actually doesn't make this very hard to do. I'm the author of a Scala serialization library called Salat that supplies some utilities for using pickled Scala signatures to get typed field information
https://github.com/novus/salat - check out some of the utilities in the salat-util package.
Actually, I think this is something that Salat should do - what a good idea.
Re: D.C. Sobral's point about the impossibility of verifying params at compile time - point taken, but in practice this should work at runtime just like deserializing anything else with no guarantees about structure, like JSON or a Mongo DBObject. Also, Salat has utilities to leverage default args where supplied.
This is not possible, because it is impossible to verify at compile time that all parameters were passed in that map.