Why does a spark RDD behave differently depending on contents? - scala

Based on this description of datasets and dataframes I wrote this very short test code which works.
import org.apache.spark.sql.functions._
val thing = Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")
val wordsDataset = sc.parallelize(thing).toDS()
If that works... why does running this give me a
error: value toDS is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.catalog.Table]
import org.apache.spark.sql.functions._
val sequence = spark.catalog.listDatabases().collect().flatMap(db =>
spark.catalog.listTables(db.name).collect()).toSeq
val result = sc.parallelize(sequence).toDS()

toDS() is not a member of RRD[T]. Welcome to the bizarre world of Scala implicits where nothing is what it seems to be.
toDS() is a member of DatasetHolder[T]. In SparkSession, there is an object called implicits. When brought in scope with an expression like import sc.implicits._, an implicit method called rddToDatasetHolder becomes available for resolution:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
When you call rdd.toDS(), the compiler first searches the RDD class and all of its superclasses for a method called toDS(). It doesn't find one so what it does is start searching all the compatible implicits in scope. While doing so, it finds the rddToDatasetHolder method which accepts an RDD instance and returns an object of a type which does have a toDS() method. Basically, the compiler rewrites:
sc.parallelize(sequence).toDS()
into
SparkSession.implicits.rddToDatasetHolder(sc.parallelize(sequence)).toDS()
Now, if you look at rddToDatasetHolder itself, it has two argument lists:
(rdd: RDD[T])
(implicit arg0: Encoder[T])
Implicit arguments in Scala are optional and if you do not supply the argument explicitly, the compiler searches the scope for implicits that match the required argument type and passes whatever object it finds or can construct. In this particular case, it looks for an instance of the Encoder[T] type. There are many predefined encoders for the standard Scala types, but for most complex custom types no predefined encoders exist.
So, in short: The existence of a predefined Encoder[String] makes it possible to call toDS() on an instance of RDD[String], but the absence of a predefined Encoder[org.apache.spark.sql.catalog.Table] makes it impossible to call toDS() on an instance of RDD[org.apache.spark.sql.catalog.Table].
By the way, SparkSession.implicits contains the implicit class StringToColumn which has a $ method. This is how the $"foo" expression gets converted to a Column instance for column foo.
Resolving all the implicit arguments and implicit transformations is why compiling Scala code is so dang slow.

Related

Understanding type inferrence in Scala

I wrote the following simple program:
import java.util.{Set => JavaSet}
import java.util.Collections._
object Main extends App {
def test(set: JavaSet[String]) = ()
test(emptySet()) //fine
test(emptySet) //error
}
DEMO
And was really surprised the the final line test(emptySet) was not compiled. Why? What is the difference between test(emptySet())? I thought in Scala we could omit parenthesis freely in such cases.
See Method Conversions in Scala specification:
The following four implicit conversions can be applied to methods which are not applied to some argument list.
Evaluation
A parameterless method m
of type => T is always converted to type T by evaluating the expression to which m is bound.
Implicit Application
If the method takes only implicit parameters, implicit arguments are passed following the rules here.
Eta Expansion
Otherwise, if the method is not a constructor, and the expected type pt
is a function type (Ts′)⇒T′, eta-expansion is performed on the expression e.
Empty Application
Otherwise, if e
has method type ()T, it is implicitly applied to the empty argument list, yielding e().
The one you want is "Empty Application", but it's only applied if none of the earlier conversions are, and in this case "Eta Expansion" happens instead.
EDIT: This was wrong and #Jasper-M's comment is right. No eta-expansion is happening, "Empty Application" is just inapplicable to generic methods currently.

Scala implicitly vs implicit arguments

I am new to Scala, and when I look at different projects, I see two styles for dealing with implicit arguments
scala]]>def sum[A](xs:List[A])(implicit m:Monoid[A]): A = xs.foldLeft(m.mzero)(m.mappend)
sum:[A](xs:List[A])(implicit m:Monoid[A])A
and
scala]]>def sum[A:Monoid](xs:List[A]): A ={
val m = implicitly[Monoid[A]]
xs.foldLeft(m.mzero)(m.mappend)
}
sum:[A](xs:List[A])(implicit evidence$1:Monoid[A])A
Based off the type of both functions, they match. Is there a difference between the two? Why would you want to use implicitly over implicit arguments? In this simple example, it feels more verbose.
When I run the above in the REPL with something that doesn't have an implicit, I get the following errors
with implicit param
<console>:11: error: could not find implicit value for parameter m: Monoid[String]
and
with implicitly and a: Monoid
<console>:11: error: could not find implicit value for evidence parameter of type Monoid[String]
In some circumstances, the implicit formal parameter is not directly used in the body of the method that takes it as an argument. Rather, it simply becomes an implicit val to be passed on to another method that requires an implicit parameter of the same (or a compatible) type. In that case, not having the overt implicit parameter list is convenient.
In other cases, the context bound notation, which is strictly syntactic sugar for an overt implicit parameter, is considered aesthetically desirable and even though the actual parameter is needed and hence the implicitly method must be used to get it is considered preferable.
Given that there is no semantic difference between the two, the choice is predicated on fairly subjective criteria.
Do whichever you like. Lastly note that changing from one to the other will not break any code nor would require recompilation (though I don't know if SBT is discriminting enough to forgo re-compiling code that can see the changed definition).

Collision of implicits in Scala

The following Scala code works correctly:
val str1 = "hallo"
val str2 = "huhu"
val zipped: IndexedSeq[(Char, Char)] = str1.zip(str2)
However if I import the implicit method
implicit def stringToNode(str: String): xml.Node = new xml.Text(str)
then the Scala (2.10) compiler shows an error: value zip is not a member of String
It seems that the presence of stringToNode somehow blocks the implicit conversion of str1 and str2 to WrappedString. Why? And is there a way to modify stringToNode such that zip works but stringToNode is still used when I call a function that requires a Node argument with a String?
You have ambiguous implicits here. Both StringOps and xml.Node have the zip-method, therefore the implicit conversion is ambiguous and cannot be resolved. I don't know why it doesn't give a better error message.
Here are some links to back it up:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.StringOps
and
http://www.scala-lang.org/api/current/index.html#scala.xml.Node
edit: it was StringOps, not WrappedString, changed the links :) Have a look at Predef: http://www.scala-lang.org/api/current/index.html#scala.Predef$
to see predefined implicits in Scala.
I would avoid using implicits in this case. You want 2 different implicit conversions which both provide a method of the same name (zip). I don't think this is possible. Also, if you import xml.Text, you can convert with just Text(str) which should be concise enough for anyone. If you must have this implicit conversion to xml.Node, I would pack the implicit def into an object and then import it only in the places where you need it to make your code readable and to, possibly, avoid conflicts where you also need to zip strings. But basically, I would very much avoid using implicits just to make convenient conversions.
Like #Felix wrote, it is generally a bad idea to define implicit conversions between similar data types, like the one you used. Doing that weakens type system, leads to ambiguities like you encountered and may produce extremely unclear ("magic") code which is very hard to analyze and debug.
Implicit conversions in Scala are mostly used to define lightweight, short-lived wrappers in order to enrich API of wrapped type. Implicit conversion that converts String into WrappedString falls into that category.
Twitter's Effective Scala has a section about this issue.

How does the Scala type checker know to prevent calling flatten on a List that is already flat

I'm kind of new to Scala and I have a question about the type system.
The flatten method works on nested collections, so if I have a List of Lists it will flatten it into a List. But it does not make sense to call flatten on a collection that is already flat. And sure enough the Scala type checker will flag that as an error.
List(List(1,2,3),List(4,5,6)).flatten // produces List(1,2,3,4,5,6)
List(1,2,3,4).flatten // type error
I understand that this somehow relies on an implicit parameter to flatten. But I don't know where the implicit value comes from and how it is used to assert the type of the object flatten is called on. Also, why doesn't the implicit parameter show up in the scaladocs for List.flatten?
To understand how this works, we need to have a look at the type signature of flatten:
def flatten[B](implicit asTraversable: (A) ⇒ GenTraversableOnce[B]): List[B]
A is the element type of the list and B is the type of the elements of each element. In order for flatten to work, there has to be an implicit conversion from the element types to a GenTraversableOnce[B]. That is only the case for collections or if you implement your own implicit conversion. For example, you could define one for pairs:
implicit def pairToList[A](p:(A,A)) = List(p._1, p._2)
List(1->2,2->3).flatten //compiles! List(1,2,2,3)
The trick is an implicit witness that ensures that the elements in the list are traversable. Here is the (slightly simplified) signature of flatten taken from GenericTraversableTemplate:
def flatten[B](implicit asTraversable: A => TraversableOnce[B])
: GenTraversable[B] =
In your case, the witness cannot be found for elements of type Int, and the invocation of flatten is therefore rejected by the compiler.
If you wanted to make your example compile, you could use the following implicit definition:
implicit def singleToList[A](a: A) = List(a)
(As a side-note: I'd consider such an implicit as quite dangerous, since it's applicability is very general, which can result in unpleasant surprises because the compiler might inject invocations at various places without you being aware of it.)

Strange behavior with implicits

I'm using the Scalacheck library to test my application. In that library there's a Gen object that defines implicit conversions of any object to a generator of objects of that class.
E.g., importing Gen._ lets you call methods such as sample on any object, through its implicit conversion to Gen:
scala> import org.scalacheck.Gen._
import org.scalacheck.Gen._
scala> "foo" sample
res1: Option[java.lang.String] = Some(foo)
In this example, the implicit Gen.value() is applied to "foo", yielding a generator that always returns Some(foo).
But this doesn't work:
scala> import org.scalacheck.Gen.value
import org.scalacheck.Gen.value
scala> "foo" sample
<console>:5: error: value sample is not a member of java.lang.String
"foo" sample
^
Why not?
Update
I'm using Scala 2.7.7final and ScalaCheck 2.7.7-1.6.
Update
Just switched to Scala 2.8.0.final with ScalaCheck 2.8.0-1.7. The problem did indeed go away.
I just tried this with Scala 2.8.0.final and ScalaCheck 1.7 built for the same. Both imports worked, meaning the second line produced the desired result for both imports:
scala> "foo" sample
res1: Option[java.lang.String] = Some(foo)
What version of Scala and ScalaCheck did you use?
Simple: You did not import the implicit conversion (whatever its name is), you only imported something called value from object org.scalacheck.Gen.
Correction / clarification:
Gen.value (that's object Gen, not trait Gen[+T]) is the implicit used to wrap arbitrary values in an instance of (an anonymous class implementing) trait Gen[T] (where T is a function from Gen.Params to the argument to which Gen.value is applied). Gen.sample is a method of trait Gen[T] that invokes its (the concrete Gen subclass) apply method to get the synthesized value.
Sadly, having looked closer, I have to admit I don't understand why the code doesn't work when the rest of the members of object Gen are not imported.