Spark toDF cannot resolve symbol after importing sqlContext implicits - scala

I'm working on writing some unit tests for my Scala Spark application
In order to do so I need to create different dataframes in my tests. So I wrote a very short DFsBuilder code that basically allows me to add new rows and eventually create the DF. The code is:
class DFsBuilder[T](private val sqlContext: SQLContext, private val columnNames: Array[String]) {
var rows = new ListBuffer[T]()
def add(row: T): DFsBuilder[T] = {
rows += row
this
}
def build() : DataFrame = {
import sqlContext.implicits._
rows.toList.toDF(columnNames:_*) // UPDATE: added :_* because it was accidently removed in the original question
}
}
However the toDF method doesn't compile with a cannot resolve symbol toDF.
I wrote this builder code with generics since I need to create different kinds of DFs (different number of columns and different column types). The way I would like to use it is to define some certain case class in the unit test and use it for the builder
I know this issue somehow relates to the fact that I'm using generics (probably some kind of type erasure issue) but I can't quite put my finger on what the problem is exactly
And so my questions are:
Can anyone show me where the problem is? And also hopefully how to fix it
If this issue cannot be solved this way, could someone perhaps offer another elegant way to create dataframes? (I prefer not to pollute my unit tests with the creation code)
I obviously googled this issue first but only found examples where people forgot to import the sqlContext.implicits method or something about a case class out of scope which is probably not the same issue as I'm having
Thanks in advance

If you look at the signatures of toDF and of SQLImplicits.localSeqToDataFrameHolder (which is the implicit function used) you'll be able to detect two issues:
Type T must be a subclass of Product (the superclass of all case classes, tuples...), and you must provide an implicit TypeTag for it. To fix this - change the declaration of your class to:
class DFsBuilder[T <: Product : TypeTag](...) { ... }
The columnNames argument is not of type Array, it's a "repeated parameter" (like Java's "varargs", see section 4.6.2 here), so you have to convert the array into arguments:
rows.toList.toDF(columnNames: _*)
With these two changes, your code compiles (and works).

Related

Cast a DataFrame Entry to a Case Class with Any-type Member [duplicate]

I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this
case class MyClass(a : Any, ...)
val df = ...
df.map(x => MyClass(x.get(0), ...))
As you can see MyClass has a field of type Any, as I do not know at compile time the type of the field I retrieve with x.get(0). It may be a long, string, int, etc.
However, when I try to execute code similar to what you see above, I get an exception:
java.lang.ClassNotFoundException: scala.Any
With some debugging, I realized that the exception is raised, not because my data is of type Any, but because MyClass has a type Any. So how can I use Datasets then?
Unless you're interested in limited and ugly workarounds like Encoders.kryo:
import org.apache.spark.sql.Encoders
case class FooBar(foo: Int, bar: Any)
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])
or
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))
you don't. All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there.
UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset API and comes with significant performance and storage penalty.
If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.

Scala- How to get a Vector's contained class?

Does Scala have a way to get the contained class(es) of a collection? i.e. if I have:
val foo = Vector[Int]()
Is there a way to get back classOf[Int] from it?
(Just checking the first element doesn't work since it might be empty.)
You can use TypeTag:
import scala.reflect.runtime.universe._
def getType[F[_], A: TypeTag](as: F[A]) = typeOf[A]
val foo = Vector[Int]()
getType(foo)
Not from the collection itself, but if you get it a parameter from a method, you could add an implicit TypeTag to that method to obtain the type at runtime. E.g.
def mymethod[T](x: Vector[T])(implicit tag: TypeTag[T]) = ...
See https://docs.scala-lang.org/.../typetags-manifests.html for details.
Technically you can do it by using TypeTag or Typeable/TypeCase from Shapless library (see link). But I just want to note that all these tricks are really very advanced solutions when there is no any better way get the task done without digging inside type parameters.
All type parameters in scala and java are affected by type erasure on runtime, and if you сatch yourself thinking about extracting these information from the class it might be a good sign to redesign the solution that you are trying to implement.

Scala Unit Testing - Mocking an implicitly wrapped function

I have a question concerning unit tests that I'm trying to achieve using Mockito in Scala. I've also looked up ScalaMock but it sounds like the feature is not provided as well. I suppose that maybe I'm looking from a narrow way to the solution and there might be a different perspective or approach to what im doing so all your opinions are welcomed.
Basically, I want to mock a function that is available to the object using implicit conversion, and I don't have any control to change how that is done. Since I'm a user to the library. The concrete example is similar to the following scenario
rdd: RDD[T] = //existing RDD
sqlContext: SQLContext = //existing sqlcontext
import sqlContext.implicits._
rdd.toDF()
/*toDF() doesn't originally exist at RDD but is implicitly added when importing sqlContext.implicits._*/
Now In the testing, I'm mocking the rdd and the sqlContext and I want to mock the toDF() function. I Can't mock the function toDF() since it doesn't exist on the RDD level. Even if I do a simple trick, importing the mocked sqlContext.implicit._ I get an error that any function that is not publicaly available to the object can't be mocked. I even tried to mock the code that is implicitly executed until toDF() but I get stuck with Final/Pivate[in accessible] classes that I also can't mock. Your suggestions are more than welcomed. Thanks in advance :)

How to interpret a val in Scala that is of type Option[T]

I am trying to analyze Scala code written by someone else, and in doing so, I would like to be able to write Unit Tests (that were not written before the code was written, unfortunately).
Being a relative Newbie to Scala, especially in the Futures concept area, I am trying to understand the following line of code.
val niceAnalysis:Option[(niceReport) => Future[niceReport]] = None
Update:
The above line of code should be:
val niceAnalysis:Option[(NiceReport) => Future[NiceReport]] = None
- Where NiceReport is a case class
-----------Update ends here----------------
Since I am trying to mock up an Actor, I created this new Actor where I introduce my niceAnalysis val as a field.
The first problem I see with this "niceAnalysis" thing is that it looks like an anonymous function.
How do I "initialize" this val, or to give it an initial value.
My goal is to create a test in my test class, where I am going to pass in this initialized val value into my test actor's receive method.
My naive approach to accomplish this looked like:
val myActorUnderTestRef = TestActorRef(new MyActorUnderTest("None))
Neither does IntelliJ like it. My SBT compile and test fails.
So, I need to understand the "niceAnalyis" declaration first and then understand how to give it an initial value. Please advise.
You are correct that this is a value that might contain a function from type niceReport to Future[niceReport]. You can pass an anonymous function or just a function pointer. The easiest to understand might be the pointer, so I will provide that first, but the easiest in longer terms would be the anonymous function most likely, which I will show second:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
def strToFuture(x: String) = Future{ x } //merely wrap the string in a future
val foo = Option(strToFuture)
Conversely, the one liner is as follows:
val foo = Option((x:String)=>Future{x})

Collision of implicits in Scala

The following Scala code works correctly:
val str1 = "hallo"
val str2 = "huhu"
val zipped: IndexedSeq[(Char, Char)] = str1.zip(str2)
However if I import the implicit method
implicit def stringToNode(str: String): xml.Node = new xml.Text(str)
then the Scala (2.10) compiler shows an error: value zip is not a member of String
It seems that the presence of stringToNode somehow blocks the implicit conversion of str1 and str2 to WrappedString. Why? And is there a way to modify stringToNode such that zip works but stringToNode is still used when I call a function that requires a Node argument with a String?
You have ambiguous implicits here. Both StringOps and xml.Node have the zip-method, therefore the implicit conversion is ambiguous and cannot be resolved. I don't know why it doesn't give a better error message.
Here are some links to back it up:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.StringOps
and
http://www.scala-lang.org/api/current/index.html#scala.xml.Node
edit: it was StringOps, not WrappedString, changed the links :) Have a look at Predef: http://www.scala-lang.org/api/current/index.html#scala.Predef$
to see predefined implicits in Scala.
I would avoid using implicits in this case. You want 2 different implicit conversions which both provide a method of the same name (zip). I don't think this is possible. Also, if you import xml.Text, you can convert with just Text(str) which should be concise enough for anyone. If you must have this implicit conversion to xml.Node, I would pack the implicit def into an object and then import it only in the places where you need it to make your code readable and to, possibly, avoid conflicts where you also need to zip strings. But basically, I would very much avoid using implicits just to make convenient conversions.
Like #Felix wrote, it is generally a bad idea to define implicit conversions between similar data types, like the one you used. Doing that weakens type system, leads to ambiguities like you encountered and may produce extremely unclear ("magic") code which is very hard to analyze and debug.
Implicit conversions in Scala are mostly used to define lightweight, short-lived wrappers in order to enrich API of wrapped type. Implicit conversion that converts String into WrappedString falls into that category.
Twitter's Effective Scala has a section about this issue.