I am new to Scala and trying to grab the language fundamentals. I have working knowledge of Spark with Java API.
I have some hard time understanding some scala code and therfore I am not able to write the same in Java. I got this piece of code in https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config
// Read Configuration
val readConfig = Config(Map(
"Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE",
"Database" -> "DepartureDelays",
"Collection" -> "flights_pcoll",
"query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" // Optional
))
// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()
As far as I know the read method returns an object of type org.apache.spark.sql.DataFrameReader and this does not have any method cosmosDB(), then how this code is working. Also how do I convert this code to Java.
Thank You
What you are seeing is the magic of Scala implicit conversions. The compiler sees that you intend to call the cosmosDB method of a DataFrameReader and that there's no method of that name with the proper signature, as you note.
When you
import com.microsoft.azure.cosmosdb.spark.schema._
you also import the contents of the package object (current git commit as of this writing, last updated in 2017 so it's stable code). The relevant bit that gets imported is
implicit def toDataFrameReaderFunctions(dfr: DataFrameReader): DataFrameReaderFunctions
An implicit def which takes one argument signals to the compiler that, if this def is in scope, the compiler can insert a call to this method if:
it has a DataFrameReader
a method is being called which is not a member of DataFrameReader
com.microsoft.azure.cosmosdb.spark.schema.DataFrameReaderFunctions has member with the desired name and signature
Since DataFrameReaderFunctions has a method cosmosDB, the compiler then translates your code to
toDataFrameReaderFunctions(spark.read).cosmosDB(readConfig)
This general approach of using an implicit conversion to make it look like you're adding methods to a type without modifying the type is called enrichment or an extension method. Implicit conversions in general should probably be avoided: they very often make code hard to follow and an errant implicit conversion in scope can make code you don't intend to compile compile. For an enrichment like this, there's an alternative: use an implicit class, where the compiler essentially autogenerates the implicit conversion but this doesn't allow you to use an Int in place of a String.
Related
Based on this description of datasets and dataframes I wrote this very short test code which works.
import org.apache.spark.sql.functions._
val thing = Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")
val wordsDataset = sc.parallelize(thing).toDS()
If that works... why does running this give me a
error: value toDS is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.catalog.Table]
import org.apache.spark.sql.functions._
val sequence = spark.catalog.listDatabases().collect().flatMap(db =>
spark.catalog.listTables(db.name).collect()).toSeq
val result = sc.parallelize(sequence).toDS()
toDS() is not a member of RRD[T]. Welcome to the bizarre world of Scala implicits where nothing is what it seems to be.
toDS() is a member of DatasetHolder[T]. In SparkSession, there is an object called implicits. When brought in scope with an expression like import sc.implicits._, an implicit method called rddToDatasetHolder becomes available for resolution:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
When you call rdd.toDS(), the compiler first searches the RDD class and all of its superclasses for a method called toDS(). It doesn't find one so what it does is start searching all the compatible implicits in scope. While doing so, it finds the rddToDatasetHolder method which accepts an RDD instance and returns an object of a type which does have a toDS() method. Basically, the compiler rewrites:
sc.parallelize(sequence).toDS()
into
SparkSession.implicits.rddToDatasetHolder(sc.parallelize(sequence)).toDS()
Now, if you look at rddToDatasetHolder itself, it has two argument lists:
(rdd: RDD[T])
(implicit arg0: Encoder[T])
Implicit arguments in Scala are optional and if you do not supply the argument explicitly, the compiler searches the scope for implicits that match the required argument type and passes whatever object it finds or can construct. In this particular case, it looks for an instance of the Encoder[T] type. There are many predefined encoders for the standard Scala types, but for most complex custom types no predefined encoders exist.
So, in short: The existence of a predefined Encoder[String] makes it possible to call toDS() on an instance of RDD[String], but the absence of a predefined Encoder[org.apache.spark.sql.catalog.Table] makes it impossible to call toDS() on an instance of RDD[org.apache.spark.sql.catalog.Table].
By the way, SparkSession.implicits contains the implicit class StringToColumn which has a $ method. This is how the $"foo" expression gets converted to a Column instance for column foo.
Resolving all the implicit arguments and implicit transformations is why compiling Scala code is so dang slow.
I am trying to write some use cases for Apache Flink. One error I run into pretty often is
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[SomeType]
My problem is that I cant really nail down when they happen and when they dont.
The most recent example of this would be the following
...
val largeJoinDataGen = new LargeJoinDataGen(dataSetSize, dataGen, hitRatio)
val see = StreamExecutionEnvironment.getExecutionEnvironment
val newStreamInput = see.addSource(largeJoinDataGen)
...
where LargeJoinDataGen extends GeneratorSource[(Int, String)] and GeneratorSource[T] extends SourceFunction[T], both defined in separate files.
When trying to build this I get
Error:(22, 39) could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[(Int, String)]
val newStreamInput = see.addSource(largeJoinDataGen)
1. Why is there an error in the given example?
2. What would be a general guideline when these errors happen and how to avoid them in the future?
P.S.: first scala project and first flink project so please be patient
You may make an import instead of implicits
import org.apache.flink.streaming.api.scala._
It will also help.
This mostly happens when you have user code, i.e. a source or a map function or something of that nature that has a generic parameter. In most cases you can fix that by adding something like
implicit val typeInfo = TypeInformation.of(classOf[(Int, String)])
If your code is inside another method that has a generic parameter you can also try adding a context bound to the generic parameter of the method, as in
def myMethod[T: TypeInformation](input: DataStream[Int]): DataStream[T] = ...
My problem is that I cant really nail down when they happen and when they dont.
They happen when an implicit parameter is required. If we look at the method definition we see:
def addSource[T: TypeInformation](function: SourceFunction[T]): DataStream[T]
But we don't see any implicit parameter defined, where is it?
When you see a polymorphic method where the type parameter is of the form
def foo[T : M](param: T)
Where T is the type parameter and M is a context bound. It means that the creator of the method is requesting an implicit parameter of type M[T]. It is equivalent to:
def foo[T](param: T)(implicit ev: M[T])
In the case of your method, it is actually expanded to:
def addSource[T](function: SourceFunction[T])(implicit evidence: TypeInformation[T]): DataStream[T]
This is why you see the compiler complaining, as it can't find the implicit parameter the method is requiring.
If we go to the Apache Flink Wiki, under Type Information we can see why this happens :
No Implicit Value for Evidence Parameter Error
In the case where TypeInformation could not be created, programs fail to compile with an error stating “could not find implicit value for evidence parameter of type TypeInformation”.
A frequent reason if that the code that generates the TypeInformation has not been imported. Make sure to import the entire flink.api.scala package.
import org.apache.flink.api.scala._
For generic methods, you'll need to require them to generate a TypeInformation at the call-site as well:
For generic methods, the data types of the function parameters and return type may not be the same for every call and are not known at the site where the method is defined. The code above will result in an error that not enough implicit evidence is available.
In such cases, the type information has to be generated at the invocation site and passed to the method. Scala offers implicit parameters for that.
Note that import org.apache.flink.streaming.api.scala._ may also be necessary.
For your types this means that if the invoking method is generic, it also needs to request the context bound for it's type parameter.
For example Scala versions (2.11, 2.12, etc.) are not binary compatible.
The following is a wrong configuration even if you use import org.apache.flink.api.scala._ :
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
Correct configuration in Maven:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12.8</scala.version>
<scala.binary.version>2.12</scala.binary.version>
</properties>
I'm working on writing some unit tests for my Scala Spark application
In order to do so I need to create different dataframes in my tests. So I wrote a very short DFsBuilder code that basically allows me to add new rows and eventually create the DF. The code is:
class DFsBuilder[T](private val sqlContext: SQLContext, private val columnNames: Array[String]) {
var rows = new ListBuffer[T]()
def add(row: T): DFsBuilder[T] = {
rows += row
this
}
def build() : DataFrame = {
import sqlContext.implicits._
rows.toList.toDF(columnNames:_*) // UPDATE: added :_* because it was accidently removed in the original question
}
}
However the toDF method doesn't compile with a cannot resolve symbol toDF.
I wrote this builder code with generics since I need to create different kinds of DFs (different number of columns and different column types). The way I would like to use it is to define some certain case class in the unit test and use it for the builder
I know this issue somehow relates to the fact that I'm using generics (probably some kind of type erasure issue) but I can't quite put my finger on what the problem is exactly
And so my questions are:
Can anyone show me where the problem is? And also hopefully how to fix it
If this issue cannot be solved this way, could someone perhaps offer another elegant way to create dataframes? (I prefer not to pollute my unit tests with the creation code)
I obviously googled this issue first but only found examples where people forgot to import the sqlContext.implicits method or something about a case class out of scope which is probably not the same issue as I'm having
Thanks in advance
If you look at the signatures of toDF and of SQLImplicits.localSeqToDataFrameHolder (which is the implicit function used) you'll be able to detect two issues:
Type T must be a subclass of Product (the superclass of all case classes, tuples...), and you must provide an implicit TypeTag for it. To fix this - change the declaration of your class to:
class DFsBuilder[T <: Product : TypeTag](...) { ... }
The columnNames argument is not of type Array, it's a "repeated parameter" (like Java's "varargs", see section 4.6.2 here), so you have to convert the array into arguments:
rows.toList.toDF(columnNames: _*)
With these two changes, your code compiles (and works).
I've made use of a few of scala's built-in type classes, and created a few of my own. However, the biggest issue I have with them at the moment is: how do I find type classes available to me? While most of those that I write are small and simple, it would be nice to know if something already exists that does what I'm about to implement!
So, is there a list, somewhere, of all the type classes or implicit values available in the standard library?
Even better, is it possible to somehow (probably within the REPL) generate a list of the implicit values available in the current scope?
It's a job for a good IDE.
IntellijIDEA 14+
Check out Implicits analyser in Scala Plugin 1.4.x. Example usage:
def myMethod(implicit a: Int) = {
}
implicit val a: Int = 1
myMethod // click the myMethod and press Ctrl+Shift+P, the "Implicit Parameters" is shown
Eclipse
Check out Implicit highlighting.
Scala REPL
You can list implicits like this:
:implicits -v
And investigate their origin like defined here:
import reflect.runtime.universe
val tree = universe.reify(1 to 4).tree
universe.showRaw(tree)
universe.show(tree)
The following Scala code works correctly:
val str1 = "hallo"
val str2 = "huhu"
val zipped: IndexedSeq[(Char, Char)] = str1.zip(str2)
However if I import the implicit method
implicit def stringToNode(str: String): xml.Node = new xml.Text(str)
then the Scala (2.10) compiler shows an error: value zip is not a member of String
It seems that the presence of stringToNode somehow blocks the implicit conversion of str1 and str2 to WrappedString. Why? And is there a way to modify stringToNode such that zip works but stringToNode is still used when I call a function that requires a Node argument with a String?
You have ambiguous implicits here. Both StringOps and xml.Node have the zip-method, therefore the implicit conversion is ambiguous and cannot be resolved. I don't know why it doesn't give a better error message.
Here are some links to back it up:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.StringOps
and
http://www.scala-lang.org/api/current/index.html#scala.xml.Node
edit: it was StringOps, not WrappedString, changed the links :) Have a look at Predef: http://www.scala-lang.org/api/current/index.html#scala.Predef$
to see predefined implicits in Scala.
I would avoid using implicits in this case. You want 2 different implicit conversions which both provide a method of the same name (zip). I don't think this is possible. Also, if you import xml.Text, you can convert with just Text(str) which should be concise enough for anyone. If you must have this implicit conversion to xml.Node, I would pack the implicit def into an object and then import it only in the places where you need it to make your code readable and to, possibly, avoid conflicts where you also need to zip strings. But basically, I would very much avoid using implicits just to make convenient conversions.
Like #Felix wrote, it is generally a bad idea to define implicit conversions between similar data types, like the one you used. Doing that weakens type system, leads to ambiguities like you encountered and may produce extremely unclear ("magic") code which is very hard to analyze and debug.
Implicit conversions in Scala are mostly used to define lightweight, short-lived wrappers in order to enrich API of wrapped type. Implicit conversion that converts String into WrappedString falls into that category.
Twitter's Effective Scala has a section about this issue.