How does $ symbol working when selecting columns from DataFrame? - scala

when we try to select Columns from DataFrame, one can use $"columnname" or col("columnname") or just "columnname".
My question is how $ symbol[which returns ColumnName] is working, i can understand i need to import sqlContext.implicits._ to use $ symbol on df.select
I dont see $ method on SQLImplicits class as well. I can see one method with the name symbolToColumn(scala.Symbol s).
Can someone explain more on this?

It comes from StringToColumn implicit inner class in SQLImplicits (which is implemented by the implicits object).
StringContext is the way that f / s and other string interpolators are written in Scala.

Related

Scala | Spark | Invoking undefined method

I am new to Scala and trying to grab the language fundamentals. I have working knowledge of Spark with Java API.
I have some hard time understanding some scala code and therfore I am not able to write the same in Java. I got this piece of code in https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config
// Read Configuration
val readConfig = Config(Map(
"Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE",
"Database" -> "DepartureDelays",
"Collection" -> "flights_pcoll",
"query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" // Optional
))
// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()
As far as I know the read method returns an object of type org.apache.spark.sql.DataFrameReader and this does not have any method cosmosDB(), then how this code is working. Also how do I convert this code to Java.
Thank You
What you are seeing is the magic of Scala implicit conversions. The compiler sees that you intend to call the cosmosDB method of a DataFrameReader and that there's no method of that name with the proper signature, as you note.
When you
import com.microsoft.azure.cosmosdb.spark.schema._
you also import the contents of the package object (current git commit as of this writing, last updated in 2017 so it's stable code). The relevant bit that gets imported is
implicit def toDataFrameReaderFunctions(dfr: DataFrameReader): DataFrameReaderFunctions
An implicit def which takes one argument signals to the compiler that, if this def is in scope, the compiler can insert a call to this method if:
it has a DataFrameReader
a method is being called which is not a member of DataFrameReader
com.microsoft.azure.cosmosdb.spark.schema.DataFrameReaderFunctions has member with the desired name and signature
Since DataFrameReaderFunctions has a method cosmosDB, the compiler then translates your code to
toDataFrameReaderFunctions(spark.read).cosmosDB(readConfig)
This general approach of using an implicit conversion to make it look like you're adding methods to a type without modifying the type is called enrichment or an extension method. Implicit conversions in general should probably be avoided: they very often make code hard to follow and an errant implicit conversion in scope can make code you don't intend to compile compile. For an enrichment like this, there's an alternative: use an implicit class, where the compiler essentially autogenerates the implicit conversion but this doesn't allow you to use an Int in place of a String.

No implicits found for parameter evidence

I have a line of code in a scala app that takes a dataframe with one column and two rows, and assigns them to variables start and end:
val Array(start, end) = datesInt.map(_.getInt(0)).collect()
This code works fine when run in a REPL, but when I try to put the same line in a scala object in Intellij, it inserts a grey (?: Encoder[Int]) before the .collect() statement, and show an inline error No implicits found for parameter evidence$6: Encoder[Int]
I'm pretty new to scala and I'm not sure how to resolve this.
Spark needs to know how to serialize JVM types to send them from workers to the master. In some cases they can be automatically generated and for some types there are explicit implementations written by Spark devs. In this case you can implicitly pass them. If your SparkSession is named spark then you miss following line:
import spark.implicits._
As you are new to Scala: implicits are parameters that you don't have to explicitly pass. In your example map function requires Encoder[Int]. By adding this import, it is going to be included in the scope and thus passed automatically to map function.
Check Scala documentation to learn more.

Spark Notebook: Does GeoPointsChart accept a Dataframe?

I have a Dataframe which has two columns latitude and longitude. I passed that to GeoPointsChart. The output is "showing 1000 rows" but it isn't actually showing me anything. Has anyone faced the same issue? Is this a syntactical mistake?
I have not worked with this notebook, but it looks like you have an API call somewhere producing a java.util.List. That class does not have a toSeq method. You want to convert java.util.List into its Scala equivalent.
First, import this:
import scala.collection.JavaConverters._
This import enriches (or pimps) the Java collections with an asScala method to do the conversion:
val testAsScala = test.asScala.toSeq
I would note though that the call to toSeq is unnecessary since the result already mixes in Seq. Still, with asScala you can now work entirely with Scala collections, which are so much easier.

What's the meaning of "$" in Dataset's operators (like select or filter)?

I am a bit confused about using $ to reference columns in DataFrame operators like select or filter.
The following statements work:
df.select("app", "renders").show
df.select($"app", $"renders").show
But, only the first statement in the following works:
df.filter("renders = 265").show // <-- this works
df.filter($"renders" = 265).show // <-- this does not work (!) Why?!
However, this again works:
df.filter($"renders" > 265).show
Basically, what is this $ in DataFrame's operators and when/how should I use it?
Implicits are a major feature of the Scala language that take a lot of different forms--like implicit classes as we will see shortly. They have different purposes, and they all come with varying levels of debate regarding how useful or dangerous they are. Ultimately though, implicits generally come down to simply having the compiler convert one class to another when you bring them into scope.
Why does this matter? Because in Spark there is an implicitclass called StringToColumn that endows a StringContext with additional functionality. As you can see, StringToColumn adds the $ method to the Scala class StringContext. This method produces a ColumnName, which extends Column.
The end result of all this is that the $ method allows you to treat the name of a column, represented as a String, as if it were the Column itself. Implicits, when used wisely, can produce convenient conversions like this to make development easier.
So let's use this to understand what you found:
df.select("app","renders").show -- succeeds because select takes multiple Strings
df.select($"app",$"renders").show -- succeeds because select takes multiple Columnss that result after the implicit conversions are applied
df.filter("renders = 265").show -- succeeds because Spark supports SQL-like filters
df.filter($"renders" = 265).show -- fails because $"renders" is of type Column after implicit conversion, and Columns use the custom === operator for equality (unlike the case in SQL).
df.filter($"renders" > 265).show -- succeeds because you're using a Column after implicit conversion and > is a function on Column.
$ is a way to convert a string to the column with that name.
Both options of select work originally because select can receive either a column or a string.
When you do the filter $"renders" = 265 is an attempt at assigning a number to the column. > on the other hand is a comparison method. You should be using === instead of =.

Spark toDF cannot resolve symbol after importing sqlContext implicits

I'm working on writing some unit tests for my Scala Spark application
In order to do so I need to create different dataframes in my tests. So I wrote a very short DFsBuilder code that basically allows me to add new rows and eventually create the DF. The code is:
class DFsBuilder[T](private val sqlContext: SQLContext, private val columnNames: Array[String]) {
var rows = new ListBuffer[T]()
def add(row: T): DFsBuilder[T] = {
rows += row
this
}
def build() : DataFrame = {
import sqlContext.implicits._
rows.toList.toDF(columnNames:_*) // UPDATE: added :_* because it was accidently removed in the original question
}
}
However the toDF method doesn't compile with a cannot resolve symbol toDF.
I wrote this builder code with generics since I need to create different kinds of DFs (different number of columns and different column types). The way I would like to use it is to define some certain case class in the unit test and use it for the builder
I know this issue somehow relates to the fact that I'm using generics (probably some kind of type erasure issue) but I can't quite put my finger on what the problem is exactly
And so my questions are:
Can anyone show me where the problem is? And also hopefully how to fix it
If this issue cannot be solved this way, could someone perhaps offer another elegant way to create dataframes? (I prefer not to pollute my unit tests with the creation code)
I obviously googled this issue first but only found examples where people forgot to import the sqlContext.implicits method or something about a case class out of scope which is probably not the same issue as I'm having
Thanks in advance
If you look at the signatures of toDF and of SQLImplicits.localSeqToDataFrameHolder (which is the implicit function used) you'll be able to detect two issues:
Type T must be a subclass of Product (the superclass of all case classes, tuples...), and you must provide an implicit TypeTag for it. To fix this - change the declaration of your class to:
class DFsBuilder[T <: Product : TypeTag](...) { ... }
The columnNames argument is not of type Array, it's a "repeated parameter" (like Java's "varargs", see section 4.6.2 here), so you have to convert the array into arguments:
rows.toList.toDF(columnNames: _*)
With these two changes, your code compiles (and works).