SQLContext implicits - scala

I am learning spark and scala. I am well versed in java, but not so much in scala. I am going through a tutorial on spark, and came across the following line of code, which has not been explained:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
(sc is the SparkContext instance)
I know the concepts behind scala implicits (atleast I think I know). Could somebody explain to me what exactly is meant by the import statement above? What implicits are bound to the sqlContext instance when it is instantiated and how? Are these implicits defined inside the SQLContext class?
EDIT
The following seems to work for me as well (fresh code):
val sqlc = new SQLContext(sc)
import sqlContext.implicits._
In this code just above. what exactly is sqlContext and where is it defined?

From ScalaDoc:
sqlContext.implicits contains "(Scala-specific) Implicit methods available in Scala for converting common Scala objects into DataFrames. "
And is also explained in Spark programming guide:
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
For example in the code below .toDF() won't work unless you will import sqlContext.implicits:
val airports = sc.makeRDD(Source.fromFile(airportsPath).getLines().drop(1).toSeq, 1)
.map(s => s.replaceAll("\"", "").split(","))
.map(a => Airport(a(0), a(1), a(2), a(3), a(4), a(5), a(6)))
.toDF()
What implicits are bound to the sqlContext instance when it is
instantiated and how? Are these implicits defined inside the SQLContext class?
Yes they are defined in object implicits inside SqlContext class, which extends SQLImplicits.scala. It looks there are two types of implicit conversions defined there:
RDD to DataFrameHolder conversion, which enables using above mentioned rdd.toDf().
Various instances of Encoder which are "Used to convert a JVM object of type T to and from the internal Spark SQL representation."

Related

sparkpb UDF compile giving "error: could not find implicit value for evidence parameter of type frameless.TypedEncoder[Array[Byte]]"

I'm a scala newbie, using pyspark extensively (on DataBricks, FWIW). I'm finding that Protobuf deserialization is too slow for me in python, so I'm porting my deserialization udf to scala.
I've compiled my .proto files to scala and then a JAR using scalapb as described here
When I try to use these instructions to create a UDF like this:
import gnmi.gnmi._
import org.apache.spark.sql.{Dataset, DataFrame, functions => F}
import spark.implicits.StringToColumn
import scalapb.spark.ProtoSQL
// import scalapb.spark.ProtoSQL.implicits._
import scalapb.spark.Implicits._
val deserialize_proto_udf = ProtoSQL.udf { bytes: Array[Byte] => SubscribeResponse.parseFrom(bytes) }
I get the following error:
command-4409173194576223:9: error: could not find implicit value for evidence parameter of type frameless.TypedEncoder[Array[Byte]]
val deserialize_proto_udf = ProtoSQL.udf { bytes: Array[Byte] => SubscribeResponse.parseFrom(bytes) }
I've double checked that I'm importing the correct implicits, to no avail. I'm pretty fuzzy on implicits, evidence parameters and scala in general.
I would really appreciate it if someone would point me in the right direction. I don't even know how to start diagnosing!!!
Update
It seems like frameless doesn't include an implicit encoder for Array[Byte]???
This works:
frameless.TypedEncoder[Byte]
this does not:
frameless.TypedEncoder[Array[Byte]]
The code for frameless.TypedEncoder seems to include a generic Array encoder, but I'm not sure I'm reading it correctly.
#Dymtro, Thanks for the suggestion. That helped.
Does anyone have ideas about what is going on here?
Update
Ok, progress - this looks like a DataBricks issue. I think that the notebook does something like the following on startup:
import spark.implicits._
I'm using scalapb, which requires that you don't do that
I'm hunting for a way to disable that automatic import now, or "unimport" or "shadow" those modules after they get imported.
If spark.implicits._ are already imported then a way to "unimport" (hide or shadow them) is to create a duplicate object and import it too
object implicitShadowing extends SQLImplicits with Serializable {
protected override def _sqlContext: SQLContext = ???
}
import implicitShadowing._
Testing for case class Person(id: Long, name: String)
// no import
List(Person(1, "a")).toDS() // doesn't compile, value toDS is not a member of List[Person]
import spark.implicits._
List(Person(1, "a")).toDS() // compiles
import spark.implicits._
import implicitShadowing._
List(Person(1, "a")).toDS() // doesn't compile, value toDS is not a member of List[Person]
How to override an implicit value?
Wildcard Import, then Hide Particular Implicit?
How to override an implicit value, that is imported?
How can an implicit be unimported from the Scala repl?
Not able to hide Scala Class from Import
NullPointerException on implicit resolution
Constructing an overridable implicit
Caching the circe implicitly resolved Encoder/Decoder instances
Scala implicit def do not work if the def name is toString
Is there a workaround for this format parameter in Scala?
Please check whether this helps.
Possible problem can be that you don't want just to unimport spark.implicits._ (scalapb.spark.Implicits._), you probably want to import scalapb.spark.ProtoSQL.implicits._ too. And I don't know whether implicitShadowing._ shadow some of them too.
Another possible workaround is to resolve implicits manually and use them explicitly.

How to fix "error: encountered unrecoverable cycle resolving import"?

How to resolve the following compile error?
SOApp.scala:7: error: encountered unrecoverable cycle resolving import.
Note: this is often due in part to a class depending on a definition nested within its companion.
If applicable, you may wish to try moving some members into another object.
import spark.implicits._
Code:
object SOApp extends App with Logging {
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
}
tl;dr Move import spark.implicits._ after val spark = SparkSession...getOrCreate().
That name spark causes a lot of confusion since it could refer to org.apache.spark package as well as to spark value.
Unlike Java, Scala allows for import statements in many more places.
What you could consider a Spark SQL idiom is to create a spark value that gives access to implicits. In Scala, you can only bring implicits into scope from stable objects (like values) so the following is correct:
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
And as you comment says, it's to bring implicit conversions of RDDs to DataFrames (among the things).
This is not to import org.apache.spark package, but for the implicit conversions.

saveasTextFile("path") in scala

I used Scala in Spark, and I tried to save my file in HDFS file, but I got an error.
I have tried rdd.saveAsTextFile("path"); sc.saveAsTextFile("pathe");
saveAsTextFile("path")
scala> inputJPG.map(x=>x.split("")).map(array=>array(0)).sc.saveAsTextFile("/loudacre/iplist")
<console>:28: error: value sc is not a member of Array[String]
inputJPG.map(x=>x.split("")).map(array=>array(0)).sc.saveAsTextFile("/loudacre/iplist")
I'm a new student and still learning Scala, so not familiar with RDD and Scala function.
Turn to my problem.I found that is because I have
val xx=data.collect()before which is not RDD, so I can't do data.saveAsTextFile in Spark.
So I moved the .collect() function => val xx=data
Then data.savaAsTextFile("path") works.

How to make SparkSession and Spark SQL implicits globally available (in functions and objects)?

I have a project with many .scala files inside a package. I want to use Spark SQL as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession.builder()
.appName("My app")
.config("spark.master", "local")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Is it a good practice to wrap the above code inside a singleton object like:
object sparkSessX{
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession.builder()
.appName("My App")
.config("spark.master", "local")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
}
and every class to extend or import that object?
I've never seen it before, but the more Scala developers use Spark the more we see new design patterns emerge. That could be one.
I think you could instead consider making val spark implicit and pass it around where needed through this implicit context (as the second argument set of your functions).
I'd however consider making the object a trait (as I'm not sure you can extend Scala objects) and moreover to make room for other traits of your classes.

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()