How to share SparkContext with methods that need it implicitly

How to share SparkContext with methods that need it implicitly - scala

I have the following method:
def loadData(a:String, b:String)(implicit sparkContext: SparkContext) : RDD[Result]
I am trying to test it using this SharedSparkContext: https://github.com/holdenk/spark-testing-base/wiki/SharedSparkContext.
So, I made my test class extend SharedSparkContext:
class Ingest$Test extends FunSuite with SharedSparkContext
And within the test method I made this call:
val res: RDD[Result] = loadData("x", "y")
However, I am getting this error:
Error:(113, 64) could not find implicit value for parameter sparkContext: org.apache.spark.SparkContext
val result: RDD[Result] = loadData("x", "y")
So how can I make the SparkContext from the testing method visible?
EDIT:
I don't see how the question is related with Understanding implicit in Scala

What is the variable name of your Spark Context? If it is 'sc' as is typically the case, you will have to alias it to the variable name the method is looking for via implicit val sparkContext = sc and then proceed to call your method in the same environment

Related

scala Unspecified Value Parameters

I want to extend the SparkSession class in spark. I copied the constructor of the original SparkSession partially reproduced here:
class SparkSession private(
#transient val sparkContext: SparkContext,
#transient private val existingSharedState: Option[SharedState],
#transient private val parentSessionState: Option[SessionState],
#transient private[sql] val extensions: SparkSessionExtensions)
extends Serializable with Closeable with Logging { self =>
private[sql] def this(sc: SparkContext) {
this(sc, None, None, new SparkSessionExtensions)
}
// other implementations
}
Here's my attempt at extending it:
class CustomSparkSession private(
#transient override val sparkContext: SparkContext,
#transient private val existingSharedState: Option[SharedState],
#transient private val parentSessionState: Option[SessionState],
#transient override private[sql] val extensions: SparkSessionExtensions)
extends SparkSession {
// implementation
}
But I get an error on the SparkSession part of extends SparkSession with error:
Unspecified value parameters: sc: SparkContext
I know that it's coming from the this constructor in the original SparkContext, but I'm not sure how, or if I can even extend this properly. Any ideas?

When you write class Foo extends Bar you are actually (1) creating a default (no-argument) constructor for class Foo, and (2) calling a default constructor of class Bar.
Consequently, if you have something like class Bar(bar: String), you can't just write class Foo extends Bar, because there is no default constructor to call, you need to pass a parameter for bar. So, you could write something like
class Foo(bar: String) extends Bar(bar)
This is why you are seeing this error - you are trying to call a constructor for SparkSession, but not passing any value for sc.
But you have a bigger problem. That private keyword you see next to SparkSession (and another one before this) means that the constructor is ... well ... private. You cannot call it. In other words, this class cannot be subclassed (outside the sql package), so you should look for another way to achieve what you are trying to do.

How to convert RDD of custom Java class objects to a DataFrame with toDF()?

I am trying to convert a Spark RDD to a Spark SQL dataframe with toDF(). I have used this function successfully many times, but in this case I'm getting a compiler error:
error: value toDF is not a member of org.apache.spark.rdd.RDD[com.example.protobuf.SensorData]
Here is my code below:
// SensorData is an auto-generated class
import com.example.protobuf.SensorData
def loadSensorDataToRdd : RDD[SensorData] = ???
object MyApplication {
def main(argv: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("My application")
conf.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val sensorDataRdd = loadSensorDataToRdd()
val sensorDataDf = sensorDataRdd.toDF() // <-- CAUSES COMPILER ERROR
}
}
I am guessing that the problem is with the SensorData class, which is a Java class that was auto-generated from a Protocol Buffer. What can I do in order to convert the RDD to a dataframe?

The reason for the compilation error is that there's no Encoder in scope to convert a RDD with com.example.protobuf.SensorData to a Dataset of com.example.protobuf.SensorData.
Encoders (ExpressionEncoders to be exact) are used to convert InternalRow objects into JVM objects according to the schema (usually a case class or a Java bean).
There's a hope you can create an Encoder for the custom Java class using org.apache.spark.sql.Encoders object's bean method.
Creates an encoder for Java Bean of type T.
Something like the following:
import org.apache.spark.sql.Encoders
implicit val SensorDataEncoder = Encoders.bean(classOf[com.example.protobuf.SensorData])
If SensorData uses unsupported types you'll have to map the RDD[SensorData] to an RDD of some simpler type(s), e.g. a tuple of the fields, and only then expect toDF work.

Passing sqlContext as an implicit parameter in Spark

I have a function in a Scala object which has the following signature
def f(v1:Int)(implicit sqlContext: SQLContext)
when I try to call this function from spark-shell I call it like
f(1)
and I expect the existing sqlContext gets passed to it implicitly but it doesn't. How can I make it to work so the sqlContext get passed to this function automatically?
--------------update-------------------
I tried to import sqlContext.implicits._ in the spark-shell before calling my function but it didn't help

You just need to add a SQLContext implicitly to the same context which you are calling your function:
implicit val sqlContext = new SQLContext() // just an example
// and then
f(1)
If you are using apacha Spark, you can use this import:
import sqlContext.implicits._

Importing implicit methods in scalatest

I'm struggling to understand why implicit imports do not work as I expect them in scalatest. The simplified failing example (using spark, but I can make it fail with my custom class also) is as follows:
class FailingSpec extends FlatSpec with Matchers with MySparkContext {
val testSqlctx = sqlctx
import sqlctx.implicits._
"sql context implicts" should "work" in {
val failingDf = Seq(ID(1)).toDS.toDF
}
}
The MySparkContext trait creates and destroys spark context in beforeAll and afterAll, and makes sqlctx available (already having to reassign it to a local variable in order to import implicits is a puzzle, but maybe for a different time). The .toDS and .toDF are then implicit methods imported from sqlctx.implicits. Running the test results in a java.lang.NullPointerException.
If I move import into test block things work:
class WorkingSpec extends FlatSpec with Matchers with MySparkContext {
"sql context implicts" should "work" in {
val testSqlctx = sqlctx
import sqlctx.implicits._
val workingDf = Seq(ID(1)).toDS.toDF
}
}
Any ideas why can't I import implicits at the top level of the test class?

beforeAll runs before any of the tests, but does not run before the constructor for the class. The order of operations in the first snippet is:
Constructor invoked, executing val testSqlctx = sqlctx and import sqlctx.implicits._
beforeAll invoked
Tests run
And the order of operations for the second snippet:
beforeAll invoked
Tests run, executing val testSqlctx = sqlctx and import sqlctx.implicits._
Assuming you give your SparkContext a default (null) value and initialize it in beforeAll, the first order of operations would try to use sqlctx when it is still null, thus causing the null pointer exception.

Kryo registration of LabeledPoint class

I am trying to run a very simple scala class in spark with Kryo registration. This class just loads data from a file into an RDD[LabeledPoint].
The code (inspired from the one in https://spark.apache.org/docs/latest/mllib-decision-tree.html):
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
object test {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("test")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrationRequired", "true")
val sc = new SparkContext(conf)
sc.getConf.registerKryoClasses(classOf[ org.apache.spark.mllib.regression.LabeledPoint ])
sc.getConf.registerKryoClasses(classOf[ org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] ])
// Load data
val rawData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = rawData.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
sc.stop()
System.exit(0)
}
}
What I understand i that, as I have set spark.kryo.registrationRequired = true, all utilized classes must be registered, so that I have registered RDD[LabeledPoint] and LabeledPoint.
The problem
I receive the following error:
java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.mllib.regression.LabeledPoint[]
Note: To register this class use: kryo.register(org.apache.spark.mllib.regression.LabeledPoint[].class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
As I understand it, it means that the class LabeledPoint[] is not registered, whereas I have registered the class LabeledPoint.
Furthermore, the code proposed in the error to register the class (kryo.register(org.apache.spark.mllib.regression.LabeledPoint[].class);) does not work.
What is the difference between the two classes?
How can I register this class?

Thanks a lot to #eliasah who contributed a lot to this answer by pointing out that the proposed solution (kryo.register(org.apache.spark.mllib.regression.LabeledPoint[].class);) is in Java and not in Scala.
Hence, what LabeledPoint[] means in Scala is Array[LabeledPoint].
I solved my problem by registering the Array[LabeledPoint] class, i.e. adding in my code:
sc.getConf.registerKryoClasses(classOf[ Array[org.apache.spark.mllib.regression.LabeledPoint] ])

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to share SparkContext with methods that need it implicitly - scala

What is the variable name of your Spark Context? If it is 'sc' as is typically the case, you will have to alias it to the variable name the method is looking for via implicit val sparkContext = sc and then proceed to call your method in the same environment

Related

scala Unspecified Value Parameters

How to convert RDD of custom Java class objects to a DataFrame with toDF()?

Passing sqlContext as an implicit parameter in Spark

Importing implicit methods in scalatest

Kryo registration of LabeledPoint class

Categories

Resources