My requirement is to call a "Spark Scala" function from an existing PySpark program.
What is the best way to pass sparkSession created in PySpark program to Scala function.
I pass my scala jar to Pyspark as follows.
spark-submit --jars ScalaExample-0.1.jar pyspark_call_scala_example.py iris.data
Scalacode
def getDf(spark: SparkSession, query:String, df: DataFrame, log: Logger): DataFrame = {
import spark.implicits._
val df = spark.sql(query)
df
}
Pysparkcode
if __name__ == '__main__':
query = sys.argv[1]
spark = SparkSession \
.builder \
.appName("PySpark using Scala example") \
.getOrCreate()
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
query_df = DataFrame(sc._jvm.com.crowdstrike.dsci.sparkjobs.PythonHelper.getDf(???, query, ???), sqlContext)
Question
How to pass sparksession and logger to getDf ?
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/
To pass SparkSession from Python to Scala, use spark._jsparkSession.
Getting error: error: Unable to coerce 'startDate' to a formatted date (long) when I ran spark submit as below:
dse -u cassandra -p cassandra spark-submit --class com.abc.rm.Total_count \
--master dse://x.x.x.x:9042 TotalCount.jar \
"2024-06-11 00:00:00.000+0000" "2027-11-15 00:00:00.000+0000" \
10-118-16-132.bbc.ds.com pramod history
Below is my code:
package com.abc.rm
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
object Total_count {
def main(args: Array[String]):Unit = {
var startDate = args(0)
var endDate = args(1)
val master = args(2)
var ks = args(3)
var table_name = args(4)
println("startDate-->"+startDate)
println("endDate-->"+endDate)
println("master-->"+master)
val conf = new org.apache.spark.SparkConf().setAppName("Total_count")
.set("spark.cassandra.connection.host", master)
.set("spark.cassandra.auth.username","cassandra")
.set("spark.cassandra.auth.password","cassandra")
var sc = new SparkContext(conf)
val rdd = sc.cassandraTable("pramod", "history")
.where("sent_date>='startDate' and sent_date <='endDate'")
.cassandraCount()
println("count--> "+rdd)
sc.stop()
System.exit(1)
}}
How can I pass/convert the argument.
You aren't passing the arguments, but instead passing the strings startDate and endDate literally. To make it working you need to write it as
.where(s"sent_date>='$startDate' and sent_date <='$endDate'")
spark 2.7 scala 2.12.7 ,when i use spark-submit submit a simple project --WordCount, i ensure package and className is OK, but still have a error
java.lang.ClassNotFoundException
as my code:
1../bin/spark-submit --master spark://localhost.localdomain:7077 --class sparkTes.WordCount.scala /java/spark/scala.jar
2.enter image description here
3.spark code
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("wordcount");
val sc = new SparkContext(conf)
val input = sc.textFile("/java/text/scala.md", 2).cache()
val lines = input.flatMap(line=>line.split(" "))
val count = lines.map(word => (word,1)).reduceByKey{case (x,y)=>x+y}
val output = count.saveAsTextFile("/java/text/WordCount")
}
I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?
Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar
I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.