Need to pass additional input parameter to spark job to validate . I know that after uber.jar we can pass all required parameters by giving space. Have option to read like below parameter using scala
spark-submit --jar uber.jar -Dtable.name=emp -Dfiltercondition=age,name
-D format is mostly for Java properties, not CLI arguments.
Spark accepts arguments through your app main method like any other Java/Scala program.
object App {
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args) // <-- here
val master = cmd.getOptionValue("master", "local[*]") // parse args
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
...
}
// Using Apache Commons CLI
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
...
}
Then spark-submit --jar app.jar --className=my.app.App --master='local[*]'
Related
I'm Writing Unit Tests for Spark Scala code and facing this issue.
When I run UnitTests files separately I'm good to go but, When I run all of UnitTests in module using maven Testcases fails.
How we can create local instance of spark or mock for UnitTests.
`
Cannot call methods on a stopped SparkContext. This stopped
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
`
Method I tried.
Tried using creating private spark session for each one UnitTest File.
Creating common spark session trait for all unit test file.
calling spark.Stop() at end of each file and removing it from all.
File are make two unit test files and try to execute them together. Both files should have spark session.
Class test1 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
val sqlCont: SQLContext = spark.sqlContext
"test1" should "take spark session spark context and sql context" in
{
//do something
}
}
Class test2 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext`enter code here`
val sqlCont: SQLContext = spark.sqlContext
"test2" should "take spark session spark context and sql context" in
{
//do something
}
}
when you run those independently each file will work fine but when you run them together using mvn test they will failed.
i want to include properties file explicitly and include it in spark code , instead of hardcoding directly in spark code with all credentials.
i am trying following approach but not able to do, AppContext is not able to be resolved.
please guide me how to achieve this.
Spark_env.properties (under src/main/resourcses in maven project for spark with scala)
CASSANDRA_HOST1=127.0.0.133
CASSANDRA_PORT1=9042
CASSANDRA_USER1=usr1
CASSANDRA_PASS1=pas2
DataMigration.cassandra.keyspace1=demo2
DataMigration.cassandra.table1= data1
CASSANDRA_HOST2=
CASSANDRA_PORT2=9042
CASSANDRA_USER2=usr2
CASSANDRA_PASS2=pas2
D.cassandra.keyspace2=kesp2
D.cassandra.table2= data2
DataMigration.DifferencedRecords.output.path1=C:/spark_windows_proj/File1.csv
DataMigration.DifferencedRecords.output.path2=C:/spark_windows_proj/File1.parquet
----------------------------------------------------------------------------------
DM.scala
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.mapreduce.v2.app.AppContext
object Data_Migration {
def main(args: Array[String]) {
val host1: String = AppContext.getProperties().getProperty("CASSANDRA_HOST1")
val port1 = AppContext.getProperties().getProperty("CASSANDRA_PORT1").toInt
val keySpace1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace1")
val DataMigrationTableName1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table1")
val username1: String = AppContext.getProperties().getProperty("CASSANDRA_USER1")
val pass1: String = AppContext.getProperties().getProperty("CASSANDRA_PASS1")
val host2: String = AppContext.getProperties().getProperty("CASSANDRA_HOST2")
val port2 = AppContext.getProperties().getProperty("CASSANDRA_PORT2").toInt
val keySpace2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace2")
val DataMigrationTableName2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table2")
val username2: String = AppContext.getProperties().getProperty("CASSANDRA_USER2")
val pass2: String = AppContext.getProperties().getProperty("CASSANDRA_PASS2")
val Result_csv: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path1")
val Result_parquet: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path2")
val sc = AppContext.getSparkContext()
val spark = SparkSession
.builder() .master("local")
.appName("ABC")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df_read1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host",host1)
.option("spark.cassandra.connection.port",port1)
.option( "spark.cassandra.auth.username",username1)
.option("spark.cassandra.auth.password",pass1)
.option("keyspace",keySpace1)
.option("table",DataMigrationTableName1)
.load()
I would rather pass the properties explicitly by passing the --properties-file option to the spark-submit when submitting the job.
The AppContext won't necessary work for all submission types, while passing config file should work everywhere.
Edit: For local usage without spark-submit, you can simply use the standard Properties class, loading it from the resources and get access to properties. You only need to put property file into src/main/resources instead of src/test/resources that is included into classpath only for tests. The code is something like:
val props = new Properties
props.load(getClass.getClassLoader.getResourceAsStream("file.props"))
I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer
I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.
I am new to scala.
I am trying to implement Custom kryo serialization.
In which I have two classes and one object :
Operation
package org.agg
object Operation {
def main(args: Array[String]) {
var SparkConf = new SparkConf()
.setAppName("Operation")
.set("spark.io.compression.codec", "org.apache.spark.io.SnappyCompressionCodec")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.set("spark.kryo.registrator", "org.agg.KryoClass")
var sc = new SparkContext(SparkConf)
var sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
println("**********Operation***********")
}
}
KryoClass
package org.agg
class KryoClass extends KryoRegistrator {
def registerClasses(kryo: Kryo) {
println("**********KryoClass***********")
kryo.register(classOf[org.agg.KryoSerializeCode])
}
}
KryoSerializeCode
package org.agg
class KryoSerializeCode {
println("**********KryoSerializeCode*************")
}
I am considering that In Operation Class If I am writing set("spark.kryo.registrator", "org.agg.KryoClass")
So this should call KryoClass and It will print println("**********KryoClass***********") statement in log file.
Command to execute Operation object is below:
spark-submit --class org.agg.Operation --master yarn --deploy-mode
cluster --num-executors 40 --executor-cores 1 --executor-memory 3400m
--files /home/hive-site.xml --jars /usr/iop/4.1.0.0/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/iop/4.1.0.0/spark/lib/datanucleus-rdbms-3.2.9.jar,/usr/iop/4.1.0.0/spark/lib/datanucleus-core-3.2.10.jar
/home/operation_jar.jar
But After executing this it is printing print statement in Operation class only not in KryoClass or KryoSerializeCode Class.
Do anyone having idea why it is not calling print statement inside KryoClass or KryoSerializeCode Class.
Spark has the habit of doing things lazily - delaying unnecessary operations until they are actually needed. Seems like the creation of the Kryo serializer falls into that category - since you didn't actually do anything with your newly created SparkContext, Spark didn't bother creating the serializer.
If you add any Spark action to your Operation.main, e.g.:
sc.parallelize(List(1,2,3)).count()
You'll see the printouts you're expecting.
NOTE: you might have to register a few more classes, or disable registrationRequired