How can I add configuration files to a Spark job running in YARN-CLUSTER mode?

How can I add configuration files to a Spark job running in YARN-CLUSTER mode? - scala

I am using spark 1.6.0. I want to upload a files using --files tag and read the file content after initializing the spark context.
My spark-submit command syntax looks like below:
spark-submit \
--deploy-mode yarn-cluster \
--files /home/user/test.csv \
/home/user/spark-test-0.1-SNAPSHOT.jar
I read the Spark documentation and it suggested me to use SparkFiles.get("test.csv") but this is not working in yarn-cluster mode.
If I change the deploy mode to local, the code works fine but I get a file not found exception in yarn-cluster mode.
I can see in logs that my files is uploaded to hdfs://host:port/user/guest/.sparkStaging/application_1452310382039_0019/test.csv directory and the SparkFiles.get is trying to look for file in /tmp/test.csv which is not correct. If someone has successfully used this, please help me solve this.

Spark submit command
spark-submit \
--deploy-mode yarn-client \
--files /home/user/test.csv \
/home/user/spark-test-0.1-SNAPSHOT.jar /home/user/test.csv
Read file in main program
def main(args: Array[String]) {
val fis = new FileInputStream(args(0));
// read content of file
}

Related

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
Spark code to read csv:
val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")
The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)
Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.
Any suggestions to resolve this error?

Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.
To access your --files use csv("#test_file.csv")
should not be copied into hdfs
Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI

Below solution worked for me:
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
To access file passed in spark-submit:
import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString
Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

Error in spark submit java.lang.OutOfMemoryError while reading 5-6 GB of text file using wholetextfile method

I have 5 files each file containing size as
File1=~500KB
File2=~1MB
File3=~1GB
File4=~6GB
File5=~1GB
And I am using wholetextfile to read all 5 files. Each file has different number of columns.
*val data = sc.wholeTextFiles("..........Path......./*")
On Further analysis I found that my code is not working after below line..Any suggestion on how to use mappartition in this case
val files = data.map { case (filename, content) => filename}
files.collect.foreach( filename => {
..../Performing some operations/...
})*
So when I try to submit this code on server then it gives error as java.lang.OutOfMemoryError
Code works fine when I remove 6GB file from the source path. So only issue with the file with big size.
I am using below spark submit code..
*spark-submit --class myClassName \
--master yarn-client --conf spark.executor.extraJavaOptions="-
Dlog4j.configuration=log4j.properties" \
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=...FilePath.../log4j.properties" \
--files ...FilePath.../log4j.properties --num-executors 4 --executor-cores 4 \
--executor-memory 10g --driver-memory 5g --conf "spark.yarn.executor.memoryOverhead=409" \
--conf "spark.yarn.driver.memoryOverhead=409" .................JarFilePath.jar*
Spark Version:1.6.0
Scala Version: 2.10.5

I suppose that you use wholeTextFile instead of textFile because "Each file has different number of columns.". (Note: textFile have a smaller memory requirement in this case, so you can have this code working without increasing --executor-memory). Basically the schema is not aligned between the files. If your end result is schema independent (i.e. having the same number of columns), then you can implement a preprocessing layer by starting a spark job on each file with textFile that outputs the desired content with the same content, number of columns.
Otherwise you can filter out the large files and start separate spark jobs on those to split them up to smaller ones. That way you will fit in memory.

AWS EMR import pyfile from S3

I'm struggling to understand how to import files as libraries with pyspark.
Let's say that I have the following
HappyBirthday.py
def run():
print('Happy Birthday!')
sparky.py
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import HappyBirthday
sc = SparkContext(appName="kmeans")
HappyBirthday.run()
sc.stop()
And both of them are stored in the same folder in S3.
How I make sure that, when I use
spark-submit --deploy-mode cluster s3://<PATH TO FILE>/sparky.py
, HappyBirthday.py is also imported?

If you are trying to run sparky.py and use a function inside HappyBirthday.py, you can try something like this.
spark-submit \
--deploy-mode cluster --master yarn \
--py-files s3://<PATH TO FILE>/HappyBirthday.py \
s3://<PATH TO FILE>/sparky.py
Just remember that s3 does not have the concept of "folders", so you just need to provide the exact path of the files or the group of files.
In case you have a whole bunch of dependencies in your project, you can bundle them all up into a single .zip file with the necessary init.py files and you can import any of the functions inside the libraries.
For example - I have sqlparse library as a dependency, with a bunch of python file s inside it. I have a package zip file, like below.
unzip -l packages.zip
Archive: packages.zip
0 05-05-2019 12:44 sqlparse/
2249 05-05-2019 12:44 sqlparse/__init__.py
5916 05-05-2019 12:44 sqlparse/cli.py
...
110 05-05-2019 12:44 sqlparse-0.3.0.dist-info/WHEEL
--------- -------
125034 38 files
This is uploaded to S3 and then used in the job.
spark-submit --deploy-mode cluster --master yarn --py-files s3://my0-test-bucket/artifacts/packages.zip s3://my-test-script/script/script.py
My file can contain imports like below.
import pyspark
import sqlparse # Importing the library
from pprint import pprint

What you want to use here is the --py-files argument for spark-submit. From the submitting applications page in the Spark documentation:
For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.
For your example, this would be:
spark-submit --deploy-mode cluster --py-files s3://<PATH TO FILE>/sparky.py

Submit an application property file with Spark typesafe config

Please, I need your help, I'm trying to submit an external configuration file for my spark application using typesafe config.
I'm loading the application.conf file in my application code like this:
lazy val conf = ConfigFactory.load()
File content
ingestion{
process {
value = "sas"
}
sas {
origin{
value = "/route"
}
destination{
value = "/route"
}
extension{
value = ".sas7bdat"
}
file{
value = "mytable"
}
month{
value = "201010,201011"
}
table{
value = "tbl"
}
}
}
My spark submit is
spark2-submit --class com.antonio.Main --master yarn --deploy-mode client --driver-memory 10G --driver-cores 8 --executor-memory 13G --executor-cores 4 --num-executors 10 --verbose --files properties.conf /home/user/ingestion-1.0-SNAPSHOT-jar-with-dependencies.jar --files application.conf
But for some reason, I'm receiving
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ingestion'
Everything looks configured correctly ?? Have I missed something.
thanks,
Antonio

Your application.conf by default must be present at the root of classpath for ConfigFactory.load() to find it. Alternatively, you can modify where to find the application.conf file through system properties. Therefore, your options are as follows.
First alternative is, add the root directory of the job to classpath:
spark2-submit ... \
--conf spark.driver.extraClassPath=./ \
--conf spark.executor.extraClassPath=./ \ // if you need to load config at executors
...
Keep the --files option as is. Note that if you run your job in the client mode, you must pass the proper path to where application.conf is located on the driver machine to the spark.driver.extraClassPath option.
Second alternative is (and I think this one is superior), you can use the config.file system property to affect where ConfigFactory.load() looks for the config file:
spark2-submit ... \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./application.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application.conf \
...
Remarks about loading config on executors and keeping the --files option also apply here.

ClassNotFoundException: com.databricks.spark.csv.DefaultSource

I am trying to export data from Hive using spark scala. But I am getting following error.
Caused by: java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
My scala script is like below.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM sparksdata")
df.write.format("com.databricks.spark.csv").save("/root/Desktop/home.csv")
I have also try this command but still is not resolved please help me.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

If you wish to run that script the way you are running it, you'll need to use the --jars for local jars or --packages for remote repo when you run the command.
So running the script should be like this :
spark-shell -i /path/to/script/scala --packages com.databricks:spark-csv_2.10:1.5.0
If you'd also want to stop the spark-shell after the job is done, you'll need to add :
System.exit(0)
by the end of your script.
PS: You won't be needing to fetch this dependency with spark 2.+.