How to use custom spark-defaults.conf settings - scala

I have added a custom value to conf/spark-defaults.conf but that value is not being used.
stephen#ubuntu:~/spark-1.2.2$ cat conf/spark-defaults.conf
spark.akka.frameSize 92345678
Now let us run my program LBFGSRunner
sbt/sbt '; project mllib; runMain org.apache.spark.mllib.optimization.LBFGSRunner spark://ubuntu:7077'
Notice the following error: the conf setting was not being used:
[error] Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Serialized task 0:0 was 26128706 bytes,
which exceeds max allowed: spark.akka.frameSize (10485760 bytes) -
reserved (204800 bytes). Consider increasing spark.akka.frameSize
or using broadcast variables for large values

Note: Working In Linux Mint.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
If you want to run your job in development mode.
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()

Related

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
Spark code to read csv:
val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")
The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)
Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.
Any suggestions to resolve this error?
Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.
To access your --files use csv("#test_file.csv")
should not be copied into hdfs
Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI
Below solution worked for me:
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
To access file passed in spark-submit:
import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString
Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

Error in spark submit java.lang.OutOfMemoryError while reading 5-6 GB of text file using wholetextfile method

I have 5 files each file containing size as
File1=~500KB
File2=~1MB
File3=~1GB
File4=~6GB
File5=~1GB
And I am using wholetextfile to read all 5 files. Each file has different number of columns.
*val data = sc.wholeTextFiles("..........Path......./*")
On Further analysis I found that my code is not working after below line..Any suggestion on how to use mappartition in this case
val files = data.map { case (filename, content) => filename}
files.collect.foreach( filename => {
..../Performing some operations/...
})*
So when I try to submit this code on server then it gives error as java.lang.OutOfMemoryError
Code works fine when I remove 6GB file from the source path. So only issue with the file with big size.
I am using below spark submit code..
*spark-submit --class myClassName \
--master yarn-client --conf spark.executor.extraJavaOptions="-
Dlog4j.configuration=log4j.properties" \
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=...FilePath.../log4j.properties" \
--files ...FilePath.../log4j.properties --num-executors 4 --executor-cores 4 \
--executor-memory 10g --driver-memory 5g --conf "spark.yarn.executor.memoryOverhead=409" \
--conf "spark.yarn.driver.memoryOverhead=409" .................JarFilePath.jar*
Spark Version:1.6.0
Scala Version: 2.10.5
I suppose that you use wholeTextFile instead of textFile because "Each file has different number of columns.". (Note: textFile have a smaller memory requirement in this case, so you can have this code working without increasing --executor-memory). Basically the schema is not aligned between the files. If your end result is schema independent (i.e. having the same number of columns), then you can implement a preprocessing layer by starting a spark job on each file with textFile that outputs the desired content with the same content, number of columns.
Otherwise you can filter out the large files and start separate spark jobs on those to split them up to smaller ones. That way you will fit in memory.

Spark Shell Add Multiple Drivers/Jars to Classpath using spark-defaults.conf

We are using Spark-Shell REPL Mode to test various use-cases and connecting to multiple sources/sinks
We need to add custom drivers/jars in spark-defaults.conf file, I have tried to add multiple jars separated by comma
like
spark.driver.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
But its not working, Can anyone please provide details for correct syntax
Note: Verified in Linux Mint and Spark 3.0.1
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
Note: spark-shell and pyspark need to verify.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages and avro package. Also if you want to include log4j.properties, then use extraJavaOptions.
AppName and master can be provided in 2 way.
use .appName() and .master()
use .conf file
file: hellospark.py
from logger import Log4j
from util import get_spark_app_config
from pyspark.sql import SparkSession
# first approach.
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,
org.apache.spark:spark-avro_2.12:3.0.1") \
.config("spark.driver.extraJavaOptions",
"-Dlog4j.configuration=file:log4j.properties "
"-Dspark.yarn.app.container.log.dir=app-logs "
"-Dlogfile.name=hello-spark") \
.getOrCreate()
# second approach.
conf = get_spark_app_config()
spark = SparkSession.builder \
.config(conf=conf)
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
logger = Log4j(spark)
file: logger.py
from pyspark.sql import SparkSession
class Log4j(object):
def __init__(self, spark: SparkSession):
conf = spark.sparkContext.getConf()
app_name = conf.get("spark.app.name")
log4j = spark._jvm.org.apache.log4j
self.logger = log4j.LogManager.getLogger(app_name)
def warn(self, message):
self.logger.warn(message)
def info(self, message):
self.logger.info(message)
def error(self, message):
self.logger.error(message)
def debug(self, message):
self.logger.debug(message)
file: util.py
import configparser
from pyspark import SparkConf
def get_spark_app_config(enable_delta_lake=False):
"""
It will read configuration from spark.conf file to create
an instance of SparkConf(). Can be used to create
SparkSession.builder.config(conf=conf).getOrCreate()
:return: instance of SparkConf()
"""
spark_conf = SparkConf()
config = configparser.ConfigParser()
config.read("spark.conf")
for (key, value) in config.items("SPARK_APP_CONFIGS"):
spark_conf.set(key, value))
if enable_delta_lake:
for (key, value) in config.items("DELTA_LAKE_CONFIGS"):
spark_conf.set(key, value)
return spark_conf
file: spark.conf
[SPARK_APP_CONFIGS]
spark.app.name = Hello Spark
spark.master = local[3]
spark.sql.shuffle.partitions = 3
[DELTA_LAKE_CONFIGS]
spark.jars.packages = io.delta:delta-core_2.12:0.7.0
spark.sql.extensions = io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog = org.apache.spark.sql.delta.catalog.DeltaCatalog
As an example in addition to Prateek's answer, I have had some success by adding the following to the spark-defaults.conf file to be loaded when starting a spark-shell session in client mode.
spark.jars jars_added/aws-java-sdk-1.7.4.jar,jars_added/hadoop-aws-2.7.3.jar,jars_added/sqljdbc42.jar,jars_added/jtds-1.3.1.jar
Adding the exact line to the spark-defaults.conf file will load the three jar files as long as they are stored in the jars_added folder when spark-shell is run from the specific directory (doing this for me seems to mitigate the need to have the jar files loaded onto the slaves in the specified locations as well). I created the folder 'jars_added' in my $SPARK_HOME directory so whenever I run spark-shell I must run it from this directory (I have not yet worked out how to change the location the spark.jars setting uses as the initial path, it seems to default to the current directory when launching spark-shell). As hinted at by Prateek the jar files need to be comma separated.
I also had to set SPARK_CONF_DIR to $SPARK_HOME/conf (export SPARK_CONF_DIR = "${SPARK_HOME}/conf") for spark-shell to recognise the location of my config file (i.e. spark-defaults.conf). I'm using PuTTY to ssh onto the master.
Just to clarify once I have added the spark.jars jar1, jar2, jar3 to my spark-defaults.conf file I type the following to start my spark-shell session:
cd $SPARK_HOME //navigate to the spark home directory which contains the jars_added folder
spark-shell
On start up the spark-shell then loads the specified jar files from the jars_added folder

Submit an application property file with Spark typesafe config

Please, I need your help, I'm trying to submit an external configuration file for my spark application using typesafe config.
I'm loading the application.conf file in my application code like this:
lazy val conf = ConfigFactory.load()
File content
ingestion{
process {
value = "sas"
}
sas {
origin{
value = "/route"
}
destination{
value = "/route"
}
extension{
value = ".sas7bdat"
}
file{
value = "mytable"
}
month{
value = "201010,201011"
}
table{
value = "tbl"
}
}
}
My spark submit is
spark2-submit --class com.antonio.Main --master yarn --deploy-mode client --driver-memory 10G --driver-cores 8 --executor-memory 13G --executor-cores 4 --num-executors 10 --verbose --files properties.conf /home/user/ingestion-1.0-SNAPSHOT-jar-with-dependencies.jar --files application.conf
But for some reason, I'm receiving
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ingestion'
Everything looks configured correctly ?? Have I missed something.
thanks,
Antonio
Your application.conf by default must be present at the root of classpath for ConfigFactory.load() to find it. Alternatively, you can modify where to find the application.conf file through system properties. Therefore, your options are as follows.
First alternative is, add the root directory of the job to classpath:
spark2-submit ... \
--conf spark.driver.extraClassPath=./ \
--conf spark.executor.extraClassPath=./ \ // if you need to load config at executors
...
Keep the --files option as is. Note that if you run your job in the client mode, you must pass the proper path to where application.conf is located on the driver machine to the spark.driver.extraClassPath option.
Second alternative is (and I think this one is superior), you can use the config.file system property to affect where ConfigFactory.load() looks for the config file:
spark2-submit ... \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./application.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application.conf \
...
Remarks about loading config on executors and keeping the --files option also apply here.

How can I add configuration files to a Spark job running in YARN-CLUSTER mode?

I am using spark 1.6.0. I want to upload a files using --files tag and read the file content after initializing the spark context.
My spark-submit command syntax looks like below:
spark-submit \
--deploy-mode yarn-cluster \
--files /home/user/test.csv \
/home/user/spark-test-0.1-SNAPSHOT.jar
I read the Spark documentation and it suggested me to use SparkFiles.get("test.csv") but this is not working in yarn-cluster mode.
If I change the deploy mode to local, the code works fine but I get a file not found exception in yarn-cluster mode.
I can see in logs that my files is uploaded to hdfs://host:port/user/guest/.sparkStaging/application_1452310382039_0019/test.csv directory and the SparkFiles.get is trying to look for file in /tmp/test.csv which is not correct. If someone has successfully used this, please help me solve this.
Spark submit command
spark-submit \
--deploy-mode yarn-client \
--files /home/user/test.csv \
/home/user/spark-test-0.1-SNAPSHOT.jar /home/user/test.csv
Read file in main program
def main(args: Array[String]) {
val fis = new FileInputStream(args(0));
// read content of file
}