I try to launch a JAR with the command scala nameJar.jar
My Spark configuration:
val sc = SparkSession.builder()
.master("local")
.config("spark.driver.extraJavaOptions", "-XX:+UseG3GC")
.config("spark.executor.extraJavaOptions", "-XX:+UseG4GC")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max","1048")
.config("spark.driver.memory","2048")
.appName("Lea")
.getOrCreate()
Error:
17/06/13 09:35:29 INFO SparkEnv: Registering MapOutputTracker
17/06/13 09:35:29 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 239075328 must be at least 471
859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:216)
I think the line is clear
java.lang.IllegalArgumentException: System memory 239075328 must be at
least 471859200. Please increase heap size using the --driver-memory
option or spark.driver.memory in Spark configuration.
You need to increase the driver-memory by
--driver-memory 1g while running
If you are using maven then
export MAVEN_OPTS="-Xms1024m -Xmx4096m -XX:PermSize=1024m"
or you can pass the VM options argument in intellij and eclipse as
-Xms1024m -Xmx4096m -XX:PermSize=1024m
Hope this helps!
Related
Scala version: 2.11.12
Spark version: 2.4.0
emr-5.23.0
Get the following when running the below command to create an Amazon EMR cluster
spark-submit --class etl.SparkDataProcessor --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.ETL_NAME=foo --conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn --conf spark.yarn.appMasterEnv.ETL_AWS_ACCESS_KEY_ID=123 --conf spark.yarn.appMasterEnv.ETL_AWS_SECRET_ACCESS_KEY=abc MY-Tool.jar
Exception
ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
How I create my spark session (where sparkMaster = yarn)
lazy val spark: SparkSession = {
val logger: Logger = Logger.getLogger("etl");
val sparkAppName = EnvConfig.ETL_NAME
val sparkMaster = EnvConfig.ETL_SPARK_MASTER
val sparkInstance = SparkSession
.builder()
.appName(sparkAppName)
.master(sparkMaster)
.getOrCreate()
val hadoopConf = sparkInstance.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.access.key", EnvConfig.ETL_AWS_ACCESS_KEY_ID)
hadoopConf.set("fs.s3a.secret.key", EnvConfig.ETL_AWS_SECRET_ACCESS_KEY)
logger.info("Created My SparkSession")
logger.info(s"Spark Application Name: $sparkAppName")
logger.info(s"Spark Master: $sparkMaster")
sparkInstance
}
UPDATE:
I determined that due to the application logic, in certain cases, we did not initialize the spark session. Because of this, it seems that when the cluster terminates, it also tries to do something with the session (perhaps close it) and is thus failing. Now that I have figured out this issue, the application runs but never actually completes. Currently, it seems to be hanging in a particular part involving spark when running in cluster mode:
val data: DataFrame = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(s"s3://$csvPath/$fileKey")
.toDF()
20/03/16 18:38:35 INFO Client: Application report for application_1584324418613_0031 (state: RUNNING)
AFAIK EnvConfig.ETL_AWS_ACCESS_KEY_ID and ETL_AWS_SECRET_ACCESS_KEY are not getting populated due to which sparksession cant be instanciated with null or empty values . try to print and debug the values.
also reading the properties from --conf spark.xxx
should be like this example. I hope you are following this...
spark.sparkContext.getConf.getOption("spark. ETL_AWS_ACCESS_KEY_ID")
once you check that, this example way should work...
/**
* Hadoop-AWS Configuration
*/
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.host", proxyHost)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.port", proxyPort)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem
another thing is, use
--master yarn or --master local[*] you can use instead of
-conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn
UPDATE :
--conf spark.driver.port=20002 may solve this issue. where 20002 is orbitary port.. seems like its waiting for the particular port for some time and its retrying for some time and its failing with the exception you got.
I got this idea by walking through the Sparks application master code from here
and comment This a bit hacky, but we need to wait until the spark.driver.port property has been set by the Thread executing the user class.
you can try this and let me know.
Further reading : Apache Spark : How to change the port the Spark driver listens to
In my case (after resolving the application issues), I needed to include core AND task node types when deploying in cluster mode.
I am receiving sparkcontext error while starting sparksession on EMR 5.3.1 in scala. Below is the version of spark I am using and the error.
This works fine on windows machine but errors out on EMR. Also is this the right way to create sparksession?
spark_version: 2.1.0
val spark = SparkSession
.builder
.master("local[*]")
.appName("vierweship_test")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
//.enableHiveSupport()
.getOrCreate()
Error:
17/08/01 13:08:14 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
I get below error if i do not use warehouse directory properties.
17/08/01 13:00:02 ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException
at java.io.File.<init>(File.java:277)
at org.apache.spark.deploy.yarn.Client.addDistributedUri$1(Client.scala:438)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:476)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:600)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
Command I am using :
spark-submit --verbose --class xxx --master yarn --jars="s3-dist-cp.jar:common-0.1.jar" --deploy-mode client --packages "Xxx:xxx:XXx" myjar-0.1.jar
I am trying to read a large avro file (2GB) using spark-shell but I am getting stackoverflow error.
val newDataDF = spark.read.format("com.databricks.spark.avro").load("abc.avro")
java.lang.StackOverflowError
at com.databricks.spark.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:71)
at com.databricks.spark.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:81)
I tried to increase driver memory and executor memory but I am still
getting same error.
./bin/spark-shell --packages com.databricks:spark-avro_2.11:3.1.0 --driver-memory 8G --executor-memory 8G
How can I read this file ? Is theere a way to partition this file?
I am training a LDA model on wikipedia articles(4 million docs, ~14GB data). I am running a scala script on one machine with ~98GB memory. I run the scala code in spark shell with following params:
$SPARK_HOME/bin/spark-shell --executor-memory 2G --driver-memory 25G --total-executor-cores 10 --conf spark.driver.maxResultSize=50g
Code snippet:
val lda = new LDA().setOptimizer(new OnlineLDAOptimizer()).setK(numTopics).setMaxIterations(maxIterations)
val ldaModel = lda.run(lda_countVector)
I get the following error when I execute lda.run():
scala> val ldaModel = lda.run(lda_countVector)
16/06/30 12:53:45 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/06/30 12:53:45 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[Stage 21:==============================> (238 + 85) / 408]16/06/30 13:35:59 ERROR Executor: Exception in task 315.0 in stage 21.0 (TID 2803)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:183)
at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:52)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:49)
at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I have tried various settings for memory like increasing driver memory, increasing maxResultSize memory, executor memory etc, but I still get the same error. If I reduce maxResultSize below 40GB, then I get maxResultSize error.
Please help me figure out the right memory settings. What should be done for a typical application like this?
Thanks.
In spark-shell you didn't provide a master url. This means that you run in local mode. So executor memory setting will be ignored. The number of cores will default the number of CPU-s. To be most conservative run with -master local[1]. This means only 1 core
You could increase the driver memory to 80G , also set spark.memory.storageFraction to 0.1. This takes memory used to cache Rdd's and gives more to the reducers.
How do I increase Spark memory when using local[*]?
I tried setting the memory like this:
val conf = new SparkConf()
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "4g")
.setMaster("local[*]")
.setAppName("MyApp")
But I still get:
MemoryStore: MemoryStore started with capacity 524.1 MB
Does this have something to do with:
.setMaster("local[*]")
Assuming that you are using the spark-shell.. setting the spark.driver.memory in your application isn't working because your driver process has already started with default memory.
You can either launch your spark-shell using:
./bin/spark-shell --driver-memory 4g
or you can set it in spark-defaults.conf:
spark.driver.memory 4g
If you are launching an application using spark-submit, you must specify the driver memory as an argument:
./bin/spark-submit --driver-memory 4g --class main.class yourApp.jar
I was able to solve this by running SBT with:
sbt -mem 4096
However the MemoryStore is half the size. Still looking into where this fraction is.
in spark 2.x ,you can use SparkSession,which looks like :
val spark= new SparkSession()
.config("spark.executor.memory", "1g")
.config("spark.driver.memory", "4g")
.setMaster("local[*]")
.setAppName("MyApp")
Tried --driver-memory 4g, --executor-memory 4g, neither worked to increase working memory. However, I noticed that bin/spark-submit was picking up _JAVA_OPTIONS, setting that to -Xmx4g resolved it. I use jdk7
The fraction of the heap used for Spark's memory cache is by default 0.6, so if you need more than 524,1MB, you should increase the spark.executor.memory setting :)
Technically you could also increase the fraction used for Spark's memory cache, but I believe this is discouraged or at least requires you to do some additional configuration. See https://spark.apache.org/docs/1.0.2/configuration.html for more details.
You can't change driver memory after application start link.
Version
spark-2.3.1
Source Code
org.apache.spark.launcher.SparkSubmitCommandBuilder:267
String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),
System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);
cmd.add("-Xmx" + memory);
SparkLauncher.DRIVER_MEMORY
--driver-memory 2g
SPARK_DRIVER_MEMORY
vim conf/spark-env.sh
SPARK_DRIVER_MEMORY="2g"
SPARK_MEM
vim conf/spark-env.sh
SPARK_MEM="2g"
DEFAULT_MEM
1g
To assign memory to Spark:
on Command shell: /usr/lib/spark/bin/spark-shell --driver-memory=16G --num-executors=100 --executor-cores=8 --executor-memory=16G
/usr/lib/spark/bin/spark-shell --driver-memory=16G --num-executors=100 --executor-cores=8 --executor-memory=16G --conf spark.driver.maxResultSize = 2G