I'm trying to merge 2 config file (or create a config file based on a single reference file) using
lazy val finalConfig:
I'm defining my java variable inside spark using spark-submit ....... --conf spark.driver.extraJavaOptions=-Dconfig.resource=./reference.conf,-Duser.resource=./user.conf ...
My goal is to be able to point a file that is not inside my jar to be used by System.getProperty("..") in my code. I changed the folder for testing (cd ..) and keep getting the same error so I guess spark doesn't care about my java arguments..?
Is there a way to point to a file (or even 2 files in my case) so that they can be merged?
I also tried to include the reference.conf file but not the user.conf file: it recognizes the reference.conf but not the user.conf that i gave with --conf spark.driver.extraJavaOptions=-Duser.resource=./user.conf .
Is there a way to do that? Thanks if you can help
I don't see you doing ConfigFactory.parseFile to loaded a file containing properties.
Typesafe automatically read any .properties file in the class path, all -D parameters passed in to the JVM and then merges them.
I am reading an external property file which is not part of the jar as following. The file "application.conf" is placed on the same directory where the jar is kept.
val applicationRootPath = System.getProperty("user.dir")
val config = Try {
ConfigFactory.parseFile(new File(applicationRootPath + "/" + "application.conf"))
appConfig = config.withFallback(ConfigFactory.load()).resolve
ConfigFactory.load() already contains all the properties present on the properties files in the class path and -d parameters. I am giving priority to my external "application.conf" and falling back on default values. For matching keys "application.conf" take precedence over other sources.
I installed pyspark with pip.
I code in jupyter notebooks. Everything works fine but not I got a java heap space error when exporting a large .csv file.
Here someone suggested editing the spark-defaults.config. Also in the spark documentation, it says
"Note: In client mode, this config must not be set through the
SparkConf directly in your application, because the driver JVM has
already started at that point. Instead, please set this through the
--driver-memory command line option or in your default properties file."
But I'm afraid there is no such file when installing pyspark with pip.
I'm I right? How do I solve this?
I recently ran into this as well. If you look at the Spark UI under the Classpath Entries, the first path is probably the configuration directory, something like /.../lib/python3.7/site-packages/pyspark/conf/. When I looked for that directory, it didn't exist; presumably it's not part of the pip installation. However, you can easily create it and add your own configuration files. For example,
mkdir /.../lib/python3.7/site-packages/pyspark/conf
vi /.../lib/python3.7/site-packages/pyspark/conf/spark-defaults.conf
The spark-defaults.conf file should be located in:
If no file is present, create one (a template should be available in the same directory).
How to find the default configuration folder
Check contents of the folder in Python:
import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark*"))
# ['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template',
# '/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']
When no spark-defaults.conf file is available, built-in values are used
To my surprise, no spark-defaults.conf but just a template file was present!
Still I could look at Spark properties, either in the “Environment” tab of the Web UI http://<driver>:4040 or using getConf().getAll() on the Spark context:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
# [('spark.driver.port', '55128'),
# ('spark.app.name', 'myApp'),
# ('spark.rdd.compress', 'True'),
# ('spark.sql.warehouse.dir', 'file:/path/spark-warehouse'),
# ('spark.serializer.objectStreamReset', '100'),
# ('spark.master', 'local[*]'),
# ('spark.submit.pyFiles', ''),
# ('spark.app.startTime', '1645484409629'),
# ('spark.executor.id', 'driver'),
# ('spark.submit.deployMode', 'client'),
# ('spark.app.id', 'local-1645484410352'),
# ('spark.ui.showConsoleProgress', 'true'),
# ('spark.driver.host', 'xxx.xxx.xxx.xxx')]
Note that not all properties are listed but:
only values explicitly specified through spark-defaults.conf, SparkConf, or the command line. For all other configuration properties, you can assume the default value is used.
For instance, consider the default parallelism is in my case:
This is the default for local mode, namely the number of cores on the local machine--see https://spark.apache.org/docs/latest/configuration.html. In my case 8=2x4cores because of hyper-threading.
If passed the property spark.default.parallelism when launching the app
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
then the property is shown in the Web UI and in the list
Precedence of configuration settings
Spark will consider given properties in this order (spark-defaults.conf comes last):
flags passed to spark-submit
From https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Some pyspark Jupyter kernels contain flags for spark-submit in the environment variable $PYSPARK_SUBMIT_ARGS, so one might want to check that too.
Related question: Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark
The spark-defaults.config file is needed when we have to change any of the default configs for spark.
As #niuer suggested, it should be present in the $SPARK_HOME/conf/ directory. But that might not be the case with you. By default, a template config file will be present there. You can just add a new spark-defaults.conf file in $SPARK_HOME/conf/.
Check your spark path. There are configuration files under:
$SPARK_HOME/conf/, e.g.
I get the below error when I use pyspark via Zeppelin.
The python & spark interpreters work and all environment variables are set correctly.
print os.environ['PYTHONPATH']
zepplin-env.sh is set with the below vars
export PYSPARK_PYTHON=/usr/local/bin/python2
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}
See the below log file
INFO [2017-11-01 12:30:42,972] ({pool-2-thread-4}
RemoteInterpreter.java[init]:221) - Create remote interpreter
paragraph_1509038605940_-1717438251's Interpreter pyspark not
Thank you in advance
I found a workaround for the above issue.The interpreter not found issue does not happen when I create note inside a directory.The issue only happens when I use notes at toplevel.Addionally I foud out that this issue does not happen in 0.7.2 version
Ex :
My flume spool directory contains non-"UTF-8" files.
So I get a Java.nio.charset.MalformedInputException error when I try to collect it.
Changing the encoding option of a .conf file will also cause an error.
And I have to use spooldir type.
How can I collecrt non-"UTF-8" files.
The encoding of our log files were Latin5 (which is Turkish)
Fixed it by adding the below line into the conf file:
AGENTNAME.sources.SOURCENAME.inputCharset = ISO-8859-9
I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5
All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:
/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.
What am I missing?
in my case:
1、cd /home/zh/.ivy2/jars
2、jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
3、add /home/zh/.ivy2/jar to PYTHONPATH in spark-env.sh like code above:
export PYTHONPATH=$PYTHONPATH:/home/zh/.ivy2/jars:.
This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.
My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.
For instance, I run pyspark from my home directory
~$ ls -lart
drwxr-xr-x 2 user user 4096 Feb 24 19:55 graphframes
~$ ls graphframes/
__init__.pyc examples.pyc graphframe.pyc tests.pyc
You would not need the py-files or jars parameters, though, something like
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5
and having the python code in the graphframes directory should work.
Add these lines to your $SPARK_HOME/conf/spark-defaults.conf :
spark.executor.extraClassPath file_path/jar1:file_path/jar2
spark.driver.extraClassPath file_path/jar1:file_path/jar2
In the more general case of importing 'orphan' python file (outside of current folder, not part of properly installed package) - use addPyFile, e.g.:
addPyFile(path): Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.