I am using MapR5.2 - Spark version 2.1.0
And i am running my spark app jar in Yarn CLuster mode.
I have tried all the available options that i found But unable to succeed.
This is our Production environment. But i need that for my particular spark job it should follow and pick-up my log4j-Driver.properties file which is present in my src/main/resources folder(I also confirmed by opening the jar my property file is present)
1) Content of My Log File -> log4j-Driver.properties
log4j.rootCategory=DEBUG, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/users/myuser/logs/Myapp.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=100MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
2) My Script for Spark-Submit Command
propertyFile=application.properties
spark-submit --class MyAppDriver \
--conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/users/myuser/log4j-Driver.properties" \
--master yarn --deploy-mode cluster \
--files /users/myuser/log4j-Driver.properties,/opt/mapr/spark/spark-2.1.0/conf/hive-site.xml,/users/myuser/application.properties
/users/myuser/myapp_2.11-1.0.jar $propertyFile
All i Need is as of now i am trying to Write my Driver Logs in the directory mentioned in my properties file(mentioned above) If i am successful in this then i will try for Executor logs as well. But first i need to make this Driver Log to write on my local (and its an Edge node of our Cluster)
/users/myuser/log4j-Driver.properties seems to be the path to the file on your local computer so you were right to use it for the --files argument.
The problem is, that there's no such file on the driver and/or executor, so when you use file:/users/myuser/log4j-Driver.properties as an argument to -Dlog4j.configuration Log4j will fail to find it.
Since you run on YARN, files listed as arguments to --files will be submitted to HDFS. Each application submission will have its own base directory in HDFS where all the files will be put by spark-submit.
In order to refer to these files use relative paths. In your case --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=log4j-Driver.properties" should work.
Related
We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?
Note: The cluster is Azure HDI Cluster.
spark-submit --master yarn --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py --files files/<env>.properties,files/<config>.json main.py
warning seen is:
warnings.warn(
/usr/hdp/current/spark3-client/python/pyspark/context.py:256:
RuntimeWarning: Failed to add file
[file:///home/sshuser/project/pyFiles/abc.py] speficied in
'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929
above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)
I want to log in maprDB a spark job with log4j. I have written a custom appender, and here my log4j.properties :
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss}
%-5p %c{1}:%L - %m%n
log4j.appender.MapRDB=com.datalake.maprdblogger.Appender
log4j.logger.testest=WARN, MapRDB
Put on src/main/resources directory
This is my main method :
object App {
val log: Logger = org.apache.log4j.LogManager.getLogger(getClass.getName)
def main(args: Array[String]): Unit = {
// custom appender
LogHelper.fillLoggerContext("dev", "test", "test", "testest", "")
log.error("bad record.")
}
}
When I run my spark-submit without any configuration, nothing happens. It is like my log4j.properties wasn't here.
If I deploy my log4j.properties file manually and add options :
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/PATH_TO_FILE/log4j.properties
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/PATH_TO_FILE/log4j.properties
It works well. Why it doesn't work without theese options ?
The "spark.driver.extraJavaOptions" :
Default value is: (none)
A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command-line option in the client mode.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options command line option or in your default properties file.
[Refer to this link for more details:
https://spark.apache.org/docs/latest/configuration.html]
I am using IntelliJ with gradle to build spark project. I tried multiple ways to disable logging but to no avail.
Here are some of the things that I have tried.
1. Add log4j.properties under src/main/resources with the following configuration
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Change this to set Spark log level
log4j.logger.org.apache.spark=WARN
# Silence akka remoting
log4j.logger.Remoting=WARN
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
Add these before and after sparkSession creation.
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
Logger.getLogger("spark").setLevel(Level.OFF)
LogManager.getRootLogger.setLevel(Level.OFF)
Add this when creating sparkSession
val spark: SparkSession = SparkSession.builder()
.appName("test").master("local")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
I am running IntelliJ using mac OS,
scalaVersion=2.11.8
sparkVersion=2.1.1
Help me please, I have looked up and down in the Internet and still can't find anything that works
Thanks a million!
May.
I'm using the spark-shell for learning purpose and for that I created several scala files containing frequently used code, like class definitions. I use the files by calling the ":load" command within the shell.
Now I would like to to use the spark-shell in in yarn-cluster mode. I start it using spark-shell --master yarn --deploy-mode client.
the shell starts without any issues but when I try to run the code loaded by ":load", I get execution errors.
17/05/04 07:59:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1493271022021_0168_01_000002 on host: xxxw03.mine.de. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e68_1493271022021_0168_01_000002
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I think I will have to share the code loaded in the shell to the workers. But how do I have to do this?
The spark-shell is useful for quickly testing but once you have an idea of what you want to do and put together a complete program it's usefulness plummets.
You probably want to now move on to using the spark-submit command.
See the docs on submitting an application https://spark.apache.org/docs/latest/submitting-applications.html
Using this command you provide a JAR file instead of individual class files.
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
<main-class> is the Java style path to your class e.g. com.example.MyMainClass
<application-jar> is the path to the JAR file containing the classes in your project and the other params are as per documented on the link I included above but these two are the two key differences in terms of how you supply your code to the cluster.
I have:
- Hadoop
- Spark JobServer
- SQL Database
I have created a file to access my SQL Database from a local instance of the Spark JobServer. In order to do this, I first have to load my JDBC-driver with this command: Class.forName("com.mysql.jdbc.Driver");. However, when I try to execute the file on Spark JobServer, I get a classNotFound error:
"message": "com.mysql.jdbc.Driver",
"errorClass": "java.lang.ClassNotFoundException",
I have read that in order to load the JDBC driver, you have to change some configurations in either the application.conf file of the Spark JobServer or its server_start.sh file. I have done this as follows. In server_start.sh I have changed the cmd value which is sent with as spark-submit command:
cmd='$SPARK_HOME/bin/spark-submit --class $MAIN --driver-memory $JOBSERVER_MEMORY
--conf "spark.executor.extraJavaOptions=$LOGGING_OPTS spark.executor.extraClassPath = hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--driver-java-options "$GC_OPTS $JAVA_OPTS $LOGGING_OPTS $CONFIG_OVERRIDES"
--driver-class-path "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--jars "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
$# $appdir/spark-job-server.jar $conffile'
I also changed some lines of the application.conf file of the Spark JobServer which is used when starting the instance:
# JDBC driver, full classpath
jdbc-driver = com.mysql.jdbc.Driver
# dependent-jar-uris = ["hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"]
But the error that JDBC class cannot be found still comes back.
Already checked for the following errors:
ERROR1:
In case somebody thinks that I just have the wrong file path (which could very well be the case as far as I know myself), I have checked for the correct file on HDFS with hadoop fs -ls hdfs://quickstart.cloudera:8020/user/cloudera/ and the file was there:
-rw-r--r-- 1 cloudera cloudera 983914 2016-01-26 02:23 hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar
ERROR2:
I have the necessary dependency loaded in my build.sbt file: libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.+" and the import command in my scala-file import java.sql._.
How can I solve this ClassNotFound error?
Are there any good alternatives to JDBC to connect to SQL?
We have something like this in local.conf
# JDBC driver, full classpath
jdbc-driver = org.postgresql.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = "/var/spark-jobserver/sqldao/data"
jdbc {
url = "jdbc:postgresql://dbserver/spark_jobserver"
user = "****"
password = "****"
}
dbcp {
maxactive = 20
maxidle = 10
initialsize = 10
}
And in the start script I have
EXTRA_JARS="/opt/spark-jobserver/lib/*"
CLASSPATH="$appdir:$appdir/spark-job-server.jar:$EXTRA_JARS:$(dse spark-classpath)"
And all dependent files that are used by Spark Jobserver is put in /opt/spark-jobserver/lib
I have not used HDFS to load jar for job-server.
But if you need mysql driver to be loaded on spark worker nodes then you should do it via dependent-jar-uris. I think that is what you are doing now.
I have packaged the project using sbt assembly and it finally works and I am happy.
But it's actually not working to have HDFS files in your dependent-jar-uri. So don't use HDFS links as your dependent-jar-uris.
Also, read this link in case you are curious: https://github.com/spark-jobserver/spark-jobserver/issues/372