log4j.properties file not found on classpath or ignored - scala

I want to log in maprDB a spark job with log4j. I have written a custom appender, and here my log4j.properties :
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss}
%-5p %c{1}:%L - %m%n
log4j.appender.MapRDB=com.datalake.maprdblogger.Appender
log4j.logger.testest=WARN, MapRDB
Put on src/main/resources directory
This is my main method :
object App {
val log: Logger = org.apache.log4j.LogManager.getLogger(getClass.getName)
def main(args: Array[String]): Unit = {
// custom appender
LogHelper.fillLoggerContext("dev", "test", "test", "testest", "")
log.error("bad record.")
}
}
When I run my spark-submit without any configuration, nothing happens. It is like my log4j.properties wasn't here.
If I deploy my log4j.properties file manually and add options :
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/PATH_TO_FILE/log4j.properties
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/PATH_TO_FILE/log4j.properties
It works well. Why it doesn't work without theese options ?

The "spark.driver.extraJavaOptions" :
Default value is: (none)
A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command-line option in the client mode.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options command line option or in your default properties file.
[Refer to this link for more details:
https://spark.apache.org/docs/latest/configuration.html]

Related

Spark Application Not reading log4j.properties present in Jar

I am using MapR5.2 - Spark version 2.1.0
And i am running my spark app jar in Yarn CLuster mode.
I have tried all the available options that i found But unable to succeed.
This is our Production environment. But i need that for my particular spark job it should follow and pick-up my log4j-Driver.properties file which is present in my src/main/resources folder(I also confirmed by opening the jar my property file is present)
1) Content of My Log File -> log4j-Driver.properties
log4j.rootCategory=DEBUG, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/users/myuser/logs/Myapp.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=100MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
2) My Script for Spark-Submit Command
propertyFile=application.properties
spark-submit --class MyAppDriver \
--conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/users/myuser/log4j-Driver.properties" \
--master yarn --deploy-mode cluster \
--files /users/myuser/log4j-Driver.properties,/opt/mapr/spark/spark-2.1.0/conf/hive-site.xml,/users/myuser/application.properties
/users/myuser/myapp_2.11-1.0.jar $propertyFile
All i Need is as of now i am trying to Write my Driver Logs in the directory mentioned in my properties file(mentioned above) If i am successful in this then i will try for Executor logs as well. But first i need to make this Driver Log to write on my local (and its an Edge node of our Cluster)
/users/myuser/log4j-Driver.properties seems to be the path to the file on your local computer so you were right to use it for the --files argument.
The problem is, that there's no such file on the driver and/or executor, so when you use file:/users/myuser/log4j-Driver.properties as an argument to -Dlog4j.configuration Log4j will fail to find it.
Since you run on YARN, files listed as arguments to --files will be submitted to HDFS. Each application submission will have its own base directory in HDFS where all the files will be put by spark-submit.
In order to refer to these files use relative paths. In your case --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=log4j-Driver.properties" should work.

How to disable logging with spark/scala project using gradle?

I am using IntelliJ with gradle to build spark project. I tried multiple ways to disable logging but to no avail.
Here are some of the things that I have tried.
1. Add log4j.properties under src/main/resources with the following configuration
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Change this to set Spark log level
log4j.logger.org.apache.spark=WARN
# Silence akka remoting
log4j.logger.Remoting=WARN
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
Add these before and after sparkSession creation.
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
Logger.getLogger("spark").setLevel(Level.OFF)
LogManager.getRootLogger.setLevel(Level.OFF)
Add this when creating sparkSession
val spark: SparkSession = SparkSession.builder()
.appName("test").master("local")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
I am running IntelliJ using mac OS,
scalaVersion=2.11.8
sparkVersion=2.1.1
Help me please, I have looked up and down in the Internet and still can't find anything that works
Thanks a million!
May.

Spark 2.2.0 - unable to read recursively into directory structure

Problem summary:
I am unable to read from nested subdirectories using my Spark program, despite setting the required Hadoop configuration (see attempted).
I get the error pasted below.
Any help is appreciated.
Version:
Spark 2.2.0
Input directory layout:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939225073/part-00000-3a44cd00-e895-4a01-9ab9-946064b739d4-c000.parquet
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939234036/part-00000-cbd47353-0590-4cc1-b10d-c18886df1c25-c000.parquet
...
Input directory parameter passed:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/*/*
Attempted (1):
Set parameter in code...
val sparkSession: SparkSession =SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support & loglevel
import sparkSession.implicits._sparkSession.sparkContext.hadoopConfiguration.setBoolean("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", true)
Did not see the configuration in place in Spark UI.
Attempted (2):
Passed the config from the CLI - spark-submit, and set it in code (see below).
spark-submit --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \...
I do see the configuration in the Spark UI, but same error – cannot traverse into the directory structure..
Code:
//Spark Session
val sparkSession: SparkSession=SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support
val conf= new SparkConf()
val cliRecursiveGlobConf=conf.get("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive")
import sparkSession.implicits._
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", cliRecursiveGlobConf)
Error & overall output:
Full error is at - https://gist.github.com/airawat/77fbdb821410a5a87dfd29ffaf60fdf9
17/08/18 15:59:29 INFO state.StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
Exception in thread "main" java.io.FileNotFoundException: File /user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=*/* does not exist.

Spark JobServer JDBC-ClassNotFound error

I have:
- Hadoop
- Spark JobServer
- SQL Database
I have created a file to access my SQL Database from a local instance of the Spark JobServer. In order to do this, I first have to load my JDBC-driver with this command: Class.forName("com.mysql.jdbc.Driver");. However, when I try to execute the file on Spark JobServer, I get a classNotFound error:
"message": "com.mysql.jdbc.Driver",
"errorClass": "java.lang.ClassNotFoundException",
I have read that in order to load the JDBC driver, you have to change some configurations in either the application.conf file of the Spark JobServer or its server_start.sh file. I have done this as follows. In server_start.sh I have changed the cmd value which is sent with as spark-submit command:
cmd='$SPARK_HOME/bin/spark-submit --class $MAIN --driver-memory $JOBSERVER_MEMORY
--conf "spark.executor.extraJavaOptions=$LOGGING_OPTS spark.executor.extraClassPath = hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--driver-java-options "$GC_OPTS $JAVA_OPTS $LOGGING_OPTS $CONFIG_OVERRIDES"
--driver-class-path "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--jars "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
$# $appdir/spark-job-server.jar $conffile'
I also changed some lines of the application.conf file of the Spark JobServer which is used when starting the instance:
# JDBC driver, full classpath
jdbc-driver = com.mysql.jdbc.Driver
# dependent-jar-uris = ["hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"]
But the error that JDBC class cannot be found still comes back.
Already checked for the following errors:
ERROR1:
In case somebody thinks that I just have the wrong file path (which could very well be the case as far as I know myself), I have checked for the correct file on HDFS with hadoop fs -ls hdfs://quickstart.cloudera:8020/user/cloudera/ and the file was there:
-rw-r--r-- 1 cloudera cloudera 983914 2016-01-26 02:23 hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar
ERROR2:
I have the necessary dependency loaded in my build.sbt file: libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.+" and the import command in my scala-file import java.sql._.
How can I solve this ClassNotFound error?
Are there any good alternatives to JDBC to connect to SQL?
We have something like this in local.conf
# JDBC driver, full classpath
jdbc-driver = org.postgresql.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = "/var/spark-jobserver/sqldao/data"
jdbc {
url = "jdbc:postgresql://dbserver/spark_jobserver"
user = "****"
password = "****"
}
dbcp {
maxactive = 20
maxidle = 10
initialsize = 10
}
And in the start script I have
EXTRA_JARS="/opt/spark-jobserver/lib/*"
CLASSPATH="$appdir:$appdir/spark-job-server.jar:$EXTRA_JARS:$(dse spark-classpath)"
And all dependent files that are used by Spark Jobserver is put in /opt/spark-jobserver/lib
I have not used HDFS to load jar for job-server.
But if you need mysql driver to be loaded on spark worker nodes then you should do it via dependent-jar-uris. I think that is what you are doing now.
I have packaged the project using sbt assembly and it finally works and I am happy.
But it's actually not working to have HDFS files in your dependent-jar-uri. So don't use HDFS links as your dependent-jar-uris.
Also, read this link in case you are curious: https://github.com/spark-jobserver/spark-jobserver/issues/372

Programatically setting (remote) master address for launching Spark

Note that the following local setting does work:
sc = new SparkContext("local[8]", testName)
But setting the remote master programmatically does not work:
sc = new SparkContext(master, testName)
or (same end result)
val sconf = new SparkConf()
.setAppName(testName)
.setMaster(master)
sc = new SparkContext(sconf)
In both of the latter cases the result is:
[16:25:33,427][INFO ][AppClient$ClientActor] Connecting to master akka.tcp://sparkMaster#mellyrn:7077/user/Master...
[16:25:33,439][WARN ][ReliableDeliverySupervisor] Association with remote system [akka.tcp://sparkMaster#mellyrn:7077]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
The following command line approach for setting the spark master consistently works (verified on multiple projects):
$SPARK_HOME/bin/spark-submit --master spark://mellyrn.local:7077
--class $1 $curdir/sparkclass.jar )
Clearly there is some additional configuration happening related to the command line spark-submit. Anyone want to posit what that might be?
In UNIX shell script below:
SP_MAST_URL=$CASSANDRA_HOME/dse client-tool spark master-address
echo $SP_MAST_URL
This will print the master from your Spark cluster environment. You may try this command utility provided by Spark and pass it on to SPARK SUBMIT command.
Note: CASSANDRA_HOME is the path where Apache cassandra installation is done. It could be any UNIX FILE path depending upon each environment.