How to do effective logging in Spark application - scala

I have a spark application code written in Scala that runs a series of Spark-SQL statements. These results are calculated by calling an action 'Count' in the end against the final dataframe. I would like to know what is the best way to do logging from within a Spark-scala application job? Since all the dataframes (around 20) in number are computed using a single action in the end, what are my options when it comes to logging the outputs/sequence/success of some statements.
Question is little generic in nature. Since spark works on lazy evaluation, the execeution plan is decided by spark and I want to know till what point application statements ran successfully and what were the intermediate results at that stage.
The intention here being to monitor the long running task and see till which point it was fine and where the the problems creeped in.
If we try to put logging before/after transformations then it gets printed when code is read. So, the logging has to be done with custom messages during the actual execution (calling the action in the end of the scala code). If I try to put count/take/first etc in between the code then the execution of job slows down a lot.

I understand the problem that you are facing. Let me put out a simple solution for this.
You need to make use of org.apache.log4j.Logger. Use following lines of code to generate logger messages.
org.apache.log4j.Logger logger = org.apache.log4j.Logger.getRootLogger();
logger.error(errorMessage);
logger.info(infoMessage);
logger.debug(debugMessage);
Now, in order to redirect these messages to a log file, you need to create a log4j property file with below contents.
# Root logger option
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=OFF
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=OFF
log4j.logger.org.spark-project.jetty.servlet.ServletHandler=OFF
log4j.logger.org.spark-project.jetty.server=OFF
log4j.logger.org.spark-project.jetty=OFF
log4j.category.org.spark_project.jetty=OFF
log4j.logger.Remoting=OFF
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# Setting properties to have logger logs in local file system
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.encoding=UTF-8
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.maxFileSize=50MB
log4j.logger.org.apache.spark=OFF
log4j.logger.org.spark-project=OFF
log4j.logger.org.apache.hadoop=OFF
log4j.logger.io.netty=OFF
log4j.logger.org.apache.zookeeper=OFF
log4j.rootLogger=INFO, rolling
log4j.appender.rolling.file=/tmp/logs/application.log
You can name the log file in the last statement. Ensure the folders at every node with appropriate permissions.
Now, we need to pass the configurations while submitting the spark job as follows.
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
And,
--files "location of spark-log4j.properties file"
Hope this helps!

you can use log4j lib from maven
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j.version}</version>
</dependency>
For logging, first you need to create a logger object and then you can do logging at different log levels like info, error, warning. Below is the example of logging info in spark scala using log4j:
import org.apache.logging.log4j.LogManager
val logger = LogManager.getLogger(this.getClass.getName)
logger.info("logging message")
So, to add info at some points you can use logger.info("logging message") at that point.

Related

How to log useful error if scala play config files missing?

Is it possible to give a better error message if the config filename is mistyped when using scala play.api.Configuration
e.g. if I run my application with sbt run -J-Dconfig.file=conf/my-config.conf but the file is actually called my_config.conf, there is no error raised about file not found, but instead the first time the error is raised is when applicationConfig.has(configPath) is called, at which point it is not clear how to determine programatically the difference between a missing config value in the file or a missing config file.
Here is what I do:
Wrap the configuration in a Config-Class.
Initialize that class on startup.
Log all property - values.
This will log exceptions on Startup. Here is an example: AdaptersContext.scala
As a remark:
If you have your config-file in the conf directory (on classpath), use:
config.resource=demo.conf

Can we use akka.event.Logging to write logs in file?

I have tried using log4j and slf4j with akka in scala and I am able to get log files. Can I achieve the same thing without using any external api other than akka APIs? By using akka.event.Logging I am able to print logs in console, but I want to print it in a file.
I have already tried setting log4j.properties file for my project in classpath and its not working when I am using akka.event.Logging.
Please suggest.
Accordingly to this http://doc.akka.io/docs/akka/current/java/logging.html
you have 3 options:
Use akka.event.Logging$DefaultLogger (to stdout, not for production)
Use akka.event.slf4j.Slf4jLogger (logger by akka for SLF4J)
Use the SLF4J API directly (with async appender)
Your case is 2 or 3 (you use log4j.properties).
Therefore you should properly configure file log4j.properties for output in file.
And
in case 2 (your desired case), you should use akka.event.Logging, for example: Logging.getLogger(system.eventStream(), "my.string")
in case 3, you should use SLF4J API, for example: org.slf4j.LoggerFactory.getLogger(...)
Your case is 2, if you use in your akka config something like this:
akka {
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
}

IntellijIdea - Disable Info Message when running Spark Application

I'm getting so many message when running application that using Apache Spark and Hbase/Hadoop Library. For Example :
0 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation #org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of successful kerberos logins and latency (milliseconds)])
How to disable it, so i just get straight to the point Log like println(varABC) only ?
What you are seeing is logs produced by Spark through log4j, as by default it enables quite a log of printouts printed to stderr. You can configure it as you are usually configuring log4j behavior, e.g. through a log4j.properties configuration file. Refer to http://spark.apache.org/docs/latest/configuration.html#configuring-logging
In /spark-2.0.0-bin-hadoop2.6/conf folder you have a file log4j.properties.template
Rename from log4j.properties.template to log4j.properties
and make the following change in log4j.properties
from: log4j.rootCategory=INFO, console
to: log4j.rootCategory=ERROR, console
Hope this Help!!!...
Under $SPARK_HOME/conf dir modify the log4j.properties file - change values INFO to ERROR as below:
log4j.rootLogger=${root.logger}
root.logger=ERROR,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.logger.org.apache.spark.repl.Main=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
this will disable all the INFO log messages and only will print ERROR or FATAL log messages. you can change these values according to your requirement(s).

Running junit tests in maven ignores programmatic log4j2 setup

I have a particular JUnit test which processes a big data file and when the logging level is left on TRACE, it kills Eclipse - something to do with the console handling, which is not relevant to this question.
I often switch between running all my tests using m2e, in which case I don't need any debugging log output, and running individual tests, where I want often want to see the TRACE output.
To avoid the necessity of editing my log4j2.xml config every time, I coded the log4j config to increase the logging level to INFO in this particular test, like this from programmatically-change-log-level-in-log4j2:
#Before
public void beforeTest() {
LoggerContext ctx = (LoggerContext) LogManager.getContext(false);
Configuration config = ctx.getConfiguration();
LoggerConfig loggerConfig = config.getLoggerConfig(
LogManager.ROOT_LOGGER_NAME);
initialLogLevel = loggerConfig.getLevel();
loggerConfig.setLevel(Level.INFO);
}
But it has no effect.
If the "ROOT_LOGGER" that I am manipulating here represents the same logger as the <root> in my log4j2.xml, then this is not going to work, is it? I need to override all the other loggers, or shut it down completely, but how?
Could it be influenced by my use of slf4j as the log4j2 wrapper in all of my other classes?
I have tried getting hold of the Appenders and using append.stop() but that doesn't work.
You can put a log4j2 config file in src/test/resources/ directory. During unit tests that file will be used.

Can Perl's Log::Log4perl's log levels be changed dynamically without updating config?

I have a Mason template running under mod_perl, which is using Log::Log4perl.
I want to change the log level of a particular appender, but changing the config is too awkward, as it would have to pass through our deployment process to go live.
Is there a way to change the log level of an appender at run-time, after Apache has started, without changing the config file, and then have that change affect any new Apache threads?
If you've imported the log level constants from Log::Log4perl::Level, then you can do things like:
$logger->level($ERROR); # one of DEBUG, INFO, WARN, ERROR, FATAL
$logger->more_logging($delta); # Increase log level by $delta levels,
# a positive integer
$logger->less_logging($delta); # Decrease log level by $delta levels.
This is in the Changing the Log Level on a Logger section in the Log::Log4perl docs.
It seems kinda hacky to me, but it works:
$Log::Log4perl::Logger::APPENDER_BY_NAME{SCREEN}->threshold($DEBUG);
And to make it more dynamic, you could pass in a variable for the Appender name and level.
%LOG4PERL_LEVELS =
(
OFF =>$OFF,
FATAL =>$FATAL,
ERROR =>$ERROR,
WARN =>$WARN,
INFO =>$INFO,
DEBUG =>$DEBUG,
TRACE =>$TRACE,
ALL =>$ALL
);
$Log::Log4perl::Logger::APPENDER_BY_NAME{$appender_name}->threshold($LOG4PERL_LEVELS{$new_level});