Redirecting Logs to a File in Scala - scala

I am new to Scala and I am struggling to find out how I can redirect my logs to a file in Scala. This is a simple task in Python but I can't find the relevant documentation for Scala. I am trying to use log4j but I don't mind to use other packages either. All references that I find discuss how to do so through a configuration file but I would like to do this programmatically.
This is what I have found so far and works but I do not know how to add a file. I think FileAppender should solve my problem but I can't find an example how to add it to my logger:
import org.apache.log4j.Logger
val logger = Logger.getLogger("My Logger")
logger.info("I am a log message")
What I wish to achieve (with some extra details) can be written in Python as follows:
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
handler = logging.FileHandler('output.log')
handler.setLevel(logging.INFO)
logger.addHandler(handler)
logger.info("I am a log message")

Link from comment translated to Scala:
import org.apache.log4j.PatternLayout
import org.apache.log4j.{Level, Logger, FileAppender}
val fa = new FileAppender
fa.setName("FileLogger")
fa.setFile("mylog.log")
fa.setLayout(new PatternLayout("%d %-5p [%c{1}] %m%n"))
fa.setThreshold(Level.DEBUG)
fa.setAppend(true)
fa.activateOptions
//add appender to any Logger (here is root)
Logger.getRootLogger.addAppender(fa)
// usage
val logger = Logger.getLogger("My Logger")
logger.info("I am a log message")
If " import org.apache.log4j.{Level, Logger, FileAppender}" is not worked, means, log4j libraries absent in classpath.

Related

How to "force" CRC files to appear when writing csv/parquet on HDFS in Spark

I seem to have the opposite problem from the rest of the Internet - any search on the topic would throw thousands of questions on how to suppress CRC files when writing out using Spark.
When using Spark on a cluster and writing stuff out to the HDFS I can't see any of the .crc files I usually see on the local system. Any ideas how to "force" them to appear?
You can try the below approach and see if .crc file is appearing on the hdfs folders.
val customConf = spark.sparkContext.hadoopConfiguration
val fileSystemObject = org.apache.hadoop.fs.FileSystem.get(customConf)
fileSystemObject.setVerifyChecksum(true)
If you write to text file on HDFS - you need to call setWriteChecksum with "false". And you will have only one your file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = new Configuration()
conf.set("fs.defaultFS", uri)
val hdfs = FileSystem.get(conf)
// this is it!
hdfs.setWriteChecksum(false)
val outputStream = hdfs.create(new Path("full/file/path"))
outputStream.write("string to be written".getBytes)
outputStream.close()
hdfs.close()

How to fix spark.read.format("parquet") error

I'm running Scala code on Azure databricks well. Now I want to move this code from Azure notebook to eclipse.
I install databricks connection following Microsoft document successfully. Pass databricks data connection test.
I also installed SBT and import to my project in eclipse
I create scala object in eclipse and also I import all jar files as external file in pyspark
package Student
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import java.util.Properties
//import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
object Test {
def isTypeSame(df: DataFrame, name: String, coltype: String) = (df.schema(name).dataType.toString == coltype)
def main(args: Array[String]){
var Result = true
val Borrowers = List(("col1", "StringType"),("col2", "StringType"),("col3", "DecimalType(38,18)"))
val dfPcllcus22 = spark.read.format("parquet").load("/mnt/slraw/ServiceCenter=*******.parquet")
if (Result == false) println("Test Fail, Please check") else println("Test Pass")
}
}
When I run this code in eclipse, it shows cannot find main class. But if I comment "val dfPcllcus22 = spark.read.format("parquet").load("/mnt/slraw/ServiceCenter=*******.parquet")", pass the test.
So it seems spark.read.format cannot be recognized.
I'm new to Scala and DataBricks.
I was researching result for several days but still cannot solve it.
If anyone can help, really appreciate.
Environment is a bit complicated to me, if more information required, please let me know
SparkSession is needed to run your code in eclipse, since your provided code does not have this line for SparkSession creation leads to an error,
val spark = SparkSession.builder.appName("SparkDBFSParquet").master("local[*]".getOrCreate()
Please add this line and run the code and it should work.

Execute code scala from spark in Zeppelin

I would like to run a scala code on Zeppelin from Spark cluster.
For example:
This is code into hdfs Spark "HelloWorldScala.scala":
object HelloWorldScala{
def main (arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName("myApp_Enrico")
val spark = SparkSession.builder.config(conf).getOrCreate()
val aList = List(1,2,3,4,5,6,7,8,9,10)
val aRdd = spark.sparkContext.parallelize(aList)
println("********* HELLO WORLD AND HELLO SPARK!! ******")
println("Print even numbers")
aRdd.filter(x=>x%2==0).map(x=>x*2).collect().foreach(println)
}
}
I would like to import in Zeppelin the HelloWorldScala file and run main, but I see the error:
Error code Zeppelin
Unfortunately you can't import single file in Zeppelin. You can pack your scala files into .jar library and put it to spark.jars (setted as property in spark) directory, after you will can import your library using line: import your.libray.packages.YourClass and using non-private functions from it. If you don't know about jar packages, and spark.jar directories just read a bit more about that.
UPDATE:
%dep
z.load("your_package_group:artifact:version")
%spark
import com.yourpackage.HelloWorldScala

Error in running Scala in terminal: "object apache is not a member of package org"

I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'

How do I write messages to the output log on AWS Glue?

AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get written to the error log (/aws-glue/jobs/error).
I have tried using:
log4jLogger = sparkContext._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
but "Hello World!" doesn't show up in either of the logs for the test job I ran.
Does anyone know how to go about writing debug log statements to the output log (/aws-glue/jobs/output)?
TIA!
EDIT:
It turns out the above actually does work. What was happening was that I was running the job in the AWS Glue Script editor window which captures Command-F key combinations and only searches in the current script. So when I tried to search within the page for the logging output it seemed as if it hadn't been logged.
NOTE: I did discover through testing the first responder's suggestion that AWS Glue scripts don't seem to output any log message with a level less than WARN!
Try to use built-in python logger from logging module, by default it writes messages to standard output stream.
import logging
MSG_FORMAT = '%(asctime)s %(levelname)s %(name)s: %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logging.basicConfig(format=MSG_FORMAT, datefmt=DATETIME_FORMAT)
logger = logging.getLogger(<logger-name-here>)
logger.setLevel(logging.INFO)
...
logger.info("Test log message")
I know the article is not new but maybe it could be helpful for someone:
For me logging in glue works with the following lines of code:
# create glue context
glueContext = GlueContext(sc)
# set custom logging on
logger = glueContext.get_logger()
...
#write into the log file with:
logger.info("s3_key:" + your_value)
I noticed the above answers are written in python. For Scala you could do the following
import com.amazonaws.services.glue.log.GlueLogger
object GlueApp {
def main(sysArgs: Array[String]) {
val logger = new GlueLogger
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
}
}
You can find both Python and Scala solution from official doc here
Just in case this helps. This works to change the log level.
sc = SparkContext()
sc.setLogLevel('DEBUG')
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info('Hello Glue')
This worked for INFO level in a Glue Python job:
import sys
root = logging.getLogger()
root.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)
root.info("check")
source
I faced the same problem. I resolved it by added
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
Before there was no prints at all, even ERROR level
The idea was taken from here
https://medium.com/tieto-developers/how-to-do-application-logging-in-aws-745114ac6eb7
Another option would be to log to stdout and glue AWS logging to stdout (using stdout is actually one of the best practices in cloud logging).
Update: it works only for setLevel("WARNING") and when prints ERROR or WARING. I didn't find how to manage it for the INFO level :(
If you're just debugging, print() (Python) or println() (Scala) works just fine.