How to write Spark-submit logs file with the Scala code? - scala

I am trying to build a scala based jar file that uses log4j to write logs. Executing the code above with spark-shell works fine (logs printing in the console). But when I try to make it write to a log file (spark-shell or spark-submit), only the line with logging.info is print out. I wish to set the log level to DEBUG. Here is my code :
import org.apache.log4j
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger, PatternLayout, Priority, RollingFileAppender}
import java.time
import java.time.format.DateTimeFormatter
trait SparkContextProvider {
def spark: SparkSession
}
trait Logs extends SparkContextProvider {
lazy val logging: log4j.Logger = Logger.getLogger(getClass.getName)
lazy val applicationId: String = spark.sparkContext.applicationId
val appender = new RollingFileAppender()
appender.setAppend(true)
appender.setMaxFileSize("50MB")
appender.setMaxBackupIndex(10)
appender.setFile("/usr/spark-3.0.2/app-logs/spark-" + applicationId + ".log")
appender.activateOptions()
val layOut = new PatternLayout()
layOut.setConversionPattern("%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n")
appender.setLayout(layOut)
logging.addAppender(appender)
logging.setLevel(Level.DEBUG)
}
object DataExtractionProcess extends Logs {
def Main(): Unit = {
logging.info("hello test world")
}
override def spark: SparkSession = SparkSession.builder
.appName("PredictiveDataOperation")
.getOrCreate()
}
I trigger the job with DataExtractionProcess.Main()
And I tried also to set log level with :
//Logger.getLogger("org.apache.spark").setLevel(Level.DEBUG)
//Logger.getRootLogger().setLevel(Level.DEBUG)
//spark.sparkContext.setLogLevel("all")
But no change in the log file.
Thanks for the help

Related

How to override Spark default log4j profile in intellij?

I am coding spark in IntelliJ with sbt. i tried many tricks to turn off spark warnings.
My code:
import org.apache.spark.{SparkConf, SparkContext, sql}
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object TheObject {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SimpleApplication").config("spark.master", "local").getOrCreate()
Logger.getLogger("org").setLevel(Level.ERROR)
val data = spark.read.format("csv")
data.printSchema()
}
}
but the output says:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
and no changes.
is there any way to turn it off without changing any file properties?

sbt test fails my spark testing suite while intellij test works

I am trying to test the behaviour of a class which eats and process DataFrames.
Following this previous questions: How to write unit tests in Spark 2.0+? I tried to use the loan pattern to run my tests in the following way:
I have a SparkSession provider trait:
/**
* This trait allows to use spark in Unit tests
* https://stackoverflow.com/questions/43729262/how-to-write-unit-tests-in-spark-2-0
*/
trait SparkSetup {
def withSparkSession(testMethod: SparkSession => Any) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("Spark test")
val sparkSession = SparkSession
.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
try {
testMethod(sparkSession)
}
// finally sparkSession.stop()
}
}
Which I use in my test class:
class InnerNormalizationStrategySpec
extends WordSpec
with Matchers
with BeforeAndAfterAll
with SparkSetup {
...
"A correct contact message" should {
"be normalized without errors" in withSparkSession{ ss => {
import ss.implicits._
val df = ss.createDataFrame(
ss.sparkContext.parallelize(Seq[Row](Row(validContact))),
StructType(List(StructField("value", StringType, nullable = false))))
val result = target.innerTransform(df)
val collectedResult: Array[NormalizedContactHistoryMessage] = result
.where(result.col("contact").isNotNull)
.as[NormalizedContactHistoryMessage]
.collect()
collectedResult.isEmpty should be(false) // There should be something
collectedResult.length should be(1) // There should be exactly 1 message...
collectedResult.head.contact.isDefined should be(true) // ... of type contact.
}}
}
...
}
When attempting to run my tests using IntelliJ facility, all tests written in this manner works (running the Spec class at once), however, the sbt test command from terminal makes all the tests fail.
I thought also it was because of parallelism, so i added
concurrentRestrictions in Global += Tags.limit(Tags.Test, 1)
in my sbt settings, but didn't work.
Here is the stack trace I receive: https://pastebin.com/LNTd3KGW
Any help?
Thanks

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Spark unable to find "spark-version-info.properties" when run from ammonite script

I have an ammonite script which creates a spark context:
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.0.1`
import org.apache.spark.{SparkConf, SparkContext}
#main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
When I run this script, it throws an error:
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: org.apache.spark.SparkException: Error while locating file spark-version-info.properties
...
Caused by: java.lang.NullPointerException
at java.util.Properties$LineReader.readLine(Properties.java:434)
at java.util.Properties.load0(Properties.java:353)
The script isn't being run from the spark installation directory and doesn't have any knowledge of it or the resources where this version information is packaged - it only knows about the ivy dependencies. So perhaps the issue is that this resource information isn't on the classpath in the ivy dependencies. I have seen other spark "standalone scripts" so I was hoping I could do the same here.
I poked around a bit to try and understand what was happening. I was hoping I could programmatically hack some build information into the system properties at runtime.
The source of the exception comes from package.scala in the spark library. The relevant bits of code are
val resourceStream = Thread.currentThread().getContextClassLoader.
getResourceAsStream("spark-version-info.properties")
try {
val unknownProp = "<unknown>"
val props = new Properties()
props.load(resourceStream) <--- causing a NPE?
(
props.getProperty("version", unknownProp),
// Load some other properties
)
} catch {
case npe: NullPointerException =>
throw new SparkException("Error while locating file spark-version-info.properties", npe)
It seems that the implicit assumption is that props.load will fail with a NPE if the version information can't be found in the resources. (That's not so clear to the reader!)
The NPE itself looks like it's coming from this code in java.util.Properties.java:
class LineReader {
public LineReader(InputStream inStream) {
this.inStream = inStream;
inByteBuf = new byte[8192];
}
...
InputStream inStream;
Reader reader;
int readLine() throws IOException {
...
inLimit = (inStream==null)?reader.read(inCharBuf)
:inStream.read(inByteBuf);
The LineReader is constructed with a null InputStream which the class internally interprets as meaning that the reader is non-null and should be used instead - but it's also null. (Is this kind of stuff really in the standard library? Seems very unsafe...)
From looking at the bin/spark-shell that comes with spark, it adds -Dscala.usejavacp=true when it launches spark-submit. Is this the right direction?
Thanks for your help!
Following seems to work on 2.11 with 1.0.1 version but not experimental.
Could be just better implemented on Spark 2.2
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.2.0`
import $ivy.`org.apache.spark:spark-sql_2.11:2.2.0`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
#main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
or more expanded answer:
#main
def main(): Unit = {
val spark = SparkSession.builder()
.appName("testings")
.master("local")
.config("configuration key", "configuration value")
.getOrCreate
val sqlContext = spark.sqlContext
val tdf2 = spark.read.option("delimiter", "|").option("header", true).csv("./tst.dat")
tdf2.show()
}