How to suppres INFO spark logs? - scala

I am experimenting with apache spark 3 in intellij by creating a simple standalone scala application. When I run my program I get lots of INFO logs. Based on various SO answers I tried all of the following:
spark.sparkContext.setLogLevel("ERROR")
SparkSession.builder.getOrCreate().sparkContext.setLogLevel("ERROR")
Logger.getRootLogger().setLevel(Level.ERROR)
Logger.getLogger(classOf[RackResolver]).getLevel
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
But none works. I still see tons of INFO logs.
So what's the working way of achieving this?

Related

spark streaming from kafka on spark operator(Kubernetes)

I have a spark structured streaming job in scala, reading from kafka and writing to S3 as hudi tables. Now I am trying to move this job to spark operator on EKS.
When I give the option in the yaml file.
spark.jars.packages: org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.1
But still I get the error at both the driver and executor
java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaBatchInputPartition .
How to add the package org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, so it works.
Edit: Seems it is an existing issue fixed only in yet to be released version spark 3.4. Based on the suggestions here and here I had to bake all the jars (spark-sql-kafka-0-10_2.12-3.1.2 and its dependencies and also hudi jar) into the spark image. Then it worked.

Eclipse jobs not in Spark UI History

I really like the DAG visualisation on the Spark UI (http://localhost:4040/jobs)
I am running local Spark and when I run a job through Eclipse IDE for Scala, these jobs are not logged in the Spark UI. Any ideas how i can get these to show?
so looks like the Eclipse is not looking at the [SPARK_HOME]\conf\spark-defaults.conf. There are ideas on how to point to this but none have worked for me. So I have set the conf properties in the scala code. Be careful of trailing spaces
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local[*]")
.set("spark.history.fs.logDirectory", "file:///c:/tmp/spark-events")
.set("spark.eventLog.dir", "file:///c:/tmp/spark-events")
.set("spark.eventLog.enabled", "true")

Connect from a windows machine to Spark

I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!
There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.

How to install scala libraries in HDP (Hortonworks Data Platform)

thanks in advance for the time you according reading this and sorry for my bad english.
I am trying to use Spark streaming for real time data processing. I have Spark installed in HDP (Hortonworks Data Platform) and for my process I need to install a scala library for JSONparsing. I read a lot of things on internet about that but it just was for a simple Spark Cluster not for solution like HDP and CDH, I tried to adapt the solution but I couldn't, I don't find any scala files to install it. Does anybody know a solution or a tips to help me ?
Thank you
To load dependencies for Spark in Zeppelin you need to create a new cell and use the following:
%dep
// it's a good idea to do a reset first, but not required
z.reset()
// the following line will load directly from the Maven online repo
z.load("org.apache.spark:spark-streaming-karka_2.10:1.6.1")
Additional details on loading dependencies for Zeppelin can be found here:
https://zeppelin.apache.org/docs/latest/interpreter/spark.html#3-dynamic-dependency-loading-via-dep-interpreter
One thin to not here, is that dependency loading must be the first cell you run on your notebook, it will give you an error message if it's not. To get around this click on the Interpreter tab and click restart on the Spark Interpreter, then go back to your notebook and run the cell with the %dep

How to build and run Scala Spark locally

I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.
So I have cloned the latest from Spark repo :
git clone https://github.com/apache/spark.git
Spark appears to be a Maven project so when I create it in Eclipse here is the structure :
Some of the top level folders also have pom files :
So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?
Building Spark locally, the short answer:
git clone git#github.com:apache/spark.git
cd spark
sbt/sbt compile
Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'.
To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.
Then, when creating the Spark Context, use sparkConfig.local[1] as master like:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("SparkDebugExample")
so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.
If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.