How to integrate Watson Studio python notebook with an IAE Spark? - pyspark

I have an IBM Analytics Engine (IAE) instance, added it to my Watson Studio project as an associated service and created an Environment based on it.
Then, I created a python notebook and set its environment to the abovementioned.
I ran a simple pyspark script in the notebook and noticed that it uses a local instance instead of the IAE.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark
output:
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.4.5
Master
local[*]
AppName
pyspark-shell
What am I doing wrong?

You're doing nothing wrong. Your code gets sent to IAE for execution. So it is running locally in your IAE instance.

Related

Spark closures behavior in local mode in IDEs

I am observing weird behavior of spark and closures in local mode on my machine v/s a 3 node cluster (Spark 2.4.5).
Following is the piece of code
object Example {
val num=5
def myfunc={
sc.parallelize(1 to 4).map(_+num).foreach(println)
}
}
I expected this to fail regardless since the local variable num is needed in the closure and therefore Example object would need to be serialized but it cannot be since it does not extend Serializable interface.
when I run the same piece of code from spark-shell on my same local machine, it fails with the error given the rationale above:
When I run the same piece of code in yarn mode on a 3 node EMR cluster, it fails with the exact same error as in the above screenshot...given the same rationale as mentioned above.
when I run the same piece of code in local mode on a the same cluster (=> master node), it also fails. The same rationale still holds true.
However, this, when I run from an sbt project (not a Spark installation or anything...just added Spark libraries to my sbt project and used a conf.master(local[..]) in local mode runs fine and gives me an o/p of 6,7,8,9:
This means its running fine everywhere except when you run it by adding Spark dependencies in your sbt project. The question is what explains the different local mode behavior when running your Spark code by simply adding your Spark libraries in sbt project?

Is there a config file when installing spark dependency with scala

I installed spark with sbt in project dependecies. Then I want to change variables of the spark env without doing it within my code with a .setMaster(). The problem is that i cannot find any config file on my computer.
This is because I have an error : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#my-mbp.domain_not_set.invalid:50487even after trying to change my hostname. Thus, I would like to go deep into spark library and try some things.
I tried pretty much everything that is on this so post : Invalid Spark URL in local spark session.
Many thanks
What worked for the issue:
export SPARK_LOCAL_HOSTNAME=localhost in shell profil (e.g. ~/.bash_profil)
SBT was not able to find the host even using the command just before running sbt. I had to put it in the profil to have a right context.

Unable to create partition on S3 using spark

I would like to use this new functionality: overwrite specific partition without delete all data in s3
I used the new flag (spark.sql.sources.partitionOverwriteMode="dynamic") and test it locally from my IDE and it worked (I was able to overwrite specific partition in s3) but when I deployed it to hdp 2.6.5 with spark 2.3.0 same code didn't create the s3 folders as expected , folder didn't create at all , only temp folder has been created
My code :
df.write
.mode(SaveMode.Overwtite)
.partitionBy("day","hour")
.option("compression", "gzip")
.parquet(s3Path)
Have you tried spark version 2.4? I have worked with this version and both EMR and Glue it has worked well, to use the "dynamic" in version 2.4 just use the code:
dataset.write.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.partitionBy("dt")
.parquet("s3://bucket/output")
AWS documentation specifies Spark version 2.3.2 to use spark.sql.sources.partitionOverwriteMode="dynamic".
Reference click here.

Spark without SPARK_HOME

I am relatively new to Spark and Scala.
I have a scala application that runs in local mode both on my windows box and a Centos cluster.
As long as spark is in my classpath (i.e., pom.xml), spark runs as unit tests without the need for a SPARK_HOME. But then how do I set Spark properties such as spark.driver.memory?
If I do have an instance of spark running locally, my unit test application seems to ignore it when in local mode. I do not see any output on the spark console suggesting it is using the spark instance I started from the command line (via spark-shell command). Am I mistaken? If not, how do I get my scala application to use that instance?
EDITED to include useful info from comments as well
spark_shell is just an interactive shell, it stands alone and is not an "instance" that other processes should connect to. When you run your spark application through spark-submit (or just running your spark code) it will start its own instance of spark. If you need to set any properties they can be bassed in a system properties or through the spark-submit --conf parameters
spark-submit requires that first you use maven assembly plugin to compile your application jar and dependencies.
This should then be deployed to the SPARK_HOME directory
Then use the submit script which must also be deployed in SPARK_HOME
The spark-submit script looks like this:
./bin/spark-submit --class xxx.ml.PipelineStart
--master local[*]
./xxx/myApp-1.0-SNAPSHOT-jar-with-dependencies.jar 100
You can set options in your SparkConf. Look at the methods available in the documentation.
There are explicit methods like SparkConf.setMaster to set certain properties. However, if you don't see a method to explicitly set a property, then just use SparkConf.set. It takes a key and a value, and the configurable properties are all found here.
If you're curious about what a property is set to, then you can also use SparkConf.get to check that out.

Connect from a windows machine to Spark

I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!
There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.