What is a spark kernel for apache toree? - scala

I have a spark cluster which master is on 192.168.0.60:7077
I used to use jupyter notebook to make some pyspark scripts.
I am now willing to move on to scala.
I don't know scala's world.
I am trying to use Apache Toree.
I installed it, downloaded the scala kernels, and runned it to the point to open a scala notebook . Till there everything seems ok :-/
But I can't find the spark context, and there are errors in the jupyter's server logs :
[I 16:20:35.953 NotebookApp] Kernel started: afb8cb27-c0a2-425c-b8b1-3874329eb6a6
Starting Spark Kernel with SPARK_HOME=/Users/romain/spark
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output
[I 16:20:38.956 NotebookApp] KernelRestarter: restarting kernel (1/5)
As I don't know scala, I am not sure of the issue here ?
It could be :
I need a spark kernel (according to https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel )
I need to add an option on the server (the error message says 'Master must start with yarn, spark, mesos, or local' )
or something else :-/
I was just willing to migrate from python to scala, and I spend a few hours lost just on starting up the jupyter IDE :-/

It looks like you are using Spark in a standalone deploy mode. As Tzach suggested in his comment, following should work:
SPARK_OPTS='--master=spark://192.168.0.60:7077' jupyter notebook
SPARK_OPTS expects usual spark-submit parameter list.
If that does not help, you would need to check the SPARK_MASTER_PORT value in conf/spark-env.sh (7077 is the default).

Related

Create Sparkmagic spark session on Ipython kernel

I started working with SparkMagic lately, I was able to create notebooks on PySpark kernel which works fine.
Now when I use SparkMagic on Ipython kernel, there's a step which should be done manually (execute %manage_spark, create endpoint, create session)
I would like to know if there's away to do these steps programatically!

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

Syntax error on topology.py when I try to run scala command in spark through Cloudera VM

Everytime I try to run following Scala command
val dataRDD = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/data/data.txt")
dataRDD.collect().foreach(println)
//or
dataRDD.count()
I get following exception -
exitCodeException exitCode=1: File "/etc/hadoop/conf.cloudera.yarn/topology.py", line 43 print default_rack^
SyntaxError: Missing parentheses in call to 'print'
-I am running Spark 1.6.0 on Cloudera VM.
Anyone else faced such issue? What can be the reason? I understand that this is due to the 'topology.py' file which is trying to print without "(" which is required on python 3. But Why is this script being excuted when I am not running python/pyspark.
This is only happening through Cloudera VM, when I run outside the vm with some other sample data, the commands work!
I know it might be too late but I am posting the answer any way in case any other user face the same issue.
Above is the known issue and the workaround is following:
Workaround: Add a YARN gateway role to each host that does not already have at least one YARN role (of any type). YARN gateway needs to be added on the node/host where you are facing this issue.

Working with jdbc jar in pyspark

I need to read from a postgres sql database in pyspark.
I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.
I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:
pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar
Inside pyspark I did:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()
However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:
java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver
If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)
So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.
Update:
I did some more looking and found this in the documentation:
"The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".
The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.
I did find this which basically explains that this needs to be done locally.
I also found this which basically says there is a fix but it is not yet available in version 1.6.1.
I found a solution which works (Don't know if it is the best one so feel free to continue commenting).
Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()
Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4.1208</version>
</dependency>
and then you don't have to add the driver as a separate jar, just use the jar with dependencies.
What version of the documentation are you looking at ?
It seems like compute-classpath.sh was deprecated a while back - as of Spark 1.3.1:
$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
6592 2015-04-11 00:04 spark-1.3.1/bin/compute-classpath.sh
$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa
produces nothing.
I think you should be using load-spark-env.sh to set your classpath:
$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh
and you'll need to set SPARK_CLASSPATH in your $SPARK_HOME/conf/spark-env.sh file (which you'll copy over from the template file $SPARK_HOME/conf/spark-env.sh.template).
I think that this is another manifestation of the issue described and fixed here: https://github.com/apache/spark/pull/12000. I authored that fix 3 weeks ago and there has been no movement on it. Maybe if others also express the fact that they have been affected by it, it may help?

Using profiles with ipython/jupyter

Here is help output from ipython:
Examples
ipython notebook # start the notebook
ipython notebook --profile=sympy # use the sympy profile
ipython notebook --certfile=mycert.pem # use SSL/TLS certificate
Seems straightforward .. but then when invoking
$ipython notebook --profile=pyspark
The following warning occurs:
[W 20:54:38.623 NotebookApp] Unrecognized alias: '--profile=pyspark',
it will probably have no effect.
So then the online help is inconsistent with the warning message.
What is the correct way to activate the profile?
Update I tried reversing the order as follows:
$ipython --profile=pyspark notebook
But then a different warning occurs:
[TerminalIPythonApp] WARNING | File not found: u'notebook
The option is for the ipython binary, but you are trying to pass the option to the notebook application, as evident from the warning, which is from NotebookApp:
[W 20:54:38.623 NotebookApp] Unrecognized alias: '--profile=pyspark',
it will probably have no effect.
That's basically saying you are passing an option to notebook which it doesn't recognize, so it won't have any effect.
You need to pass the option to ipython:
ipython --profile=foo -- notebook
Online docs are inaccurate. The order needs to be reversed - with the -- option before notebook:
$ipython --profile=pyspark notebook
But .. the issues go beyond that even ..
It seems that jupyter (newer version of ipython) may not respect ipython profiles at all.
There are multiple references to same. Here is one from the Spark mailing list
Does anyone have a pointer to Jupyter configuration with pyspark? The
current material on python inotebook is out of date, and jupyter
ignores ipython profiles.