I am using pyspark kernel inside jupyterhub and want to connect hivellap from spark . I am able to create a spark session but when i am trying to execute
from pyspark_llap import HiveWarehouseSession it is showing error no module found pyspark_llap
The same command i am able to execute in python kernel and it successfully executed.
Kindly suggest what configuration is needed to import HiveWarehouseSession from pyspark_llap inside pyspark kernel.
Related
I started working with SparkMagic lately, I was able to create notebooks on PySpark kernel which works fine.
Now when I use SparkMagic on Ipython kernel, there's a step which should be done manually (execute %manage_spark, create endpoint, create session)
I would like to know if there's away to do these steps programatically!
I have performed some queries using pyspark on jupyter notebook now i want the outputs to be visualized in powerbi/tableau.
How do i do it ? I'm completely new to pyspark and need help !
I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.
I run the following code in Jupyter notebook, but get ImportError. Note that 'udf' can be imported in Jupyter.
from pyspark.sql.functions import pandas_udf
ImportError Traceback (most recent call
last) in ()
----> 1 from pyspark.sql.functions import pandas_udf
ImportError: cannot import name 'pandas_udf'
Anyone knows how to fix it? Thank you very much!
It looks like you start jupyter notebook by itself, rather than start pyspark with jupyter notebook, which is following command:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
If your jupyter notebook server process are running from another machine, maybe you want to use this command to make it available to all IP addresses of your sever.
(NOTE: This could be a potential security issue if your server is on a public or untrusted network)
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 " pyspark
I will revised my answer if the problem still persist after you start jupyter notebook like that.
I have a spark cluster which master is on 192.168.0.60:7077
I used to use jupyter notebook to make some pyspark scripts.
I am now willing to move on to scala.
I don't know scala's world.
I am trying to use Apache Toree.
I installed it, downloaded the scala kernels, and runned it to the point to open a scala notebook . Till there everything seems ok :-/
But I can't find the spark context, and there are errors in the jupyter's server logs :
[I 16:20:35.953 NotebookApp] Kernel started: afb8cb27-c0a2-425c-b8b1-3874329eb6a6
Starting Spark Kernel with SPARK_HOME=/Users/romain/spark
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output
[I 16:20:38.956 NotebookApp] KernelRestarter: restarting kernel (1/5)
As I don't know scala, I am not sure of the issue here ?
It could be :
I need a spark kernel (according to https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel )
I need to add an option on the server (the error message says 'Master must start with yarn, spark, mesos, or local' )
or something else :-/
I was just willing to migrate from python to scala, and I spend a few hours lost just on starting up the jupyter IDE :-/
It looks like you are using Spark in a standalone deploy mode. As Tzach suggested in his comment, following should work:
SPARK_OPTS='--master=spark://192.168.0.60:7077' jupyter notebook
SPARK_OPTS expects usual spark-submit parameter list.
If that does not help, you would need to check the SPARK_MASTER_PORT value in conf/spark-env.sh (7077 is the default).