I'm new to PySpark and I want to connect remote Hadoop Cluster (CDP) through Linux server by using spark-submit command.
Any help would be appreciated.
I need spark-submit command to connect remote CDP.
You can use Apache Livy to submit remote jobs to a CDP cluster. Here is detailed info on how to install and use Livy to submit jobs :
After downloading and unzipping Livy you should add following lines in livy.conf file. Then start livy service.
livy.spark.master = yarn
livy.spark.deploy-mode = cluster
You can find examples of how to create a spark submit script on following links:
https://community.cloudera.com/t5/Community-Articles/Submit-a-Spark-Job-to-CDP-Data-Hub-using-the-Livy-REST-API/ta-p/322481
https://livy.apache.org/examples/
Related
I am aware of Change Apache Livy's Python Version and How do i setup Pyspark in Python 3 with spark-env.sh.template.
I also have seen the Livy documentation
However, none of that works. Livy keeps using Python 2.7 no matter what.
This is running Livy 0.6.0 on an EMR cluster.
I have changed the PYSPARK_PYTHON environment variable to /usr/bin/python3 in the hadoop user, my user, the root, and ec2-user. Logging into the EMR master node via ssh and running pyspark starts python3 as expected. But, Livy keeps using python2.7.
I added export PYSPARK_PYTHON=/usr/bin/python3 to the /etc/spark/conf/spark-env.sh file. Livy keeps using python2.7.
I added "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/bin/python3" and "spark.executorEnv.PYSPARK_PYTHON":"/usr/bin/python3" to the items listed below and in every case . Livy keeps using python2.7.
sparkmagic config.json and config_other_settings.json files before starting a PySpark kernel Jupyter
Session Properties in the sparkmagic %manage_spark Jupyter widget. Livy keeps using python2.7.
%%spark config cell-magic before the line-magic %spark add --session test --url http://X.X.X.X:8998 --auth None --language python
Note: This works without any issues in another EMR cluster running Livy 0.7.0 I have gone over all of the settings on the other cluster and cannot find what is different. I did not have to do any of this on the other cluster, Livy just used python3 by default.
How exactly do I get Livy to use python3 instead of python2?
Finally just found an answer after posting.
I ran the following in a PySpark kernel Jupyter session cell before running any code to start the PySpark session on the remote EMR cluster via Livy.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3"
}
}
Simply adding "spark.pyspark.python": "python3" to the .sparkmagic config.json or config_other_settings.json also worked.
Confusing that this does not match the official Livy documentation.
My PySpark cluster is installed in one server say 10.45.25.30 and I wants to run my PySpark code in another server lets take 10.45.32.67 by connecting to the PySpark cluster installed in 10.45.25.30.
How can I to connect to the PySpark cluster that is being installed in another server from my current server?
Here are installed kernels:
$jupyter-kernelspec list
Available kernels:
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala
apache_toree_sql /usr/local/share/jupyter/kernels/apache_toree_sql
pyspark3kernel /usr/local/share/jupyter/kernels/pyspark3kernel
pysparkkernel /usr/local/share/jupyter/kernels/pysparkkernel
python3 /usr/local/share/jupyter/kernels/python3
sparkkernel /usr/local/share/jupyter/kernels/sparkkernel
sparkrkernel /usr/local/share/jupyter/kernels/sparkrkernel
A new notebook was created but fails with
The code failed because of a fatal error:
Error sending http request and maximum retry encountered..
There is no [error] message in the jupyter console
If you use magicspark to connect your Jupiter notebook, you should also start Livy which is API service used by magicspark to talk to your Spark cluster.
Download Livy from Apache Livy and unzip it
Check SPARK_HOME environment is set, if not, set to your Spark installation directory
Run Livy server by <livy_home>/bin/livy-server in the shell/command line
Now go back to your notebook, you should be able to run spark code in cell.
I'm trying to configure Spark on my local IDE and local install of Conda jupyter environment to use our corp spark/hive connect which looks something specs similar to this:
host: mycompany.com
port: 10003
I tried to configure spark-default.conf
spark.master spark://mycompany.com:10003
And when I try and call the spark context : sc
I get the following error with Jupyter:
Exception: Java gateway process exited before sending the driver its port number
Does anyone know of any good documentation that I can use to configure my local instance of jupyter and or Netbeans to use Spark with Scala or Python?
The situation is as follows:
I'm doing this on Windows 7, with MIT Kerberos client kfw 4.0.1. I'm connecting to a YARN cluster, via OpenVPN, that is secured with Kerberos 5. This cluster has been around for a while and it's been in use by other people, so the error is not likely to be on that side of things.
I can get a ticket via kinit (returns without error). However, once I try to do any of the following commands:
hdfs dfs -ls
spark-shell --master yarn
spark-submit anything --master yarn --deploy-mode cluster
essentially any spark or hadoop command on the cluster
I get the error: Can't get Kerberos realm (or Unable to locate Kerberos realm).
My krb5.ini file is in C:\ProgramData\MIT\Kerberos5
How can I further troubleshoot this?
Your JVM can not locate the krb5.conf file. You have several options:
set JVM property: -Djava.security.krb5.conf=/path/to/krb5.conf
or put the krb5.conf file into the <jdk-home>/jre/lib/security folder
or put the krb5.conf file into the c:\winnt\ folder
More information about locating the krb5.conf file are placed here: https://docs.oracle.com/javase/7/docs/technotes/guides/security/jgss/tutorials/KerberosReq.html