What's the pyspark directory location in cloudera - pyspark

I am not able to get the python directory in cloudera VM
Can anyone please tell me the default pyspark directory in cloudera ?
Thanks

You can find spark home directory under /usr/lib/spark , under this in bin you can find pyspark or spark-shell.

Related

How to install and use pyspark on mac

I'm taking a machine learning course and am trying to install pyspark to complete some of the class assignments. I downloaded pyspark from this link, unzipped it and put it in my home directory, and added the following lines to my .bash_profile.
export SPARK_PATH=~/spark-3.3.0-bin-hadoop2.6
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
However, when I try to run the command:
pyspark
to start a session, I get the error:
-bash: pyspark: command not found
Can someone tell me what I need to do to get pyspark working on my local machine? Thank you.
You are probably missing the PATH entry. Here are the environment variable changes I did to get pyspark working on my Mac:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.6.jdk/Contents/Home/
export SPARK_HOME=/opt/spark-3.3.0-bin-hadoop3
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'
Also ensure that, you've Java SE 8+ and Python 3.5+ installed.
Start the server from /opt/spark-3.3.0-bin-hadoop3/sbin/start-master.sh.
Then run pyspark and copy+paste URL displayed on screen in web browser.

pyspark pip installation hive-site.xml

I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)

<console>:25: error: object databricks is not a member of package com

I actually work on zeppelin with spark and scala. I want to import the library which contain : import com.databricks.spark.xml.
I tried but I have still the same mistake in zeppelin mistake : <console>:25: error: object databricks is not a member of package com.
What I've done actually ? I create a note in Zeppelin with this code : %dep
z.load("com.databricks:spark-xml_2.11:jar:0.5.0"). Even with that, the interpreter don't work. It's like it don't succeed to load the library.
Have you an idea why it don't work ?
Thanks for your help and have a nice day !
Your problem is very common and not intuitive to solve. I resolved an issue similar to this (I wanted to load the postgres jdbc connector in AWS EMR and I was using a linux terminal). Your issue can be resolved by checking if you can:
load the jar file manually to the environment that is hosting Zeppelin.
add the path of the jar file to your CLASSPATH environment variable. I don't know where you're hosting your files that manage your CLASSPATH env, but in EMR, my file, viewed from the Zeppelin root directory, was here: /usr/lib/zeppelin/conf/zeppelin-env.sh
download the zeppelin interpreter with
$ sudo ./bin/install-interpreter.sh --name "" --artifact
add the interpreter in Zeppelin wby going to the Zeppelin Interpreter GUI and add in the interpreter group.
Reboot Zeppelin with:
$ sudo stop zeppelin
$ sudo start zeppelin
It's very likely that your configurations may vary slightly, but I hope this helps provide some structure and relevance.

No module named pyspark in mac

I have configured my .bash_profile like below . please let me know if anything i'm missing here . I'm getting
No module named pyspark
# added by Anaconda3 5.2.0 installer
export PATH=/Users/pkumar5/anaconda3/bin:$PATH
export JAVA_HOME=/Library/Java/Home
# spark configuration
export SPARK_PATH=~/spark-2.3.2-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias snotebook='$SPARK_PATH/bin/pyspark --master "local[2]"'
i'm trying use pyspark in jupyter notebook , I'm getting error called "No module named pyspark" .Please . help me out to resolve.
You might have to define the correct $PYTHONPATH (this is where Python looks for modules).
Also something you might want to check: if you have installed pyspark correctly, it might be that you installed it for Python 3 while your Jupyter notebook kernel is running Python 2, so switching kernel would solve the issue.

How to connect/run map reduce on Hadoop installed on Ubuntu

I have successfully installed Hadoop on Ubutu and running well, now I want to run a sample mapReduce using Eclipse connecting the Hadoop I installed.
it is really great if some could helps to sort this.
Thanks
Rajesh
You can just export the eclipse MR code as Jar and place it in the Local file system and run the following command.
hadoop jar <jarFileName> [<argumentsToBePassed>]