Error when running pyspark - pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.

You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.

In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.

Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()

Related

Do I need to import Spark again when I restart working on Google Colab?

I've been using Google Colab to practice PySpark. Do I need to re-install PySpark, findspark and all other files before I start using queries?
Or is there any shortcut that I should be aware of?
\cmd 1
!wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
\cmd 2
!tar -xvzf spark-3.3.1-bin-hadoop3.tgz
\cmd 3
`!ls /content/spark-3.3.1-bin-hadoop3``
!pip install findspark
\cmd 4
import os
``
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"
``
import findspark
findspark.init()
\cmd 5
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark 3.3 on Google Colab").getOrCreate()
Is there any way I can save some time in copy pasting all these former formalities for the sake of learning faster?
What does resetting the data stored means(on Runtime)?
Any of your productivity tip if using Google Colab?
How to make a PySpark Cluster just like in Databricks?

AttributeError: module 'torchtext' has no attribute 'legacy'

I am trying to use torchtext to process test data, however, I get the error: "AttributeError: module 'torchtext' has no attribute 'legacy'", when I run the following code. Can anyone please guide me what the issue here? I am using python 3.10.4. Thanks
import pandas as pd
import torch
import torchtext
import spacy
def prep_data(file_path):
TEXT=torchtext.legacy.data.Field(tokenize='spacy', tokenizer_language='en_core_web_sm')
LABEL=torchtext.legacy.data.LabelField(dtype=torch.long)
fields=[('clean_text', TEXT), ('label',LABEL)]
dataset = torchtext.legacy.data.TabularDataset(
path=file_path, format='csv',
skip_header=True, fields=fields)
print(dataset.examples[0])
if __name__=="__main__":
train_path='./data/train.csv'
test_path='./data/test.csv'
prep_data(train_path)
I addressed the same issue by updating the torchtext.
pip install torchtext==0.9
I also had the same issue. I solved my problem by using a pytorch stable version You are probably using versions 0.10, and 0.11. These were the versions using legacy.
Please update to the latest versions 0.13 and 0.14.
pip install torchtext==<version>

PySpark on Linux with pycharm - first exception error

I am trying to run my first PySpark script on a Linux VM I configured. The error message I have is KeyError: SPARK_HOME when I run the following:
from os import environ
from pyspark import SparkContext
I momentarily made this error go away by running export SPARK_HOME=~/spark-2.4.3-bin-hadoop2.7. I then ran into a new error error=2, No such file or directory. Searching took me to this page:https://community.cloudera.com/t5/Community-Articles/Tutorial-Install-Configure-iPython-and-create-run-PySpark/ta-p/246400. I then ran export PYSPARK_PYTHON=~/python3*. This brings me back to experiencing the KeyError: SPARK_HOME error.
Honestly, I'm stumbling through this, because it's my first time configuring Spark, and using PySpark. I still don't quite understand the ins-and-outs of pycharm, as well.
I expect to be able to run the following basic sample script on this page: https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f with no issues.
there is a package called findspark here
or you may use below code to set path if not found in environment
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'full_path_to_spark_root'
[code continues]

can't find module 'graphframes' -- Jupyter

I'm trying to install graphframes package following some instructions I have already read.
My first attempt was to do this in the command line:
pyspark--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
This works perfectly and the download was successfully done in the machine.
However, when I try to import the package in my Jupyter notebook, it displays the error:
can't find module 'graphframes'
My first attempt is to copy the package folder /graphframes to the /site-packages, but I can not make it with a simple cp command.
I'm quite new using spark and I'm sure I'm missing some parts of the configuration...
Could you please help me?
This was what worked for me.
Extract the contents of the graphframes-xxx-xxx-xxx.jar file. You should get something like
graphframes
| -- examples
|-- ...
| -- __init__.py
| -- ...
Zip up the entire folder (not just the contents) and name it whatever you want. We'll just call it graphframes.zip.
Then, run the pyspark shell with
pyspark --py-files graphframes.zip \
--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
You may need to do
sc.addPyFile('graphframes.zip')
before
import graphframes
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Running Apache SystemML

I am trying to get Apache SystemML set up and running (on Ubuntu) in a standalone mode.
I am relying on the github documentation to set this up.
I would like to run this with pyspark and I am following the instructions from this beginner's guide
After successfully installing systemml and launching pyspark shell, I tried the following code from the tutorial:
import systemml as sml
import numpy as np
m1 = sml.matrix(np.ones((3,3)) + 2)
The import statements work fine, however I encounter the following error with the 3rd line:
ImportError: Unable to load systemML.jar into the current pyspark session.Hint: Provide
the following argument to pyspark: --driver-class-path /usr/local...
As per the hint provided, I launched pyspark again appending the "--driver -class-path..." at the end. But I encountered the same error.
While googling for this, I found this error being highlighted in the Apache SystemML documentations. However, I wasn't really able to address the issue.
Any help will be greatly appreciated!
Can you please confirm that "/usr/local..." in your comment is path to systemml-*-incubating-SNAPSHOT.jar and that file exists ?