Do I need to import Spark again when I restart working on Google Colab? - pyspark

I've been using Google Colab to practice PySpark. Do I need to re-install PySpark, findspark and all other files before I start using queries?
Or is there any shortcut that I should be aware of?
\cmd 1
!wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
\cmd 2
!tar -xvzf spark-3.3.1-bin-hadoop3.tgz
\cmd 3
`!ls /content/spark-3.3.1-bin-hadoop3``
!pip install findspark
\cmd 4
import os
``
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"
``
import findspark
findspark.init()
\cmd 5
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark 3.3 on Google Colab").getOrCreate()
Is there any way I can save some time in copy pasting all these former formalities for the sake of learning faster?
What does resetting the data stored means(on Runtime)?
Any of your productivity tip if using Google Colab?
How to make a PySpark Cluster just like in Databricks?

Related

Download file from Databricks (Scala)

I've used the following piece of code to divide the romania-latest.osm.pbf into romania-latest.osm.pbf.node.parquet and romania-latest.osm.pbf.way.parquet in Databricks. Now, I want to download these files to my local computer, to be able to use these in Intellij, but I can't seem to find where they're located and how to get them. I'm using the community edition of Databricks. This is done in Scala.
import sys.process._
"wget https://github.com/adrianulbona/osm-parquetizer/releases/download/v1.0.0/osm-parquetizer-1.0.0.jar -P /tmp/osm" !!
import sys.process._
"wget http://download.geofabrik.de/europe/monaco-latest.osm.pbf -P /tmp/osm" !!
import sys.process._
"java -jar /tmp/osm/osm-parquetizer-1.0.0.jar /tmp/osm/monaco-latest.osm.pbf" !!
I've searched on Google for a solution but nothing seems to work.

Kedro on Databricks: Cannot import SparkDataset

Cannot import SparkDataset in Databricks using;
from kedro.extras.datasets.spark import SparkDataSet
have you done pip install "kedro[spark.SparkDataSet]" ?
A new kedro project needs dependencies installed for connectors before use.
Also dataset types are case sensitive , so make sure in your catalog it is SparkDataSet and not Sparkdataset etc.

XGBoost in Databricks with Python

So recently I've been working around with Mlib Databricks cluster and saw that according to docs XGBoost is available for my cluster version (5.1). This cluster is running Python 2.
I get the feeling that XGBoost4J is only available for Scala and Java. So my question is: how do I import the xgboost module to this environment without losing the distribution capabilites?
A sample of my code is below
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
import xgboost as xgb # Throws error because module is not installed and it should
# Transform class to classIndex to make xgboost happy
stringIndexer = StringIndexer(inputCol="species", outputCol="species_index").fit(newInput)
labelTransformed = stringIndexer.transform(newInput).drop("species")
# Compose feature columns as vectors
vectorCols = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species_index"]
vectorAssembler = VectorAssembler(inputCols=vectorCols, outputCol="features")
xgbInput = vectorAssembler.transform(labelTransformed).select("features", "species_index")
You can try to use spark-sklearn to distribute the python or scikit-learn version of xgboost, but that distribution is different than the xgboost4j distribution. I heard that the pyspark api for xgboost4j on databricks is coming, so stay tuned.
Relevant pull request, by the way, can be found here

can't find module 'graphframes' -- Jupyter

I'm trying to install graphframes package following some instructions I have already read.
My first attempt was to do this in the command line:
pyspark--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
This works perfectly and the download was successfully done in the machine.
However, when I try to import the package in my Jupyter notebook, it displays the error:
can't find module 'graphframes'
My first attempt is to copy the package folder /graphframes to the /site-packages, but I can not make it with a simple cp command.
I'm quite new using spark and I'm sure I'm missing some parts of the configuration...
Could you please help me?
This was what worked for me.
Extract the contents of the graphframes-xxx-xxx-xxx.jar file. You should get something like
graphframes
| -- examples
|-- ...
| -- __init__.py
| -- ...
Zip up the entire folder (not just the contents) and name it whatever you want. We'll just call it graphframes.zip.
Then, run the pyspark shell with
pyspark --py-files graphframes.zip \
--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
You may need to do
sc.addPyFile('graphframes.zip')
before
import graphframes
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()