XGBoost in Databricks with Python - pyspark

So recently I've been working around with Mlib Databricks cluster and saw that according to docs XGBoost is available for my cluster version (5.1). This cluster is running Python 2.
I get the feeling that XGBoost4J is only available for Scala and Java. So my question is: how do I import the xgboost module to this environment without losing the distribution capabilites?
A sample of my code is below
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
import xgboost as xgb # Throws error because module is not installed and it should
# Transform class to classIndex to make xgboost happy
stringIndexer = StringIndexer(inputCol="species", outputCol="species_index").fit(newInput)
labelTransformed = stringIndexer.transform(newInput).drop("species")
# Compose feature columns as vectors
vectorCols = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species_index"]
vectorAssembler = VectorAssembler(inputCols=vectorCols, outputCol="features")
xgbInput = vectorAssembler.transform(labelTransformed).select("features", "species_index")

You can try to use spark-sklearn to distribute the python or scikit-learn version of xgboost, but that distribution is different than the xgboost4j distribution. I heard that the pyspark api for xgboost4j on databricks is coming, so stay tuned.

Relevant pull request, by the way, can be found here

Related

Do I need to import Spark again when I restart working on Google Colab?

I've been using Google Colab to practice PySpark. Do I need to re-install PySpark, findspark and all other files before I start using queries?
Or is there any shortcut that I should be aware of?
\cmd 1
!wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
\cmd 2
!tar -xvzf spark-3.3.1-bin-hadoop3.tgz
\cmd 3
`!ls /content/spark-3.3.1-bin-hadoop3``
!pip install findspark
\cmd 4
import os
``
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"
``
import findspark
findspark.init()
\cmd 5
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark 3.3 on Google Colab").getOrCreate()
Is there any way I can save some time in copy pasting all these former formalities for the sake of learning faster?
What does resetting the data stored means(on Runtime)?
Any of your productivity tip if using Google Colab?
How to make a PySpark Cluster just like in Databricks?

'GroupedData' object has no attribute 'applyInPandas' for Pandas udf. Why am I unable to implement applyInPandas for the Grouped Data?

Code and error image
Link referred:
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html.
I am trying to perform pandas_udf in pyspark.
Pyspark version of mine : 2.4.4
Pyarrow version of mine : 8.0.0

AttributeError: module 'torchtext' has no attribute 'legacy'

I am trying to use torchtext to process test data, however, I get the error: "AttributeError: module 'torchtext' has no attribute 'legacy'", when I run the following code. Can anyone please guide me what the issue here? I am using python 3.10.4. Thanks
import pandas as pd
import torch
import torchtext
import spacy
def prep_data(file_path):
TEXT=torchtext.legacy.data.Field(tokenize='spacy', tokenizer_language='en_core_web_sm')
LABEL=torchtext.legacy.data.LabelField(dtype=torch.long)
fields=[('clean_text', TEXT), ('label',LABEL)]
dataset = torchtext.legacy.data.TabularDataset(
path=file_path, format='csv',
skip_header=True, fields=fields)
print(dataset.examples[0])
if __name__=="__main__":
train_path='./data/train.csv'
test_path='./data/test.csv'
prep_data(train_path)
I addressed the same issue by updating the torchtext.
pip install torchtext==0.9
I also had the same issue. I solved my problem by using a pytorch stable version You are probably using versions 0.10, and 0.11. These were the versions using legacy.
Please update to the latest versions 0.13 and 0.14.
pip install torchtext==<version>

Importing spark.implicits._ inside a Jupyter notebook

In order to make use of $"my_column" constructs within a spark sql we need to:
import spark.implicits._
This is however not working afaict inside a jupyter notebook cell: the result is:
Name: Compile Error
Message: <console>:49: error: stable identifier required, but this.$line7$read.spark.implicits found.
import spark.implicits._
^
I have seen notebooks in the past for which that did work - but they may have been zeppelin.. Is there a way to get this for jupyter ?
Here is a hack that works
val spark2: SparkSession = spark
import spark2.implicits._
So now the spark2 reference is "stable" apparently.

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()