Pyspark in google colab - pyspark

I am trying to use pyspark on google colab. Every tutorial follows a similar method
!pip install pyspark # Import SparkSession
from pyspark.sql import SparkSession # Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
spark # Import a Spark function from library
from pyspark.sql.functions import col
But I get an error in
----> 4 spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
RuntimeError: Java gateway process exited before sending its port number
I tried installing java using something like this
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
as suggested by the tutorials, but nothing seems to work.

This worked for me so i post in case someone needs it.
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
import pyspark.sql as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions as pyspark_functions
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

Related

Error in Pycharm when linking to pyspark: name 'spark' is not defined

When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter
When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

How to refer deltalake tables in jupyter notebook using pyspark

I'm trying to start use DeltaLakes using Pyspark.
To be able to use deltalake, I invoke pyspark on Anaconda shell-prompt as —
pyspark — packages io.delta:delta-core_2.11:0.3.0
Here is the reference from deltalake — https://docs.delta.io/latest/quick-start.html
All commands for delta lake works fine from Anaconda shell-prompt.
On jupyter notebook, reference to a deltalake table gives error.Here is the code I am running on Jupyter Notebook -
df_advisorMetrics.write.mode("overwrite").format("delta").save("/DeltaLake/METRICS_F_DELTA")
spark.sql("create table METRICS_F_DELTA using delta location '/DeltaLake/METRICS_F_DELTA'")
Below is the code I am using at start of notebook to connect to pyspark -
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
Below is the error I get:
Py4JJavaError: An error occurred while calling o116.save.
: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions?
I have created a Google Colab/Jupyter Notebook example that shows how to run Delta Lake.
https://github.com/prasannakumar2012/spark_experiments/blob/master/examples/Delta_Lake.ipynb
It has all the steps needed to run. This uses the latest spark and delta version. Please change the versions accordingly.
A potential solution is to follow the techniques noted in Import PySpark packages with a regular Jupyter notebook.
Another potential solution is to download the delta-core JAR and place it in the $SPARK_HOME/jars folder so when you run jupyter notebook it automatically includes the Delta Lake JAR.
I use DeltaLake all the time from a Jupyter notebook.
Try the following in you Jupyter notebook running Python 3.x.
### import Spark libraries
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
### spark package maven coordinates - in case you are loading more than just delta
spark_packages_list = [
'io.delta:delta-core_2.11:0.6.1',
]
spark_packages = ",".join(spark_packages_list)
### SparkSession
spark = (
SparkSession.builder
.config("spark.jars.packages", spark_packages)
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
sc = spark.sparkContext
### Python library in delta jar.
### Must create sparkSession before import
from delta.tables import *
Assuming you have a spark dataframe df
HDFS
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save("my_delta_file", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load("my_delta_file")
AWS S3 ObjectStore
Initial S3 setup
### Spark S3 access
hdpConf = sc._jsc.hadoopConfiguration()
user = os.getenv("USER")
### Assuming you have your AWS credentials in a jceks keystore.
hdpConf.set("hadoop.security.credential.provider.path", f"jceks://hdfs/user/{user}/awskeyfile.jceks")
hdpConf.set("fs.s3a.fast.upload", "true")
### optimize s3 bucket-level parquet column selection
### un-comment to use
# hdpConf.set("fs.s3a.experimental.fadvise", "random")
### Pick one upload buffer option
hdpConf.set("fs.s3a.fast.upload.buffer", "bytebuffer") # JVM off-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "array") # JVM on-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "disk") # DEFAULT - directories listed in fs.s3a.buffer.dir
s3_bucket_path = "s3a://your-bucket-name"
s3_delta_prefix = "delta" # or whatever
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save(f"{s3_bucket_path}/{s3_delta_prefix}/", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load(f"{s3_bucket_path}/{s3_delta_prefix}/")
Spark Submit
Not directly answering the original question, but for completeness, you can do the following as well.
Add the following to your spark-defaults.conf file
spark.jars.packages io.delta:delta-core_2.11:0.6.1
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
Refer to conf file in spark-submit command
spark-submit \
--properties-file /path/to/your/spark-defaults.conf \
--name your_spark_delta_app \
--py-files /path/to/your/supporting_pyspark_files.zip \
--class Main /path/to/your/pyspark_script.py

Databricks-connect with Python + Scala UDF in JAR file not working locally

I'm trying to use a JAR file in python (using Databricks-connect) in Vs Code.
I already checked the path to the jar file.
I have the following code as example:
import datetime
import time
from pyspark.sql import SparkSession
from pyDataHub import LoadProcessorBase, ProcessItem
from pyspark.sql.functions import col, lit, sha1, concat, udf, array
from pyspark.sql import functions
from pyspark.sql.types import TimestampType, IntegerType, DoubleType, StringType
from pyspark import SparkContext
from pyspark.sql.functions import sha1, upper
from pyspark.sql.column import Column, _to_java_column, _to_seq
spark = SparkSession \
.builder \
.config("spark.jars", "/users/Phill/source/jar/DataHub_Core_Functions.jar") \
.getOrCreate()
sc = spark.sparkContext
def PhillHash(col):
f = sc._jvm.com.narato.datahub.core.HashContentGenerator.getGenerateHashUdf()
return upper(sha1(Column(f.apply(_to_seq(sc, [col], _to_java_column)))))
sc._jsc.addJar("/users/Phill/source/jar/DataHub_Core_Functions.jar")
spark.range(100).withColumn("test", PhillHash("id")).show()
Any help would be appreciated cause I'm out of options here...
The error I get is the following:
Exception has occurred: TypeError 'JavaPackage' object is not callable
Add the jar to a dbfs location and update the path accordingly. The workers cannot connect to your local filesystem.
Also make sure you are running version 5.4 of the databricks runtime (or higher).

How to executing arbitrary sql from pyspark notebook using SQLContext?

I'm trying a basic test case of reading data from dashDB into spark and then writing it back to dashDB again.
Step 1. First within the notebook, I read the data:
sqlContext = SQLContext(sc)
dashdata = sqlContext.read.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="GOSALES.BRANCH"
).cache()
Step 2. Then from dashDB I create the target table:
DROP TABLE ****.FROM_SPARK;
CREATE TABLE ****.FROM_SPARK AS (
SELECT *
FROM GOSALES.BRANCH
) WITH NO DATA
Step 3. Finally, within the notebook I save the data to the table:
from pyspark.sql import DataFrameWriter
writer = DataFrameWriter(dashdata)
dashdata = writer.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="****.FROM_SPARK"
)
Question: Is it possible to run the sql in step 2 from pyspark? I couldn't see how this could be done from the pyspark documentation. I don't want to use vanilla python for connecting to dashDB because of the effort involved in setting up the library.
Use ibmdbpy. See this brief demo.
With as_idadataframe() you can upload DataFrames into dashDB as a table.
Added key steps here as stackoverflow doesn't like linking to answers:
Step 1: Add a cell containing:
#!pip install --user future
#!pip install --user lazy
#!pip install --user jaydebeapi
#!pip uninstall --yes ibmdbpy
#!pip install ibmdbpy --user --no-deps
#!wget -O $HOME/.local/lib/python2.7/site-packages/ibmdbpy/db2jcc4.jar https://ibm.box.com/shared/static/lmhzyeslp1rqns04ue8dnhz2x7fb6nkc.zip
Step 2: Then from annother notebook cell
from ibmdbpy import IdaDataBase
idadb = IdaDataBase('jdbc:db2://<dashdb server name>:50000/BLUDB:user=<dashdb user>;password=<dashdb pw>')
....
Yes,
You can create table in dashdb from Notebook.
Below is the code for Scala :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import java.sql.Connection
import java.sql.DriverManager
import java.sql.SQLException
import com.ibm.db2.jcc._
import java.io._
val jdbcClassName="com.ibm.db2.jcc.DB2Driver"
val url="jdbc:db2://awh-yp-small02.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;" // enter the hostip fromc connection settings
val user="<username>"
val password="<password>"
Class.forName(jdbcClassName)
val connection = DriverManager.getConnection(url, user, password)
val stmt = connection.createStatement()
stmt.executeUpdate("CREATE TABLE COL12345(" +
"month VARCHAR(82))")
stmt.close()
connection.commit()
connection.close()