How to refer deltalake tables in jupyter notebook using pyspark - pyspark

I'm trying to start use DeltaLakes using Pyspark.
To be able to use deltalake, I invoke pyspark on Anaconda shell-prompt as —
pyspark — packages io.delta:delta-core_2.11:0.3.0
Here is the reference from deltalake — https://docs.delta.io/latest/quick-start.html
All commands for delta lake works fine from Anaconda shell-prompt.
On jupyter notebook, reference to a deltalake table gives error.Here is the code I am running on Jupyter Notebook -
df_advisorMetrics.write.mode("overwrite").format("delta").save("/DeltaLake/METRICS_F_DELTA")
spark.sql("create table METRICS_F_DELTA using delta location '/DeltaLake/METRICS_F_DELTA'")
Below is the code I am using at start of notebook to connect to pyspark -
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
Below is the error I get:
Py4JJavaError: An error occurred while calling o116.save.
: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions?

I have created a Google Colab/Jupyter Notebook example that shows how to run Delta Lake.
https://github.com/prasannakumar2012/spark_experiments/blob/master/examples/Delta_Lake.ipynb
It has all the steps needed to run. This uses the latest spark and delta version. Please change the versions accordingly.

A potential solution is to follow the techniques noted in Import PySpark packages with a regular Jupyter notebook.
Another potential solution is to download the delta-core JAR and place it in the $SPARK_HOME/jars folder so when you run jupyter notebook it automatically includes the Delta Lake JAR.

I use DeltaLake all the time from a Jupyter notebook.
Try the following in you Jupyter notebook running Python 3.x.
### import Spark libraries
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
### spark package maven coordinates - in case you are loading more than just delta
spark_packages_list = [
'io.delta:delta-core_2.11:0.6.1',
]
spark_packages = ",".join(spark_packages_list)
### SparkSession
spark = (
SparkSession.builder
.config("spark.jars.packages", spark_packages)
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
sc = spark.sparkContext
### Python library in delta jar.
### Must create sparkSession before import
from delta.tables import *
Assuming you have a spark dataframe df
HDFS
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save("my_delta_file", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load("my_delta_file")
AWS S3 ObjectStore
Initial S3 setup
### Spark S3 access
hdpConf = sc._jsc.hadoopConfiguration()
user = os.getenv("USER")
### Assuming you have your AWS credentials in a jceks keystore.
hdpConf.set("hadoop.security.credential.provider.path", f"jceks://hdfs/user/{user}/awskeyfile.jceks")
hdpConf.set("fs.s3a.fast.upload", "true")
### optimize s3 bucket-level parquet column selection
### un-comment to use
# hdpConf.set("fs.s3a.experimental.fadvise", "random")
### Pick one upload buffer option
hdpConf.set("fs.s3a.fast.upload.buffer", "bytebuffer") # JVM off-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "array") # JVM on-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "disk") # DEFAULT - directories listed in fs.s3a.buffer.dir
s3_bucket_path = "s3a://your-bucket-name"
s3_delta_prefix = "delta" # or whatever
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save(f"{s3_bucket_path}/{s3_delta_prefix}/", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load(f"{s3_bucket_path}/{s3_delta_prefix}/")
Spark Submit
Not directly answering the original question, but for completeness, you can do the following as well.
Add the following to your spark-defaults.conf file
spark.jars.packages io.delta:delta-core_2.11:0.6.1
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
Refer to conf file in spark-submit command
spark-submit \
--properties-file /path/to/your/spark-defaults.conf \
--name your_spark_delta_app \
--py-files /path/to/your/supporting_pyspark_files.zip \
--class Main /path/to/your/pyspark_script.py

Related

Pyspark in google colab

I am trying to use pyspark on google colab. Every tutorial follows a similar method
!pip install pyspark # Import SparkSession
from pyspark.sql import SparkSession # Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
spark # Import a Spark function from library
from pyspark.sql.functions import col
But I get an error in
----> 4 spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
RuntimeError: Java gateway process exited before sending its port number
I tried installing java using something like this
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
as suggested by the tutorials, but nothing seems to work.
This worked for me so i post in case someone needs it.
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
import pyspark.sql as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions as pyspark_functions
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

Error I'm getting when I write Pyspark code to connect with Snowflake

I'm getting an error when I tried to write PySpark code from Jupiter Notebook to connect with Snowflake. Here's the error I got:
Py4JJavaError: An error occurred while calling o526.load.
: java.lang.ClassNotFoundException: Failed to find data source: net.snowflake.spark.snowflake. Please find packages at http://spark.apache.org/third-party-projects.html
Spark-version: v2.4.5
Master: local[*]
Python 3.X
Here's my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
.master("local") \
.appName("Test") \
.config('spark.jars','/Users/zhao/Downloads/snowflake-jdbc-3.5.4.jar,/Users/zhao/Downloads/spark-snowflake_2.11-2.3.2.jar') \
.getOrCreate()
sfOptions = {
"sfURL" : "xxx",
"sfUser" : "xxx",
"sfPassword" : "xxx",
"sfDatabase" : "xxx",
"sfSchema" : "xxx",
"sfWarehouse" : "xxx",
"sfRole": "xxx"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from CustomerInfo limit 10") \
.load()
I'll appreciate if anyone can give me some ideas :)
How are you starting up your jupyter notebook server instance? Are you ensuring your PYTHONPATH and SPARK_HOME variables are properly set, and that Spark isn't pre-running an instance? Also, is your Snowflake Spark Connector jar using the right Spark and Scala version variants?
Here's a fully bootstrapped and tested run on a macOS machine, as a reference (uses homebrew):
# Install JDK8
~> brew tap adoptopenjdk/openjdk
~> brew cask install adoptopenjdk8
# Install Apache Spark (v2.4.5 as of post date)
~> brew install apache-spark
# Install Jupyter Notebooks (incl. optional CLI notebooks)
~> pip3 install --user jupyter notebook
# Ensure we use JDK8 (using very recent JDKs will cause class version issues)
~> export JAVA_HOME="/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home"
# Setup environment to allow discovery of PySpark libs and the Spark binaries
# (Uses homebrew features to set the paths dynamically)
~> export SPARK_HOME="$(brew --prefix apache-spark)/libexec"
~> export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/build:${PYTHONPATH}"
~> export PYTHONPATH="$(brew list apache-spark | grep 'py4j-.*-src.zip$' | head -1):${PYTHONPATH}"
# Download jars for dependencies in notebook code into /tmp
# Snowflake JDBC (v3.12.8 used here):
~> curl --silent --location \
'https://search.maven.org/classic/remotecontent?filepath=net/snowflake/snowflake-jdbc/3.12.8/snowflake-jdbc-3.12.8.jar' \
> /tmp/snowflake-jdbc-3.12.8.jar
# Snowflake Spark Connector (v2.7.2 used here)
# But more importantly, a Scala 2.11 and Spark 2.4.x compliant one is fetched
~> curl --silent --location \
'https://search.maven.org/classic/remotecontent?filepath=net/snowflake/spark-snowflake_2.11/2.7.2-spark_2.4/spark-snowflake_2.11-2.7.2-spark_2.4.jar' \
> /tmp/spark-snowflake_2.11-2.7.2-spark_2.4.jar
# Run the jupyter notebook service (opens up in webbrowser)
~> jupyter notebook
Code run within a new Python 3 notebook:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
sfOptions = {
"sfURL": "account.region.snowflakecomputing.com",
"sfUser": "username",
"sfPassword": "password",
"sfDatabase": "db_name",
"sfSchema": "schema_name",
"sfWarehouse": "warehouse_name",
"sfRole": "role_name",
}
spark = SparkSession.builder \
.master("local") \
.appName("Test") \
.config('spark.jars','/tmp/snowflake-jdbc-3.12.8.jar,/tmp/spark-snowflake_2.11-2.7.2-spark_2.4.jar') \
.getOrCreate()
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from CustomerInfo limit 10") \
.load()
df.show()
The above example uses a read method (to move data from Snowflake into Spark) but if you'd like to write a dataframe instead, see the documentation on moving data from Spark to Snowflake.
You need to have spark snowflake connector in your class path. Follow the instructions from official page.
https://docs.snowflake.com/en/user-guide/spark-connector-install.html
I have followed the steps described in the answer from #user13472370, with the same library versions, Additionally, I am using the same snowflake connection parameters to connect to Snowflave from SQL Workbench. However I am still receiving the same error
An error occurred while calling o43.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at net.snowflake.spark.snowflake.Parameters$MergedParameters.<init>(Parameters.scala:288)
Update: Meanwhile I found an easy to implement solution using the AWS Glue service: https://aws.amazon.com/blogs/big-data/performing-data-transformations-using-snowflake-and-aws-glue/
Probably a bit late, but another alternative for future users: in my case this issue was due to version incompatibilities, after lots of trial and error, and seeing examples on the web, I found a snowflake jdbc, spark connector and python versions compatible with our(old) spark cluster version(2.4.4), no code nor credential changes, and this error was gone.

sequence files from sqoop import

I have imported a table using sqoop and saved it as a sequence file.
How do I read this file into an RDD or Dataframe?
I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass
but it did not work. I am using pyspark for reading the files.
in python its not working however in SCALA it works:
You need to do following steps:
step1:
If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.
example:
sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export --
username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/
step2:
Also, you need to download the jar file from below link:
http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm
step3:
Suppose, customers table is imported using sqoop as sequence file.
Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar
example:
spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar
step4: Now run below commands inside the spark-shell
scala> import org.apache.hadoop.io.LongWritable
scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")
scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)
You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4
$ pyspark --packages seq-datasource-v2-0.2.0.jar
df = spark.read.format("seq").load("data.seq")
df.show()

Spark Shell Add Multiple Drivers/Jars to Classpath using spark-defaults.conf

We are using Spark-Shell REPL Mode to test various use-cases and connecting to multiple sources/sinks
We need to add custom drivers/jars in spark-defaults.conf file, I have tried to add multiple jars separated by comma
like
spark.driver.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
But its not working, Can anyone please provide details for correct syntax
Note: Verified in Linux Mint and Spark 3.0.1
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
Note: spark-shell and pyspark need to verify.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages and avro package. Also if you want to include log4j.properties, then use extraJavaOptions.
AppName and master can be provided in 2 way.
use .appName() and .master()
use .conf file
file: hellospark.py
from logger import Log4j
from util import get_spark_app_config
from pyspark.sql import SparkSession
# first approach.
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,
org.apache.spark:spark-avro_2.12:3.0.1") \
.config("spark.driver.extraJavaOptions",
"-Dlog4j.configuration=file:log4j.properties "
"-Dspark.yarn.app.container.log.dir=app-logs "
"-Dlogfile.name=hello-spark") \
.getOrCreate()
# second approach.
conf = get_spark_app_config()
spark = SparkSession.builder \
.config(conf=conf)
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
logger = Log4j(spark)
file: logger.py
from pyspark.sql import SparkSession
class Log4j(object):
def __init__(self, spark: SparkSession):
conf = spark.sparkContext.getConf()
app_name = conf.get("spark.app.name")
log4j = spark._jvm.org.apache.log4j
self.logger = log4j.LogManager.getLogger(app_name)
def warn(self, message):
self.logger.warn(message)
def info(self, message):
self.logger.info(message)
def error(self, message):
self.logger.error(message)
def debug(self, message):
self.logger.debug(message)
file: util.py
import configparser
from pyspark import SparkConf
def get_spark_app_config(enable_delta_lake=False):
"""
It will read configuration from spark.conf file to create
an instance of SparkConf(). Can be used to create
SparkSession.builder.config(conf=conf).getOrCreate()
:return: instance of SparkConf()
"""
spark_conf = SparkConf()
config = configparser.ConfigParser()
config.read("spark.conf")
for (key, value) in config.items("SPARK_APP_CONFIGS"):
spark_conf.set(key, value))
if enable_delta_lake:
for (key, value) in config.items("DELTA_LAKE_CONFIGS"):
spark_conf.set(key, value)
return spark_conf
file: spark.conf
[SPARK_APP_CONFIGS]
spark.app.name = Hello Spark
spark.master = local[3]
spark.sql.shuffle.partitions = 3
[DELTA_LAKE_CONFIGS]
spark.jars.packages = io.delta:delta-core_2.12:0.7.0
spark.sql.extensions = io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog = org.apache.spark.sql.delta.catalog.DeltaCatalog
As an example in addition to Prateek's answer, I have had some success by adding the following to the spark-defaults.conf file to be loaded when starting a spark-shell session in client mode.
spark.jars jars_added/aws-java-sdk-1.7.4.jar,jars_added/hadoop-aws-2.7.3.jar,jars_added/sqljdbc42.jar,jars_added/jtds-1.3.1.jar
Adding the exact line to the spark-defaults.conf file will load the three jar files as long as they are stored in the jars_added folder when spark-shell is run from the specific directory (doing this for me seems to mitigate the need to have the jar files loaded onto the slaves in the specified locations as well). I created the folder 'jars_added' in my $SPARK_HOME directory so whenever I run spark-shell I must run it from this directory (I have not yet worked out how to change the location the spark.jars setting uses as the initial path, it seems to default to the current directory when launching spark-shell). As hinted at by Prateek the jar files need to be comma separated.
I also had to set SPARK_CONF_DIR to $SPARK_HOME/conf (export SPARK_CONF_DIR = "${SPARK_HOME}/conf") for spark-shell to recognise the location of my config file (i.e. spark-defaults.conf). I'm using PuTTY to ssh onto the master.
Just to clarify once I have added the spark.jars jar1, jar2, jar3 to my spark-defaults.conf file I type the following to start my spark-shell session:
cd $SPARK_HOME //navigate to the spark home directory which contains the jars_added folder
spark-shell
On start up the spark-shell then loads the specified jar files from the jars_added folder

BigQuery connector ClassNotFoundException in PySpark on Dataproc

I'm trying to run a script in PySpark, using Dataproc.
The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't.
The error I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat
I made sure I have all the jars, added some new jars as suggested in other similar posts. I also checked the SPARK_HOME variable.
Below you can see the code; the error appears when trying to instantiate table_data.
"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
As pointed out in the example, you need to include BigQuery connector jar when submitting the job.
Through Dataproc jobs API:
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
/path/to/your/script.py \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
or spark-submit from inside the cluster:
spark-submit --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
/path/to/your/script.py