Pyspark write dataframe to bigquery [error gs]

Pyspark write dataframe to bigquery [error gs] - pyspark

I'm trying to write a dataframe to a bigquery table. I have set the sparkSession with the required parameters. However, at the moment of doing the write I get an error:
Py4JJavaError: An error occurred while calling o114.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
The code is the following one:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark2 = SparkSession.builder\
.config("spark.jars", "/Users/xyz/Downloads/gcs-connector-hadoop2-latest.jar") \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0")\
.config("google.cloud.auth.service.account.json.keyfile", "/Users/xyz/Downloads/MyProject-cd7627f8ef9b.json") \
.getOrCreate()
spark2.conf.set("parentProject", "xyz")
data=spark2.createDataFrame(
[
("AAA", 51),
("BBB", 23),
],
['codiPuntSuministre', 'valor']
)
spark2.conf.set("temporaryGcsBucket","bqconsumptions")
data.write.format('bigquery') \
.option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.option('table', 'consumptions.c1') \
.mode('append') \
.save()
df=spark2.read.format("bigquery").option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.load("consumptions.c1")
I don't get any error if taking out the write from the code, so the error comes when trying to write and may be with something related to the auxiliar bucket to operate with bigquery

the error here suggests that it is not able to recognize the filesystem , you can use the below link for adding the support for gs filesystem , it happens because when you write to bigquery the files are loaded to google cloud storage bucket temporarily and then it is loaded into the bigquery table .
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

Related

trying to use johnsnow pretrained pipeline on spark dataframe but unable to read delta file in the same session

i am using the below code to read the spark dataframe from hdfs:
from delta import *
from pyspark.sql import SparkSession
builder= SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark=configure_spark_with_delta_pip(builder).getOrCreate()
#change file path here
delta_df = spark.read.format("delta").load('hdfs://localhost:9000/final_project/data/2022-03-30/')
delta_df.show(10, truncate=False)
and below code to use the pretrained pipeline:
from sparknlp.pretrained import PipelineModel
from pyspark.sql import SparkSession
import sparknlp
# spark session one way
spark = SparkSession.builder \
.appName("Spark NLP")\
.master("local[4]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2").getOrCreate()
# alternate way #uncomment below to use
#spark=sparknlp.start(spark32=True)
# unzip the file and change path here
pipeline = PipelineModel.load("/home/sidd0613/final_project/classifierdl_bertwiki_finance_sentiment_pipeline_en_3.3.0_2.4_1636617651675")
print("-------")
# creating a spark data frame from the sentence
df=spark.createDataFrame([["As interest rates have increased, housing rents have also increased."]]).toDF('text')
# passing dataframe to the pipeline to derive sentiment
result = pipeline.transform(df)
#printing the result
print(result)
print("DONE!!!")
i wish to merge these two codes but the two spark sessions are not merging or not working for both tasks together. please help!
i tried merging the .config() options of both spark sessions and it did not work
also i tried to create two spark sessions but it did not work
a normal spark session is enough to read other format files but to read a delta file i had to strictly use this option : configure_spark_with_delta_pip(builder)
is there any way to bypass this? or to make the code running?

The configure_spark_with_delta_pip is just a shortcut to setup correct parameters of the SparkSession... If you look into its source code you'll see following code, you'll see that everything it's doing is configuring the spark.jars.packages. But because you're using it separately for SparkNLP, you're overwriting Delta's value.
Update 14.04.2022: it wasn't released at time of answer, but available in version 1.2.0
To handle such situations, configure_spark_with_delta_pip has an additional parameter extra_packages to specify additional packages to be configured. So in your case the code should look as following:
builder = SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")
my_packages = ["com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2"]
spark=configure_spark_with_delta_pip(builder, extra_packages=my_packages) \
.getOrCreate()
Before that implementation with extra parameters is released, you need to avoid using that function, and simply configure all parameters yourself, like this:
scala_version = "2.12"
delta_version = "1.1.0"
all_packages = ["com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2",
f"io.delta:delta-core_{scala_version}:{delta_version}"]
spark = SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", ",".join(all_packages)) \
.getOrCreate()

Error I'm getting when I write Pyspark code to connect with Snowflake

I'm getting an error when I tried to write PySpark code from Jupiter Notebook to connect with Snowflake. Here's the error I got:
Py4JJavaError: An error occurred while calling o526.load.
: java.lang.ClassNotFoundException: Failed to find data source: net.snowflake.spark.snowflake. Please find packages at http://spark.apache.org/third-party-projects.html
Spark-version: v2.4.5
Master: local[*]
Python 3.X
Here's my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
.master("local") \
.appName("Test") \
.config('spark.jars','/Users/zhao/Downloads/snowflake-jdbc-3.5.4.jar,/Users/zhao/Downloads/spark-snowflake_2.11-2.3.2.jar') \
.getOrCreate()
sfOptions = {
"sfURL" : "xxx",
"sfUser" : "xxx",
"sfPassword" : "xxx",
"sfDatabase" : "xxx",
"sfSchema" : "xxx",
"sfWarehouse" : "xxx",
"sfRole": "xxx"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from CustomerInfo limit 10") \
.load()
I'll appreciate if anyone can give me some ideas :)

How are you starting up your jupyter notebook server instance? Are you ensuring your PYTHONPATH and SPARK_HOME variables are properly set, and that Spark isn't pre-running an instance? Also, is your Snowflake Spark Connector jar using the right Spark and Scala version variants?
Here's a fully bootstrapped and tested run on a macOS machine, as a reference (uses homebrew):
# Install JDK8
~> brew tap adoptopenjdk/openjdk
~> brew cask install adoptopenjdk8
# Install Apache Spark (v2.4.5 as of post date)
~> brew install apache-spark
# Install Jupyter Notebooks (incl. optional CLI notebooks)
~> pip3 install --user jupyter notebook
# Ensure we use JDK8 (using very recent JDKs will cause class version issues)
~> export JAVA_HOME="/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home"
# Setup environment to allow discovery of PySpark libs and the Spark binaries
# (Uses homebrew features to set the paths dynamically)
~> export SPARK_HOME="$(brew --prefix apache-spark)/libexec"
~> export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/build:${PYTHONPATH}"
~> export PYTHONPATH="$(brew list apache-spark | grep 'py4j-.*-src.zip$' | head -1):${PYTHONPATH}"
# Download jars for dependencies in notebook code into /tmp
# Snowflake JDBC (v3.12.8 used here):
~> curl --silent --location \
'https://search.maven.org/classic/remotecontent?filepath=net/snowflake/snowflake-jdbc/3.12.8/snowflake-jdbc-3.12.8.jar' \
> /tmp/snowflake-jdbc-3.12.8.jar
# Snowflake Spark Connector (v2.7.2 used here)
# But more importantly, a Scala 2.11 and Spark 2.4.x compliant one is fetched
~> curl --silent --location \
'https://search.maven.org/classic/remotecontent?filepath=net/snowflake/spark-snowflake_2.11/2.7.2-spark_2.4/spark-snowflake_2.11-2.7.2-spark_2.4.jar' \
> /tmp/spark-snowflake_2.11-2.7.2-spark_2.4.jar
# Run the jupyter notebook service (opens up in webbrowser)
~> jupyter notebook
Code run within a new Python 3 notebook:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
sfOptions = {
"sfURL": "account.region.snowflakecomputing.com",
"sfUser": "username",
"sfPassword": "password",
"sfDatabase": "db_name",
"sfSchema": "schema_name",
"sfWarehouse": "warehouse_name",
"sfRole": "role_name",
}
spark = SparkSession.builder \
.master("local") \
.appName("Test") \
.config('spark.jars','/tmp/snowflake-jdbc-3.12.8.jar,/tmp/spark-snowflake_2.11-2.7.2-spark_2.4.jar') \
.getOrCreate()
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from CustomerInfo limit 10") \
.load()
df.show()
The above example uses a read method (to move data from Snowflake into Spark) but if you'd like to write a dataframe instead, see the documentation on moving data from Spark to Snowflake.

You need to have spark snowflake connector in your class path. Follow the instructions from official page.
https://docs.snowflake.com/en/user-guide/spark-connector-install.html

I have followed the steps described in the answer from #user13472370, with the same library versions, Additionally, I am using the same snowflake connection parameters to connect to Snowflave from SQL Workbench. However I am still receiving the same error
An error occurred while calling o43.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at net.snowflake.spark.snowflake.Parameters$MergedParameters.<init>(Parameters.scala:288)
Update: Meanwhile I found an easy to implement solution using the AWS Glue service: https://aws.amazon.com/blogs/big-data/performing-data-transformations-using-snowflake-and-aws-glue/

Probably a bit late, but another alternative for future users: in my case this issue was due to version incompatibilities, after lots of trial and error, and seeing examples on the web, I found a snowflake jdbc, spark connector and python versions compatible with our(old) spark cluster version(2.4.4), no code nor credential changes, and this error was gone.

How to parallelize spark.read.parquet()?

My Spark job reads a folder with parquet data partitioned by the column partition:
val spark = SparkSession
.builder()
.appName("Prepare Id Mapping")
.getOrCreate()
import spark.implicits._
spark.read
.parquet(sourceDir)
.filter($"field" === "ss_id" and $"int_value".isNotNull)
.select($"int_value".as("ss_id"), $"partition".as("date"), $"ct_id")
.coalesce(1)
.write
.partitionBy("date")
.parquet(idMappingDir)
I've noticed that only one task is created so it's very slow. There is a lot of subfolders like partition=2019-01-07 inside the source folder, and each subfolder contains a lot of files with the extension snappy.parquet. I submit the job --num-executors 2 --executor-cores 4, and RAM is not an issue. I tried reading from both S3 and the local filesystem. I tried adding .repartition(nPartitions), removing .coalesce(1) and .partitionBy("date") but the same.
Could you suggest how I can get Spark read these parquet files in parallel?

Well, I've figured out the correct code:
val spark = SparkSession
.builder()
.appName("Prepare Id Mapping")
.getOrCreate()
import spark.implicits._
spark.read
.option("mergeSchema", "true")
.parquet(sourceDir)
.filter($"field" === "ss_id" and $"int_value".isNotNull)
.select($"int_value".as("ss_id"), $"partition".as("date"), $"ct_id")
.write
.partitionBy("date")
.parquet(idMappingDir)
Hope this will save someone time in future.

How to refer deltalake tables in jupyter notebook using pyspark

I'm trying to start use DeltaLakes using Pyspark.
To be able to use deltalake, I invoke pyspark on Anaconda shell-prompt as —
pyspark — packages io.delta:delta-core_2.11:0.3.0
Here is the reference from deltalake — https://docs.delta.io/latest/quick-start.html
All commands for delta lake works fine from Anaconda shell-prompt.
On jupyter notebook, reference to a deltalake table gives error.Here is the code I am running on Jupyter Notebook -
df_advisorMetrics.write.mode("overwrite").format("delta").save("/DeltaLake/METRICS_F_DELTA")
spark.sql("create table METRICS_F_DELTA using delta location '/DeltaLake/METRICS_F_DELTA'")
Below is the code I am using at start of notebook to connect to pyspark -
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
Below is the error I get:
Py4JJavaError: An error occurred while calling o116.save.
: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions?

I have created a Google Colab/Jupyter Notebook example that shows how to run Delta Lake.
https://github.com/prasannakumar2012/spark_experiments/blob/master/examples/Delta_Lake.ipynb
It has all the steps needed to run. This uses the latest spark and delta version. Please change the versions accordingly.

A potential solution is to follow the techniques noted in Import PySpark packages with a regular Jupyter notebook.
Another potential solution is to download the delta-core JAR and place it in the $SPARK_HOME/jars folder so when you run jupyter notebook it automatically includes the Delta Lake JAR.

I use DeltaLake all the time from a Jupyter notebook.
Try the following in you Jupyter notebook running Python 3.x.
### import Spark libraries
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
### spark package maven coordinates - in case you are loading more than just delta
spark_packages_list = [
'io.delta:delta-core_2.11:0.6.1',
]
spark_packages = ",".join(spark_packages_list)
### SparkSession
spark = (
SparkSession.builder
.config("spark.jars.packages", spark_packages)
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
sc = spark.sparkContext
### Python library in delta jar.
### Must create sparkSession before import
from delta.tables import *
Assuming you have a spark dataframe df
HDFS
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save("my_delta_file", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load("my_delta_file")
AWS S3 ObjectStore
Initial S3 setup
### Spark S3 access
hdpConf = sc._jsc.hadoopConfiguration()
user = os.getenv("USER")
### Assuming you have your AWS credentials in a jceks keystore.
hdpConf.set("hadoop.security.credential.provider.path", f"jceks://hdfs/user/{user}/awskeyfile.jceks")
hdpConf.set("fs.s3a.fast.upload", "true")
### optimize s3 bucket-level parquet column selection
### un-comment to use
# hdpConf.set("fs.s3a.experimental.fadvise", "random")
### Pick one upload buffer option
hdpConf.set("fs.s3a.fast.upload.buffer", "bytebuffer") # JVM off-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "array") # JVM on-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "disk") # DEFAULT - directories listed in fs.s3a.buffer.dir
s3_bucket_path = "s3a://your-bucket-name"
s3_delta_prefix = "delta" # or whatever
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save(f"{s3_bucket_path}/{s3_delta_prefix}/", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load(f"{s3_bucket_path}/{s3_delta_prefix}/")
Spark Submit
Not directly answering the original question, but for completeness, you can do the following as well.
Add the following to your spark-defaults.conf file
spark.jars.packages io.delta:delta-core_2.11:0.6.1
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
Refer to conf file in spark-submit command
spark-submit \
--properties-file /path/to/your/spark-defaults.conf \
--name your_spark_delta_app \
--py-files /path/to/your/supporting_pyspark_files.zip \
--class Main /path/to/your/pyspark_script.py

BigQuery connector ClassNotFoundException in PySpark on Dataproc

I'm trying to run a script in PySpark, using Dataproc.
The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't.
The error I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat
I made sure I have all the jars, added some new jars as suggested in other similar posts. I also checked the SPARK_HOME variable.
Below you can see the code; the error appears when trying to instantiate table_data.
"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)

As pointed out in the example, you need to include BigQuery connector jar when submitting the job.
Through Dataproc jobs API:
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
/path/to/your/script.py \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
or spark-submit from inside the cluster:
spark-submit --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
/path/to/your/script.py

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark write dataframe to bigquery [error gs] - pyspark

Related

trying to use johnsnow pretrained pipeline on spark dataframe but unable to read delta file in the same session

Error I'm getting when I write Pyspark code to connect with Snowflake

How to parallelize spark.read.parquet()?

How to refer deltalake tables in jupyter notebook using pyspark

BigQuery connector ClassNotFoundException in PySpark on Dataproc

Categories

Resources