Apache Spark Pool Mongodb connector - mongodb

I have been trying to read/write with synapse spark pools into a mongodb atlas server, i have tried PyMongo but im more interested in using the mongodb spark connector but in the install procedure they use this command:
./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \
--packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
The problem im facing is that Synapse Spark Pools allow for spark session configuration but not for packacges command or use of the spark shell, how can i acomplish this instalation inside a spark pool?

This can be solved by installing the jar directly under workspace packages. Download the jar of the connector and then upload it to synapse, then to the spark pool.

Related

Prevent pyspark from using in-memory session/docker

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.
Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

Using Spark-Submit to write to S3 in "local" mode using S3A Directory Committer

I'm currently running PySpark via local mode. I want to be able to efficiently output parquet files to S3 via the S3 Directory Committer. This PySpark instance is using the local disk, not HDFS, as it is being submitted via spark-submit --master local[*].
I can successfully write to my S3 Instance without enabling the directory committer. However, this involves writing staging files to S3 and renaming them, which is slow and unreliable. I would like for Spark to write to my local filesystem as a temporary store, and then copy to S3.
I have the following configuration in my PySpark conf:
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
self.spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
My spark-submit command looks like this:
spark-submit --master local[*] --py-files files.zip --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark.internal.io.cloud.PathOutputCommitProtocol --driver-memory 4G --name clean-raw-recording_data main.py
spark-submit gives me the following error, due to the requisite JAR not being in place:
java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
My questions are:
Which JAR (specifically, the maven coordinates) do I need to include in spark-submit --packages in order to be able to reference PathOutputCommitProtocol?
Once I have (1) working, will I be able to use PySpark's local mode to stage temporary files on the local filesystem? Or is HDFS a strict requirement?
I need this to be running in local mode, not cluster mode.
EDIT:
I got this to work with the following configuration:
Using pyspark version 3.1.2 and the package
org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253.
I needed to add the cloudera repository using the --repositories option for spark-submit:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
you need the spark-hadoop-cloud module for the release of spark you are using
the committer is happy using the local fs (it's now the public integration test suites work https://github.com/hortonworks-spark/cloud-integration. all that's needed is a "real" filesystem shared across all workers and the spark driver, so the driver gets the manifests of each pending commit.
print the _SUCCESS file after a job to see what the committer did: 0 byte file == old committer, JSON with diagnostics == new one

jupyter notebook connecting to Apache Spark 3.0

I'm trying to connect my Scala kernel in a notebook environment to an existing Apache 3.0 Spark cluster.
I've tried the following methods in integrating Scala into a notebook environment;
Jupyter Scala (Almond)
Spylon Kernel
Apache Zeppelin
Polynote
In each of these Scala environments I've tried to connect to an existing cluster using the following script:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("spark:<ipaddress>:7077)
.getOrCreate()
However when I go to the WebUI at localhost:8080 I don't see anything running on the cluster.
I am able to connect to the cluster using pyspark, but need help with connecting Scala to the cluster.

How do you get the driver and executors to load and recognize the postgres driver in EMR with spark-submit?

BACKGROUND
I am trying to run a spark-submit command that streams from Kafka and performs a JDBC sink into a postgres DB in AWS EMR (version 5.23.0) and using scala (version 2.11.12). The errors I see are
INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on <master-public-dns-name>, executor 1: java.sql.SQLException (No suitable driver found for jdbc:postgres://...
ERROR WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#44dd5258 is aborting.
19/06/20 06:11:26 ERROR WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#44dd5258 aborted.
HYPOTHESIS PROBLEM
I think the error is telling me that the jdbc postgres driver cannot be found on the executors, which is why it cannot sink to postgres.
PREVIOUS ATTEMPTS
I have already done the following:
Identified my driver in my structured streaming job as Class.forName("org.postgresql.Driver")
added --jars postgresql-42.1.4.jar \ to my spark-submit job in order to send the jars to the driver and executors. In this attempt, this postgres driver jar exists in my local /home/user_name/ directory
Also tried --jars /usr/lib/spark/jars/postgresql-42.1.4.jar \ to my spark-submit job, which is the location that spark in emr finds all the jars for execution
started my spark-submit job with spark-submit --driver-class-path /usr/lib/spark/jars/postgresql-42.1.4.jar:....
added the /usr/lib/spark/jars/postgresql-42.1.4.jar to the spark.driver.extraClassPath, spark.executor.extraClassPath, spark.yarn.dist.jars, spark.driver.extraLibraryPath, spark.yarn.secondary.jars, java.library.path, and to the System Classpath in general
My jdbc connection, while working in Zeppelin, does not work in spark-submit. It is jdbc:postgres://master-public-dns-name:5432/DBNAME"
EXPECTED RESULT:
I expect my executors to recognize the postgres driver and sink the data to the postgres DB.
PREVIOUS ATTEMPTS:
I've already used the following suggestions to no avail:
Adding JDBC driver to Spark on EMR
No Suitable Driver found Postgres JDBC
No suitable driver found for jdbc:postgresql://192.168.1.8:5432/NexentaSearch
use -- packages org.postgresql:postgresql:<VERSION>

Exception after Setting property 'spark.sql.hive.metastore.jars' in 'spark-defaults.conf'

Given below is the version of Spark & Hive I have installed in my system
Spark : spark-1.4.0-bin-hadoop2.6
Hive : apache-hive-1.0.0-bin
I have configured the Hive installation to use MySQL as Metastore. The goal is to access the MySQL Metastore & execute HiveQL queries inside spark-shell(using HiveContext)
So far I am able to execute the HiveQL queries by accessing the Derby Metastore(As described here, believe Spark-1.4 comes bundled with Hive 0.13.1 which in turn uses the internal Derby database as Metastore)
Then I tried to point spark-shell to my external Metastore(MySQL in this case) by setting the property(as suggested here) given below in $SPARK_HOME/conf/spark-defaults.conf,
spark.sql.hive.metastore.jars /home/mountain/hv/lib:/home/mountain/hp/lib
I have also copied $HIVE_HOME/conf/hive-site.xml into $SPARK_HOME/conf. But I am getting the following exception when I start the spark-shell
mountain#mountain:~/del$ spark-shell
Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars.
Am I missing something (or) not setting the property spark.sql.hive.metastore.jars correctly?
Note: In Linux Mint verified.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
Corrupted version of hive-site.xml will cause this... please copy the correct hive-site.xml