Could not instantiate EventHubSourceProvider for Azure Databricks

Could not instantiate EventHubSourceProvider for Azure Databricks - pyspark

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()

To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question

I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.

I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.

check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster

Related

How to tell PySpark where the pymongo-spark package is located? [duplicate]

I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.

Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.

Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.

I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.

Good Luck！
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()

Starting KsqlRestApplication form scala and getting NoSuchMethodError org.apache.kafka.streams.StreamsConfig.getConsumerConfigs

I am trying to write a program that enables me to run predefined KSQL operations on Kafka topics in Scala, but I don't want to open the KSQL Cli everytime. Therefore I want to start the KSQL "Server" from within my Scala program. If I understand the KSQL source code right, I have to build and start a KsqlRestApplication:
def restServer = KsqlRestApplication.buildApplication(new
KsqlRestConfig(defaultServerProperties), true, new VersionCheckerAgent
{override def start(ksqlModuleType: KsqlModuleType, properties:
Properties): Unit = ???})
But when I try doing that, I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kafka.streams.StreamsConfig.getConsumerConfigs(Ljava/lang/String;Ljava/lang/String;)Ljava/util/Map;
at io.confluent.ksql.rest.server.BrokerCompatibilityCheck.create(BrokerCompatibilityCheck.java:62)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:241)
I looked into the function call in BrokerCompatibilityCheck and in the create function it calls the StreamsConfig.getConsumerConfigs() with 2 Strings as parameters instead of the parameters defined in
https://kafka.apache.org/0102/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(StreamThread,%20java.lang.String,%20java.lang.String).
Are my KSQL and Kafka version simply not compatible or am I doing something wrong?
I am using KSQL version 4.1.0-SNAPSHOT and Kafka version 1.0.0.

Yes, NoSuchMethodError typically indicates a version incompatibility between libraries.
The link you posted is to javadoc for kafka 0.10.2. The method hasn't changed in 1.0 but indeed in the upcoming 1.1 it only takes 2 Strings:
https://kafka.apache.org/11/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(java.lang.String,%20java.lang.String)
. That suggests the version of KSQL you're using (4.1.0-SNAPSHOT) depends on version 1.1 of kafka streams, which is currently in the release candidate phase and I believe and should be out soon:
https://lists.apache.org/thread.html/780c4458b16590e99261b69d7b41b9ec374a3226d72c8d38885a008a#%3Cusers.kafka.apache.org%3E
As per that email you can find the latest (1.1.0-rc2) artifacts in the apache staging repo:
https://repository.apache.org/content/groups/staging/

Apache Spark: I always got org.apache.axis.AxisFault: (404)Not Found when using google-spark-adwords

I'm still a newbie in Apache Spark dev.
I'm using apache spark to query data from google ads words using spark-google-adwords. But, I always got this org.apache.axis.AxisFault: (404)Not Found
I'm using Scala 2.11 and latest stable Apache Spark. I've tried to look for the solution for this problem, but I still couldn't find out the cause.
Regards,

This issue was resolved by adding a copy of axis2.xml to classpath and overriding few connection manager params as follows:
HttpConnectionManagerParams params = new HttpConnectionManagerParams();
params.setDefaultMaxConnectionsPerHost(20); //SET VALUE BASED ON YOUR REQUIREMENTS/LOAD TESTING etc
MultiThreadedHttpConnectionManager multiThreadedHttpConnectionManager = new MultiThreadedHttpConnectionManager();
multiThreadedHttpConnectionManager.setParams(params);
HttpClient httpClient = new HttpClient(multiThreadedHttpConnectionManager);
ConfigurationContext configurationContext = ConfigurationContextFactory.createConfigurationContextFromFileSystem("**PATH TO COPY OF AXIS2.XML**");
configurationContext.setProperty(HTTPConstants.CACHED_HTTP_CLIENT, httpClient);
credit : https://issues.apache.org/jira/browse/AXIS2-4807

How to check if ElasticSearch is running properly

I am new to ElasticSearch and I am facing issues while connecting to ElasticSearch. Please find below details:
hq plugin and head plugin are showing different results:
Output of HQ Plugin:
Output of Head Plugin:
When I try to connect from my scala code, I get following error:
org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: []
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:305)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:200)
at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at org.elasticsearch.client.support.AbstractClient.index(AbstractClient.java:102)
at org.elasticsearch.client.transport.TransportClient.index(TransportClient.java:340)
at com.sksamuel.elastic4s.IndexDsl$IndexDefinitionExecutable$$anonfun$apply$1.apply(IndexDsl.scala:23)
at com.sksamuel.elastic4s.IndexDsl$IndexDefinitionExecutable$$anonfun$apply$1.apply(IndexDsl.scala:23)
at com.sksamuel.elastic4s.Executable$class.injectFuture(Executable.scala:21)
at com.sksamuel.elastic4s.IndexDsl$IndexDefinitionExecutable$.injectFuture(IndexDsl.scala:20)
at com.sksamuel.elastic4s.IndexDsl$IndexDefinitionExecutable$.apply(IndexDsl.scala:23)
at com.sksamuel.elastic4s.IndexDsl$IndexDefinitionExecutable$.apply(IndexDsl.scala:20)
at com.sksamuel.elastic4s.ElasticClient.execute(ElasticClient.scala:28)
Here is my Code which I use for connection:
val settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "elasticsearch")
.build()
val client = ElasticClient.remote(settings, ElasticsearchClientUri("elasticsearch://10.50.xxx.xxx:9300"))
I also checked my connection and I am able to successfully telnet 10.50.xxx.xxx on both 9200 and 9300 ports
I read somewhere that the problem might be with http.cors, So I added following lines to /etc/elasticsearch/elasticsearch.yml file on the server:
http.cors.allow-origin: "/.*/"
http.cors.enabled: true
Please suggest what am I doing wrong ?
-- Update --
Thanks # Evaldas Buinauskas, It was version problem, I had installed elastic version 2.0 and was using libraries and plugins of version 1.7. I downgraded my elastic to version 1.7 and everything worked!

The issue comes from different Elasticsearch, head plugin and Scala client versions.
In pre 2.0 Elasticsearch still supported deprecated _status endpoint (deprecated in 1.2.0)
Version 2.0 completely dropped it and replaced it with _recovery.
Both head and Scala weren't upgraded and tried to call dropped api.

Apache Spark : JDBC connection not working

I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).
I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.
Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.
I have successfully added the required jar (tested this jar) in the classpath by running
pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"
I can see that jar is successfully added in environment UI.
When I run the following in pyspark shell-
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
I get this ERROR -
>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
df = self._ssql_ctx.load(source, joptions)
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://[host]/[dbname]
at java.sql.DriverManager.getConnection(DriverManager.java:602)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
at org.apache.spark.sql.jdbc.JDBCRelation.<init> (JDBCRelation.scala:125)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:619)

I had this exact problem with mysql/mariadb, and got BIG clue from this question
So your pyspark command should be:
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.

A slightly more elegant solution:
val props = new Properties
props.put("driver", "org.postgresql.Driver")
sqlContext.read.jdbc("jdbc:postgresql://[host]/[dbname]", props)

As jake256 suggested
"driver", "org.postgresql.Driver"
key-value pair was missing. In my case, I launched pyspark as :
pyspark --jars /path/to/postgresql-9.4.1210.jar
with following instructions :
from pyspark.sql import DataFrameReader
url = 'postgresql://192.168.2.4:5432/postgres'
properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='weather', properties=properties
)
df.show()
+-------------+-------+-------+-----------+----------+
| city|temp_lo|temp_hi| prcp| date|
+-------------+-------+-------+-----------+----------+
|San Francisco| 46| 50| 0.25|1994-11-27|
|San Francisco| 43| 57| 0.0|1994-11-29|
| Hayward| 54| 37|0.239999995|1994-11-29|
+-------------+-------+-------+-----------+----------+
Tested on :
Ubuntu 16.04
PostgreSQL server version 9.5.
Postgresql driver used is postgresql-9.4.1210.jar
and Spark version is spark-2.0.0-bin-hadoop2.6
but I am also confident that it should also work on
spark-2.0.0-bin-hadoop2.7.
Java JDK 1.8 64bits
other JDBC Drivers can be found on :
https://www.petefreitag.com/articles/jdbc_urls/
tutorial I followed is on :
https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/
similar solution was suggested also on :
pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

This error seems to get thrown when you use the wrong version of JDBC driver. Check https://jdbc.postgresql.org/download.html to make sure that you have the right one.
Note in particular:
JDK 1.1 - JDBC 1. Note that with the 8.0 release JDBC 1 support has
been removed, so look to update your JDK when you update your server.
JDK 1.2, 1.3 - JDBC 2. JDK 1.3 + J2EE - JDBC 2 EE. This contains
additional support for javax.sql classes.
JDK 1.4, 1.5 - JDBC 3. This contains support for SSL and javax.sql, but does not require J2EE as it has been added to the J2SE release. JDK 1.6 - JDBC4. Support for JDBC4 methods is not complete, but the majority of methods are implemented.
JDK 1.7, 1.8 - JDBC41. Support for JDBC4 methods is not
complete, but the majority of methods are implemented.

see this post please, just place your script after all the options. see this

That’s pretty straightforward. To connect to external database to retrieve data into Spark dataframes additional jar file is required. E.g. with MySQL the JDBC driver is required. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar in a path that’s accessible from every node in the cluster. Preferably this is a path on shared file system. E.g. with Pouta Virtual Cluster such path would be under /shared_data, here I use /shared_data/thirdparty_jars/.
With direct Spark job submissions from terminal one can specify –driver-class-path argument pointing to extra jars that should be provided to workers with the job. However this does not work with this approach, so we must configure these paths for front end and worker nodes in the spark-defaults.conf file, usually in /opt/spark/conf directory.
spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar
spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Could not instantiate EventHubSourceProvider for Azure Databricks - pyspark

To resolve the issue, I did the following: Uninstall azure event hub library versions Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central Restart cluster Validate by re-running code provided in the question

I had to take this a step further. in the format method I had to add in this: .format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.

check the cluster scala version and the library version Unisntall the older libraries and install : com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17 in the shared workspace(right click and install library) and also in the cluster

Related

How to tell PySpark where the pymongo-spark package is located? [duplicate]

Starting KsqlRestApplication form scala and getting NoSuchMethodError org.apache.kafka.streams.StreamsConfig.getConsumerConfigs

Apache Spark: I always got org.apache.axis.AxisFault: (404)Not Found when using google-spark-adwords

How to check if ElasticSearch is running properly

Apache Spark : JDBC connection not working

Categories

Resources