Connecting AzureDatabricks on a CosmosDB MongoDB API database - mongodb

I am trying to connect a Python notebook in an Azure Databricks cluster on a CosmosDB MongoDB API database.
I'm using the mongo connector 2.11.2.4.2
Python 3
My code is as follows:
ReadConfig = {
"Endpoint" : "https://<my_name>.mongo.cosmos.azure.com:443/",
"Masterkey" : "<my_key>",
"Database" : "database",
"preferredRegions" : "West US 2",
"Collection": "collection1",
"schema_samplesize" : "1000",
"query_pagesize" : "200000",
"query_custom" : "SELECT * FROM c"
}
df = spark.read.format("mongo").options(**ReadConfig).load()
df.createOrReplaceTempView("dfSQL")
The error I get is that Could not initialize class com.mongodb.spark.config.ReadConfig$.
How can I work this out?

Answer to my own question.
Using MAVEN as the source, I installed the right library to my cluster using the path
org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
Spark 2.4
An example of code I used is as follows (for those who wanna try):
# Read Configuration
readConfig = {
"URI": "<URI>",
"Database": "<database>",
"Collection": "<collection>",
"ReadingBatchSize" : "<batchSize>"
}
pipelineAccounts = "{'$sort' : {'account_contact': 1}}"
# Connect via azure-cosmosdb-spark to create Spark DataFrame
accountsTest = (spark.read.
format("com.mongodb.spark.sql").
options(**readConfig).
option("pipeline", pipelineAccounts).
load())
accountsTest.select("account_id").show()

Make sure you using latest Azure Cosmos DB Spark Connector.
Download the latest azure-cosmosdb-spark library for the version of Apache Spark you are running:
Spark 2.4: azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar
Spark 2.3: azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar
Spark 2.2: azure-cosmosdb-spark_2.2.0_2.11-1.1.1-uber.jar
Upload the downloaded JAR files to Databricks following the instructions in Upload a Jar, Python Egg, or Python Wheel.
Install the uploaded libraries into your Databricks cluster.
Reference: Azure Databricks - Azure Cosmos DB

Related

Delta Live Tables with EventHub

I am trying to create streaming from eventhub using delta live tables, but I am having trouble installing the library . Is it possible to install maven library using Delta Live tables using sh /pip?
I would like to install
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
https://learn.microsoft.com/pl-pl/azure/databricks/spark/latest/structured-streaming/streaming-event-hubs
Right now it's not possible to use external connectors/Java libraries for Delta Live Tables. But for EventHubs there is a workaround - you can connect to EventHubs using the built-in Kafka connector - you just need to specify correct options as it's described in the documentation:
#dlt.table
def eventhubs():
readConnectionString="Endpoint=sb://<....>.windows.net/;?.."
eh_sasl = f'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="{readConnectionString}";'
kafka_options = {
"kafka.bootstrap.servers": "<eh-ns-name>.servicebus.windows.net:9093",
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.request.timeout.ms": "60000",
"kafka.session.timeout.ms": "30000",
"startingOffsets": "earliest",
"kafka.sasl.jaas.config": eh_sasl,
"subscribe": "<topic-name>",
}
return spark.readStream.format("kafka") \
.options(**kafka_options).load()

Confluent Cloud with MongoDB sink connect is not working

I'm trying to connectting confluent kafka connect with mongodb sink,but its not working as expected, it throws null pointer exception.
My confluent development environment are running in GCP VM instance.
And installed "confluent-hub install mongodb/kafka-connect-mongodb:latest"
Below are my sink configuration.
{
"name": "today-menu-sink",
"config": {
"connector.class":"com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max":"1",
"topics":"newuser",
"connection.uri":"mongodb+srv://*********************.mongodb.net",
"database":"BigBoxStore",
"collection":"users",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":false,
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url" : "http://*******:8081" ,
"value.converter.schemas.enable":true
}
}

Unable to access to Hive warehouse directory with Spark

I'm trying to connect to the Hive warehouse directory by using Spark on IntelliJ which is located at the following path :
hdfs://localhost:9000/user/hive/warehouse
In order to do this, I'm using the following code :
import org.apache.spark.sql.SparkSession
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "hdfs://localhost:9000/user/hive/warehouse"
val spark = SparkSession
.builder()
.appName("Spark Hive Local Connector")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.enableHiveSupport()
.getOrCreate()
spark.catalog.listDatabases().show(false)
spark.catalog.listTables().show(false)
spark.conf.getAll.mkString("\n")
import spark.implicits._
import spark.sql
sql("USE test")
sql("SELECT * FROM test.employee").show()
As one can see, I have created a database 'test' and create a table 'employee' into this database using the hive console. I want to get the result of the latest request.
The 'spark.catalog.' and 'spark.conf.' are used in order to print the properties of the warehouse path and database settings.
spark.catalog.listDatabases().show(false) gives me :
name : default
description : Default Hive database
locationUri : hdfs://localhost:9000/user/hive/warehouse
spark.catalog.listTables.show(false) gives me an empty result. So something is wrong at this step.
At the end of the execution of the job, i obtained the following error :
> Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'test' not found;
I have also configured the hive-site.xml file for the Hive warehouse location :
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
</property>
I have already created the database 'test' using the Hive console.
Below, the versions of my components :
Spark : 2.2.0
Hive : 1.1.0
Hadoop : 2.7.3
Any ideas ?
Create the resource directory under the src in your IntelliJ project copy the conf files under this folder. Build the project .. Ensure to define hive.metastore.warehouse.uris path correctly refer the hive-site.xml . In log if your are getting INFO metastore: Connected to metastore then you are good to go. Example.
Kindly note it making connection to intellij and running the job will be slow compare to package the jar and running on your hadoop cluster.

SparkSession return nothing with an HiveServer2 connection throught JDBC

I have an issue about reading data from a remote HiveServer2 using JDBC and SparkSession in Apache Zeppelin.
Here is the code.
%spark
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
val prop = new java.util.Properties
prop.setProperty("user","hive")
prop.setProperty("password","hive")
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver")
val test = spark.read.jdbc("jdbc:hive2://xxx.xxx.xxx.xxx:10000/", "tests.hello_world", prop)
test.select("*").show()
When i run this, I've got no errors but no data too, i just retrieve all the column name of table, like this :
+--------------+
|hello_world.hw|
+--------------+
+--------------+
Instead of this :
+--------------+
|hello_world.hw|
+--------------+
+ data_here +
+--------------+
I'am running all of this on :
Scala 2.11.8,
OpenJDK 8,
Zeppelin 0.7.0,
Spark 2.1.0 ( bde/spark ),
Hive 2.1.1 ( bde/hive )
I run this setup in Docker which each of those have their own container but connected in the same network.
Furthermore it just works when i use use the spark beeline to connect to my remote Hive.
Did i have forgot something ?
Any help would be appreciated.
Thanks in advance.
EDIT :
I've found a workaround, which is sharing docker volume or docker data-container between Spark and Hive, more precisily the Hive warehouse folder between them, and with configuring the spark-defaults.conf. Then you can acces hive through SparkSession without JDBC. Here is the step by step to how to do it :
Share the Hive warehouse folder between Spark and Hive
Configure spark-defaults.conf with like this :
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory Xg
spark.driver.cores X
spark.executor.memory Xg
spark.executor.cores X
spark.sql.warehouse.dir file:///your/path/here
Replace 'X' with your values.
Hope it helps.

Connecting to a postgresql db using JDBC from the Bluemix Apache Spark service

I have a problem connecting to my postgresql 8.4 db using Apache Spark service on Bluemix.
My code is:
%AddJar https://jdbc.postgresql.org/download/postgresql-8.4-703.jdbc4.jar -f
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://<ip_address>:5432/postgres?
user=postgres&password=<password>", "dbtable" -> "table_name"))
And I get the error:
Name: java.sql.SQLException
Message: No suitable driver found for jdbc:postgresql://:5432/postgres?user=postgres&password=
I've read around and it seems I need to add the JDBC driver to the Spark class path. I've no idea how to do this in the Bluemix Apache Spark service.
There is currently an issue with adding JDBC drivers to Bluemix Apache Spark. The team is working to resolve it. You can follow the progress here:
https://developer.ibm.com/answers/questions/248803/connecting-to-postgresql-db-using-jdbc-from-bluemi.html
Possibly have a look here? I believe the load() function is deprecated in Spark 1.4 [source].
You could try this instead
val url = "jdbc:postgresql://:5432/postgres"
val prop = new java.util.Properties
prop.setProperty("user","postgres")
prop.setProperty("password","xxxxxx")
val table = sqlContext.read.jdbc(url,"table_name",prop)
The url may or may not require the completed version - i.e.
jdbc:postgresql://:5432/postgres?
user=postgres&password=password
This worked for me on Bluemix
%AddJar https://jdbc.postgresql.org/download/postgresql-9.4.1208.jar -f
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:postgresql://:/", "user" -> "", "password" -> "","dbtable" -> "", "driver" -> "org.postgresql.Driver")).load()