Connection issue: Databricks - Snowflake

Connection issue: Databricks - Snowflake - single-sign-on

I am trying to connect to Snowflake from Databricks Notebook through externalbrowser authenticator but without any success.
CMD1
sfOptions = {
"sfURL" : "xxxxx.west-europe.azure.snowflakecomputing.com",
"sfAccount" : "xxxxx",
"sfUser" : "ivan.lorencin#xxxxx",
"authenticator" : "externalbrowser",
"sfPassword" : "xxxxx",
"sfDatabase" : "DWH_PROD",
"sfSchema" : "APLSDB",
"sfWarehouse" : "SNOWFLAKExxxxx",
"tracing" : "ALL",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
CMD2
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
And CMD2 is not completed but I am receiving ".. Running command ..." that last forever.
Can anybody help what is going wrong here? How can I establish a connection?

It looks like you're setting Authenticator to externalbrowser, but according to the docs it should be sfAuthenticator - is this intentional? If you are trying to do an OAuth type of auth, why do you also have password?
If you're account/user requires OAuth to login, I'd remove that password entry from sfOptions, edit that one entry to sfAuthenticator and try again.
If that does not work, you should ensure that your Spark cluster can reach out to all the required Snowflake hosts (see SnowCD for assistance).

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.

How to use proxy with Snowpark session builder to connect to snowflake

I am new to using snowpark, recently released by Snowflake. I am using Intellij to build udf(user defined functions). However struggling to use proxy using Intellij to connect to snowflake. Below are few things I already tried:
putting proxy in Intellij (under Preferences)
adding proxy before building session
System.setProperty("https.useProxy", "true")
System.setProperty("http.proxyHost", "xxxxxxx")
System.setProperty("http.proxyPort", "443")
System.setProperty("no_proxy", "snowflakecomputing.com")
Below is my code -
val configs = Map (
"URL" -> "xxxxx.snowflakecomputing.com:443",
"USER" -> "xxx",
"PASSWORD" -> "xxxx",
"ROLE" -> "ROLE_xxxxx",
"WAREHOUSE" -> "xxxx",
"DB" -> "xxxx",
"SCHEMA" -> "xxxx",
)
val session = Session.builder.configs(configs).create

Snowpark uses JDBC driver to connect to Snowflake, therefore the proxy properties from JDBC connector can be used here as well.
In your Map add:
"proxyHost" -> "proxyHost Value"
"proxyPort" -> "proxyPort Value"
More information here
If you're specifying a proxy by setting Java system properties, then you can call System.setProperty, like:
System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", "proxyHost Value");
System.setProperty("http.proxyPort", "proxyPort Value");
System.setProperty("https.proxyHost", "proxyHost HTTPS Value");
System.setProperty("https.proxyPort", ""proxyPort HTTPS Value"")
or directly to JVM like:
-Dhttp.useProxy=true
-Dhttps.proxyHost=<proxy_host>
-Dhttp.proxyHost=<proxy_host>
-Dhttps.proxyPort=<proxy_port>
-Dhttp.proxyPort=<proxy_port>
More information here

How to know number of Request Units (RUs) consumed using azure-cosmosdb-spark

How to know number of Request Units (RUs) consumed using "azure-cosmosdb-spark" SDK ?
And is it possible to restrict not to cross provisioned RUs while reading?
When querying Azure Cosmos from databricks using "azure-cosmosdb-spark" SDK
It is crossing provisioned RUs on Cosmos.
How to restrict load(), not to cross provisioned RUs
I didn't find config to do from documentation
https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references
READ_COSMOS_SPARK_CONN = {
"Endpoint" : "---",
"Masterkey" : --,
"Database" : "--",
"Collection" : "--",
}
READ_COSMOS_SPARK_CONFIG = {
**READ_COSMOS_SPARK_CONN,
"query_pagesize" : "2147483647",
"SamplingRatio" : "1.0",
"schema_samplesize" : "10",
}
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**READ_COSMOS_SPARK_CONFIG).load()
dfResult = (
df
.filter(sf.col('timestamp') >= "2021-02-21T00:00:00")
.filter(sf.col('timestamp') < "2021-02-22T00:00:00")
)
display(dfResult)
Note:
I used "query_custom" but no use. It actually increased my RUs consumptions.
and I can't query on partition key for complex reasons.
But I applied index on "timestamp".

Issue with write from Databricks to Azure cosmos DB

I am trying to write data from Spark (using Databricks) to Azure Cosmos DB(Mongo DB). There are no errors when executing notebook but i am getting below error when querying the collection.
I have used jar from databricks website azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar. My versions is 6.5 (includes Apache Spark 2.4.5, Scala 2.11)
import org.joda.time.format._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.functions._
val configMap = Map(
"Endpoint" -> "https://******.documents.azure.com:***/",
"Masterkey" -> "****==",
"Database" -> "dbname",
"Collection" -> "collectionname"
)
val config = Config(configMap)
val df = spark.sql("select id,name,address from emp_db.employee")
CosmosDBSpark.save(df, config)
when i query the collection i get below response
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 1,
"errmsg" : "Unknown server error occurred when processing this request.",
"$err" : "Unknown server error occurred when processing this request."
}
Any help would be much appreciated. Thank you!!!

That error suggests you are using CosmosDB with the MongoDB api.
The spark connector for CosmosDB only supports it when using the SQL api.
Instead you should use the MongoDB connector.
https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
Use this instead https://docs.mongodb.com/spark-connector/master/

Spark REST API: Failed to find data source: com.databricks.spark.csv

I have a pyspark file stored on s3. I am trying to run it using spark REST API.
I am running the following command:
curl -X POST http://<ip-address>:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "testing.py"],
"appResource" : "s3n://accessKey:secretKey/<bucket-name>/testing.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://<ip-address>:6066",
"spark.jars" : "spark-csv_2.10-1.4.0.jar",
"spark.jars.packages" : "com.databricks:spark-csv_2.10:1.4.0"
}
}'
and the testing.py file has a code snippet:
myContext = SQLContext(sc)
format = "com.databricks.spark.csv"
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
dataFrame2 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location2).repartition(1)
outDataFrame = dataFrame1.join(dataFrame2, dataFrame1.values == dataFrame2.valuesId)
outDataFrame.write.format(format).option("header", "true").option("nullValue","").save(outLocation)
But on this line:
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
I get exception:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
I was trying different things out and one of those things was that I logged into the ip-address machine and ran this command:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
so that It would download the spark-csv in .ivy2/cache folder. But that didn't solve the problem. What am I doing wrong?

(Posted on behalf of the OP).
I first added spark-csv_2.10-1.4.0.jar on driver and worker machines. and added
"spark.driver.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
"spark.executor.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
Then I got following error:
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat
And then I added commons-csv-1.4.jar on both machines and added:
"spark.driver.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
"spark.executor.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
And that solved my problem.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Connection issue: Databricks - Snowflake - single-sign-on

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

How to use proxy with Snowpark session builder to connect to snowflake

How to know number of Request Units (RUs) consumed using azure-cosmosdb-spark

Issue with write from Databricks to Azure cosmos DB

Spark REST API: Failed to find data source: com.databricks.spark.csv

Categories

Resources