I am trying to connect to Snowflake from Databricks Notebook through externalbrowser authenticator but without any success.
CMD1
sfOptions = {
"sfURL" : "xxxxx.west-europe.azure.snowflakecomputing.com",
"sfAccount" : "xxxxx",
"sfUser" : "ivan.lorencin#xxxxx",
"authenticator" : "externalbrowser",
"sfPassword" : "xxxxx",
"sfDatabase" : "DWH_PROD",
"sfSchema" : "APLSDB",
"sfWarehouse" : "SNOWFLAKExxxxx",
"tracing" : "ALL",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
CMD2
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
And CMD2 is not completed but I am receiving ".. Running command ..." that last forever.
Can anybody help what is going wrong here? How can I establish a connection?
It looks like you're setting Authenticator to externalbrowser, but according to the docs it should be sfAuthenticator - is this intentional? If you are trying to do an OAuth type of auth, why do you also have password?
If you're account/user requires OAuth to login, I'd remove that password entry from sfOptions, edit that one entry to sfAuthenticator and try again.
If that does not work, you should ensure that your Spark cluster can reach out to all the required Snowflake hosts (see SnowCD for assistance).
Related
I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.
I am new to using snowpark, recently released by Snowflake. I am using Intellij to build udf(user defined functions). However struggling to use proxy using Intellij to connect to snowflake. Below are few things I already tried:
putting proxy in Intellij (under Preferences)
adding proxy before building session
System.setProperty("https.useProxy", "true")
System.setProperty("http.proxyHost", "xxxxxxx")
System.setProperty("http.proxyPort", "443")
System.setProperty("no_proxy", "snowflakecomputing.com")
Below is my code -
val configs = Map (
"URL" -> "xxxxx.snowflakecomputing.com:443",
"USER" -> "xxx",
"PASSWORD" -> "xxxx",
"ROLE" -> "ROLE_xxxxx",
"WAREHOUSE" -> "xxxx",
"DB" -> "xxxx",
"SCHEMA" -> "xxxx",
)
val session = Session.builder.configs(configs).create
Snowpark uses JDBC driver to connect to Snowflake, therefore the proxy properties from JDBC connector can be used here as well.
In your Map add:
"proxyHost" -> "proxyHost Value"
"proxyPort" -> "proxyPort Value"
More information here
If you're specifying a proxy by setting Java system properties, then you can call System.setProperty, like:
System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", "proxyHost Value");
System.setProperty("http.proxyPort", "proxyPort Value");
System.setProperty("https.proxyHost", "proxyHost HTTPS Value");
System.setProperty("https.proxyPort", ""proxyPort HTTPS Value"")
or directly to JVM like:
-Dhttp.useProxy=true
-Dhttps.proxyHost=<proxy_host>
-Dhttp.proxyHost=<proxy_host>
-Dhttps.proxyPort=<proxy_port>
-Dhttp.proxyPort=<proxy_port>
More information here
How to know number of Request Units (RUs) consumed using "azure-cosmosdb-spark" SDK ?
And is it possible to restrict not to cross provisioned RUs while reading?
When querying Azure Cosmos from databricks using "azure-cosmosdb-spark" SDK
It is crossing provisioned RUs on Cosmos.
How to restrict load(), not to cross provisioned RUs
I didn't find config to do from documentation
https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references
READ_COSMOS_SPARK_CONN = {
"Endpoint" : "---",
"Masterkey" : --,
"Database" : "--",
"Collection" : "--",
}
READ_COSMOS_SPARK_CONFIG = {
**READ_COSMOS_SPARK_CONN,
"query_pagesize" : "2147483647",
"SamplingRatio" : "1.0",
"schema_samplesize" : "10",
}
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**READ_COSMOS_SPARK_CONFIG).load()
dfResult = (
df
.filter(sf.col('timestamp') >= "2021-02-21T00:00:00")
.filter(sf.col('timestamp') < "2021-02-22T00:00:00")
)
display(dfResult)
Note:
I used "query_custom" but no use. It actually increased my RUs consumptions.
and I can't query on partition key for complex reasons.
But I applied index on "timestamp".
I am trying to write data from Spark (using Databricks) to Azure Cosmos DB(Mongo DB). There are no errors when executing notebook but i am getting below error when querying the collection.
I have used jar from databricks website azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar. My versions is 6.5 (includes Apache Spark 2.4.5, Scala 2.11)
import org.joda.time.format._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.functions._
val configMap = Map(
"Endpoint" -> "https://******.documents.azure.com:***/",
"Masterkey" -> "****==",
"Database" -> "dbname",
"Collection" -> "collectionname"
)
val config = Config(configMap)
val df = spark.sql("select id,name,address from emp_db.employee")
CosmosDBSpark.save(df, config)
when i query the collection i get below response
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 1,
"errmsg" : "Unknown server error occurred when processing this request.",
"$err" : "Unknown server error occurred when processing this request."
}
Any help would be much appreciated. Thank you!!!
That error suggests you are using CosmosDB with the MongoDB api.
The spark connector for CosmosDB only supports it when using the SQL api.
Instead you should use the MongoDB connector.
https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
Use this instead https://docs.mongodb.com/spark-connector/master/
I have a pyspark file stored on s3. I am trying to run it using spark REST API.
I am running the following command:
curl -X POST http://<ip-address>:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "testing.py"],
"appResource" : "s3n://accessKey:secretKey/<bucket-name>/testing.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://<ip-address>:6066",
"spark.jars" : "spark-csv_2.10-1.4.0.jar",
"spark.jars.packages" : "com.databricks:spark-csv_2.10:1.4.0"
}
}'
and the testing.py file has a code snippet:
myContext = SQLContext(sc)
format = "com.databricks.spark.csv"
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
dataFrame2 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location2).repartition(1)
outDataFrame = dataFrame1.join(dataFrame2, dataFrame1.values == dataFrame2.valuesId)
outDataFrame.write.format(format).option("header", "true").option("nullValue","").save(outLocation)
But on this line:
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
I get exception:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
I was trying different things out and one of those things was that I logged into the ip-address machine and ran this command:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
so that It would download the spark-csv in .ivy2/cache folder. But that didn't solve the problem. What am I doing wrong?
(Posted on behalf of the OP).
I first added spark-csv_2.10-1.4.0.jar on driver and worker machines. and added
"spark.driver.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
"spark.executor.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
Then I got following error:
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat
And then I added commons-csv-1.4.jar on both machines and added:
"spark.driver.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
"spark.executor.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
And that solved my problem.