How to use proxy with Snowpark session builder to connect to snowflake - scala

I am new to using snowpark, recently released by Snowflake. I am using Intellij to build udf(user defined functions). However struggling to use proxy using Intellij to connect to snowflake. Below are few things I already tried:
putting proxy in Intellij (under Preferences)
adding proxy before building session
System.setProperty("https.useProxy", "true")
System.setProperty("http.proxyHost", "xxxxxxx")
System.setProperty("http.proxyPort", "443")
System.setProperty("no_proxy", "snowflakecomputing.com")
Below is my code -
val configs = Map (
"URL" -> "xxxxx.snowflakecomputing.com:443",
"USER" -> "xxx",
"PASSWORD" -> "xxxx",
"ROLE" -> "ROLE_xxxxx",
"WAREHOUSE" -> "xxxx",
"DB" -> "xxxx",
"SCHEMA" -> "xxxx",
)
val session = Session.builder.configs(configs).create

Snowpark uses JDBC driver to connect to Snowflake, therefore the proxy properties from JDBC connector can be used here as well.
In your Map add:
"proxyHost" -> "proxyHost Value"
"proxyPort" -> "proxyPort Value"
More information here
If you're specifying a proxy by setting Java system properties, then you can call System.setProperty, like:
System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", "proxyHost Value");
System.setProperty("http.proxyPort", "proxyPort Value");
System.setProperty("https.proxyHost", "proxyHost HTTPS Value");
System.setProperty("https.proxyPort", ""proxyPort HTTPS Value"")
or directly to JVM like:
-Dhttp.useProxy=true
-Dhttps.proxyHost=<proxy_host>
-Dhttp.proxyHost=<proxy_host>
-Dhttps.proxyPort=<proxy_port>
-Dhttp.proxyPort=<proxy_port>
More information here

Related

Spring Cloud Data Flow - Unable to set securityContext/allowPrivilegeEscalation while deploying a stream

Creating a simple TEST Stream in Spring Cloud Data Flow (2.9.4)
> stream create --name "words" --definition "http --server.port=9001 | splitter --expression=payload.split(' ') | log"
> stream deploy --name "words" --propertiesFile words-stream.properties
> cat words-stream.properties
app.http.server.port=9001
app.splitter.expression=payload.split(' ')
app.splitter.producer.partitionKeyExpression=payload
deployer.log.count=3
deployer.http.kubernetes.deployment-labels=applicationid:123456
deployer.log.kubernetes.deployment-labels=applicationid:123456
deployer.splitter.kubernetes.deployment-labels=applicationid:123456
deployer.http.kubernetes.podSecurityContext={allowPrivilegeEscalation: false}
deployer.log.kubernetes.podSecurityContext={allowPrivilegeEscalation: false}
deployer.splitter.kubernetes.podSecurityContext={allowPrivilegeEscalation: false}
Get the following error on deploy
org.springframework.cloud.skipper.SkipperException: Could not install AppDeployRequest [[AppDeploymentRequest#2e8d33f commandlineArguments = list[[empty]], deploymentProperties = map['spring.cloud.deployer.appName' -> 'log', 'spring.cloud.deployer.count' -> '3', 'spring.cloud.deployer.group' -> 'words', 'spring.cloud.deployer.indexed' -> 'true', 'spring.cloud.deployer.kubernetes.deployment-labels' -> 'applicationid:123456', 'spring.cloud.deployer.kubernetes.podSecurityContext' -> '{allowPrivilegeEscalation: false}'], definition = [AppDefinition#62fe74f5 name = 'log-v5', properties = map['management.metrics.tags.application.type' -> '${spring.cloud.dataflow.stream.app.type:unknown}', 'spring.cloud.dataflow.stream.app.label' -> 'log', 'management.metrics.tags.stream.name' -> '${spring.cloud.dataflow.stream.name:unknown}', 'management.metrics.tags.application' -> '${spring.cloud.dataflow.stream.name:unknown}-${spring.cloud.dataflow.stream.app.label:unknown}-${spring.cloud.dataflow.stream.app.type:unknown}', 'spring.cloud.dataflow.stream.name' -> 'words', 'management.metrics.tags.instance.index' -> '${vcap.application.instance_index:${spring.cloud.stream.instanceIndex:0}}', 'spring.cloud.stream.bindings.input.consumer.partitioned' -> 'true', 'wavefront.application.service' -> '${spring.cloud.dataflow.stream.app.label:unknown}-${spring.cloud.dataflow.stream.app.type:unknown}-${vcap.application.instance_index:${spring.cloud.stream.instanceIndex:0}}', 'spring.cloud.stream.instanceCount' -> '3', 'spring.cloud.stream.bindings.input.group' -> 'words', 'management.metrics.tags.application.guid' -> '${spring.cloud.application.guid:unknown}', 'management.metrics.tags.application.name' -> '${vcap.application.application_name:${spring.cloud.dataflow.stream.app.label:unknown}}', 'spring.cloud.dataflow.stream.app.type' -> 'sink', 'spring.cloud.stream.bindings.input.destination' -> 'words.splitter', 'wavefront.application.name' -> '${spring.cloud.dataflow.stream.name:unknown}']], resource = Docker Resource [docker:xxx.yyy.com/springcloudstream/log-sink-kafka:3.2.0]]] to platform [default]. Error Message = [Invalid binding property '{allowPrivilegeEscalation: false}']
at org.springframework.cloud.skipper.server.deployer.DefaultReleaseManager.install(DefaultReleaseManager.java:152) ~[spring-cloud-skipper-server-core-2.8.4.jar:2.8.4]
at org.springframework.cloud.skipper.server.service.ReleaseService.install(ReleaseService.java:198) ~[spring-cloud-skipper-server-core-2.8.4.jar:2.8.4]
at org.springframework.cloud.skipper.server.service.ReleaseService.install(ReleaseService.java:184) ~[spring-cloud-skipper-server-core-2.8.4.jar:2.8.4]
at org.springframework.cloud.skipper.server.service.ReleaseService.install(ReleaseService.java:145) ~[spring-cloud-skipper-server-core-2.8.4.jar:2.8.4]
at org.springframework.cloud.skipper.server.service.ReleaseService$$FastClassBySpringCGLIB$$f1c5f0a2.invoke(<generated>) ~[spring-cloud-skipper-server-core-2.8.4.jar:2.8.4]
Unable to create/deploy new stream/task pods that will use securityContext/allowPrivilegedEscalation as false.
Looking forward some guidelines on getting this securityContext created for the deployed objects.
Update 11/21/2022:
spring cloud data flow team - I was wondering if some one could look into the question and advice.
The reason I need the container Security Context is our company has made securityContext/allowPrivilegeEscalation should be set to false on all containers in a pod.
Need a way to pass and set this property to be compliant
Error Message:
[psp-allow-privilege-escalation-container] OPA-GATEKEEPER CONSTRAINT: Container index-provider is attempting to run without a required securityContext/allowPrivilegeEscalation, Allowed = false.]
Please note I have tried with
deployer.http.kubernetes.containerSecurityContext={allowPrivilegeEscalation: false}
the deployment does recognize this property at all.
Update: 11/29/2022
Please note I removed the Policy agent and checked the deployment yaml of the pod. The "log" component of the stream has the issue on the initContainer. The "log" component of the stream there is an initContainers where busybox is started and few entries are added to /config/application.properties. The policy checks the securityContext/allowPrivilegeEscalation in the initContainers also.
Please can you confirm if this is a bug or is there way to overcome with configuration also.
The container security context is only currently applied to the main container.
It will be added to the init container to address spring-cloud-deployer-kubernetes/issues/512.

Connection issue: Databricks - Snowflake

I am trying to connect to Snowflake from Databricks Notebook through externalbrowser authenticator but without any success.
CMD1
sfOptions = {
"sfURL" : "xxxxx.west-europe.azure.snowflakecomputing.com",
"sfAccount" : "xxxxx",
"sfUser" : "ivan.lorencin#xxxxx",
"authenticator" : "externalbrowser",
"sfPassword" : "xxxxx",
"sfDatabase" : "DWH_PROD",
"sfSchema" : "APLSDB",
"sfWarehouse" : "SNOWFLAKExxxxx",
"tracing" : "ALL",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
CMD2
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
And CMD2 is not completed but I am receiving ".. Running command ..." that last forever.
Can anybody help what is going wrong here? How can I establish a connection?
It looks like you're setting Authenticator to externalbrowser, but according to the docs it should be sfAuthenticator - is this intentional? If you are trying to do an OAuth type of auth, why do you also have password?
If you're account/user requires OAuth to login, I'd remove that password entry from sfOptions, edit that one entry to sfAuthenticator and try again.
If that does not work, you should ensure that your Spark cluster can reach out to all the required Snowflake hosts (see SnowCD for assistance).

How to fix: 'TypeError: Cannot initialize connector undefined:' while use loopback-connector-firestore

I tried several answers from stackoverflow but cannot solve the problem, I want to connect the firestore to loopback using this package: loopback-connector-firestore (https://www.npmjs.com/package/loopback-connector-firestore), after create the datasource using lb datasource command and start the system, error below will show out:
TypeError: Cannot initialize connector undefined: Cannot read property 'replace' of undefined
The loopback already connected to other datasource. How can I add firestore into it?
This is the datasources.json file:
{
-----other db datasources here-----
"Firestore": {
"name": "Firestore",
"projectId": "project id",
"clientEmail": "client email",
"privateKey": "key here",
"databaseName": "name here",
"connector": "loopback-connector-firestore"
}
}
In server.js file:
var ds = loopback.createDataSource({
connector: require('loopback-connector-firestore'),
provider: 'Firestore'
});
var storage = ds.createModel('storage');
app.model(storage);
The environment settings:
* Kubuntu 18.04
* nodejs v10.16
* npm v6.9
* loopback v3
I think for now the some of the connectors aren't supported by Loopback, and 'Loopback-firebase-connector' is one of them.
you may need to check the following:
Reference: https://loopback.io/doc/en/lb3/Community-connectors.html

Having trouble making sense of vert.x config loading

I am trying to create a verticle by using a config.json and am not experiencing what I expect by reading the docs. I will attempt to explain the steps I've taken as best I can but I have tried many variations to the startup steps of my verticle, so I may not be 100% accurate. This is using vert.x 3.7.0.
First, I have successfully used my config to launch my verticle when I include the config file in the expected location, conf/config.json:
{
"database" : {
"port" : 5432,
"host" : "127.0.0.1",
"name" : "linked",
"user" : "postgres",
"passwd" : "postgres",
"connectionPoolSize" : 5
},
"chatListener" : {
"port" : 8080,
"host" : "localhost"
}
}
and use the launcher to pass the config to start the verticle (pseudocode):
public static void main(String[] args){
//preprocessing
Launcher.executeCommand("run", "MyVerticle")
...
and
public static void main(String[] args){
//preprocessing
Launcher.executeCommand("run", "MyVerticle -config conf/config.json")
...
both work correctly. My config is loaded I can pull the data from config() inside my verticle:
JsonObject chatDbOpts = new JsonObject().put( "config", config.getJsonObject( "database" ) );
....
But when I pass a file reference that is not the default location to the launcher,
$ java -jar vert.jar -config /path/to/config.json
it ignores it and uses the built-in config, which is empty, ignoring my config. Yet the debug output from the vertx Config loader indicates it is using the default location:
conf/config.json
which it doesn't actually do, because my config file is there. So the config loader isn't loading from the default location when a different config is specified on the CLI.
So I changed the code to digest the config in main and validated the json file can be found and read. Then passed the file reference to the launcher but got the same behaviour. So then I changed to using a DeploymentOptions object with deployVerticle.
Output from my preprocessor steps of loading the config and converting it to a JsonObject:
Command line arguments: [-conf, d:/cygwin64/home/rcoe/conf/config.json]
Launching application with the following config:
{
"database" : {
"port" : 5432,
"host" : "127.0.0.1",
"name" : "linked",
"user" : "postgres",
"passwd" : "postgres",
"connectionPoolSize" : 5
},
"chatListener" : {
"port" : 8080,
"host" : "localhost"
}
}
This JsonObject is used to create a DeploymentOptions reference:
DeploymentOptions options = new DeploymentOptions(jsonCfg);
Vertx.vertx().deployVerticle( getClass().getName(), options );
Didn't work.
So then I tried creating an empty DeploymentOptions reference and setting the config:
DeploymentOptions options = new DeploymentOptions();
Map<String,Object> m = new HashMap<>();
m.put("config", jsonObject);
JsonObject cfg = new JsonObject(m);
options.setConfig( cfg );
Vertx.vertx().deployVerticle( getClass().getName(), options );
which also fails to pass my desired config. Instead, it uses config from the default location.
Here's the output from the verticle's starting up. It is using the conf/config.json file,
Config file path: conf\config.json, format:json
-Dio.netty.buffer.checkAccessible: true
-Dio.netty.buffer.checkBounds: true
Loaded default ResourceLeakDetector: io.netty.util.ResourceLeakDetector#552c2b11
Config options:
{
"port" : 5432,
"host" : "not-a-real-host",
"name" : "linked",
"user" : "postgres",
"passwd" : "postgres",
"connectionPoolSize" : 5
}
versus the config that is given to the DeploymentOptions reference:
Launching application with the following config:
{
"database" : {
"port" : 5432,
"host" : "127.0.0.1",
"name" : "linked",
"user" : "postgres",
"passwd" : "postgres",
"connectionPoolSize" : 5
},
...
Anyway, hope these steps make sense and show I've tried a variety of methods to load custom config. I have seen my config get passed into the vert.x code that is responsible for invoking verticles but by the time my verticle's start() method gets called, my config is gone.
Thanks.
As is usual, authoring a question leads to a better view to the problem. The solution, as I understand it, is to always create a map with a key called "config" and a value of the JsonObject you want to pass.
To deploy:
private void launch( final JsonObject jsonObject )
{
DeploymentOptions options = new DeploymentOptions();
Map<String, Object> m = new HashMap<>();
m.put( "config", jsonObject );
JsonObject cfg = new JsonObject( m );
options.setConfig( cfg );
Vertx.vertx().deployVerticle( MainVerticle.class.getName(), options );
}
#override
public void start( final Future<Void> startFuture )
{
ConfigRetriever cfgRetriever = ConfigRetriever.create( vertx.getDelegate() );
cfgRetriever.getConfig( ar -> {
try {
if( ar.succeeded() ) {
JsonObject config = ar.result();
JsonObject cfg = config.getJsonObject( "config" );
JsonObject chatDbOpts = cfg.getJsonObject( "database" );
LOGGER.debug( "Launching ChatDbServiceVerticle with the following config:\n{}",
chatDbOpts.encodePrettily() );
JsonObject chatHttpOpts = cfg.getJsonObject( "chatListener" );
LOGGER.debug( "Launching HttpServerVerticle with the following config:\n{}",
chatHttpOpts.encodePrettily() );
...
produces the output:
Launching ChatDbServiceVerticle with the following config:
{
"port" : 5432,
"host" : "127.0.0.1",
"name" : "linked",
"user" : "postgres",
"passwd" : "postgres",
"connectionPoolSize" : 5
}
Launching HttpServerVerticle with the following config:
{
"port" : 8080,
"host" : "localhost"
}
But this begs the question as to the point of a DeploymentOptions(JsonObject) constructor if the config() ignores any object that can't be retrieved with a specific key? It required stepping through the debugger to find this. There's no hint of this requirement in the docs, https://vertx.io/blog/vert-x-application-configuration/.

writing to Mongo replica set from Spark (in scala)

I'm trying to write from a Spark RDD to MongoDB using the mongo-spark-connector.
I'm facing two problems
[main problem] I can't connect to Mongo if I define the host according to the documentation (using all instances in the mongo replica set)
[secondary/related problem] If I connect to the primary only, I can write... but I typically crash the primary writing the first collection
Environment:
mongo-spark-connector 1.1
spark 1.6
scala 2.10.5
First I'll setup a dummy example to demonstrate...
import org.bson.Document
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.rdd.RDD
/**
* fake json data
*/
val recs: List[String] = List(
"""{"a": 123, "b": 456, "c": "apple"}""",
"""{"a": 345, "b": 72, "c": "banana"}""",
"""{"a": 456, "b": 754, "c": "cat"}""",
"""{"a": 876, "b": 43, "c": "donut"}""",
"""{"a": 432, "b": 234, "c": "existential"}"""
)
val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))
Some values that won't change...
// credentials
val user = ???
val pwd = ???
// fixed values
val db = "db_name"
val replset = "replset_name"
val collection_name = "collection_name"
Here's what does NOT work... in this case "url" would look something like machine.unix.domain.org and "ip" would look like... well, an IP address.
This is how the documentation says to define the host... with every machine in the replica set.
val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"
I can't get either of these to work. Using every permutation I can think of for the uri...
val uri = s"mongodb://${user}:${pwd}#${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}#${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}#${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}#${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}#${host}" // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}#${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above
EDIT
more detail on the error messages.. the errors take to forms...
form 1
typically includes java.net.UnknownHostException: machine.unix.domain.org
also, comes back with server addresses in url form even when defined as IP addresses
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}]
form 2
(authentication error... though connecting with same credentials to primary only works fine)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,
type=UNKNOWN, state=CONNECTING, exception=
{com.mongodb.MongoSecurityException: Exception authenticating
MongoCredential{mechanism=null, userName='xx', source='admin', password=
<hidden>, mechanismProperties={}}}, caused by
{com.mongodb.MongoCommandException: Command failed with error 18:
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...
end EDIT
here's what DOES work... on the dummy data only... more on that below...
val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}#${host}/${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri" -> uri,
"database" -> db,
"collection" -> collection_name,
"replicaSet" -> replset))
// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)
This... connecting to primary only... works great for dummy data, but crashes the primary for real data (50 - 100GB from and RDD with 2700 partitions). My guess is that it opens up too many connections at once... it looks like it opens ~900 connections to write (this jives since default parallelism 2700 based on 900 virtual cores and parellelism factor of 3x).
I'm guessing if I repartition so it opens fewer connections, I'll have better luck... but I'm guessing this also ties in to writing to the primary only instead of spreading it over all instances.
I've read everything I can find here... but most examples are for single instance connections... https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration
It turns out there were two problems here. From the original question, these were referenced as errors of 'form 1' and 'form 2'.
error of 'form 1' - solution
The gist of the problem turned out to be a bug in the mongo-spark-connector. It turns out that it can't connect to a replica set using IP addresses... it requires URIs. Since the DNS servers in our cloud don't have these lookups, I got it working by modifying /etc/hosts on every executor and then using the connection string format like this:
val host = "URI1:27017,URI2:27017,URI3:27017"
val uri = s"mongodb://${user}:${pwd}#${host}/${db}?replicaSet=${replset}&authSource=${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri"->uri,
"database"->db,
"collection"->collection,
"replicaSet"->replset,
"writeConcern.w"->"majority"))
this required first adding the following to /etc/hosts on every machine:
IP1 URI1
IP2 URI2
IP3 URI3
Now of course, i can't figure out how to use bootstrap actions in AWS EMR to update /etc/hosts when the cluster spins up. But that's another question. (AWS EMR bootstrap action as sudo)
error of 'form 2' - solution
adding &authSource=${db} to the uri solved this.