How do I query Spark JobServer and find where it stores my Jars? - spark-jobserver

I am trying to follow this documentation:
https://github.com/spark-jobserver/spark-jobserver#dependency-jars
Option 2 Listed in the docs says:
The dependent-jar-uris can also be used in job configuration param
when submitting a job. On an ad-hoc context this has the same effect
as dependent-jar-uris context configuration param. On a persistent
context the jars will be loaded for the current job and then for every
job that will be executed on the persistent context. curl -d ""
'localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m'
OK⏎ curl
'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample&context=test-context&sync=true'
-d '{ dependent-jar-uris = ["file:///myjars/deps01.jar", "file:///myjars/deps02.jar"], input.string = "a b c a b see" }' The
jars /myjars/deps01.jar & /myjars/deps02.jar (present only on the SJS
node) will be loaded and made available for the Spark driver &
executors.
Is "file:///myjars/" directory the SJS node's JAR directory or some custom directory?
I have a client on a Windows box and a Spark JobServer on a Linux box. Next, I upload a JAR to SJS node. SJS node puts that Jar somewhere. Then, when I call to start a Job and set the 'dependent-jar-uris', the SJS node will find my previously uploaded JAR and run the job:
"dependent-jar-uris" set to "file:///tmp/spark-jobserver/filedao/data/simpleJobxxxxxx.jar"
This works fine, but I had to manually go searching around the SJS node to find this location (e.g. file:///tmp/spark-jobserver/filedao/data/simpleJobxxxxxx.jar) and then add it into my future requests to start the job.
Instead, how to I make a REST call from the client to just get the path where Spark JobServer puts my jars when I uploaded them, so that I can set the file:/// path correctly in my 'dependent-jar-uris' property dynamically?

I don't think uploaded jars using "POST /jars" can be used in dependent-jar-uris. Since you are uploading jars, you already know the local path. Just use that.

Related

How to enable custom stage library for streamsets data collector?

I have custom stage library which I want to use in Streamsets Data collector pipeline, I have followed all the steps given in below link to install custom stage lib, but still I am not able search stage library data collector.
Could you please help ?
SS document link :
docs.streamsets.com/datacollector/latest/help/datacollector/UserGuide/Configuration/CustomStageLibraries.html#concept_pmc_jk1_1x
There are only a few things to check:
First, ensure you've exported the USER_LIBRARIES_DIR environment variable so SDC can find the custom stage libraries:
export USER_LIBRARIES_DIR="/opt/sdc-user-libs/"
Then give the custom stage library the appropriate permissions in the JVM's security manager. (Verify that you've added a section like this example to the $SDC_CONF/sdc-security.policy file):
// custom stage library directory
grant codebase "file:///opt/sdc-user-libs/-" {
permission java.security.AllPermission;
};
Verify that you've restarted SDC to pick up the previously-mentioned changes.
After restarting SDC, ensure the custom stage's jar files have been opened by the JVM:
lsof -p (sdc pid) | grep (jar file name)
The documentation can be found here

Creating and using a custom kafka connect configuration provider

I have installed and tested kafka connect in distributed mode, it works now and it connects to the configured sink and reads from the configured source.
That being the case, I moved to enhance my installation. The one area I think needs immediate attention is the fact that to create a connector, the only available mean is through REST calls, this means I need to send my information through the wire, unprotected.
In order to secure this, kafka introduced the new ConfigProvider seen here.
This is helpful as it allows to set properties in the server and then reference them in the rest call, like so:
{
.
.
"property":"${file:/path/to/file:nameOfThePropertyInFile}"
.
.
}
This works really well, just by adding the property file on the server and adding the following config on the distributed.properties file:
config.providers=file # multiple comma-separated provider types can be specified here
config.providers.file.class=org.apache.kafka.common.config.provider.FileConfigProvider
While this solution works, it really does not help to easy my concerns regarding security, as the information now passed from being sent over the wire, to now be seating on a repository, with text on plain sight for everyone to see.
The kafka team foresaw this issue and allowed clients to produce their own configuration providers implementing the interface ConfigProvider.
I have created my own implementation and packaged in a jar, givin it the sugested final name:
META-INF/services/org.apache.kafka.common.config.ConfigProvider
and added the following entry in the distributed file:
config.providers=cust
config.providers.cust.class=com.somename.configproviders.CustConfigProvider
However I am getting an error from connect, stating that a class implementing ConfigProvider, with the name:
com.somename.configproviders.CustConfigProvider
could not be found.
I am at a loss now, because the documentation on their site is not explicit about how to configure custom config providers very well.
Has someone worked on a similar issue and could provide some insight into this? Any help would be appreciated.
I just went through these to setup a custom ConfigProvider recently. The official doc is ambiguous and confusing.
I have created my own implementation and packaged in a jar, givin it the sugested final name:
META-INF/services/org.apache.kafka.common.config.ConfigProvider
You could name the final name of jar whatever you like, but needs to pack to jar format which has .jar suffix.
Here is the complete step by step. Suppose your custom ConfigProvider fully-qualified name is com.my.CustomConfigProvider.MyClass.
1. create a file under directory: META-INF/services/org.apache.kafka.common.config.ConfigProvider. File content is full qualified class name:
com.my.CustomConfigProvider.MyClass
Include your source code, and above META-INF folder to generate a Jar package. If you are using Maven, file structure looks like this
put your final Jar file, say custom-config-provider-1.0.jar, under the Kafka worker plugin folder. Default is /usr/share/java. PLUGIN_PATH in Kafka worker config file.
Upload all the dependency jars to PLUGIN_PATH as well. Use the META-INFO/MANIFEST.MF file inside your Jar file to configure the 'ClassPath' of dependent jars that your code will use.
In kafka worker config file, create two additional properties:
CONNECT_CONFIG_PROVIDERS: 'mycustom', // Alias name of your ConfigProvider
CONNECT_CONFIG_PROVIDERS_MYCUSTOM_CLASS:'com.my.CustomConfigProvider.MyClass',
Restart workers
Update your connector config file by curling POST to Kafka Restful API. In Connector config file, you could reference the value inside ConfigData returned from ConfigProvider:get(path, keys) by using the syntax like:
database.password=${mycustom:/path/pass/to/get/method:password}
ConfigData is a HashMap which contains {password: 123}
If you still seeing ClassNotFound exception, probably your ClassPath is not setup correctly.
Note:
• If you are using AWS ECS/EC2, you need to set the worker config file by setting the environment variable.
• worker config and connector config file are different.

Azure batch Application package not getting copied to Working Directory of Task

I have created Azure Batch pool with Linux Machine and specified Application Package for the Pool.
My command line is
command='python $AZ_BATCH_APP_PACKAGE_scriptv1_1/tasks/XXX/get_XXXXX_data.py',
python3: can't open file '$AZ_BATCH_APP_PACKAGE_scriptv1_1/tasks/XXX/get_XXXXX_data.py':
[Errno 2] No such file or directory
when i connect to node and look at working directory non of the Application Package files are present there.
How do i make sure that files from Application Package are available in working directory or I can invoke/execute files under Application Package from command line ?
Make sure that your async operation have proper await in place before you start using the package in your code.
Also please share your design \ pseudo-code scenario and how you are approaching it as a design?
Further to add:
Seems like this one is pool level package.
The error seems like that the application env variable is either incorrectly used or there is some other user level issue. Please checkout linmk below and specially the section where use of env variable is mentioned.
This seems like user level issue because In case of downloading the package resource, if there will be an error it will be visible to you via exception handler or at the tool level is you are using batch explorer \ Batch-labs or code level exception handling.
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
Reason \ Rationale:
If the pool level or the task application has error, an error-list will come back if there was an error in the application package then it will be returned as the UserError or and AppPackageError which will be visible in the exception handle of the code.
Key you can always RDP into your node and checkout the package availability: information here: https://learn.microsoft.com/en-us/azure/batch/batch-api-basics#connecting-to-compute-nodes
I once created a small sample to help peeps around so this resource might help you to checkeout the use here.
Hope rest helps.
On Linux, the application package with version string is formatted as:
AZ_BATCH_APP_PACKAGE_{0}_{1}
On Windows it is formatted as:
AZ_BATCH_APP_PACKAGE_APPLICATIONID#version
Where 0 is the application name and 1 is the version.
$AZ_BATCH_APP_PACKAGE_scriptv1_1 will take you to the root folder where the application was unzipped.
Does this "exact" path exist in that location?
tasks/XXX/get_XXXXX_data.py
You can see more information here:
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
Edit: Just saw this question: "or can I invoke/execute files under Application Package from command line"
Yes you can invoke and execute files from the application package directory with the environment variable above.
If you type env on the node you will see the environment variables that have been set.

How do I pass key/value config settings when submitting a Job to Snappy Job Server?

I have a job that loads a data file from a different location each time. I'd like to submit the same job JAR and just pass a different location to it using the Config.java parameter of the runJavaJob() API.
I do not see a way to pass key/value configuration to the snappy-job.sh Usage.
How would I do this?
You can set the key value pairs as part of APP_PROPS environment setting before firing the snappy-job.sh. We have demonstrated it in our Getting Started section.
$ export APP_PROPS="consumerKey=,consumerSecret=,accessToken=,accessTokenSecret="
$ ./bin/snappy-job.sh submit --lead localhost:8090 --app-name TwitterPopularTagsJob --class io.snappydata.examples.TwitterPopularTagsJob --app-jar ./lib/quickstart-0.2.1-PREVIEW.jar --stream

Two Configuration files in Scala-Spray framework

I have REST API, that is developed using Scala and Spray framework. I am able to execute and launch my Api from localhost. The API is connected to the database. The IP Address(localhost) and port of Database is read from the "application.conf" file under the resources.
Everything works fine till I start using Docker. In Docker I have :
1. One Docker container of Rest API
2. One Docker container of Database.
The IP address of Database changes for each docker instance, therefore I need to update my "application.conf" file. Although I can use the hostname of Db instance that remains the same.
My issue is : Can I have two "application.conf" files , one for localhost and one for Docker instance? IS there a way to change the "application.conf" file at the run time.
P.s I am using "sbt run" to run the application and as per documentation it does not support java system properties or environment variables
Yes, you can choose the config at runtime. spray & akka use the typesafe config library which allows setting single settings or the whole configuration using JVM properties.
From the documentation of config:
For applications using application.{conf,json,properties}, system
properties can be used to force a different config source:
config.resource specifies a resource name - not a basename, i.e. application.conf not application
config.file specifies a filesystem path, again it should include the extension, not be a basename
config.url specifies a URL
These system properties specify a replacement for
application.{conf,json,properties}, not an addition. They only
affect apps using the default ConfigFactory.load() configuration. In
the replacement config file, you can use include "application" to
include the original default config file; after the include statement
you could go on to override certain settings.