Google Cloud Dataproc - Submit Spark Jobs Via Spark - scala

Is there a way to submit Spark jobs to Google Cloud Dataproc from within the Scala code?
val Config = new SparkConf()
.setMaster("...")
What should the master URI look like?
What key-value pairs should be set to authenticate with an API key or keypair?

In this case, I'd strongly recommend an alternative approach. This type of connectivity has not been tested or recommended for a few reasons:
It requires opening firewall ports to connect to the cluster
Unless you use a tunnel, your data may be exposed
Authentication is not enabled by default
Is SSHing into the master node (the node which is named cluster-name-m) a non-starter? It is pretty easy to SSH into the master node to directly use Spark.

Related

Get Databricks cluster ID (or get cluster link) in a Spark job

I want to get the cluster link (or the cluster ID to manually compose the link) inside a running Spark job.
This will be used to print the link in an alerting message, making it easier for engineers to reach the logs.
Is it possible to achieve that in a Spark job running in Databricks?
When Databricks cluster starts, there is a number of Spark configuration properties added. Most of them are having name starting with spark.databricks. - you can find all of the in the Environment tab of the Spark UI.
Cluster ID is available as spark.databricks.clusterUsageTags.clusterId property and you can get it as:
spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
You can get workspace host name via dbutils.notebook.getContext().apiUrl.get call (for Scala), or dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() (for Python)

How to connect to a kerberoized hdfs from Spark on Kubernetes?

I'm trying to connect to hdfs which is kerberized which fails with the error
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
What additional parameters do I need to add while creating the spark setup apart from the standard thing that you need to spawn Spark worker containers?
Check <property>hadoop.security.authentication<property> in your hdfs-site.xml properties file.
In your case it should have value kerberos or token.
Or you can configure it from code by specifying property explicitly:
Configuration conf = new Configuration();
conf.set("hadoop.security.authentication", "kerberos");
You can find more information about secure connection to hdfs here
I have also asked a very similar question here.
Firstly, please verify whether this is error is occurring on your driver pod or the executor pods. You can do this by looking at the logs of the driver and the executors as they start running. While I don't have any errors with my spark job running only on the master, I do face this error when I summon executors. The solution is to use a sidecar image. You can see an implementation of this in ifilonenko's project, which he referred to in his demo.
The premise of this approach is to store the delegation token (obtained by running a kinit) into a shared persistent volume. This volume can then be mounted to your driver and executor pods, thus giving them access to the delegation token, and therefore, the kerberized hdfs. I believe you're getting this error because your executors currently do not have the delegation token necessary for access to hdfs.
P.S. I'm assuming you've already had a look at Spark's kubernetes documentation.

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.
I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.
I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?
To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.
Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.

Documentation for standalone REST WS

Using Spark 1.3.1, when a master node is started with ./sbin/start-master.sh, a RESTful webservice is started on that machine (for me port 6066). Is there any documentation on how to interact with that service?
I found this code, but I was not able to find the corresponding Scaladoc let alone some sort of guide.
Here's the JIRA ticket, contains the Design Doc that motived this feature.
The goal is to create a new submission gateway that is stable across Spark versions
Additionally,
It is also not a goal to expose the new gateway as a general mechanism
for users of Spark to submit their applications. The new gateway will
be used strictly internally between Spark submit and the standalone
Master.