How to connect to a kerberoized hdfs from Spark on Kubernetes? - kubernetes

I'm trying to connect to hdfs which is kerberized which fails with the error
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
What additional parameters do I need to add while creating the spark setup apart from the standard thing that you need to spawn Spark worker containers?

Check <property>hadoop.security.authentication<property> in your hdfs-site.xml properties file.
In your case it should have value kerberos or token.
Or you can configure it from code by specifying property explicitly:
Configuration conf = new Configuration();
conf.set("hadoop.security.authentication", "kerberos");
You can find more information about secure connection to hdfs here

I have also asked a very similar question here.
Firstly, please verify whether this is error is occurring on your driver pod or the executor pods. You can do this by looking at the logs of the driver and the executors as they start running. While I don't have any errors with my spark job running only on the master, I do face this error when I summon executors. The solution is to use a sidecar image. You can see an implementation of this in ifilonenko's project, which he referred to in his demo.
The premise of this approach is to store the delegation token (obtained by running a kinit) into a shared persistent volume. This volume can then be mounted to your driver and executor pods, thus giving them access to the delegation token, and therefore, the kerberized hdfs. I believe you're getting this error because your executors currently do not have the delegation token necessary for access to hdfs.
P.S. I'm assuming you've already had a look at Spark's kubernetes documentation.

Related

How to set spark executor memory in the Azure Data Factory Linked service

My Spark Scala code is failing due to Spark out of memory issue. I am running the code from ADF pipeline. In Databricks cluster, the executor memory is set to 4g. I want to change this value at ADF level instead of changing it at cluster level. While creating a linked service we have additional cluster settings where we can define the cluster spark configuration. Please find the below. Could someone please let me know how to set the spark executor memory in linked service in ADF.
Thank you.
Add Name = spark.executor.memory and Value = 6g
Monitor core configuration settings to ensure your Spark jobs run in a predictable and performant way. These settings help determine the best Spark cluster configuration for your particular workloads.
Also refer - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-settings

Get Databricks cluster ID (or get cluster link) in a Spark job

I want to get the cluster link (or the cluster ID to manually compose the link) inside a running Spark job.
This will be used to print the link in an alerting message, making it easier for engineers to reach the logs.
Is it possible to achieve that in a Spark job running in Databricks?
When Databricks cluster starts, there is a number of Spark configuration properties added. Most of them are having name starting with spark.databricks. - you can find all of the in the Environment tab of the Spark UI.
Cluster ID is available as spark.databricks.clusterUsageTags.clusterId property and you can get it as:
spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
You can get workspace host name via dbutils.notebook.getContext().apiUrl.get call (for Scala), or dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() (for Python)

aws EMR cluster via cloud formation: not seeing hdfs on one

okay i have a EMR cluster which writes to HDFS and I am able to view the directory and see the files
via
hadoop fs -ls /user/hadoop/jobs - i am not seeing /user/hive or jobs directory in hadoop, but its supposed to be there.
I need to get in to the spark shell and perform sparql, so i created identical cluster with same vpc,security groups, and subnet id.
What i am supposed to see
Why this is happending i am not sure but i think this might be it? Or any suggestions
Could this be something to with a stale rule?

how to configure a Akka Pub/Sub to run on same machine?

I am following the Distributed Publish Subscribe in Cluster example in Akka. However, I would like to run all the actor (publisher and subscribers) on the same node (my laptop). I am not sure if I understand how to configure that, could somebody help me? is it possible to use the runOn or should it be declared in a configuration file? Currently,
I run into this error:
Caused by: akka.ConfigurationException: ActorSystem [akka://mySystem]
needs to have a 'ClusterActorRefProvider' enabled in the
configuration, currently uses [akka.actor.LocalActorRefProvider]
Your error is telling you what the problem is. In your application.conf you should set akka.actor.provider = "akka.cluster.ClusterActorRefProvider". If you want to use a 1 node cluster on your laptop you should also set akka.cluster.min-nr-of-members = 1.

Google Cloud Dataproc - Submit Spark Jobs Via Spark

Is there a way to submit Spark jobs to Google Cloud Dataproc from within the Scala code?
val Config = new SparkConf()
.setMaster("...")
What should the master URI look like?
What key-value pairs should be set to authenticate with an API key or keypair?
In this case, I'd strongly recommend an alternative approach. This type of connectivity has not been tested or recommended for a few reasons:
It requires opening firewall ports to connect to the cluster
Unless you use a tunnel, your data may be exposed
Authentication is not enabled by default
Is SSHing into the master node (the node which is named cluster-name-m) a non-starter? It is pretty easy to SSH into the master node to directly use Spark.