Spark job fails on image pull in Iguazio - mlops

I am using code examples in the MLRun documentation for running a spark job on Iguazio platform. Docs say I can use a default spark docker image provided by the platform, but when I try to run the job the pod hangs with Error ImagePullBackOff. Here is the function spec item I am using:
my_func.spec.use_default_image = True
How do I configure Iguazio to use the default spark image that is supposed to be included in the platform?

You need to deploy the default image to the cluster docker registry. There is one image for remote spark and one image for spark operator. Those images contain all the necessary dependencies for remote Spark and Spark Operator.
See the code below.
# This section has to be invoked once per MLRun/Iguazio upgrade
from mlrun.runtimes import RemoteSparkRuntime
RemoteSparkRuntime.deploy_default_image()
from mlrun.runtimes import Spark3Runtime
Spark3Runtime.deploy_default_image()
Once these images are deployed (to the cluster docker registry), your function with function spec “use_default_image” = True will be able to pull the image and deploy.

Related

Get Databricks cluster ID (or get cluster link) in a Spark job

I want to get the cluster link (or the cluster ID to manually compose the link) inside a running Spark job.
This will be used to print the link in an alerting message, making it easier for engineers to reach the logs.
Is it possible to achieve that in a Spark job running in Databricks?
When Databricks cluster starts, there is a number of Spark configuration properties added. Most of them are having name starting with spark.databricks. - you can find all of the in the Environment tab of the Spark UI.
Cluster ID is available as spark.databricks.clusterUsageTags.clusterId property and you can get it as:
spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
You can get workspace host name via dbutils.notebook.getContext().apiUrl.get call (for Scala), or dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() (for Python)

Google Cloud Composer pulls stale image from Google Container Registry

I'm trying to run an Airflow task through Google Cloud Composer with the KubernetesPodOperator in an environment built from an image in a private Google Container Registry. The Container Registry and Cloud Composer instances are under the same project and everything worked fine until I updated the image the DAG refers too.
When I update the image in the Container Registry, Cloud Composer keeps using a stale image.
Concretely, in the code below
import datetime
import airflow
from airflow.contrib.operators import kubernetes_pod_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
# Create Airflow DAG the the pipeline
with airflow.DAG(
'my_dag',
schedule_interval=datetime.timedelta(days=1),
start_date=YESTERDAY) as dag:
my_task = kubernetes_pod_operator.KubernetesPodOperator(
task_id='my_task',
name='my_task',
cmds=['echo 0'],
namespace='default',
image=f'gcr.io/<my_private_repository>/<my_image>:latest')
if I update the image gcr.io/<my_private_repository>/<my_image>:latest in the Container Registry, Cloud Composer keeps using the stale image that is not present anymore in the Container Registry and throws an error.
Is this a bug?
Thanks a lot!
As mentioned in the documentation for KubernetesPodOperator, the default value for image_pull_policy is 'IfNotPresent'. You need to configure your Pod Spec to pull image image always.
Simplest way to do it is setting the image_pull_policy to 'Always'.
Few more ways are mentioned in K8s Container Images documentation.

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Unable to run spark-submit on a spark cluster running in docker container

I have a set up of spark cluster running on docker in which the following things are running:-
spark-master
three spark-workers (spark-worker-1, spark-worker-2, spark-worker-3)
For setting up the spark cluster I have followed the instructions given on URL:-
https://github.com/big-data-europe/docker-spark
Now I want to launch a spark application which can run on this cluster and for this, I am using bde2020/spark-scala-template and following the instructions given on URL:-
https://github.com/big-data-europe/docker-spark/tree/master/template/scala
But when I tried to run the jar file then it starts running on the spark master present in the bde2020/spark-scala-template image and not on the master of my cluster running in a different container.
Please help me to do that. Stucked very badly.

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.