Azure Databrics - Running a Spark Jar from Gen2 DataLake Storage - scala

I am trying to run a spark-submit from Azure Databrics. Currently I can create a job, with the jar uploaded within the Databrics workspace, and run it.
My queries are:
Is there a way to access a jar residing on a GEN2 DataLake storage and do a spark-submit from Databrics workspace, or even from Azure ADF ? (Because the communication between the workspace and GEN2 storage is protected "fs.azure.account.key")
Is there a way to do a spark-submit from a databrics notebook?

Is there a way to access a jar residing on a GEN2 DataLake storage and
do a spark-submit from Databrics workspace, or even from Azure ADF ?
(Because the communication between the workspace and GEN2 storage is
protected "fs.azure.account.key") Unfortunately, you cannot access a
jar residing on Azure Storage such as ADLS Gen2/Gen1 account.
Note: The --jars, --py-files, --files arguments support DBFS and S3 paths.
Typically, the Jar libraries are stored under dbfs:/FileStore/jars.
You need to upload libraries in dbfs and pass as the parameters in the jar activity.
For more details, refer "Transform data by running a jar activity in Azure Databricks using ADF".
Is there a way to do a spark-submit from a databricks notebook?
To answer the second question, you may refer the below Job types:
Reference: SparkSubmit and "Create a job"
Hope this helps.
If this answers your query, do click “Mark as Answer” and "Up-Vote" for the same. And, if you have any further query do let us know.

Finally I figured out how to run this:
You can do a run a Databricks jar from an ADF, and attach it to an existing cluster, which will have the adls key configured in the cluster.
It is not possible to do a spark-submit from a notebook. But you can create a spark job in jobs, or you can use the Databricks Run Sumbit api, to do a spark-submit.

Related

Custom spark log location configuration in Azure databricks

We execute Databricks notebook pipelines using Azure data factory. We have configured 'Log delivery' option to get logs on DBFS. Currently, when two pipeline runs simelteneouly we are not clearly able to segregate logs per pipeline. It is possible using spark, when the instance is readily available in the databricks, to point logs directory to be ex /var/spark/{random_id}/logs/ ?

Mounting Azure Blob Storage to Azure Databricks without using cluster

We have a requirement that while provisioning the Databricks service thru CI/CD pipeline in Azure DevOps we should able to mount a blob storage to DBFS without connecting to a cluster. Is it possible to mount object storage to DBFS cluster by using a bash script from Azure DevOps ?
I looked thru various forums but they all mention about doing this using dbutils.fs.mount but the problem is we cannot run this command in Azure DevOps CI/CD pipeline.
Will appreciate any help on this.
Thanks
What you're asking is possible but it requires a bit of extra work. In our organisation we've tried various approaches and I've been working with Databricks for a while. The solution that works best for us is to write a bash script that makes use of the databricks-cli in your Azure Devops pipeline. The approach we have is as follows:
Retrieve a Databricks token using the token API
Configure the Databricks CLI in the CI/CD pipeline
Use Databricks CLI to upload a mount script
Create a Databricks job using the Jobs API and set the mount script as file to execute
The steps above are all contained in a bash script that is part of our Azure Devops pipeline.
Setting up the CLI
Setting up the Databricks CLI without any manual steps is now possible since you can generate a temporary access token using the Token API. We use a Service Principal for authentication.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/tokens
Create a mount script
We have a scala script that follows the mount instructions. This can be Python as well. See the following link for more information:
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#mount-azure-data-lake-storage-gen2-filesystem.
Upload the mount script
In the Azure Devops pipeline the databricks-cli is configured by creating a temporary token using the token API. Once this step is done, we're free to use the CLI to upload our mount script to DBFS or import it as a notebook using the Workspace API.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/workspace#--import
Configure the job that actually mounts your storage
We have a JSON file that defines the job that executes the "mount storage" script. You can define a job to use the script/notebook that you've uploaded in the previous step. You can easily define a job using JSON, check out how it's done in the Jobs API documentation:
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/jobs#--
At this point, triggering the job should create a temporary cluster that mounts the storage for you. You should not need to use the web interface, or perform any manual steps.
You can apply this approach to different environments and resource groups, as do we. For this we make use of Jinja templating to fill out variables that are environment or project specific.
I hope this helps you out. Let me know if you have any questions!

Spark Databricks local file API

I'm trying to build a summary/report of the processing done in spark DataBricks.
I came across below piece of code that allows the data to be written to dbfs as well as ADLS(through mount point), but the issue arises when I package the code in jar and try to execute it as a DataBricks job, I get file not found exception and now wondering how to write data into storage with out using notebooks .
import java.io.File
import java.io.PrintWriter
val writer = new PrintWriter(new File("/dbfs/mnt/data/out-01/test-01"))
writer.write("Hello Developer, Welcome to Programming.")
writer.write("Hello Developer, Welcome to Programming 2.")
writer.close()
I came across DButils from DataBricks, but haven't seen any sample code / documentation that I can use.
Any help on it will be appreciated.
If your notebook created by the below of the figure below to mount ADLS, yes that you can directly write date to dbfs in the current session of your databricks.
So I think the necessary code of DBFS mount points is missing in your code which be packaged to jar file.
And please refer to the offical documents as below to see how to access directly ADLS v1 and v2 in your code.
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Meanwhile, if without databricks library, you also can change your code using ADLS SDK and REST APIs to write code without using DBFS and run it in databricks.

Azure Data Factory using existing cluster in Databricks

I have created a pipeline in Azure Data Factory. I created a Databricks workspace, notebook (with some code), and a cluster. I created the connection from ADF to DB. I tested the connection. All lights are green. I published the ADF pipeline.
When I trigger the job, it says SUCCESS. But nothing happens in Databricks. No job is created in DB. The code in the notebook cell is apparently not executed. (I know this because the code prints the current time.)
Has anyone done this successfully?
To be clear, I want Data Factory to use an existing cluster in Databricks, not create a new one. I have named the cluster in the pipeline setup params.
Please reference this tutorial: Run a Databricks notebook with the Databricks Notebook Activity in Azure Data Factory.
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
One of the difference is you don't need to create new job cluster, select use an existing cluster.
Hope this helps.
Solved. The problem was that the notebook (containing my code) was within my User notebook folder. Data Factory did not have permission to see/use my notebook. I created the same notebook within the Shared folder and everything works fine.
I will point out that ADF should issue an error/warning if the named notebook cannot be seen or used. The ADF pipeline verified fine, reported a successful run, but just failed silently.

How to cache jars for DataProc Spark job submission

I am submitting a Spark job to Dataproc using either gcloud or Google Cloud DataProc API. One of the arguments is '--jars' (or its Java API equivalent), where I supply comma separated list of jar files to be provided to the executor and driver classpaths:
gs://google-storage-bucket/lib/x1.jar,gs://google-storage-bucket/lib/x2.jar, etc...
Same JAR files are copied from Google storage bucket to the working directory for each SparkContext on the executor nodes every time I submit a job and it takes about 2 minutes, before the job really starts execution (I can see that on the Google Cloud console - https://console.cloud.google.com/dataproc/jobs/...).
Is it possible to somehow cache these jar files on Spark nodes and use them in the classpath with every job submission? That would save about 50% of the run time.
Thanks,
Victor
Indeed, if you pass in arguments of the form file:///your/path/on/the/cluster/nodes/filesystem then it will be interpreted as referring to files on the cluster nodes themselves.
You can either copy files from GCS into the nodes at cluster creation time using an initiailization action or try to run some kind of Spark job to do it on an existing cluster and/or manually SSH'ing in to stage those jars.