Mounting Azure Blob Storage to Azure Databricks without using cluster - azure-devops

We have a requirement that while provisioning the Databricks service thru CI/CD pipeline in Azure DevOps we should able to mount a blob storage to DBFS without connecting to a cluster. Is it possible to mount object storage to DBFS cluster by using a bash script from Azure DevOps ?
I looked thru various forums but they all mention about doing this using dbutils.fs.mount but the problem is we cannot run this command in Azure DevOps CI/CD pipeline.
Will appreciate any help on this.
Thanks

What you're asking is possible but it requires a bit of extra work. In our organisation we've tried various approaches and I've been working with Databricks for a while. The solution that works best for us is to write a bash script that makes use of the databricks-cli in your Azure Devops pipeline. The approach we have is as follows:
Retrieve a Databricks token using the token API
Configure the Databricks CLI in the CI/CD pipeline
Use Databricks CLI to upload a mount script
Create a Databricks job using the Jobs API and set the mount script as file to execute
The steps above are all contained in a bash script that is part of our Azure Devops pipeline.
Setting up the CLI
Setting up the Databricks CLI without any manual steps is now possible since you can generate a temporary access token using the Token API. We use a Service Principal for authentication.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/tokens
Create a mount script
We have a scala script that follows the mount instructions. This can be Python as well. See the following link for more information:
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#mount-azure-data-lake-storage-gen2-filesystem.
Upload the mount script
In the Azure Devops pipeline the databricks-cli is configured by creating a temporary token using the token API. Once this step is done, we're free to use the CLI to upload our mount script to DBFS or import it as a notebook using the Workspace API.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/workspace#--import
Configure the job that actually mounts your storage
We have a JSON file that defines the job that executes the "mount storage" script. You can define a job to use the script/notebook that you've uploaded in the previous step. You can easily define a job using JSON, check out how it's done in the Jobs API documentation:
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/jobs#--
At this point, triggering the job should create a temporary cluster that mounts the storage for you. You should not need to use the web interface, or perform any manual steps.
You can apply this approach to different environments and resource groups, as do we. For this we make use of Jinja templating to fill out variables that are environment or project specific.
I hope this helps you out. Let me know if you have any questions!

Related

Configure azure VM using terraform and ansible using azure cli and without ssh keys in azure pipelines

I'm using Azure Pipelines with terraform to create a dynamic infrastructure (unknown number of VMs, NIC..) in Azure.
I also tried to use Ansible in a separate stage to configure my VM, but I can't because the work is complicated because I had to create ssh keys and use them by Ansible, so I want to configure my Azure VM without using SSHAkey and use Azure CLI instead, I know that Ansible is agentless and use SSH to communicate with VM but I hope there is a module which allows us to configure VM using Azure CLI?
Alternatively, if anyone already has an existing project like this one and is well organised & less work, I would be grateful.

Job clusters in Databricks linked service Azure Data Factory are only uploading one init script

Job clusters in Databricks linked service Azure Data Factory are only uploading one init script even though I have two in my configuration. I believe this a recent bug in ADF as my setup was uploading the two scripts one week ago but it is not anymore. Also I tested Databricks clusters API and I can upload two scripts.
Databricks Init Scripts Set up in Azure Data Factory

How to add multiple containers in Azure Storage Container task in Azure Devops

I want to add multiple containers to Container Name section in Azure Storage container task in Azure Devops.
I want to create 5 containers and provide list of container name.
Thank you
You can't do that directly on the task, but this is doable on Agent job level.
Variable containersList should have comma separated list one,two,three,four and in Create blob task you refer to it as $(containersList)
However, in this way you will get Deploy json file run many times. If this is not what you need, please consider moving Create blob containers into separate job.
ANother option is to use Azure CLI task and az storage container create where you can just create foreach loop and call CLI command.

Azure Data Factory using existing cluster in Databricks

I have created a pipeline in Azure Data Factory. I created a Databricks workspace, notebook (with some code), and a cluster. I created the connection from ADF to DB. I tested the connection. All lights are green. I published the ADF pipeline.
When I trigger the job, it says SUCCESS. But nothing happens in Databricks. No job is created in DB. The code in the notebook cell is apparently not executed. (I know this because the code prints the current time.)
Has anyone done this successfully?
To be clear, I want Data Factory to use an existing cluster in Databricks, not create a new one. I have named the cluster in the pipeline setup params.
Please reference this tutorial: Run a Databricks notebook with the Databricks Notebook Activity in Azure Data Factory.
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
One of the difference is you don't need to create new job cluster, select use an existing cluster.
Hope this helps.
Solved. The problem was that the notebook (containing my code) was within my User notebook folder. Data Factory did not have permission to see/use my notebook. I created the same notebook within the Shared folder and everything works fine.
I will point out that ADF should issue an error/warning if the named notebook cannot be seen or used. The ADF pipeline verified fine, reported a successful run, but just failed silently.

How to run a group of PowerShell scripts in Azure

I have a group of interdependent .ps1 scripts I want to run in Azure (trying to set up continuous deployment with git, Pester unit tests, etc., as outlined in this blog). How can I run these scripts in azure without needing to manage a server on which those scripts can run? E.g., can I put them in a storage account and execute them there, or do something similar?
Using an Azure automation account/runbook seems to be limited to a single script per runbook (granted, you can use modules, which is insufficient in my case).
Note that I need to use PowerShell version 5+ (I noticed Azure web apps and functions only have 4.x.)
Thanks in advance!
You were on the right track with Azure Functions. However, given that you need v5+ of PowerShell, you may want to look at Azure Container Instances (ACI) instead. It's a little different approach (via containers), but should not impose any limitations and will free you from having to manage a virtual machine.
Note: At this time ACI is in preview. Documentation is available here.
There is a PowerShell container image available on Docker Hub that you could start with. To execute multiple scripts in the container, you can override CMD in the docker file.