How do I perform this command in azure powershell on my hdinsight cluster? - powershell

I have managed to use this command on my hdinsight when I connect via ssh using azure cli, however I want to create an azure powershell scrip that will run the following command but I can't figure out how. I have tried searching for it online but can't find anything.
sudo -HE /usr/bin/anaconda/bin/conda install pandas

In this documentation see a section titled "Apply a script action to a running cluster from Azure PowerShell". You will need to take your script and put it in blob storage and then have the cluster execute that script ok each node using an HDInsight script action. The nice thing about script actions is that when they do cluster maintenance patching the underlying servers and need to take down a node and bring up a new node (or if you scale the cluster) then it will run the script action on any new nodes.

Related

How can I run a mongo setup script in a BitBucket Pipelines service container?

Does Bitbucket Pipelines spin up service container with a build context of my source code? I ask because mongo images allow you to put setup scripts in a special folder to be automatically run
Does Bitbucket Pipelines allow any other way to copy files or run commands on a service container? If option 1 is a "no", then I want to know if I can copy my scripts over another way and run them within the service container
Are there any other options for running my mongo shell script on the BitBucket Pipelines service image other than downloading and installing mongoshell on the primary build container and running it from there?

Mounting Azure Blob Storage to Azure Databricks without using cluster

We have a requirement that while provisioning the Databricks service thru CI/CD pipeline in Azure DevOps we should able to mount a blob storage to DBFS without connecting to a cluster. Is it possible to mount object storage to DBFS cluster by using a bash script from Azure DevOps ?
I looked thru various forums but they all mention about doing this using dbutils.fs.mount but the problem is we cannot run this command in Azure DevOps CI/CD pipeline.
Will appreciate any help on this.
Thanks
What you're asking is possible but it requires a bit of extra work. In our organisation we've tried various approaches and I've been working with Databricks for a while. The solution that works best for us is to write a bash script that makes use of the databricks-cli in your Azure Devops pipeline. The approach we have is as follows:
Retrieve a Databricks token using the token API
Configure the Databricks CLI in the CI/CD pipeline
Use Databricks CLI to upload a mount script
Create a Databricks job using the Jobs API and set the mount script as file to execute
The steps above are all contained in a bash script that is part of our Azure Devops pipeline.
Setting up the CLI
Setting up the Databricks CLI without any manual steps is now possible since you can generate a temporary access token using the Token API. We use a Service Principal for authentication.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/tokens
Create a mount script
We have a scala script that follows the mount instructions. This can be Python as well. See the following link for more information:
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#mount-azure-data-lake-storage-gen2-filesystem.
Upload the mount script
In the Azure Devops pipeline the databricks-cli is configured by creating a temporary token using the token API. Once this step is done, we're free to use the CLI to upload our mount script to DBFS or import it as a notebook using the Workspace API.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/workspace#--import
Configure the job that actually mounts your storage
We have a JSON file that defines the job that executes the "mount storage" script. You can define a job to use the script/notebook that you've uploaded in the previous step. You can easily define a job using JSON, check out how it's done in the Jobs API documentation:
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/jobs#--
At this point, triggering the job should create a temporary cluster that mounts the storage for you. You should not need to use the web interface, or perform any manual steps.
You can apply this approach to different environments and resource groups, as do we. For this we make use of Jinja templating to fill out variables that are environment or project specific.
I hope this helps you out. Let me know if you have any questions!

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

How to run a group of PowerShell scripts in Azure

I have a group of interdependent .ps1 scripts I want to run in Azure (trying to set up continuous deployment with git, Pester unit tests, etc., as outlined in this blog). How can I run these scripts in azure without needing to manage a server on which those scripts can run? E.g., can I put them in a storage account and execute them there, or do something similar?
Using an Azure automation account/runbook seems to be limited to a single script per runbook (granted, you can use modules, which is insufficient in my case).
Note that I need to use PowerShell version 5+ (I noticed Azure web apps and functions only have 4.x.)
Thanks in advance!
You were on the right track with Azure Functions. However, given that you need v5+ of PowerShell, you may want to look at Azure Container Instances (ACI) instead. It's a little different approach (via containers), but should not impose any limitations and will free you from having to manage a virtual machine.
Note: At this time ACI is in preview. Documentation is available here.
There is a PowerShell container image available on Docker Hub that you could start with. To execute multiple scripts in the container, you can override CMD in the docker file.

create multiple Azure VMs in single powershell script parallely

I am able to create Azure VM using powershell.
I have to create 4 VM's parallel.
Does any feature in powershell to do create multiple VMs parallel ? Something like background jobs or call the same function for all different VMs using threads kind of ?
Have you considered VM Scale Sets? They automatically deploy VMs in parallel in a highly available configuration and make managing those VMs much easier (overview doc here: https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-overview). You can of course deploy a scale set or a bunch of VMs from powershell (doc for deploying a scale set via powershell here: https://learn.microsoft.com/en-us/azure/virtual-machines/windows/tutorial-create-vmss), but the Powershell commandlets require you to specify lots of related properties (e.g. virtual network, subnet, load balancer configs, etc.). The Azure CLI 2.0 (which you can use on both Windows and Linux!) gives lots of good defaults. For instance, in Azure CLI 2.0 you can do this single command to create all of your VMs in parallel:
az vmss create --resource-group vmss-test-1 --name MyScaleSet --image UbuntuLTS --authentication-type password --admin-username azureuser --admin-password P#ssw0rd! --instance-count 4
(taken from the documentation here: https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-create#create-from-azure-cli)
Hope this helps! :)
No, there is no built-in Azure powershell cmdlets or features enabling you to do so. You can create your own routine for that. I'm using PS jobs for that.
You need to use Save-AzureRmContext and Import-AzureRmContext to authenticate powershell inside jobs or use any form of automated login.
Thanks all, I have solved my issue using PS workflow parallel and sequence features. Achieved it.