List content of a directory in Spark code in Azure Synapse - scala

In Databricks' Scala language, the command dbutils.fs.ls lists the content of a directory. However, I'm working on a notebook in Azure Synapse and it doesn't have dbutils package. What is a Spark command corresponding to dbutils.fs.ls?
%%scala
dbutils.fs.ls("abfss://container#datalake.dfs.core.windows.net/outputs/wrangleddata")
%%spark
// list the content of a directory. ????

Just use mssparkutils, it's a rough equivalent and the main documentation page is here. A simple example:
mssparkutils.fs.ls("/")
mssparkutils.fs.ls("abfss://container#datalake.dfs.core.windows.net/outputs/wrangleddata")

Related

Can't trigger a wheel in azure data factory

I created and tested multiple times a python wheel on my local machine and on Azure Databricks as a Job and it worked fine.
Now, I'm trying to create an Azure Data Factory Pipeline that triggers the wheel stored in Azure Databricks (dbfs:/..) everytime a new file is stored in a blob storage container.
The wheel takes a parameter (-f) and the values is new file name. I passed it to the wheel using argparse inside the script.py and parameters section of databricks job in the previous tests.
I created the pipeline and setted two parameters param and value that I want to pass to the wheel whose values are -f and new-file.txt. See image here
Then I created a Databricks Python file in ADF workspace and paste wheel path into Python file section. Now I'm wondering if this is the right way to do this.
I passed the parametes in the way you can see in the image below and I didn't add any library as I've already attacched the wheel in the upper section (I've tried to add the wheel also as library but notthing changed). See image here
I've created the trigger for blob storage and I've checked that in the trigger json file the parameters exists. Trying to trigger the pipeline I received this error: See image here
I checked if there are errors in code and I changed to UTF-8 the encoding as suggested in other questions of the community but notthing changes.
At this point, I think that I didn't trigger correctly the blob storage or the wheel can't be attached in the way I've done. I didn't add other resources in the workspace, hence I've only Databricks Python file.
Any advice is really appreciate,
thanks for the help!
If I understand your goal is to launch a wheel package from a databricks python notebook using Azure data factory and calling the notebook via the activity python databricks.
I think the problem that you are facing would be when calling the python wheel from the notebook.
Here is an example that I tried to use which is close to your needs and it worked fine.
I created a hello.py script and put it on the path /dbfs/FileStore/jars/
Here is the content of hello.py (just prints the provided arguments)
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-f', help='file', type=str)
args = parser.parse_args()
print('You provided the value : ', args.f)
I created a python notebook on databricks that takes arguments and passes them to the hello.py script.
This code defines the parameter that the notebook can take (which refers to the parameters you pass via Azure Data Factory while calling the activity databricks)
dbutils.widgets.text("value", "new_file.txt")
dbutils.widgets.text("param", "-f")
This code retrieves the parameters passed to the databricks notebook
param = dbutils.widgets.get("param")
value = dbutils.widgets.get("value")
And finally we call the python hello.py script to execute our custom code as follows :
!python /dbfs/FileStore/jars/hello.py $param $value
Pay attention to the ! at the begining.
Hope this helps your needs and don't forget to mark the answer :) .

Azure Databricks: How to delete files of a particular extension outside of DBFS using python

I am able to delete a file of a particular extension from the directory /databricks/driver using the bash command in databricks.
%%bash
rm /databricks/driver/file*.xlsx
But I am unable to figure out, how to access and delete a file outside of dbfs in a python script,
I think using dbutils we cannot access files outside of DBFS and the below command outputs False as its looking in DBFS.
dbutils.fs.rm("/databricks/driver/file*.xlsx")
I am eager to be corrected.
Not sure how to do it using dbutils but I am able to delete it using glob
import os
from glob import glob
for file in glob('/databricks/driver/file*.xlsx'):
os.remove(file)

Can I include a variable in an `sh` command in zeppelin?

I'm using Zeppelin with Hadoop on a Spark cluster.
I'd like to run a command to check files on s3 and I'd like to use a variable.
This is my code
%sh
aws s3 ls s3://my-bucket/my_folder/
Can I replace my-bucket/my_folder/ with a variable?
What do you mean by "a variable"? A Python variable? If so, I'm not sure. But if you just want to pull the path out onto another line, you can use a shell variable:
%sh
export AWS_FOLDER=my-bucket/my_folder/
aws s3 ls s3://$AWS_FOLDER

Tagging Azure Resources from .csv

Is there an easy way to read a .csv in a VSTS pipeline from a PowerShell script?
I have a script that can tag Azure Resources and it gets the key-value pairs from a .csv file. It works a charm when running it locally and running:
$csv = Import-Csv "d:\tagging\tags.csv"
But I'm struggling to find a way to reference the .csv in VSTS (Devops Services). I've put the .csv with the script in the same repo/folder, and I've created an Azure PowerShell script task.
I need to know what the Import-Csv should look like if it's in VSTS. Do I need to add additional steps so that the agent downloads the .csv when running the script?
This is the current error:
The hosted agent can't find the file and reports "Could not find file 'D:\a_tasks\AzurePowerShell_72s1a1931b-effb-4d2e-8fd8-f8472a07cb62\3.1.6\tags.csv'.
Let's say you put the file in your repo in the location /AwesomeCSV/MyCSV.csv. Your CSV's location, from a build perspective, would be $(Build.SourcesDirectory)/AwesomeCSV/MyCSV.csv.
So basically, pass in $(Build.SourcesDirectory)/AwesomeCSV/MyCSV.csv to the script as an argument, or reference it as an environment variable in your script as $env:BUILD_SOURCESDIRECTORY.

Jenkins Powershell Output

I would like to capture the output of some variables to be used elsewhere in the job using Jenkins Powershell plugin.
Is this possible?
My goal is to build the latest tag somehow and the powershell script was meant to achieve that, outputing to a text file would not help and environment variables can't be used because the process is seemingly forked unfortunately
Besides EnvInject the another common approach for sharing data between build steps is to store results in files located at job workspace.
The idea is to skip using environment variables altogether and just write/read files.
It seems that the only solution is to combine with EnvInject plugin. You can create a text file with key value pairs from powershell then export them into the build using the EnvInject plugin.
You should make the workspace persistant for this job , then you can save the data you need to file. Other jobs can then access this persistant workspace or use it as their own as long as they are on the same node.
Another option would be to use jenkins built in artifact retention, at the end of the jobs configure page there will be an option to retain files specified by a match (e.g *.xml or last_build_number). These are then given a specific address that can be used by other jobs regardless of which node they are on , the address can be on the master or the node IIRC.
For the simple case of wanting to read a single object from Powershell you can convert it to a JSON string in Powershell and then convert it back in Groovy. Here's an example:
def pathsJSON = powershell(returnStdout: true, script: "ConvertTo-Json ((Get-ChildItem -Path *.txt) | select -Property Name)");
def paths = [];
if(pathsJSON != '') {
paths = readJSON text: pathsJSON
}