How do I use an Airflow variable inside a Databricks notebook? - pyspark

I have a Databricks PySpark notebook that gets called from an Airflow DAG.
I created a variable in Airflow by going to Admin - Variables and added a key-value pair.
I cannot find a way to use that Airflow variable in Databricks.
Edit to add sample of my code.
notebook_task = {
'notebook_path': '/Users/email#exaple.com/myDAG',
'base_parameters': {
"token": token
}
}
and the operator defined here
opr_submit_run = DatabricksSubmitRunOperator(
task_id='run_notebook',
existing_cluster_id='xxxxx',
run_name='test',
databricks_conn_id='databricks_xxx',
notebook_task=notebook_task
)
What ended up working is using base_parameters instead of notebook_parans which can be found here https://docs.databricks.com/dev-tools/api/latest/jobs.html
and accessing it from databricks by using
my_param = dbutils.widgets.get("token")

Extending the answer provided by Alex since this question was asked in the context of Apache-Airflow that executing a databricks notebook.
The DatabricksRunNowOperator (which is available by the databricks provider) has notebook_params that is a dict from keys to values for jobs with notebook task, e.g. "notebook_params": {"name": "john doe", "age": "35"}. The map is passed to the notebook and will be accessible through the
dbutils.widgets.get function. As Alex explained you can access the value from databricks notebook as:
my_param = dbutils.widgets.get("key")
An example usage will be:
spark_jar_task = DatabricksSubmitRunOperator(
task_id='spark_jar_task',
new_cluster=new_cluster,
notebook_params={"name": "john doe", "age": "35"},
spark_jar_task={'main_class_name': 'com.example.ProcessData'},
libraries=[{'jar': 'dbfs:/lib/etl-0.1.jar'}],
)
The issue now is how to pass a value from Airflow Variable rather than a static value. For that we need the notebook_params to be a templated field so the Jinja engine will template the value. The problem is that notebook_params is not listed in the template_fields
To overcome this we can create a custom version of the operator as:
class MyDatabricksRunNowOperator(DatabricksRunNowOperator):
template_fields = DatabricksRunNowOperator.template_fields + ('notebook_params',)
Then we can use macro {{ var.value.my_var }} which will be templated during run time as:
spark_jar_task = MyDatabricksSubmitRunOperator(
task_id='spark_jar_task',
new_cluster=new_cluster,
notebook_params={"var_value": {{ var.value.my_var }} },
spark_jar_task={'main_class_name': 'com.example.ProcessData'},
libraries=[{'jar': 'dbfs:/lib/etl-0.1.jar'}],
)
The operator will get the value of my_var Variable and pass it to your notebook.

if you set it as a parameter to the notebook call (parameters inside notebook_task), then you need to use the dbutils.widgets.get function, put at the beginning of notebook something like this:
my_param = dbutils.widgets.get("key")

Related

Terraform: getting project id from azuredevops_project resource. Error data.azuredevops_projects.projectname.projects is set of object with 1 element

I am trying to access azuredevops project from terraform using azuredevops_devops resource. Using that project, I want to access repositories and create a new repo. But I am getting error in the second data block where I try to assign to project_id, but the output block prints the correct details.
data "azuredevops_projects" "sampleproject"{
name = "sample"
}
output "projectdetails"{
value = [for obj in data.azuredevops_projects.sampleproject.projects : obj.name]
}
error I receive here is: Incorrect attribute value type.data.azuredevops_projects.sampleproject.projects is set of object with 1 element :
data "azuredevops_git_repository" "samplerepo"{
project_id = [for obj in data.azuredevops_projects.sampleproject.projects : obj.name]
name = "Services-Template"
}
I am new to terraform, just practicing this for learning purpose.
Thanks for your answers, I tried everything but the below solution worked
initial outcome:
+ projectdetails = [
+ "74899dhjhjk-8909-4a97-9e9b-73488hfjikjd9",
]
outcoume after below solution:
element(data.azuredevops_projects.sampleproject.projects.*.project_id,0)
"74899dhjhjk-8909-4a97-9e9b-73488hfjikjd9"

How to return data from azure databricks notebook in Azure Data Factory

I have a requirement where I need to transform data in azure databricks and then return the transformed data. Below is notebook sample code where I am trying to return some json.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json
import pandas as pd
# Define a dictionary containing ICC rankings
rankings = {'test': ['India', 'South Africa', 'England',
'New Zealand', 'Australia'],
'odi': ['England', 'India', 'New Zealand',
'South Africa', 'Pakistan'],
't20': ['Pakistan', 'India', 'Australia',
'England', 'New Zealand']}
# Convert the dictionary into DataFrame
rankings_pd = pd.DataFrame(rankings)
# Before renaming the columns
rankings_pd.rename(columns = {'test':'TEST'}, inplace = True)
rankings_pd.rename(columns = {'odi':'ODI'}, inplace = True)
rankings_pd.rename(columns = {'t20':'twenty-20'}, inplace = True)
# After renaming the columns
#print(rankings_pd.to_json())
dbutils.notebook.exit(rankings_pd.to_json())
In order to achieve the same, I created a job under a cluster for this notebook and then I had to create a custom connector too following this article https://medium.com/#poojaanilshinde/create-azure-logic-apps-custom-connector-for-azure-databricks-e51f4524ab27. Using the connectors with API endpoint '/2.1/jobs/run-now' and then '/2.1/jobs/runs/get-output' in Azure Logic App, I am able to get the return value but after the job is executed successfully, sometimes I just get the status as running with no output. I need to get the output when job is executed successfully with transformation.
Please suggest a way better way for this if I am missing anything.
looks like dbutils.notebooks.exit() only accpet "string", you can return the value as json string and convert to json object in DataFactory or Logic App. https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#--notebook-utility-dbutilsnotebook

Run same Databricks notebook for different arguments concurrently?

The following code (not mine) is able to run NotebookA and NotebookB concurrently. I need some help to figure out how to pass multiple arguments to the same notebooks.
I want to pass this list of arguments to each notebook:
args = {}
args["arg1"] = "some value"
args["arg2"] = "another value"
If I wanted to pass the arguments above to each of the running notebooks, what will I need to amend in the code below?
Here is the working code:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)
inputs = [("NotebookA", "NotebookB") ]
run_in_parallel = lambda x: dbutils.notebook.run(x, 1800)
from concurrent.futures import ThreadPoolExecutor, wait
pool = ThreadPoolExecutor(3)
results = []
with ThreadPoolExecutor(3) as pool:
for x in inputs:
results.extend(pool.map(run_in_parallel, list(x)))
The dbutils.notebook.run accepts the 3rd argument as well, this is a map of parameters (see documentation for more details). So in your case, you'll need to change definition of the run_in_parallel to something like this:
run_in_parallel = lambda x: dbutils.notebook.run(x, 1800, args)
and the rest of the code should be the same.
If you'll want to pass different arguments to different notebooks, then you'll need to have a list of tuples, and pass this list to a map, like this:
data = [('notebook 1', {'arg1':'abc'}), ('notebook2', {'arg1': 'def', 'arg2': 'jkl'})]
...
run_in_parallel = lambda x: dbutils.notebook.run(x[0], 1800, x[1])
with ThreadPoolExecutor(3) as pool:
results.extend(pool.map(run_in_parallel, data))

how do i get a variable out of powershell in jenkins declarative pipeline?

steps {
script{
env.StorysTested = ''
try{
powershell('''
//some code here
foreach ( $item in $Comments )
{
//some code here
//assigning a new value to StoryTested env variable
$env:StorysTested = "some value"
}
//below line works fine and displays the value
Write-Output "Stories tested : $env:StorysTested"
''')
//below null value is displayed for StorysTested``
echo " From Grrovy : ${env.StorysTested}"
}
catch(err)
{
throw err
}
}
I am using a jenkins declarative pipeline.
In the above code i m trying to use the value of $env:StorysTested in groovy which was assigned in powershell. Is there any way i can retain a variable value that was assigned in powershell, after the powershell execution is over. storing it in env variable was one way i thought of but clearly that didnt work.
If you set an environment variable using $env:StorysTested = "some value", this variable is stored for the powershell process and is not permanent or visible outside this process.
To create more permanent environment variables (i.e., user-level or machine-level) you need to use the .NET Framework and the SetEnvironmentVariable method:
[Environment]::SetEnvironmentVariable("StorysTested", "some value", "User")
or
[Environment]::SetEnvironmentVariable("StorysTested", "some value", "Machine")
To delete from within PowerShell, you use the same .NET method and assign a $null value to the variable like this:
[Environment]::SetEnvironmentVariable("StorysTested",$null,"User") # or "Machine" of course
Hope that helps

How to Define and Use Variables in Pymongo Aggregation Script

I am trying to learn about mongodb aggregation. I've been able to get the commands to work for a single output. I am now working on a pymongo script to parse through a dirty collection and output sterilised data into a clean collection. I am stuck on how to define variables properly so that I can use them in the aggregation command. Please forgive me if this turns out to be a trivial matter. But I've been searching through online documents for a while now, but I've not had any luck.
This is the script so far:
from pymongo import MongoClient
import os, glob, json
#
var_Ticker = "corn"
var_Instrument = "Instrument"
var_Date = "Date"
var_OpenPrice = "prices.openPrice.bid"
var_HighPrice = "prices.highPrice.bid"
var_LowPrice = "prices.lowPrice.bid"
var_ClosePrice = "prices.closePrice.bid"
var_Volume = "prices.lastTradedVolume"
var_Unwind = "$prices"
#
#
client = MongoClient()
db = client.cmdty
col_clean = var_Ticker + "_clean"
col_dirty = var_Ticker + "_dirty"
db[col_dirty].aggregate([{$project:{_id:0,var_Instrument:1,var_Date:1,var_OpenPrice:1,var_HighPrice:1,var_LowPrice:1,var_ClosePrice:1,var_Volume:1}},{$unwind:var_Unwind},{$out:col_clean}])
This is the error that I get:
>>> db[col_dirty].aggregate([{$project:{_id:0,var_Instrument:1,var_Date:1,var_OpenPrice:1,var_HighPrice:1,var_LowPrice:1,var_ClosePrice:1,var_Volume:1}},{$unwind:var_Unwind},{$out:col_clean}])
File "<stdin>", line 1
db[col_dirty].aggregate([{$project:{_id:0,var_Instrument:1,var_Date:1,var_OpenPrice:1,var_HighPrice:1,var_LowPrice:1,var_ClosePrice:1,var_Volume:1}},{$unwind:var_Unwind},{$out:col_clean}])
^
SyntaxError: invalid syntax
If I take out the variables and use the proper values, the command works fine.
Any assistance would be greatly appreciated.
In Python you must wrap a literal string like "$project" in quotes:
db[col_dirty].aggregate([{"$project":{"_id":0,var_Instrument:1 ...
The same goes for "_id", which is a literal string. This is different from how Javascript treats dictionary keys.
Note that you should not put quotes around var_Instrument, since that is not a string literal, it's a variable whose value is a string.