Best way to pass parameters to SparkSubmitOperator - pyspark

So far i have been providing all required variables in the "application" field in the file itself this however feels a bit hacky.
So for example:
spark_clean_store_data = SparkSubmitOperator(
task_id="my_task_id",
application="/path/to/my/dags/scripts/clean_store_data.py",
conn_id="spark_conn",
dag=dag,
)
So the question is what is the most airflowy/proper way to provide SparkSubmitOperator with parameters like input data and or output files?

As per documentation, you might consider using the following parameters of the SparkSubmitOperator
files : a comma-separated string that allows you to upload files in the working directory of each executor
application_args : a list of string that allows you to pass arguments to the application

Related

how to pass the outputs from Get metadata stage and use it for file name comparison in databricks notebook

I have 2 Get metadata stages in ADF which is fetching file names from 2 different folders, I need to use these outputs for file name comparison in databricks notebook and return true if all the files are present.
how to pass the output from Get meta data stages to databricks and perform string comparison and
return true if all files are present and return false if even 1 file is missing
How to achieve this?
Please find the below answer which I explained with 1 Get metadata stage , the same can be replicated for more than one also.
Create an ADF pipeline with below activities.
Now in the Get Metadata activity , add the childItems in the Fieldlist as argument, to pass the output of Get Metadata to Notebook as show below
In the Databricks Notebook activity , add the below parameter as Base Paramter which will capture the output of Get Metadata and pass as input paramater to Notebook. Generally this parameter will of object datatype , but I converted to string datatype to access the names of files in the notebook as show below
#string(activity('Get Metadata1').output.childItems)
Now we can able to access the Get Metadata output as string in the notebook.
import ast
required_filenames = ['File1.csv','File2.csv','File3.csv'] ##This is for comparing with the output we get from GetMetadata activity.
metadata_value = dbutils.widgets.get('metadata_output') ##Accessing the output from Get Metadata and storing into a variable using databricks widgets.
metadata_list = ast.literal_eval(metadata_value) ##Converting the above string datatype to the list datatype.
blob_output_list=[] ##Creating an empty list to add the names of files we get from GetMetadata activity.
for i in metadata_list:
blob_output_list.append(i['name']) ##This will add all the names of files from blob storage to the empty list we created above.
validateif = all(item in blob_output_list for item in required_filenames) ##This validateif variable now compare both the lists using list comprehension and provide either True or False.
I tried in the above way and can able to solve the provided requirement. Hope this helps.
Request to please upvote the answer if this helps in your requirement.

Pass string parameter to remote process in kdb

I am trying to pass a variable that is string to the ipc query. This does not work for me.
Example:
[`EDD.RDB; "?[`tab;enlist(like;`OrderId;",("string Number),");();(?:;`Actions)]"]
I am trying to query this RDB where OrderId like Number(string)
Number is a parameter but when I passed as string to the remote process, Number is not string any more. I tried to put string in front but still get the same result.
What I want to pass to remote process is this:
Number:"abc"
"?[`tab;enlist(like;`OrderId;"abc");();(?:;`Actions)]"
EDIT as you have updated your question.
It's hard to give a solid answer here as your example is lacking information.
What you have posted is not a valid IPC call in KDB+. I suspect what you may be trying to run is something like:
h(`EDD.RDB; "?[`tab;enlist(like;`OrderId;",("string Number),");();(?:;`Actions)]"])
Assuming Number is an int (e.g. Number:123) then in that case you could rewrite it as:
h(`EDD.RDB;"select distinct Actions from t where orderID like \"",string[Number],"\"")
Which is easier to read and work with. Assuming Number is defined on the client side then the above should return an answer.
If you do want to use functional form then you could try something like:
"?[`tab;enlist (like;`orderID;string[",string[Number],"]);1b;(enlist`Actions)!enlist`Actions]"
As your query string.
If Number is already a string on your process, e.g Number:"123" then you should be able to either:
h(`EDD.RDB;"select distinct Actions from t where orderID like \"",Number,"\"")
OR
h(`EDD.RDB;"?[`tab;enlist (like;`orderID;string[",Number,"]);1b;(enlist`Actions)!enlist`Actions]")
Does the IPC query have to be string? Passing parameters would be cleaner using (func;params) syntax for IPC.
handleToRdb ({[number] ?[`tab;enlist(like;`OrderId;number);();(?:;`Actions)]};"abc")

Use CSV values in JMeter as request path

I have one of jmeter User defined variable as a "comma separated value" - ${countries} = IN,US,CA,ALL .
(I was first trying to get it as a list/array - [IN,US,CA,ALL] )
I want to use the variable to test a web service - GET /${country}/info . IS it possible using ForEach controller or Loop controller ?
Only thing is that I want to save it or read it as IN,US,..,ALL and use it in the request path.
Thanks
The CSV should be as per the format mentioned in the image attached.
Refer to the link on how to use CSV in Jmeter: http://ivetetecedor.com/how-to-use-a-csv-file-with-jmeter/
Thread Group Settings
No. of threads: 1
Ramp-up period: 1
Loop Count: 4
Hope this will help.
CSV config is a red herring, you don't need it.
You can use a regular expression extractor to split up the variable into another variable (eg MyVar), using something like:
(.+?)[,\n]
This is trying to match each item before a , or newline. It will place the values in variables like MyVar_1, MyVar_2, etc. This is as close to an array as JMeter understands natively.
You can then loop on the contents of the matches using MyVar_matchNr, and MyVar_1 to MyVar_n (you will need to use __V() function to access the 'array' contents.

How to load two sheets at once with xlsread

I currently have this code to load a set of prompts to assign the appropriate data:
full=xlsread(input('File Name for Full data?\n'),input('Sheet Name for full?\n'));
empty=xlsread(input('File Name for Empty data?\n'),input('Sheet Name for empty?\n'));
xx1=full(:,1);
yy1=full(:,2);
ff1=full(:,3);
xx2=empty(:,1);
yy2=empty(:,2);
ff2=empty(:,3);
However, since the full and empty sheets are both in one spreadsheet, I would like to make it so that there is only one prompt for the file and then a prompt for each sheet, so something like:
everything=xlsread(input('File Name for Full data?\n'),input('Sheet Name for full?\n'),input('Sheet Name for empty?\n');
xx1=everything(:,1);
yy1=everything(:,2);
ff1=everything(:,3);
xx2=everything(:,4);
yy2=everything(:,5);
ff2=everything(:,6);
What can I do to make this work out?
Just make the input calls before you use xlsread
filename = input('File Name for Full data?\n')
full = input('Sheet Name for full?\n')
empty = input('Sheet Name for empty?\n')
full=xlsread(filename, full);
empty=xlsread(filename, empty);
xx1=full(:,1);
yy1=full(:,2);
ff1=full(:,3);
xx2=empty(:,1);
yy2=empty(:,2);
ff2=empty(:,3);
Though xlsread does not support this directly, you can create a wrapper that will call xlsread in the right way.
Basically just ask the input arguments that you need, and based on them make a call to xlsread.
It is indeed a weakness that you cannot read multiple sheets simultaneously, but xlsread is just a very basic command. Personally I think it is a greater weakness that you can only read out contiguous ranges.

Auto Complete User Input PowerShell 2.0

I have a large list of data (over 1000 different values) and I want the user to be able to select certain values from the list from a PowerShell console.
What is the easiest way from within the console to allow the user to quickly select values?
I would like to do something like tab completion or the ability to use the arrow keys to scroll through the values but I am not sure how to do either of these things.
Any advice would be greatly appreciated.
PowerShell tab completion can be extended to custom parameters and parameter values (in v3). However, this is a property of advanced functions. You can use the ValidateSetAttribute to do that.
Check the Technet help topic on advanced functions: http://technet.microsoft.com/en-us/library/hh847806.aspx
You can replace the tabexpansion (v2) and tabexpansion2 (v3) function in PowerShell to auto complete parameter values outside of advanced functions. You can get a basic definition of this in PowerShell v3 by running
Get-Content function:TabExpansion2
Here is an example of showing custom tab expansion function.
http://www.powershellmagazine.com/2012/11/29/using-custom-argument-completers-in-powershell-3-0/
But, if you want to the user to be able to auto complete values for a Read-Host kind of input, you need to write a proxy for Read-Host to achieve that.
You can, optionally, look at PowerTab module at http://powertab.codeplex.com/
For folks who are looking for a way to do this and are fortunate enough to be using PS v3 (and my apologies for all those required to stay with V2):
The easiest way to achieve this is using the "ValidateSet" option in your input parameters.
function Show-Hello {
param (
[ValidateSet("World", "Galaxy", "Universe")]
[String]$noun
)
$greetingString = "Hello, " + $noun + "!"
Write-Host "`t=>`t" $greetingString "`t<="
}
ValidateSet throws an error if a user attempts to use any other input:
Show-Hello "Solar System"
Show-Hello : Cannot validate argument on parameter 'noun'. The argument `
"Solar System" does not belong to the set "World,Galaxy,Universe" specified `
by the ValidateSet attribute. Supply an argument that is in the set and `
then try the command again.
It also adds tab-completion to your function for that parameter. And if it is the FIRST parameter for your function, you don't even have to type in "-noun" for the tab-complete to make suggestions for its value.