Passing a value back TO ADF from a databricks Python node

Passing a value back TO ADF from a databricks Python node - azure-data-factory

In Azure DF I run python code using a Databricks.Python node. What I am trying to do is return a value from the Python script (e.g. print(value_to_return)) and use this to set the value of a variable which will be used by another node (web).
Is there anyway of returning a value form a python script that is run from an ADF pipeline. I know I can do it using the Databricks.NoteBook node and adding dbutils.notebook.exit() at the end, but I am really trying to achieve this using the the Python node.
The repeating solution I see is to write the value to a file/db table and then read it back and set the variable value that way.

The answer is no. You can only call a Python script, it cannot return a custom value to the output of the Python node in ADF to be used as an input to another node in the pipeline.

Related

Persistable key value pair storage in Synapse or ADF

I am using Synapse and have a lot of scenarios where I need to read a value at the beginning of a pipeline then save a value at the end of a pipeline as a key value pair (kvp). e.g. when the pipeline begins I read a value from a kvp store to get the max date from the last time the pipeline ran, I use that value to get all values from a table that are greater than or equal to that datetime. when the pipeline finishes doing what it has to do, I save the max modified date from this run. wash, rise, dry. I have a few ideas, like parquet file, redis (this seems a bit much). Just trying to see if anyone has come up with a more elegant/simple approach.

You can use Global Parameters which can be used in different pipelines and the values can be modified in the run time.
Go to Manage in Azure Data Factory and click on Global Parameters in the left panel options. Then click on + New.
Create a new Global Parameter.
Later you use this global parameter in any pipeline and can change its value in runtime. Refer below image for the same.

ADF - built-in copy task to create folders based on variables

I am using Azure Data Factory's built-in copy task, set on a daily schedule, to copy data into a container in Azure Data Lake Storage Gen2 using the Built-in Copy tool. For my destination, I'm trying to use date variables to create a folder structure for the data. In my resultant pipeline the formula looks like this:
Dir1/Dir2/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}/#{formatDateTime(pipeline().parameters.windowStart,'MM')}/#{formatDateTime(pipeline().parameters.windowStart,'dd')}
Unfortunately this is throwing an error:
Operation on target ForEach_h33 failed: Activity failed because an inner activity failed; Inner activity name: Copy_h33, Error: The function 'formatDateTime' expects its first parameter to be of type string. The provided value is of type 'Null'.
Everything I've created was just generated by the tool, the folder path I used when following the tool was as suggested:
Dir1/Dir2/{year}/{month}/{day} (I was then able to set the format of each variable - e.g., yyyy, MM, dd, which suggests the tool understood what I was doing.
The only other thing I can think of, is that the folder structure in the container only contains Dir1/Dir2/ - I am expecting the subdirectories to be created as the copy task runs.
I'll also add, everything runs fine if I just use the directory Dir1/Dir2/ - so the issue is with my variables.

There is nothing wrong in your built-in copy task. It will give perfect result when it is running on the scheduled trigger time.
But if you want to run with manual trigger, it will give the error like above.
When you are running with manual trigger you must give the windowStart parameter value by yourself. In the above It is saying that the value is null.
Give the value as MM/DD/YYYY. The Schedule trigger automatically takes this value when it runs daily. But In manual triggering we have to specify this value.
Then click on Ok. Now, you will not get error like that, and you can create folders with the year/month/day format.

How to use the User Input node in SPSS modeler?

I am trying to make my SPSS flow a bit more dynamic. Although there is no way (that I am aware of) to take an input, is there a way to use User Input node to take some parameter values and use these parameters in another select node to perform tests on data.
What I am trying to achieve is run the same flow with some minor parameter changes. I have the flow running for some static values used in select nodes. It would be great if there was a way to change these static select nodes, even if it's not with a User Input node.

I believe what you are looking for is actually to make a parameter in SPSS modeler and then enable/check the Prompt? which will ask the user for input every time the stream is executed.
In SPSS Modeler 18.2 onwards I was able to see the prompt option as the first feature of parameter tab.

AZURE DATA FACTORY - Can I set a variable from within a CopyData task or by using the output?

I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark

I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?

Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.

Do pipeline variables persist between runs?

I'm doing a simple data flow pipeline between 2 cosmos dbs. The pipeline starts with the dataflow, which grabs the pipeline variable "LastPipelineStartTime" and passes that parameter to the dataflow for the query to use in order to get all new data where c._ts >= "LastPipelineStartTime". Then, on data flow success, updates the variable via Set Variable to the pipeline.TriggerTime(). Essentially so I'm always grabbing new data between pipeline runs.
My question is: it looks like the variable during each debug run reverts back to its Default Value of 0, and instead grabs everything each time. Am I misunderstanding or using pipeline variables wrong? Thanks!

As i know,the variable which is set in the Set Variable Activity has it's own life cycle: during current execution of pipeline.Any change of variable can't persist until next execution stage.
To implement your needs,pls refer to my workarounds as below:
1.If you execute ADF pipeline in the schedule,you could just pass the schedule time as parameter into it to make sure you grab new data.
2.If the frequency is random,persist the trigger time into other residence(e.g. simple file in the blob storage),before data flow activity,use LookUp Activity to grab that time from blob storage file.