My use case is the following, I am building an Oozie Pipeline and i need to pass it an argument.
Indeed my spark job must receive a string date as an argument and it would be great to pass the argument to the Oozie Workflow in order to use it in the Spark Submit. Anyone got any idea ? I didn't find the answer on Google
Thanks
Create workflow.xml that references some variable inputDate
Create
file job.properties that defines default value for inputDate
Run
your job using CLI, overriding default value when is needed:
oozie job -run -config job.properties -DinputDate=2017-08-19
Related
I created and tested multiple times a python wheel on my local machine and on Azure Databricks as a Job and it worked fine.
Now, I'm trying to create an Azure Data Factory Pipeline that triggers the wheel stored in Azure Databricks (dbfs:/..) everytime a new file is stored in a blob storage container.
The wheel takes a parameter (-f) and the values is new file name. I passed it to the wheel using argparse inside the script.py and parameters section of databricks job in the previous tests.
I created the pipeline and setted two parameters param and value that I want to pass to the wheel whose values are -f and new-file.txt. See image here
Then I created a Databricks Python file in ADF workspace and paste wheel path into Python file section. Now I'm wondering if this is the right way to do this.
I passed the parametes in the way you can see in the image below and I didn't add any library as I've already attacched the wheel in the upper section (I've tried to add the wheel also as library but notthing changed). See image here
I've created the trigger for blob storage and I've checked that in the trigger json file the parameters exists. Trying to trigger the pipeline I received this error: See image here
I checked if there are errors in code and I changed to UTF-8 the encoding as suggested in other questions of the community but notthing changes.
At this point, I think that I didn't trigger correctly the blob storage or the wheel can't be attached in the way I've done. I didn't add other resources in the workspace, hence I've only Databricks Python file.
Any advice is really appreciate,
thanks for the help!
If I understand your goal is to launch a wheel package from a databricks python notebook using Azure data factory and calling the notebook via the activity python databricks.
I think the problem that you are facing would be when calling the python wheel from the notebook.
Here is an example that I tried to use which is close to your needs and it worked fine.
I created a hello.py script and put it on the path /dbfs/FileStore/jars/
Here is the content of hello.py (just prints the provided arguments)
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-f', help='file', type=str)
args = parser.parse_args()
print('You provided the value : ', args.f)
I created a python notebook on databricks that takes arguments and passes them to the hello.py script.
This code defines the parameter that the notebook can take (which refers to the parameters you pass via Azure Data Factory while calling the activity databricks)
dbutils.widgets.text("value", "new_file.txt")
dbutils.widgets.text("param", "-f")
This code retrieves the parameters passed to the databricks notebook
param = dbutils.widgets.get("param")
value = dbutils.widgets.get("value")
And finally we call the python hello.py script to execute our custom code as follows :
!python /dbfs/FileStore/jars/hello.py $param $value
Pay attention to the ! at the begining.
Hope this helps your needs and don't forget to mark the answer :) .
I created a job that could be reusable for new files. The entire activities in the job, the maps and everything else will remain the same except for the file name. I already tried it once but it seems that i need to re "load" the file and remap everything again. It's inefficient. Is there any way for me to pass different file in a job without remaping, reconfiguring and reloading anything?
You have multiple options for allowing a DataStage parallel job to use a different filename for input on each job run:
When using either Sequential File stage or File Connector stage, in stead of typing the actual filename, you can input the name of a job parameter which has been defined on the Parameters tab of the job properties dialog. For example, if you define string parameter myFile, then in the filename field of input stage you would enter #myFile# and at job run-time that would be replaced by whatever is the current value of the myFile parameter. If you run job manually from Director/Designer clients, you will have job run dialog where you can specify a value for job parameters. If you start job via dsjob command, there are options to pass in job parameters on command line. You also have option to use parameterset files that you can modify prior to job run.
Another option would be to use a file location and pattern instead of a specific file name. Both Sequential File stage and File Connector stage let you specify a pattern, for example: /data/my_input_files/*.txt
Then, each time you run job it will input any files at that location matching the above pattern, so it can process multiple files. However, to prevent re-processing files from prior job runs, you will want to clean up any files at that location after job completes. Then when you have new files to process just put them in that directory and re-run the job.
In case if all the files contains a similar data structure, you need to implement one parallel job and if you have a similar pattern of file name for all file names Such as 1234ab.xls, 1234vd.xls, 1234gd.xls, ... you could pass the file name as 1234??.xls In the sequential job file name parameter (Use this as file name in parallel job) which contains the above parallel job to be executed.
I'm using Talend Open Studio for ESB ver.6.3.1 and created the jobs which is used to pull the data from ALM to MongoDB.In that i used tRESTClient with query parameter(date) for pulling data from ALM & tMongoDBOutput for inserting ALM data. After that i've build the job & imported into eclipse as a java project. I tried to run the program with option of 'Run as Java Application'.It is working fine.
The above job i gave query parameter value directly like 'tRESTClient --> Basic settings --> Query parameters --> name = "query" & value = "{last-modified[>=(2017-04-19 13:02:15)]}" ', so this job will pull the records based on the query parameters value.
Now i generated the eclipse talend job as a runnable jar file & trying to pass query parameter value from CMD as a parameter value.
How to pass query parameters value as parameter from CMD?
In windows command promote you can write the following command form.
for example if you have a jar file named test.jar and you have two parameters name & query, then you can pass the command as
javaw -jar test.jar -n 'name' -q 'your query'
Then you can handle these two parameters i.e name and query in your main method.
I would like to capture the output of some variables to be used elsewhere in the job using Jenkins Powershell plugin.
Is this possible?
My goal is to build the latest tag somehow and the powershell script was meant to achieve that, outputing to a text file would not help and environment variables can't be used because the process is seemingly forked unfortunately
Besides EnvInject the another common approach for sharing data between build steps is to store results in files located at job workspace.
The idea is to skip using environment variables altogether and just write/read files.
It seems that the only solution is to combine with EnvInject plugin. You can create a text file with key value pairs from powershell then export them into the build using the EnvInject plugin.
You should make the workspace persistant for this job , then you can save the data you need to file. Other jobs can then access this persistant workspace or use it as their own as long as they are on the same node.
Another option would be to use jenkins built in artifact retention, at the end of the jobs configure page there will be an option to retain files specified by a match (e.g *.xml or last_build_number). These are then given a specific address that can be used by other jobs regardless of which node they are on , the address can be on the master or the node IIRC.
For the simple case of wanting to read a single object from Powershell you can convert it to a JSON string in Powershell and then convert it back in Groovy. Here's an example:
def pathsJSON = powershell(returnStdout: true, script: "ConvertTo-Json ((Get-ChildItem -Path *.txt) | select -Property Name)");
def paths = [];
if(pathsJSON != '') {
paths = readJSON text: pathsJSON
}
I am using oozie to run my map-reduce job. I want to create the output file according to the date. But it takes date as a string and ends up printing instead of taking date as the value :
/user/skataria/geooutput/$(date +"%m%d%Y%H%M%S")
Here is the oozie properties file:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
date=(date +"%m%d%Y%H%M%S")
oozie.wf.application.path=${nameNode}/services/advert/sid
inputDir=${nameNode}/user/${user.name}/geoinput/testgeo
outputDir=${nameNode}/user/${user.name}/geooutput/${date}
Somehow i cant have oozie as tag because my reputation is under 1500
It looks like you're trying to use a linux shell command (date +"%m%d%Y%H%M%S") in a java properties file - this isn't going to resolve.
One work around, assuming this is part of a manually submitted Workflow job (as opposed to a Coordinator job) is to provide the date property from the command line using the -D key=value option, and linux shell back quotes to resolve the output of a command inline
oozie job -run -config job.properties -D date=`date +"%m%d%Y%H%M%S"`
You'll need to make sure your version of Oozie support the -D key=value option
Yes I agree the shell option works. But that does not solve my usecase. I want to run my map-reduce job daily and schedule this thru Hue. The output directory needs to be parameterized as an job property to Oozie.
By the way I find that Oozie has Expression language Functions,
Unfortunately the function timestamp() returns the UTC current date and time in W3C format down to the second (YYYY-MM-DDThh:mm:ss.sZ). i.e.: 1997-07-16T19:20:30.45Z and completely unusable for creating a sub-directory name in HDFS
So for now,
I have a workaround. I am using the Workflow EL Function wf:id()
In workflow.xml
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>/user/sasubramanian/impressions/output/outpdir/${yyyy_mm_dd}/${wf:id()}</value>
</property>
This creates a output directory with subdirectory as,
/user/foouser/subdir1/output/outpdir/0000006-130321100837625-oozie-oozi-W
NOTE: You must specify this in the workflow.xml. This will not work if you specified it in the job.properties