Creating output file according to date in oozie - date

I am using oozie to run my map-reduce job. I want to create the output file according to the date. But it takes date as a string and ends up printing instead of taking date as the value :
/user/skataria/geooutput/$(date +"%m%d%Y%H%M%S")
Here is the oozie properties file:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
date=(date +"%m%d%Y%H%M%S")
oozie.wf.application.path=${nameNode}/services/advert/sid
inputDir=${nameNode}/user/${user.name}/geoinput/testgeo
outputDir=${nameNode}/user/${user.name}/geooutput/${date}
Somehow i cant have oozie as tag because my reputation is under 1500

It looks like you're trying to use a linux shell command (date +"%m%d%Y%H%M%S") in a java properties file - this isn't going to resolve.
One work around, assuming this is part of a manually submitted Workflow job (as opposed to a Coordinator job) is to provide the date property from the command line using the -D key=value option, and linux shell back quotes to resolve the output of a command inline
oozie job -run -config job.properties -D date=`date +"%m%d%Y%H%M%S"`
You'll need to make sure your version of Oozie support the -D key=value option

Yes I agree the shell option works. But that does not solve my usecase. I want to run my map-reduce job daily and schedule this thru Hue. The output directory needs to be parameterized as an job property to Oozie.
By the way I find that Oozie has Expression language Functions,
Unfortunately the function timestamp() returns the UTC current date and time in W3C format down to the second (YYYY-MM-DDThh:mm:ss.sZ). i.e.: 1997-07-16T19:20:30.45Z and completely unusable for creating a sub-directory name in HDFS
So for now,
I have a workaround. I am using the Workflow EL Function wf:id()
In workflow.xml
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>/user/sasubramanian/impressions/output/outpdir/${yyyy_mm_dd}/${wf:id()}</value>
</property>
This creates a output directory with subdirectory as,
/user/foouser/subdir1/output/outpdir/0000006-130321100837625-oozie-oozi-W
NOTE: You must specify this in the workflow.xml. This will not work if you specified it in the job.properties

Related

running job with different file without reloading the file

I created a job that could be reusable for new files. The entire activities in the job, the maps and everything else will remain the same except for the file name. I already tried it once but it seems that i need to re "load" the file and remap everything again. It's inefficient. Is there any way for me to pass different file in a job without remaping, reconfiguring and reloading anything?
You have multiple options for allowing a DataStage parallel job to use a different filename for input on each job run:
When using either Sequential File stage or File Connector stage, in stead of typing the actual filename, you can input the name of a job parameter which has been defined on the Parameters tab of the job properties dialog. For example, if you define string parameter myFile, then in the filename field of input stage you would enter #myFile# and at job run-time that would be replaced by whatever is the current value of the myFile parameter. If you run job manually from Director/Designer clients, you will have job run dialog where you can specify a value for job parameters. If you start job via dsjob command, there are options to pass in job parameters on command line. You also have option to use parameterset files that you can modify prior to job run.
Another option would be to use a file location and pattern instead of a specific file name. Both Sequential File stage and File Connector stage let you specify a pattern, for example: /data/my_input_files/*.txt
Then, each time you run job it will input any files at that location matching the above pattern, so it can process multiple files. However, to prevent re-processing files from prior job runs, you will want to clean up any files at that location after job completes. Then when you have new files to process just put them in that directory and re-run the job.
In case if all the files contains a similar data structure, you need to implement one parallel job and if you have a similar pattern of file name for all file names Such as 1234ab.xls, 1234vd.xls, 1234gd.xls, ... you could pass the file name as 1234??.xls In the sequential job file name parameter (Use this as file name in parallel job) which contains the above parallel job to be executed.

How to Pass argument to Oozie

My use case is the following, I am building an Oozie Pipeline and i need to pass it an argument.
Indeed my spark job must receive a string date as an argument and it would be great to pass the argument to the Oozie Workflow in order to use it in the Spark Submit. Anyone got any idea ? I didn't find the answer on Google
Thanks
Create workflow.xml that references some variable inputDate
Create
file job.properties that defines default value for inputDate
Run
your job using CLI, overriding default value when is needed:
oozie job -run -config job.properties -DinputDate=2017-08-19

Jenkins Powershell Output

I would like to capture the output of some variables to be used elsewhere in the job using Jenkins Powershell plugin.
Is this possible?
My goal is to build the latest tag somehow and the powershell script was meant to achieve that, outputing to a text file would not help and environment variables can't be used because the process is seemingly forked unfortunately
Besides EnvInject the another common approach for sharing data between build steps is to store results in files located at job workspace.
The idea is to skip using environment variables altogether and just write/read files.
It seems that the only solution is to combine with EnvInject plugin. You can create a text file with key value pairs from powershell then export them into the build using the EnvInject plugin.
You should make the workspace persistant for this job , then you can save the data you need to file. Other jobs can then access this persistant workspace or use it as their own as long as they are on the same node.
Another option would be to use jenkins built in artifact retention, at the end of the jobs configure page there will be an option to retain files specified by a match (e.g *.xml or last_build_number). These are then given a specific address that can be used by other jobs regardless of which node they are on , the address can be on the master or the node IIRC.
For the simple case of wanting to read a single object from Powershell you can convert it to a JSON string in Powershell and then convert it back in Groovy. Here's an example:
def pathsJSON = powershell(returnStdout: true, script: "ConvertTo-Json ((Get-ChildItem -Path *.txt) | select -Property Name)");
def paths = [];
if(pathsJSON != '') {
paths = readJSON text: pathsJSON
}

Informatica Session Failing

I created a mapping that pulls data from a flat file that shows me usage data for specific SSRS reports. The file is overwritten each day with the previous days usage data. My issue is, sometimes the report doesn't have any usage for that day and my ETL sends me a "Failed" email because there wasn't any data in the Source. The job from running if there is no data in the source or to prevent it from failing.
--Thanks
A simple way to solve this is to create a "Passthrough" mapping that only contains a flat file source, source qualifier, and a flat file target.
You would create a session that runs this mapping at the beginning of your workflow and have it read your flat file source. The target can just be a dummy flat file that you keep overwriting. Then you would have this condition in the link to your next session that would actually process the file:
$s_Passthrough.SrcSuccessRows > 0
Yes, there are several ways, you can do this.
You can provide an empty file to ETL job when there is no source data. To do this, use a pre-session command like touch <filename> in the Informatica workflow. This will create an empty file with the <filename> if it is not present. The workflow will run successfully with 0 rows.
If you have a script that triggers the Informatica job, then you can put a check there as well like this:
if [ -e <filename> ]
then
pmcmd ...
fi
This will skip the job from executing.
Have another session before the actual dataload. Read the file, use a FALSE filter and some dummy target. Link this one to the session you already have and set the following link condition:
$yourDummySessionName.SrcSuccessRows > 0

Current working directory for SXPG_COMMAND_EXECUTE?

Is there a way to specify the current working directory for the system command executed by the function module SXPG_COMMAND_EXECUTE?
I do not see any parameter which would allow me to do that either by defining the command in transaction SM69 or on the list of IMPORTING parameters in SE37.
It looks like by default such commands are started in DIR_HOME which can be viewed by the transaction AL11. Do I have any control over that?
There isn't a way of doing it via `SM69' unfortunately. I think the only solution is to create a script and call that.
I was going to suggest wrapping the statements in a SM69 command defined as a call to sh with parameters of -c 'cd <dir> && /path/to/command' but unfortunately that doesn't work. According to note 401095 wildcards are not permitted. When I tested, && was translated into a single &, causing the command to fail.
Would be good if you access this information using FM FILE_GET_NAME_USING_PATH (export the script name for which you want to find the physical directory).
The recieving path can be used in SXPG_COMMAND_EXECUTE.
Because the external commands I called were actually .bat files I solved this by putting the following expression at the beginning of each and every one.
cd /d %~dp0
This Stackoverflow question helped a lot actually.