How to process files only for the past hour using Talend? - talend

I have continuous sensor data coming in every 5 mins in form of files. I want to pick files only for the past hour and do the required processing.
for e.g: the talend job runs at 12:01pm , it picks all the files from 11:00 am to 12:00 pm only.
Can anyone please suggest the approach I should take to make this happen within talend. is there any inbuilt component that can pick files for previous one hour ?
Here is the flow.

Use tFileProperties, in which you will get builtin schema with the name of mstring_name. By using this column you will get last modified time of file and in tJava or tJavaRow you can check wether this time lie between past one hour using talendDate functions
iterate all files and in tJavaRow write this code :
Date lastModifiedDate = TalendDate.parseDate("EEE MMM dd HH:mm:ss zzz yyyy", input_row.mtime_string);
Date current_date = TalendDate.getCurrentDate();
if(TalendDate.diffDate(current_date, lastModifiedDate,"HH") <= 1) {
output_row.abs_path = input_row.abs_path;
}
by this you will get all the files which are between past one hour.
hope this helps..
here is the complete job design :
tFileList--->(iterate)---->tFileProperties---->(row1 main)---->tJavaRow---->if---->tFileInputDelimited---->main----->tMap---->main----->tFileOutput
The context you are setting tJavaRow, check its nullability in if condition :
context.getProperty("file") != null && !context.getProperty("file").isEmpty()
After this use the context as you are doing

There is no built-in component that will give you files based on time.
However, you can accomplish this by using tFileList-->tFileProperties. Configure tFileList to sort by last modified date, then tFileProperties will give you the modified date. From there, you can filter based on the date value - if older than an hour, stop, otherwise process.

Related

Azure Data Factory - Date Expression in component 'derived column' for last 7 Days

I am very new to Azure Data Factory. I have created a simple Pipeline using the same source and target table. The pipeline is supposed to take the date column from the source table, apply an expression to the column date (datatype date as shown in the schema below) in the source table, and it is supposed to either load 1 if the date is within the last 7 days or 0 otherwise in the column last_7_days (as in schema).
The schema for both source and target tables look like this:
Now, I am facing a challenge to write an expression in the component DerivedColumn. I have managed to find out the date which is 7 days ago with the expression: .
In summary, the idea is to load last_7_days column in target Table with value '1' if date >= current date - interval 7 day and date <= current date like in SQL.I would be very grateful, if anyone could help me with any tips and suggestions. If you require further information, please let me know.
Just for more information: source/target table column date is static with 10 years of date from 2020 till 2030 in yyyy-mm-dd format. ETL should run everyday and only put value 1 to the last 7 days: column last_7_days looking back from current date. Other entries must recieve value 0.
You currently use the expression bellow:
case ( date == currentDate(),1, date >= subDays(currentDate(),7),1, date <subDays(currentDate(),7,0, date > currentDate(),0)
If we were you, we will also choose case() function to build the expression.
About you question in comment, I'm afraid no, there isn't an another elegant way for. To achieve our request, Data Flow expression can be complex. It may be comprised with many functions. case() function is the best one for you.
It's very clear and easy to understand.

How to get the time difference in talend?

How to get the difference in time by comparing with the previous value and getting the result .Say for example
There are
2017-01-01 13:00:00
2017-01-01 13:15:00
I need the difference as 15 minutes after finding the difference,How to do it?
Firstly, you'll have to use TalendDate.diffDate(column1,column2,"pattern") to get the time difference.
Then, if you want to compare current value with previous one (in the same column), you can set a sequence on your flow, it will help you identify which one is the previous value. Then, you'll just have to read twice your flow, and have an inner join between current sequence and current sequence -1 to get the currentDate and the previous Date.
First subjob :
YourFlow -> tMap -> tHashOutput
In tMap, add a new "sequence" column to your field and use Numeric.sequence("s1",1,1).
This way all lines will have an ID.
Then, read twice your Hash , and join flows on "sequence - 1"
tHashInput_1----|
|--tMap--->Output
tHashInput_2----|
Put the TalendDate.diffDate() method in the output, using the two Dates fields.
Here is an alternative :
Start defining starting talend job execution time, this way (here in a tJava, but you can also use tSetGlobalVar component) :
globalMap.put("startDate", TalendDate.getDate("CCYY/MM/DD hh:mm:ss"));
The following code is used later in the job inside a tJava :
String endDate = TalendDate.getDate("CCYY/MM/DD hh:mm:ss");
long executionTime = format.parse(endDate).getTime() - format.parse(((String)globalMap.get("startDate"))).getTime();
System.out.println("Execution Time : "+(executionTime/(60*60*1000))+" Hour(s) "+(executionTime/(60*1000)%60)+" Minute(s) "+(executionTime/1000%60)+" second(s).");

Using Current Date Time in SAS

I am selecting the data from a table using a date string. I would like to select all rows that have a update time stamp greater than or equal to today.
The simplest way that I can think of is to put today's date in the string, and it works fine.
WHERE UPDATE_DTM >'29NOV2016:12:00'DT;
However, if I want to put something like today's date or system date, what should I put?
I used today(), but it returned all rows in the table. I am not sure if it's because today() in SAS refers to the date 1/1/1960? I also tried &sysdate, but it returned an error message seems like it requires a date conversion.
WHERE UPDATE_DTM > TODAY();
Any ideas? Your thoughts are greatly appreciated!
DATETIME() is the datetime equivalent of TODAY() (but includes the current time). You could also use dhms(TODAY(),0,0,0) if you want effectively midnight (or, for your example above, dhms(TODAY(),12,0,0) to get noon today).

How to check max date of each input file , if it matches with previous week start date then only process?

I have 25 countries data file has week wise data in CSV format which we get every Monday in ftp location, I just need to consolidate all the files into one file which I am able to do.
In each data file there is "Week" column and now I need to check whether latest week data is there in file or not , if not there send the mail saying file does not have latest data.
For example next Monday is on 16th March so max week in file should be 9th March.
How can I apply that logic?
Using tAggregateRow and tJavaRow I am able to get the max week of each file but how to design job after that?
The basic steps you want to follow are:
Keep the expected max date in a global variable at the start of
job. In this example it should be 9th March.
Read each file one by one and get the max of week date and if it
matches the global variable then do not send email. Otherwise send
the email.
So an example job flow might look like:
tFileList---iterate--->tFileInputDelimited--->tAggregaterow--->tJavaRow---RUN IF condition(based on if SendEmailflag is Y)--->tSendMail
The tAggregateRow should get the max week date.
In the tJavaRow you should compare if input_row.maxdate == globalmaxdate(9-march) and based on this set another flag SendEmailFlag=Y or N with it defaulting to N.

Grouping Expert, Current date against start date

I am having trouble grouping certain results in a work in progress report that arranges by start date, I have grouped using fixed values before but because the dates keep moving I am unsure what to do.
The start date is WIP_Schedule.Start_Date
the groups I am trying to create are:
[Group1] Overdue = the current date has passed the start date.
[Group2] (Yet to be named) = the current date 2 week period prior to the start date
[Group3] To Do = the current date after the two week period prior to the start date.
I am after a works instruction on how to achieve this.
I know this isn't a lot of information, if you require any more please ask.
Thanks,
Daniel
This is a pretty straightforward requirement, so you should be able to figure it out by searching the web. However, I'll give you part of the answer and hopefully you can figure it out.
Start by creating a formula to figure out the status of the date.
If {WIP_Schedule.Start_Date} > current date then "Overdue"
Else......
Then you can group based on that formula. All you have to do is figure out the rest of the formula.