How to check max date of each input file , if it matches with previous week start date then only process? - talend

I have 25 countries data file has week wise data in CSV format which we get every Monday in ftp location, I just need to consolidate all the files into one file which I am able to do.
In each data file there is "Week" column and now I need to check whether latest week data is there in file or not , if not there send the mail saying file does not have latest data.
For example next Monday is on 16th March so max week in file should be 9th March.
How can I apply that logic?
Using tAggregateRow and tJavaRow I am able to get the max week of each file but how to design job after that?

The basic steps you want to follow are:
Keep the expected max date in a global variable at the start of
job. In this example it should be 9th March.
Read each file one by one and get the max of week date and if it
matches the global variable then do not send email. Otherwise send
the email.
So an example job flow might look like:
tFileList---iterate--->tFileInputDelimited--->tAggregaterow--->tJavaRow---RUN IF condition(based on if SendEmailflag is Y)--->tSendMail
The tAggregateRow should get the max week date.
In the tJavaRow you should compare if input_row.maxdate == globalmaxdate(9-march) and based on this set another flag SendEmailFlag=Y or N with it defaulting to N.

Related

How can I calculate my win rate on one column based on the date on another?

I created a rudimentary Google Form to track my win rate at Starcraft. The first column on the resulting Google Sheet is Timestamp created by the form.
I have another column that has my win-loss, and I am able to calculate my percentage for the entire sheet (all games). However, I want to be able to see my daily win rate, and I can't figure out the correct way to go about it.
I tried COUNTIF, COUNTIFS, with TODAY() and I was able to count the games for a certain day, but I don't know how to use it to tie in with my win-loss column. What I currently do is adjust my Daily formula to specify today's date before playing. I was hoping I won't need to do this.
Please see Win-Loss Stats Sheet
Solution:
You can extend your formula to compare against the date in column A:
=(COUNTIFS(D2:D, "Win", ARRAYFORMULA(INT(A2:A)),TODAY())/((COUNTIFS(D2:D, "Win",ARRAYFORMULA(INT(A2:A)),TODAY()))+(COUNTIFS(D2:D, "Loss",ARRAYFORMULA(INT(A2:A)),TODAY()))))
The additional condition would be ARRAYFORMULA(INT(A2:A)),TODAY(), which converts the timestamps into dates and compares them to today's date.
Sample Output:

Azure Data Factory - Date Expression in component 'derived column' for last 7 Days

I am very new to Azure Data Factory. I have created a simple Pipeline using the same source and target table. The pipeline is supposed to take the date column from the source table, apply an expression to the column date (datatype date as shown in the schema below) in the source table, and it is supposed to either load 1 if the date is within the last 7 days or 0 otherwise in the column last_7_days (as in schema).
The schema for both source and target tables look like this:
Now, I am facing a challenge to write an expression in the component DerivedColumn. I have managed to find out the date which is 7 days ago with the expression: .
In summary, the idea is to load last_7_days column in target Table with value '1' if date >= current date - interval 7 day and date <= current date like in SQL.I would be very grateful, if anyone could help me with any tips and suggestions. If you require further information, please let me know.
Just for more information: source/target table column date is static with 10 years of date from 2020 till 2030 in yyyy-mm-dd format. ETL should run everyday and only put value 1 to the last 7 days: column last_7_days looking back from current date. Other entries must recieve value 0.
You currently use the expression bellow:
case ( date == currentDate(),1, date >= subDays(currentDate(),7),1, date <subDays(currentDate(),7,0, date > currentDate(),0)
If we were you, we will also choose case() function to build the expression.
About you question in comment, I'm afraid no, there isn't an another elegant way for. To achieve our request, Data Flow expression can be complex. It may be comprised with many functions. case() function is the best one for you.
It's very clear and easy to understand.

How do I pull the week of the month from text strings in this Twilio format 2019-08-22 06:12:58 MDT?

I am using the Twilio log file to crunch some data and need to convert the Twilio format for dates into something that Google Sheets can recognize as a date so I can then extract what week of the month the date is referring to. Also would be helpful to get the syntax that converts the Twilio date to a recognizable date for Googlesheets in case there are other things I need to do with the date field.
Currently, this is the format in the log file: "2019-08-22 06:12:58 MDT"
I'm using this =text(index(split(I2," "),,1),"mmmm") to determine the month and am struggling to have this now be able to work with the WEEKNUM function of Googlesheets to get the number of the week the date is from. I've tried =DATE(index(split(I2," "),,1),"mmmm"), =WEEKNUM(index(split(I2," "),,1),"mmmm") but am terrible with the formula syntax and can't fix the date value.
=DATE(index(split(I2," "),,1),"mmmm")
I expect to see a value from 1-5.
The text() part of the formula is turning the date input into text. And so you can't use it to calculate the weeknum().
=weeknum(index(split(I2," "),,1)) will get you closer. But it will give you the week of the year.
You may want to see this for a way to get to week of the month from week in the year.

How to process files only for the past hour using Talend?

I have continuous sensor data coming in every 5 mins in form of files. I want to pick files only for the past hour and do the required processing.
for e.g: the talend job runs at 12:01pm , it picks all the files from 11:00 am to 12:00 pm only.
Can anyone please suggest the approach I should take to make this happen within talend. is there any inbuilt component that can pick files for previous one hour ?
Here is the flow.
Use tFileProperties, in which you will get builtin schema with the name of mstring_name. By using this column you will get last modified time of file and in tJava or tJavaRow you can check wether this time lie between past one hour using talendDate functions
iterate all files and in tJavaRow write this code :
Date lastModifiedDate = TalendDate.parseDate("EEE MMM dd HH:mm:ss zzz yyyy", input_row.mtime_string);
Date current_date = TalendDate.getCurrentDate();
if(TalendDate.diffDate(current_date, lastModifiedDate,"HH") <= 1) {
output_row.abs_path = input_row.abs_path;
}
by this you will get all the files which are between past one hour.
hope this helps..
here is the complete job design :
tFileList--->(iterate)---->tFileProperties---->(row1 main)---->tJavaRow---->if---->tFileInputDelimited---->main----->tMap---->main----->tFileOutput
The context you are setting tJavaRow, check its nullability in if condition :
context.getProperty("file") != null && !context.getProperty("file").isEmpty()
After this use the context as you are doing
There is no built-in component that will give you files based on time.
However, you can accomplish this by using tFileList-->tFileProperties. Configure tFileList to sort by last modified date, then tFileProperties will give you the modified date. From there, you can filter based on the date value - if older than an hour, stop, otherwise process.

Loading date or datetime into date dimension

Let's say I have a date dimension and from my business requirements I know that the most granular I would need to go is to examine the specific day of the month that an event occurred.
The data I am given provides me with the exact time that an event occurred (YYYY-MM-DD HH:MM:SS). I have two opitons:
Before loading the data into the date dimension, slice the HH:MM:SS from the date.
Create the time attributes in my date dimension and insert the full date time.
The way I see it, I should go with the option 1. This would remove redundant data and save some space. However, if I go with option 2, should the business requirements ever change or if my manager suddenly wants to be more granular I wouldn't need to modify my original design. Which option is more commonly used? Are there more options that I did not consider?
Update - follow up question
I receive new data every month. If I used a pre built date dimension with all the dates would I then need to run my script every month to populate the table with new dates of that month or would I have a continuous process where by every day insert into the table one row, which would be that date?
I would agree with you and avoid option 2. A standard date dimension table is at the individual date level. If you did need to analyse by time of day, you could create an additional time of day dimension at the level of a second in a single day, and link to that from your fact table.
Your date dimension should be created by script automatically, rather than from the dates that events occurred. This allows you to analyse across a range of events from other facts, and on dates where no events occur, using a standard, prebuilt dimension.
I would also include the full date/time stamp as a column in the fact table, along with the 'DateKey' to the dimension table. This would allow you some visibility/analysis of the timestamp, you would not lose the data, and would still allow you to analyse by the date dimension.
Update - follow up question
Your pre-built date dimension (the standard way of doing it) would usually contain some dates in the future. There's no reason not to, for example, include another 5 years of dates in the table. But if you'd like it to gradually grow over time, you could have a script that is run once a day, once a month, or once a year to add new dates. Its totally up to you! There are many example scripts for building date dimensions- just google date dimension script. They exist for the language of your choice, e.g. SQL, C#, Power Query, etc.