I have a talend job that create folder based on account ID on a specific folder(C/LogDetails).
Job run every 5 mins and because of this I have no space left in the directory and this prevent job from creating more folders based on account ID.
In short because of lack of space in the folder(C/LogDetails) the job failed.
I want to build a solution in talend that will delete all folders where date modified must be less than today's date.
in tFileList give the parent folder path c/LogDetails and select 'directories' in FileList type dropdown.
in tFileProperties component use the global variable ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")). it will iterate all the folders inside your parent folder because you have selected fileList type as directories in tFileList component.
in tJavaRow use below code
if(TalendDate.compareDate(TalendDate.parseDate("yyyy-MM-dd", TalendDate.getDate("yyyy-MM-dd")),
TalendDate.parseDate("E MMM dd HH:mm:ss Z yyyy", row1.mtime_string)) == 1) {
context.abs_path = input_row.abs_path;
System.out.println("if : "+context.abs_path);
}
join tjavarow with if condition to tFileDelete component. If conditon should be context.abs_path should not be null or empty. give the context.abs_path in tFileDelete and select delete folder option
hope this help..:)
Related
I have 2 Get metadata stages in ADF which is fetching file names from 2 different folders, I need to use these outputs for file name comparison in databricks notebook and return true if all the files are present.
how to pass the output from Get meta data stages to databricks and perform string comparison and
return true if all files are present and return false if even 1 file is missing
How to achieve this?
Please find the below answer which I explained with 1 Get metadata stage , the same can be replicated for more than one also.
Create an ADF pipeline with below activities.
Now in the Get Metadata activity , add the childItems in the Fieldlist as argument, to pass the output of Get Metadata to Notebook as show below
In the Databricks Notebook activity , add the below parameter as Base Paramter which will capture the output of Get Metadata and pass as input paramater to Notebook. Generally this parameter will of object datatype , but I converted to string datatype to access the names of files in the notebook as show below
#string(activity('Get Metadata1').output.childItems)
Now we can able to access the Get Metadata output as string in the notebook.
import ast
required_filenames = ['File1.csv','File2.csv','File3.csv'] ##This is for comparing with the output we get from GetMetadata activity.
metadata_value = dbutils.widgets.get('metadata_output') ##Accessing the output from Get Metadata and storing into a variable using databricks widgets.
metadata_list = ast.literal_eval(metadata_value) ##Converting the above string datatype to the list datatype.
blob_output_list=[] ##Creating an empty list to add the names of files we get from GetMetadata activity.
for i in metadata_list:
blob_output_list.append(i['name']) ##This will add all the names of files from blob storage to the empty list we created above.
validateif = all(item in blob_output_list for item in required_filenames) ##This validateif variable now compare both the lists using list comprehension and provide either True or False.
I tried in the above way and can able to solve the provided requirement. Hope this helps.
Request to please upvote the answer if this helps in your requirement.
I have next directory structure in HDFS:
logs_folder
|---2021-03-01
|---log1
|---log2
|---log3
2021-03-02
|---log1
|---log2
2021-03-03
|---log1
|---log2
...
Logs are made up of text data. There is no date in the data because it is already in the folder name. I want to read all the logs and save them in the following format:
date id
where id - field from the log, but I need to take the date from the folder name.
Expected output:
2021-03-01 id1
2021-03-01 id2
...
2021-03-02 id234
2021-03-02 id456
...
How to add date from folder name to output?
I found close question how to add full pathname to data on reading:
A = LOAD '/logs_folder/*' using PigStorage(',','-tagPath');
DUMP A ;
How can I incorporate the current input filename into my Pig Latin script?
It is very close, but how to get parent folder name only instead of full path?
Finally I used this approach:
Load data using `-tagPathz attribute - it adds column to loaded data contains full path to every file
Use regex to filter parent folder only
Code example:
hadoop_data = LOAD '/logs_folder/*' USING PigStorage(',', '-tagPath') as (filepath:chararray, id:chararray, feature:chararray, value:chararray);
hadoop_data = FOREACH hadoop_data GENERATE id,(chararray)REGEX_EXTRACT(filepath,'.*\\/(.*)\\/',1) as path,
feature,value;
My data consist of 3 fields - id, feature, value, but you can see there are 4 of them - filepath field was added!
I have a talend job which takes the input from a csv file. The CSV file has child job names, and a start date. Right now this is how the job is built
I have a tFileInputDelimited, which takes the input from the file and this connects to the tFlowToIterate, which has the key value pairs
Example :
c1, c2 and c3 which are C1 -->Job1
C2 -->J1
C3 -->1/16/2017
J1 is the name of the child job and C3 has the date.
In the trunjob I have used the "Use Dynamic job" and the Context job is "globalMap.get("c2") which will execute all the child jobs.
Now I need to execute those child jobs whose c3 value is today.
If you are question is continuation of this thread Running Talend child jobs through a parent job, then you can follow below steps,
Below is my input date
ChildJob1, 1/16/2017
ChildJob2, 1/17/2017
ChildJob3, 1/17/2017
I have modified the same job from previous answer with addition tjava component like below,
Below is the code that exist inside tjava component
System.out.println("|-----------------Date from Input file is "+row5.Date.toString()+"------------|");
System.out.println("|-----------------Job name from Input file is "+context.JobName+"---------------------------|");
String input = TalendDate.getDate("DD/MM/yyyy");
SimpleDateFormat inputFormatter = new SimpleDateFormat("DD/MM/yyyy");
Date date = inputFormatter.parse(input); // Getting Today's date in DD/MM/YYYY format
context.IsTodayJob = TalendDate.compareDate(date,row5.Date) == 0 ? true : false;
In tjava component, I am setting the context variable IsTodayJob value by comparing Today's date and the date value from input file.
And I have connected the tjava component with tRunJob component through run if option with below condition
This gave me below result.
Hope this would help you out.
I have searched all over, and read this post.
But it doesn't seem complete and doesn't work.
The situation: I need to get the last modified file from a directory on the local machine. I then need to pass that file into the fileinputdelimited component.
I currently have:
tfilelist --> iterate --> titeratetoflow --> tsamplerow
-->tflowtoiterate -> tinpufiledelimited ---> tlogrow (just to make sure its pulling the right file)
But it doesn't work. I have configured it. so that titeratetoflow has a column called
"FileName" with "((String)globalMap.get("CURRENT_FILE"))" as the value,
"FileDirectory" with ((String)globalMap.get("CURRENT_FILEDIRECTORY")) as value, and
"FileAndDirectory" with ((String)globalMap.get("CURRENT_FILEPATH")) as value.
The tsamplerow is limited to "1".
The tiflowtoiterate is set so that
"FileNameOnly" is value of "FileName"
"FileDirectoryOnly" is "FileDirectory" and
"FilePathComplete" is "FileAndDirectory"
In the File location field of the tinputfiledelimited, I have "((String)globalMap.get("FilePathComplete"))"
When it runs I get an error saying cannot find file or path. If I cut out the fileinput component and have it send straight to the tlogrow, it shows a single line of blank entry.
Any ideas?
I'm not sure if you've just slightly misconfigured the job here but it seems to work fine for me.
Here's a few screenshots showing my job design:
The only thing I can think of just by looking at your post is that you might have slightly messed up the key value pair combinations in the tFlowToIterate. I tend to find that the default settings there work fine pretty much all of the time and it makes it a little more obvious what it's doing as well.
EDIT: Actually, it looks like you might be using the wrong values in your tIterateToFlow. The tFileList will throw the values for the file paths etc in to the global map but it will preface it with the unique component name. If you hit ctrl+space in the value window it should prompt you with a list of available values (these are also specified in the "Outline" tab of the studio). It typically makes an implicit conversion to String but for this you will need to explicitly convert it so use .toString() instead of (String).
Another way to get last modified file is as below
tFileList(sorted DESC by file modified date) ------> tFixedFlowInput (schema - filename, filenumber) ----->tHashOutput
here in tFixedFlowInput
filename = file(String)globalMap.get("tFileList_1_CURRENT_FILEPATH")+"/"+(String)globalMap.get("tFileList_1_CURRENT_FILE")
filenumber = (Integer)globalMap.get("tFileList_1_NB_FILE")
What above will accomplish is get list of all files in the directory with their number/rank - where the file last modified will have file number =1 and next to that will have 2...and so on.
Now on SubJobOK of above tFileList you can have tHashInput which will read from above tHashOutput and filter only row where filenumber==1 - which means the last modified file.
tHashInput (link to tHashoutput) ---->tFilterRow(filenumber==1)------>tLogRow
One reason why you are getting null is probably you have used globalMap.get("CURRENT_FILEPATH) instead of globalMap.get("tFileList_1_CURRENT_FILEPATH")
The Simple Solution for above problem could be as below:
tFileList(sorted ASC by file modified date)--> tIterateToFlow --> tJava( just to end the subjob).
Then on
subjob ok --> tfileinput ( use (String)globalMap.get("tFileList_1_CURRENT_FILE") or (String)globalMap.get("tFileList_1_CURRENT_FILEPATH") as a file name/file path)
Explanation:
Since tFileList iterates all the files in ASC order, it will always have Latest file name stored in globalMap for the last iteration. The list is only iterated till tIterateToFlow hence after this component (String)globalMap.get("tFileList_1_CURRENT_FILE") will always give the last file name from the iterated list, which is the latest file in out case.
Main Flow :
Component View:
I have an excel file, which initially imports stock data from our cloud based accounting program through .iqy web query.
The column headings are:
A1= Quantity B1= Item C1= Description D1= Bin Code
Now I have created a macro which;
Referesh's the data
Range("A1").QueryTable.Refresh False
Delete's all zero stock items
Dim intRow
Dim intLastRow
intLastRow = Range("A65536").End(xlUp).Row
For intRow = intLastRow To 1 Step -1
Rows(intRow).Select
If Cells(intRow, 1).Value = 0 Or Cells(intRow, 1) = "" Then
Cells(intRow, 1).Select
Selection.EntireRow.Delete
End If
Next intRow
Auto Sort by Bin Code
Range("A1:D1").Select
Selection.AutoFilter
Range("A2").Select
Range("A1:D1668").Sort Key1:=Range("D1"), Order1:=xlAscending, Header:= _
xlGuess, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal
Save the Master list
Dim sFileName As String, sPath As String
sPath = "C:\stock\ms\Master List "
sFileName = Format(Now(), "dd'mm'yy")
ActiveWorkbook.SaveAs (sPath & sFileName)
Now this is the tricky bit,
At least 30 items a day need to be checked, however a bin can not be incomplete! So once 30 items have been selected the script needs to check to see if the next item is in the same bin as the 30th item, and include this in the extraction. So lets say item 30 is in bin 10A2, and also item 31, 32, 33, 34, so all in all 34 items (rows) need to be extracted into a new workbook and saved.
This process must start from the previous days sample, so the mechanics should go like this:
look in c\stock\sl\Sample List -1 dd'mm'yy sample list for -1 day, look at the last item bin number, say 10A1,
take the next rows bin number, 10A2,
from the first row which has 10A2, select 30 rows,
Continue till the bin number changes.
save that file as Sample List dd'mm'yy in c\stock\sl\
email Sample List dd'mm'yy to NNN#NNN.com
This should be able to repeat. Also on Saturday and Sunday the company is not open, so on mondays it should look back on friday, and so forth, also accounting for public holidays.
Any help with this would be a life saver? I don't mind if you want to change the file names so that this issue with holidays can be addressed. However, somewhere a time stamp needs to be placed for the files.
You might want to check out the Dictionary object, it would probably help in this task. If you have any questions along the way ask another question. Not sure if someone else would want to give you a more thorough answer to this question.
Your project might be big enough that you would want to work with classes too.
Please avoid every Select in your code.
For instance,
Range("A1:D1").Select
Selection.AutoFilter
can be replaced by:
Range("A1:D1").AutoFilter