Currently I have 14 files in my directory, I want to upload them into 7 per batch means want to upload 7 files first and then remaining 7 files.
Can you please tell me the flow.
Job design should be like this .
1- in tJava_1 i have done dummy logs :
System.out.println(((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")));
2- tfileproperties (i will need the mtime variable to get when the file was created) has a configuration as file :
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
3- tBufferOutput component will store you the data
4- tBufferInput component to get the data stored in tBufferOutput (don't forget to get the schema )
5- tSortRow component to sort by mtime
6- tJava_2
System.out.println(((Integer)globalMap.get("tBufferInput_1_NB_LINE")));
// to get 7 first files
int j=((Integer)globalMap.get("tBufferInput_1_NB_LINE"))-7 ;
globalMap.put("j",j) ;
7- tloop would have While configuration :
Declare int i=0
Condition : i<(int)globalMap.get("j")
itération : i++
/////////////////////////////////////////////////////////////
Other way is to do it like this :
we can do with combination of tFileList and tFileExist components.
1- Let us Assume A1 is the Source.
2-Create A2 and A3 two other folders.
3- Iterate A1 folders with tFileList and use tFileExists -A2 directory path and global variable "tFileList_1_CURRENT_FILE to check the files in A2.
4- After TfileExistsuse condition tFileExist_2_EXISTS then copy the file using tFileCopy to Folder A3.
5- At end of run make sync between A1 and A2.
6- After processing new files in A3 archive/delete them
7- Like this every run new files will be stored in A3.
Related
I have the following directory structure on an Azure container:
-dwh-prod
-Main_Folder
-2021-01
-file1.parquet
-2021-02
-file2.parquet
-file3.parquet
where the Data is partitioned by year and month to create subfolders. Within these sub-folders, I have my data files. I want to load into my data flow only the latest files that were added within one day from running my data flow pipeline.
I tried using currentUTC() in End Time and subtracting one day -> AddDays(currentUTC(), -1) in Start Time in the 'Filter by last modified' option provided in source options but it didn't work.
I also tried using currentTimestamp() instead but to no avail.
How do I go about solving this?
Your expression is correct. Please change the folder path from MainFolder to Main_folder in your dataset and set Main_Folder/*/*.parquet as your Wildcard paths in your Source option. Then it will work.
I think your solution is close, but I'm not sure the folder name is sufficient. I'm also not familiar with "currentUTC". The correct function should be utcNow.
Below is an outline of how I would approach this problem.
Source Dataset
Add a Parameter for the subfolder (year-month):
and then set the Folder path to an expression like:
Pipeline
You could either pass in the subfolder or calculate it at runtime. My preference would be to pass it in as a parameter:
I would then add variables to calculate the start and end times. Since you are running this daily, I would be sure to force the time to the START of the day(s). This should handle any vagaries based on run time. Also, I would use the built in getPastTime function:
Now use these objects in your Source configuration:
I have a scenario, where I have around 10K files inside a folder named '_out'. and '_out' in-turn contains other folders and inner files and sub folders.
I'm deleting around 6000 files , which are directly present under the _out folder.
I delete the files by using the following code snippets:
for(IFile iFile:files){
iFile.delete(true, new NullProgressMonitor());
}
During the delete of each file, Property of each resource is being deleted, internally in the API Resource.deleteResource(boolean, MultiStatus), using method call to IPropertyManager.deleteResource(IResource). This API in-turn calls PropertyManager2.deleteProperties(IResource, int depth) with INFINITE depth and hence, all the folder and sub-folders inside the location "\workspace.metadata.plugins\org.eclipse.core.resources.projects\projectName.indexes\8f" are visited and 'Properties.index' files are loaded in each folder and then, checked whether the given IFile has a property in them. Where '8f' folder represents the Bucket for the '_out' folder.
for all the 6000 files, this operation is being repeated and hence, it takes around 15 minutes to delete all the 6000 files.
Ideally, should we just visit the properties.index file under the '8f' and delete the property for the given file and return?
My assumption is that the Properties for any files present directly under the '_out' folder will be saved under the 'properties.index' file in the Bucket corresponding to '_out' folder. (in my scenario 8f folder).
Apologies, if I have misunderstood it. Kindly, clarify.
I have raised a bug for the same at link.
Thanks,
Palraj
I would like to save variables as mat files on s3. The example on the official site shows "tall table" only. Maybe I can use the "system" command overstep MATLAB but I am looking for a straight forward solution.
Any suggestion?
It does look like save does not support saving to remote filesystems.
You can, however, write matrices, cells, tables and timetables.
An example which uses writetable:
LastName = {'Smith';'Johnson';'Williams';'Jones';'Brown'};
Age = [38;43;38;40;49];
T = table(Age,LastName)
writetable(T,'s3://.../table.txt')
Note:
To write to a remote location, filename must contain the full path of
the file specified as a uniform resource locator (URL) of the form:
scheme_name://path_to_file/my_file.ext
To obtain the right URL of the bucket, you can navigate to the contents of the s3 bucket, select a file in there, choose Copy path and remove the name of the file (e.g table.txt).
The alternative is, as you mentioned, a system call:
a = rand(5);
save('matExample','a');
system('aws s3api put-object --bucket mybucket --key=s3mat.mat --body=matExample.mat')
the mat file matExample.mat is saved as s3.mat on the server.
What is the best way to iterate over files and feed them into tMongoDBBulkLoad? It sees that you cannot feed into this component from a tFileList componet (Iterate) - which would make the most sense.
I want to import 80 files, rather than create one massive file which is too large to open in notepad if I have issues during the import.
Thanks
---Update----
I know how to do this with other components, my issue is I cannot feed an Iterate component into the tMongoBulkLoad
The simplified job will be like this :
tFileList ---------iterate--------tMongoDBBulkLoad
and in the tMongoDBBulkLoad settings you set the Data file to :
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
Here, the tFileList will iterate over files, in each iteration, the tMongoDBBulkLoad will be triggered to load the current file, which is indicated by the global variable.
--- Reply for the Update ---
To connect an iterate trigger to the component, you can add a dummy tjava with no code, it will be like this :
tFileList -----(iterate)-----tJava-------(onComponentOk)-------tMongoDBBulkLoad
What is the best way to apply logic to objects during an iteration of tFileList.
The issue is that if I use a tFileList to get a list of files, i am not able to use tJavaRow or jMap to create the filename that i want the file to be renamed. Basically, if i have zip files with years(2010,2011,2012 etc) and each zip file contains files with the same name (f1.csv, f2.csv, f3.csv), i want to iterate through the compressed files, uncompress them and rename the files with
f1_2010.csv, f2_2010.csv, f3_2010.csv..f1_2012.csv etc.
Thanks!
Iterate links are providing a way to execute components based on events or facts while main links are transfering data between components.
With something looking like that you should be able to resolve your problem :
tFileList_1 --iterate--> tFileUnarchive_1
|
onComponentOK
|
tFileList_2 -- iterate --> tFileCopy_1
|
onComponentOK
|
tFileArchive_1
Use ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) in your tFileUnarchive to get the ZIP path.
In tFileCopy use ((String)globalMap.get("tFileList_2_CURRENT_FILEPATH")) to get the path of file and config it to be a rename.
For your name modification you can add tJava on "onComponentOK" links. By using globalMap.put("year",((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")).substring(x,x)) or more complicated code. And use these variables in your others components parameters.