Fetch data from share for two sides - share

there is a share with data on it for a ETL-process. Problem is, two sides need to fetch this data before the data can be deleted from share.
What is best practice to make sure both sides have fetched the data before moving data to trash?

The data, when loaded, should be "registered". That is to say, a record should be written to a database to track that the file/data was loaded. Filename, row count, date found, date loaded, location on the share, a unique identifier for the data, etc.... all this should be tracked. If you are doing this and doing it at both locations where the data is to be loaded, then it is just a matter of having each destination for the data querying the other destination to see if the file has been loaded.
That or you can have a separate, 3rd process that is the "cleanup" process... deleted completed data/files. It would inspect the files located in the share and then inspect each destination to see if the file/data has been loaded. Once it has, this other/3rd process would handle the delete of the source data (or preferably the archive if not already archived earlier).

Related

How to do duplicate file check in DataStage?

For instance
File A Loaded then next day
File B Loaded then next day
This time Again, File A received this time sequence should be abort
Can anyone help me out with this
Thanks
There are multiple ways to solve this, but please don't do intentionally aborts as they're most likely boomerangs.
Keep track of filenames and file hashes (like MD5sum) in a table and compare the list before loading. If the file is known, handle/ignore it.
Just read the file again as if it was new or updated. Compare old data with new data using the Change Capture stage, handle data as needed, e.g. write changed and new data to target. (recommended)
I would not recommend writing a sequence that "should abort" as this is not the goal of an ETL process. If the file contains the very same content that is already known, just ignore it. If it has updated data, handle it as needed. Only abort, if there is a technical issue, e.g. the file given is wrong formatted. An abort of a job should indicate that something is wrong with the job. When you get a file twice, then it's not the job that failed.
If an error was found in the data that needs to be fixed by others, write the information about it to a table. Have a another independend process monitoring that table to tell the data producer about it (via dashboard, email,...).

Nextcloud - mass removal of collaborative tags from files

due to an oversight in a flow-routine that was meant to tag certain folders on upload into the cloud, a huge amount of unwanted files were also tagged in the process. Now there are thousands upon thousands of files that have the wrong tag and need to be untagged. Neither doing this by hand nor reuploading with the correct flow-routine are really workable options. Is there a way to do the following:
Crawl through every entry in a folder
If its a file, untag it, if its a folder, don't
Everything I found about tags and NextCloud was concerning with handling them when they were uploaded, but never running over existing files in regards of tagging.
Is this possible?
The cloud stores those data into the configured database. So you could simply remove the assigns from the db.
The assigns are stored in oc_systemtag_object_mapping while the tags itself are in oc_systemtag. If you found the ID of the tag to remove (let's say 4), you could simply remove all assignments from the db:
DELETE FROM oc_systemtag_object_mapping WHERE systemtagid = 4;
If you would like to do this only for a specific folder, it's not even getting much more complicated. Files (including their folder structure!) are stored in oc_filecache, while oc_systemtag_object_mapping.objectid references oc_filecache.fileid. So with some joining and LIKEing, you could limit the rows to delete. If your tag is used for non-files, your condition should include oc_systemtag_object_mapping.objecttype = 'files'.

What is the best practice to persist items inside a reorderablelist after change of order

I have a reorderable list where i read the further information of each row on tap from a json file. Each row displays the corresponding json files name. I read these from a local folder in users device. The list shown here enables the user to reorder the items. The problem is I want to ensure persistent reodering by which I mean my app should remember the ordering user made next time the app is launched. I cannot think of anyway to go with this. Do I store a local json file keeping all the file names and the corresponding row index? What would be a best practice for this. This list is though to have row amount of 50 to 200 so I need a scalable solution.

What are some options for keeping track of temporary results and re-use them after restart, in case the program dies while running?

(Suggestions for improving the title of this question are welcomed.)
I have a perl script that uses web APIs to fetch a user's "liked" posts on various sites (tumblr, reddit, etc.), then download some portion of each post (for example, an image that's linked from the post).
Right now, I have a JSON-encoded file that keeps track of the posts that have already been fetched (for tumblr, it just records the total number of likes, for reddit, it records, the "id" of the last post fetched) so that the script can just pick up with the newly "liked" items the next time it runs. This means that after the program is finished archiving a new batch of links, the new "stopping point" is recorded in the JSON file.
However, if the program croaks for some reason (or is killed with ctrl+c, say), the progress is not recorded (since the progress is only recorded at the end of the "fetching"). So the next time the program runs, it looks in the tracking file and gets the last recorded stopping point (the last time it successfully completed fetching and recorded the progress), and picks up there again, downloading duplicates up to the point where it croaked the last time.
My question is, what's the best (i.e. simplest, most efficient, take your pick--I'm open to options here) way to record progress with each incremental archived item, so that if the program dies for some reason, it always knows exactly where to pick up where it left off? Adapting the current method (literally print-ing to the tracking file at the end of each fetch) to do the same thing after each individual item is definitely not the best solution because it's got to be pretty inefficient.
Edited for clarity
Let me make clearer that the file used to track the downloaded posts is not large, and does not grow appreciably with each "fetch" operation. There is only one element for each api (tumblr, etc.) that contains either the total number of likes for the account (in other words, the number that we have already downloaded, so we query the api for the current total, subtract the number in the file, and we know how many new items to fetch), or the ID of the last item fetched (reddit uses this, so we can ask the api for all items "after" the one in the file and only get the new stuff).
My problem is not an ever growing list of fetched posts, rather it is writing to the tracking file every time one single post is downloaded (and there could be thousands of posts downloaded in a single run).
Some ideas I would consider:
Write to the file more often or use an interrupt handler to 'safely' handle the interrupt signal. When it's called, allow the script to write to your file so it's as current as possible and elegantly quit.
Use a better storage mechanic than writing to a flat file. I would consider, depending on the need, using a database to store the ids. I groan when database starts getting in play due to the complexities it adds, however it doesn't have to be. I've used SQLite for queuing but also consider DBD::CSV which just writes to a CSV while allowing SQL syntax (haven't used it myself). In your code you could then check if the id is already in the database and know to skip it. I would imagine that SQLite is also more 'efficient' than reading/writing a flat file and, imo, would be easier to code than having to write code to read a file yourself.
I'd just use a hash, tied to an NDBM file, to keep track of what is loaded and what isn't.
When you start a new batch of URLs, you delete the NDBM file.
Then, in your code, at the start of the program, you do
tie(%visited, 'NDBM_File', 'visitedurls', O_RDWR|O_CREAT, 0666)
(don't worry about the O_CREAT, the file will remain intact if it exists unless you pass O_TRUNC as well)
Assuming your main loop looks like this:
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results=fetch($url);
save_results($url, $results);
}
you change that to
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results;
if ($visited{$url}) {
$results=$visited{$url};
} else {
$results=fetch($url);
$visited{$url}=$results;
}
save_results($url, $results);
}
So whenever you fetch a new URL, you write the results to the NDBM file, and whenever you restart your program, the results that have already been fetched will be in the NDBM file and fetched from there instead of reading the URL.
This assumes $results is a scalar, else you won't be able to store/retrieve it in this way. But as you're producing JSON anyway, the "partial json" for each URL will probably be what you want to store.

Need Core Data help to insert objects

First of all I want to show how I made this in SQL:
Both the location and environment table will never contain more than those four rows. Each log can only be associated with 4 rows.
What I don't understand is how do I even start writing code that will take whatever the user has chosen, based on state switches etc in my UI and persist this?
Because when the user are done I want to store a "log-record", and the log-record may have location and environment rows associated with it. And what happen when the user let say, choose all the location rows, four times a row....does it add the location to the location "entity" every time? Would I end up with a lot of duplicated data? I would appreciate any help that can show me how to do this. Thank you!
Looks like you need three entities. You'll have Location and Environment entities that have whichever attributes they need, and a Log entity that has relationship with both Environment and Location. I think you're asking if instances of Location and Environment that happen to be the same will be duplicated in the core data store, or if multiple Log instances will relate to the same Location and Environment instances. Is that right? Answer: It's up to you. Say you want to save a Location instance that has a particular set of attributes. You could first search for one that has that exact set of attributes and associate it with your Log instance, or you could just create a new Location instance and not worry about the duplication. If you're storing zillions of these Log entries, the first plan might save a lot of space. If you're not saving them all that often, and particularly if the user can go back and change the data associated with a Log instance, you might want to use separate instances even if they happen to be the same.