What are some options for keeping track of temporary results and re-use them after restart, in case the program dies while running? - perl

(Suggestions for improving the title of this question are welcomed.)
I have a perl script that uses web APIs to fetch a user's "liked" posts on various sites (tumblr, reddit, etc.), then download some portion of each post (for example, an image that's linked from the post).
Right now, I have a JSON-encoded file that keeps track of the posts that have already been fetched (for tumblr, it just records the total number of likes, for reddit, it records, the "id" of the last post fetched) so that the script can just pick up with the newly "liked" items the next time it runs. This means that after the program is finished archiving a new batch of links, the new "stopping point" is recorded in the JSON file.
However, if the program croaks for some reason (or is killed with ctrl+c, say), the progress is not recorded (since the progress is only recorded at the end of the "fetching"). So the next time the program runs, it looks in the tracking file and gets the last recorded stopping point (the last time it successfully completed fetching and recorded the progress), and picks up there again, downloading duplicates up to the point where it croaked the last time.
My question is, what's the best (i.e. simplest, most efficient, take your pick--I'm open to options here) way to record progress with each incremental archived item, so that if the program dies for some reason, it always knows exactly where to pick up where it left off? Adapting the current method (literally print-ing to the tracking file at the end of each fetch) to do the same thing after each individual item is definitely not the best solution because it's got to be pretty inefficient.
Edited for clarity
Let me make clearer that the file used to track the downloaded posts is not large, and does not grow appreciably with each "fetch" operation. There is only one element for each api (tumblr, etc.) that contains either the total number of likes for the account (in other words, the number that we have already downloaded, so we query the api for the current total, subtract the number in the file, and we know how many new items to fetch), or the ID of the last item fetched (reddit uses this, so we can ask the api for all items "after" the one in the file and only get the new stuff).
My problem is not an ever growing list of fetched posts, rather it is writing to the tracking file every time one single post is downloaded (and there could be thousands of posts downloaded in a single run).

Some ideas I would consider:
Write to the file more often or use an interrupt handler to 'safely' handle the interrupt signal. When it's called, allow the script to write to your file so it's as current as possible and elegantly quit.
Use a better storage mechanic than writing to a flat file. I would consider, depending on the need, using a database to store the ids. I groan when database starts getting in play due to the complexities it adds, however it doesn't have to be. I've used SQLite for queuing but also consider DBD::CSV which just writes to a CSV while allowing SQL syntax (haven't used it myself). In your code you could then check if the id is already in the database and know to skip it. I would imagine that SQLite is also more 'efficient' than reading/writing a flat file and, imo, would be easier to code than having to write code to read a file yourself.

I'd just use a hash, tied to an NDBM file, to keep track of what is loaded and what isn't.
When you start a new batch of URLs, you delete the NDBM file.
Then, in your code, at the start of the program, you do
tie(%visited, 'NDBM_File', 'visitedurls', O_RDWR|O_CREAT, 0666)
(don't worry about the O_CREAT, the file will remain intact if it exists unless you pass O_TRUNC as well)
Assuming your main loop looks like this:
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results=fetch($url);
save_results($url, $results);
you change that to
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results;
if ($visited{$url}) {
} else {
save_results($url, $results);
So whenever you fetch a new URL, you write the results to the NDBM file, and whenever you restart your program, the results that have already been fetched will be in the NDBM file and fetched from there instead of reading the URL.
This assumes $results is a scalar, else you won't be able to store/retrieve it in this way. But as you're producing JSON anyway, the "partial json" for each URL will probably be what you want to store.


How to do duplicate file check in DataStage?

For instance
File A Loaded then next day
File B Loaded then next day
This time Again, File A received this time sequence should be abort
Can anyone help me out with this
There are multiple ways to solve this, but please don't do intentionally aborts as they're most likely boomerangs.
Keep track of filenames and file hashes (like MD5sum) in a table and compare the list before loading. If the file is known, handle/ignore it.
Just read the file again as if it was new or updated. Compare old data with new data using the Change Capture stage, handle data as needed, e.g. write changed and new data to target. (recommended)
I would not recommend writing a sequence that "should abort" as this is not the goal of an ETL process. If the file contains the very same content that is already known, just ignore it. If it has updated data, handle it as needed. Only abort, if there is a technical issue, e.g. the file given is wrong formatted. An abort of a job should indicate that something is wrong with the job. When you get a file twice, then it's not the job that failed.
If an error was found in the data that needs to be fixed by others, write the information about it to a table. Have a another independend process monitoring that table to tell the data producer about it (via dashboard, email,...).

How to Load Only One Track at a Time in Liquidsoap

I have a MySQL database that stores all my tracks and their associated information. One of the tables in the database is a queue table from which I pull a track for Liquidsoap to play. I am providing those tracks to play with Liquidsoap by using the request.dynamic.list.
def get_track() =
# Get the first line of my external process
result = list.hd(default="", get_process_lines(scripts ^ "get_track.py"))
# Create and return a request using this result
# Create the source
sourcetrack = request.dynamic.list(id="play_queue", conservative=false, get_track)
The get_track.py script retrieves a record from a queue table in the database.
I noticed that Liquidsoap will grab two tracks when in starts up. Two get "accepted" and one is "prepared."
Is there a way to get Liquidsoap to only accept one track at a time and wait to accept the next one only when reaching near the end of the currently playing track?
I also have scheduled programs that get added to the queue table in the database and when this occurs, all tracks are cleared from the queue table in the database and the program is then added to the queue table.
Since Liquidsoap appears to have a track already loaded in its queue while playing the "prepared" track, is there a way to remove that track so Liquidsoap will not play that track next, but rather call again the get_track.py script to load new track from queue table in database?
Liquidsoap always prepares stream's next items in advance, and it's a fundamental principle of its scheduler. This allows to start a download before playing the downloaded track, for example. As long as you are using request.dynamic.list, the called script must take care of this. In other words, you can't only rely on clock time to evaluate the track to return.
As far as I understand your use case you might prefer using a request.queue source, and have your script push each request on time via the telnet server.

Can watchman send why a file changed?

Is watchman capable of posting to the configured command, why it's sending a file to that command?
For example:
a file is new to a folder would possibly be a FILE_CREATE flag;
a file that is deleted would send to the command the FILE_DELETE flag;
a file that's modified would send a FILE_MOD flag etc.
Perhaps even when a folder gets deleted (and therefore the files thereunder) would send a FOLDER_DELETE parameter naming the folder, as well as a FILE_DELETE to the files thereunder / FOLDER_DELETE to the folders thereunder
Is there such a thing?
No, it can't do that. The reasons why are pretty fundamental to its design.
The TL;DR is that it is a lot more complicated than you might think for a client to correctly process those individual events and in almost all cases you don't really want them.
Most file watching systems are abstractions that simply translate from the system specific notification information into some common form. They don't deal, either very well or at all, with the notification queue being overflown and don't provide their clients with a way to reliably respond to that situation.
In addition to this, the filesystem can be subject to many and varied changes in a very short amount of time, and from multiple concurrent threads or processes. This makes this area extremely prone to TOCTOU issues that are difficult to manage. For example, creating and writing to a file typically results in a series of notifications about the file and its containing directory. If the file is removed immediately after this sequence (perhaps it was an intermediate file in a build step), by the time you see the notifications about the file creation there is a good chance that it has already been deleted.
Watchman takes the input stream of notifications and feeds it into its internal model of the filesystem: an ordered list of observed files. Each time a notification is received watchman treats it as a signal that it should go and look at the file that was reported as changed and then move the entry for that file to the most recent end of the ordered list.
When you ask Watchman for information about the filesystem it is possible or even likely that there may be pending notifications still due from the kernel. To minimize TOCTOU and ensure that its state is current, watchman generates a synchronization cookie and waits for that notification to be visible before it responds to your query.
The combination of the two things above mean that watchman result data has two important properties:
You are guaranteed to have have observed all notifications that happened before your query
You receive the most recent information for any given file only once in your query results (the change results are coalesced together)
Let's talk about the overflow case. If your system is unable to keep up with the rate at which files are changing (eg: you have a big project and are very quickly creating and deleting files and the system is heavily loaded), the OS can't fit all of the pending notifications in the buffer resources allocated to the watches. When that happens, it blows those buffers and sends an overflow signal. What that means is that the client of the watching API has missed some number of events and is no longer in synchronization with the state of the filesystem. If that client is maintains state about the filesystem it is no longer valid.
Watchman addresses this situation by re-examining the watched tree and synthetically marking all of the files as being changed. This causes the next query from the client to see everything in the tree. We call this a fresh instance result set because it is the same view you'd get when you are querying for the first time. We set a flag in the result so that the client knows that this has happened and can take appropriate steps to repair its own state. You can configure this behavior through query parameters.
In these fresh instance result sets, we don't know whether any given file really changed or not (it's possible that it changed in such a way that we can't detect via lstat) and even if we can see that its metadata changed, we don't know the cause of that change.
There can be multiple events that contribute to why a given file appears in the results delivered by watchman. We don't them record them individually because we can't track them with unbounded history; imagine a file that is incrementally being written once every second all day long. Do we keep 86400 change entries for it per day on hand and deliver those to our clients? What if there are hundreds of thousands of files like this? We'd have to truncate that data, and at that point the loss in the data reduces how well you can reason about it.
At the end of all of this, it is very rare for a client to do much more than try to read a file or look at its metadata, and generally speaking, they want to do that only when the file has stopped changing. For this use case, watchman-wait, watchman-make and trigger all have the concept of a settle period that causes the change notifications to be delayed in delivery until after the filesystem has stopped changing.

Powershell - Copying CSV, Modifying Headers, and Continuously Updating New CSV

We have a log that tracks faxes sent through our fax server. It is a .csv that contains Date_Time, Duration, CallerID, Direction (i.e. inbound/outbound), Dialed#, and Answered#. This file is overwritten every 10 minutes with any new info that was tracked on the fax server. This cannot be changed to be appended.
Sometimes our faxes fail, and the duration on those will be equal to 00:00:00. We really don't know if they are failing until users let us know that they are getting complaints about missing faxes. I am trying to create a Powershell script that can read the file and notify us via email if there are n amount of failures.
I started working on it, but it quickly became a big mess as I ran into more problems. One issue I was trying to overcome was having it email us over and over if there are certain failures. Since I can't save anything on the original .csv's, I was trying to preform these ideas in the script.
Copy .csv with a new header titled "LoggedFailure". Create file if it doesn't exist.
Compare the two files, and add different data (i.e. updates on the original) to the copy.
Check copied .csv for Durations equal to 00:00:00. If it is, mark the LoggedFailure header as "Yes" or some value.
If there are n amount of failures, email us.
Have this script run as a scheduled task (every hour or so).
I'm having difficulty with maintaining the data. I haven't done a lot of work with scripting or programming, so I'm having trouble with making the correct logic. I can look up cmdlets and understand them, but my main issue is logic. Does anyone have any tips or could provide some ideas on how to best update the data, track failures as to not send duplicate information, and have it run?
I'd use a hash table with the Dialed# as the key. Create PSCustomObjects that have LastFail date and FailCount properties as the values. Read through the log in chronological order, and add/increment a new entry in the hash table every time it finds an entry with Duration of 00:00:00 that's newer than what's already in the hash table. If it finds a successful delivery event, delete the entry with that Dialed# key from the hash table if it exists.
When it's done, the hash table keys will be a collection of the Dialed numbers that are failing, and the objects in the values will tell you how many failures there have been, and when the last one was. Use that to determine determine if an alert needs to be sent, and what numbers to report.
When a problem with a given fax number is resolved, a successful fax to that number will clear the entry from the hash table, and stop the alerts.
Save the hash table between runs by exporting it as CLIXML, and re-import it at the beginning of each run.

Using Mojo Event Loop for a long-running script processing huge text files?

I have a script implemented as a Mojo::Command.
It reads a huge text file and extracts data from it. The file contains simple tab separated (C/TSV) records. One record per line.
How can I use the Mojo Event loop to store those records in small files - one file per record - so my script does not wait for each record to be stored but continues to the next record.
Here is a stripped down example:
package My::task;
use Mojo::Base 'Mojolicious::Command';
#in My::task::run
#use Text::CSV to open and read the file
while (!$csv->eof()) {
my $row = $csv->getline($fh)
I was thinking Mojo Event Loop can be used and avoid forking/threading.
I used successfully previously Parallel::Forker, but I was thinking Mojo has what to offer to speedup the execution.
Is that possible? How?
It depends on the nature of do_something_time_consuming. If that is something that has your process CPU-busy, then you're looking for parallelism, which an event loop doesn't try to give you. In that event you might want to feed each row to redis (via mojo::redis) and have worker processes consume, process, store each record. Then throughput is down to how many parallel workers you can run.
On the other hand, if do_something_time_consuming involves a lot of waiting, eg post to a web service and wait for results, then an event loop (incl mojo's) can be a big win, and handle the concurrency that you want. It's hard to guess which of the non-blocking UserAgent examples is closest to your scenario, since you're short on detail. The gist is to create a callback that does what you want (eg store_the_record_somewhere) when it gets a response back from the remote service.