Using Mojo Event Loop for a long-running script processing huge text files? - perl

I have a script implemented as a Mojo::Command.
It reads a huge text file and extracts data from it. The file contains simple tab separated (C/TSV) records. One record per line.
How can I use the Mojo Event loop to store those records in small files - one file per record - so my script does not wait for each record to be stored but continues to the next record.
Here is a stripped down example:
package My::task;
use Mojo::Base 'Mojolicious::Command';
#in My::task::run
#use Text::CSV to open and read the file
while (!$csv->eof()) {
my $row = $csv->getline($fh)
do_something_time_consuming_and_store_the_record_somewhere($row)
}
I was thinking Mojo Event Loop can be used and avoid forking/threading.
I used successfully previously Parallel::Forker, but I was thinking Mojo has what to offer to speedup the execution.
Is that possible? How?

It depends on the nature of do_something_time_consuming. If that is something that has your process CPU-busy, then you're looking for parallelism, which an event loop doesn't try to give you. In that event you might want to feed each row to redis (via mojo::redis) and have worker processes consume, process, store each record. Then throughput is down to how many parallel workers you can run.
On the other hand, if do_something_time_consuming involves a lot of waiting, eg post to a web service and wait for results, then an event loop (incl mojo's) can be a big win, and handle the concurrency that you want. It's hard to guess which of the non-blocking UserAgent examples is closest to your scenario, since you're short on detail. The gist is to create a callback that does what you want (eg store_the_record_somewhere) when it gets a response back from the remote service.

Related

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency
I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

How to Load Only One Track at a Time in Liquidsoap

I have a MySQL database that stores all my tracks and their associated information. One of the tables in the database is a queue table from which I pull a track for Liquidsoap to play. I am providing those tracks to play with Liquidsoap by using the request.dynamic.list.
def get_track() =
# Get the first line of my external process
result = list.hd(default="", get_process_lines(scripts ^ "get_track.py"))
print(result)
# Create and return a request using this result
[request.create(result)]
end
# Create the source
sourcetrack = request.dynamic.list(id="play_queue", conservative=false, get_track)
The get_track.py script retrieves a record from a queue table in the database.
I noticed that Liquidsoap will grab two tracks when in starts up. Two get "accepted" and one is "prepared."
Is there a way to get Liquidsoap to only accept one track at a time and wait to accept the next one only when reaching near the end of the currently playing track?
I also have scheduled programs that get added to the queue table in the database and when this occurs, all tracks are cleared from the queue table in the database and the program is then added to the queue table.
Since Liquidsoap appears to have a track already loaded in its queue while playing the "prepared" track, is there a way to remove that track so Liquidsoap will not play that track next, but rather call again the get_track.py script to load new track from queue table in database?
Liquidsoap always prepares stream's next items in advance, and it's a fundamental principle of its scheduler. This allows to start a download before playing the downloaded track, for example. As long as you are using request.dynamic.list, the called script must take care of this. In other words, you can't only rely on clock time to evaluate the track to return.
As far as I understand your use case you might prefer using a request.queue source, and have your script push each request on time via the telnet server.

mimic test_start test_stop events in distributed mode worker

In my locustfile I defined test_on_start and test_on_stop events to read a file needed for the test and to write detailed statistics in a CSV at the end of the test. when running in distributed mode, these events occur on the master, not the worker. I am assembling a list of detailed stats for each task in a task sequence and at the end of the test writing a CSV file when the test stops. I found this stackoverflow question which references a setup and teardown. I added these to my class User(HttpUser): but they appear to not be executed.
How can I mimic these events when the test is running on a worker in distributed mode?
Is there a better way?
I am using User on_start and on_stop already - my on_start calls a function to select a random user from a list which was created when the #events.test_start.add_listener is fired, which only happens on the master and not on the workers, so the worker doesn't have any user login data.
It seems counter productive to open the file, read it, select a user at random and close it every time the User on_start method is called. User on_start also sets up the iteration list [] which is where i store the times per task.
When the task sequence is done, meaning the last task is executed, i do a self.interrupt() which runs on_stop, which is where I take the iteration times, and put them into a second list, which is later written using the CSV module. maybe it would be better to just write the data to the CSV during on_stop
The setup/teardown for individual Users has been removed (because they were confusing, as it was run on the first instance of that User class, and when people set properties on that instance got very confused by the fact that later instances didnt get that). Tbh, I wish they had just been replaced by class methods...
The User still has on_start/stop methods though, and if you combine that with a flag it may be able to do what you want. Something like this:
class MyUser(HttpUser):
stopped = False
...
def on_stop(self):
if not MyUser.stopped:
MyUser.stopped = True
# write your csv
# this doesnt guarantee that all your Users are finished though.
https://docs.locust.io/en/stable/writing-a-locustfile.html#on-start-and-on-stop-methods

Itemwriter output using same order that itemreader used to read file

We have a springbatch job that reads a file (flatfileitemreader), process it and writes data to a queue (jmsitemwriter).
We have another job that reads the queue (jmsitemreader) and writes a file (flatfileitemwriter). It's asynchronous process (in between the execution of the two jobs, we have some manual process that must be performed).
The flat file content doesn't have a line identifier. And we use a multi-threaded approach when reading the file ("throttle-limit"). So, the messages queued do not maintain the same order that they used to have into the flat file.
The problem is that we should generate an output file respecting the original order. So the line 33 inside the incoming file, should be line 33 into the outgoing file (it will have the contents of the original line, plus some data).
Does springbatch provide "native" a way to order the output, respecting the original read order? I used "native" because one solution that we thought is to create an additional step just to add a line number to the file and use it at the end, but I was wondering if this "reinvent the wheel"...
We are using SB 3.0.3
TIA,
Bob
The use case you are describing asks that you maintain order across multiple jobs which is not supported. In theory (while not guaranteed) a single, single threaded step would retain the order of the input file.
Since you are reading in a multithreaded manor, there really isn't a good way to guarantee the order of the items as they are being read. The best you could do is synchronize the read method and add an id as the items are being read. If the bottleneck you're attempting to address with multithreading is in the processor or writer, this may not be a bad option.

What are some options for keeping track of temporary results and re-use them after restart, in case the program dies while running?

(Suggestions for improving the title of this question are welcomed.)
I have a perl script that uses web APIs to fetch a user's "liked" posts on various sites (tumblr, reddit, etc.), then download some portion of each post (for example, an image that's linked from the post).
Right now, I have a JSON-encoded file that keeps track of the posts that have already been fetched (for tumblr, it just records the total number of likes, for reddit, it records, the "id" of the last post fetched) so that the script can just pick up with the newly "liked" items the next time it runs. This means that after the program is finished archiving a new batch of links, the new "stopping point" is recorded in the JSON file.
However, if the program croaks for some reason (or is killed with ctrl+c, say), the progress is not recorded (since the progress is only recorded at the end of the "fetching"). So the next time the program runs, it looks in the tracking file and gets the last recorded stopping point (the last time it successfully completed fetching and recorded the progress), and picks up there again, downloading duplicates up to the point where it croaked the last time.
My question is, what's the best (i.e. simplest, most efficient, take your pick--I'm open to options here) way to record progress with each incremental archived item, so that if the program dies for some reason, it always knows exactly where to pick up where it left off? Adapting the current method (literally print-ing to the tracking file at the end of each fetch) to do the same thing after each individual item is definitely not the best solution because it's got to be pretty inefficient.
Edited for clarity
Let me make clearer that the file used to track the downloaded posts is not large, and does not grow appreciably with each "fetch" operation. There is only one element for each api (tumblr, etc.) that contains either the total number of likes for the account (in other words, the number that we have already downloaded, so we query the api for the current total, subtract the number in the file, and we know how many new items to fetch), or the ID of the last item fetched (reddit uses this, so we can ask the api for all items "after" the one in the file and only get the new stuff).
My problem is not an ever growing list of fetched posts, rather it is writing to the tracking file every time one single post is downloaded (and there could be thousands of posts downloaded in a single run).
Some ideas I would consider:
Write to the file more often or use an interrupt handler to 'safely' handle the interrupt signal. When it's called, allow the script to write to your file so it's as current as possible and elegantly quit.
Use a better storage mechanic than writing to a flat file. I would consider, depending on the need, using a database to store the ids. I groan when database starts getting in play due to the complexities it adds, however it doesn't have to be. I've used SQLite for queuing but also consider DBD::CSV which just writes to a CSV while allowing SQL syntax (haven't used it myself). In your code you could then check if the id is already in the database and know to skip it. I would imagine that SQLite is also more 'efficient' than reading/writing a flat file and, imo, would be easier to code than having to write code to read a file yourself.
I'd just use a hash, tied to an NDBM file, to keep track of what is loaded and what isn't.
When you start a new batch of URLs, you delete the NDBM file.
Then, in your code, at the start of the program, you do
tie(%visited, 'NDBM_File', 'visitedurls', O_RDWR|O_CREAT, 0666)
(don't worry about the O_CREAT, the file will remain intact if it exists unless you pass O_TRUNC as well)
Assuming your main loop looks like this:
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results=fetch($url);
save_results($url, $results);
}
you change that to
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results;
if ($visited{$url}) {
$results=$visited{$url};
} else {
$results=fetch($url);
$visited{$url}=$results;
}
save_results($url, $results);
}
So whenever you fetch a new URL, you write the results to the NDBM file, and whenever you restart your program, the results that have already been fetched will be in the NDBM file and fetched from there instead of reading the URL.
This assumes $results is a scalar, else you won't be able to store/retrieve it in this way. But as you're producing JSON anyway, the "partial json" for each URL will probably be what you want to store.