Perl scripts, to use forks or threads? - perl

I am writing a couple fo scripts that go and collect data from a number of servers, the number will grow and im trynig to future proof my scripts, but im a little stuck.
so to start off with I have a script that looks up an IP in a mysql database and then connects to each server grabs some information and then puts it into the database again.
What i have been thinknig is there is a limited amount of time to do this and if i have 100 servers it will take a little bit of time to go out to each server get the information and then push it to a db. So I have thought about either using forks or threads in perl?
Which would be the prefered option in my situation? And hs anyone got any examples?
Thanks!
Edit: Ok so a bit more inforamtion needed: Im running on Linux, and what I thought was i could get the master script to collect the db information, then send off each sub process / task to connect and gather information then push teh information back to the db.

Which is best depends a lot on your needs; but for what it's worth here's my experience:
Last time I used perl's threads, I found it was actually slower and more problematic for me than forking, because:
Threads copied all data anyway, as a thread would, but did it all upfront
Threads didn't always clean up complex resources on exit; causing a slow memory leak that wasn't acceptable in what was intended to be a server
Several modules didn't handle threads cleanly, including the database module I was using which got seriously confused.
One trap to watch for is the "forks" library, which emulates "threads" but uses real forking. The problem I faced here was many of the behaviours it emulated were exactly what I was trying to get away from. I ended up using a classic old-school "fork" and using sockets to communicate where needed.
Issues with forks (the library, not the fork command):
Still confused the database system
Shared variables still very limited
Overrode the 'fork' command, resulting in unexpected behaviour elsewhere in the software

Forking is more "resource safe" (think database modules and so on) than threading, so you might want to end up on that road.
Depending on your platform of choice, on the other hand, you might want to avoid fork()-ing in Perl. Quote from perlfork(1):
Perl provides a fork() keyword that
corresponds to the Unix system call of
the same name. On most Unix-like
platforms where the fork() system call
is available, Perl's fork() simply
calls it.
On some platforms such as Windows
where the fork() system call is not
available, Perl can be built to
emulate fork() at the interpreter
level. While the emulation is
designed to be as compatible as
possible with the real fork() at the
level of the Perl program, there are
certain important differences that
stem from the fact that all the pseudo
child "processes" created this way
live in the same real process as far
as the operating system is concerned.

Related

Powershell memory usage - expensive?

I am new to powershell but has written up a few scripts running on a windows2003 server. It's definitely more powerful than cmd scripting (maybe due to me having a programming background). However, when I delve further, I noticed that:
Each script launched will run under 1 powershell process, ie.
you see a new powershell process for each script.
the scripts I tested for memory are really simple, say, build a
string or query an environment variable, then Start-Sleep for 60
sec, So nothing needy (as to memory usage). But each process takes
around >30MB. Call me stingy, but as there are memory-intensive
applications scheduled to run everyday, and if I need to schedule a
few powershell scripts to run regularly and maybe some script
running continuously as a service, I'd certainly try to keep memory
consumption as low as possible. <-- This is because we recently
experienced a large application failure due to lack of memory.
I have not touched on C# yet, but would anyone reckon that it sometimes may be better to write the task in C#?
Meanwhile, I've seen posts regarding memory leak in powershell. Am I right to think that the memory created by the script will be withing the process space of powershell, so that when the script terminates hence powershell terminates, the memory created get cleared?
My PowerShell.exe 2.0 by itself (not running a script) is ~30MB on XP. This shouldn't worry you much with the average memory per machine these days. Regarding memory leaks, there have been cases where people use 3rd party libraries that have memory leaks when objects arn't properly disposed of. To address those you have to manually invoke the garbage collectorusing [gc]::Collect(), but this is rare. Other times i've seen people use Get-Content to read a very large file and assign it to a variable before using it. This will take alot of memory as well. In that case you can use the pipeline to read the file portions at a time to reduce your memory footprint.
1 - Yes, a new process is created. The same is true when running a cmd script, vb script, or C# compiled executable.
2 - Loading the powershell host and runtime will take some non-trivial amount of memory, which will vary from system to system and version to version. It will generally be a heavier-weight process than a cmd shell or a dedicated C# exe. For those MB, you are getting the rich runtime and library support that makes Powershell so powerful.
General comments:
The OS allocates memory per-process. Once a process terminates, all of its memory is reclaimed. This is the general design of any modern OS, and is not specific to Powershell or even Windows.
If your team is running business-critical applications on hardware such that a handful of 30MB processes can cause a catastrophic failure, you have bigger problems. Opening a browser and going to Facebook will eat more memory than that.
In the time it takes you to figure out some arcane batch script solution, you could probably create a better solution in Powershell, and your company could afford new dedicated hardware with the savings in billable hours :-)
You should use the tool which is most appropriate for the job. Powershell is often the right tool, but not always. It's great for automating administrative tasks in a Windows environment (file processing, working with AD, scheduled tasks, setting permissions, etc, etc). It's less great for high-performance, heavily algorithmic tasks, or for complex coding against raw .NET APIs. For these tasks, C# would make more sense.
Powershell has huge backing/support from Microsoft (and a big user community!), and it's been made very clear that it is the preferred scripting environment for Windows going forward. All new server-side tech for Windows has powershell support. If you are working in admin/IT, it would be a wise investment to build up some skills in Powershell. I would never discourage someone from learning C#, but if your role is more IT than dev then Powershell will be the right tool much more often, and your colleagues are more likely to also understand it.
Powershell requires (much) more resources (RAM) than cmd so if all you need is something quick and simple, it makes more sense to use cmd.
CMD uses native Win32 calls and Powershell uses the .Net framework. Powershell takes longer to load, and can consume a lot more RAM than CMD.
"I monitored a Powershell session executing Get-ChildItem. It grew to
2.5GB (all of it private memory) after a few minutes and was no way nearly finished. CMD “dir /o-d” with a small scrollback buffer
finished in about 2 minutes, and never took more than 300MB of
memory."
https://qr.ae/pGmwoe

Non-blocking / Asynchronous Execution in Perl

Is there a way to implement non-blocking / asynchronous execution (without fork()'ing) in Perl?
I used to be a Python developer for many years... Python has really great 'Twisted' framework that allows to do so (using DEFERREDs. When I ran search to see if there is anything in Perl to do the same, I came across POE framework - which seemed "close" enough to what I was searching for. But... after spending some time reading the documentation and "playing" with the code, I came against "the wall" - which is following limitation (from POE::Session documentation):
Callbacks are not preemptive. As long as one is running, no others will be dispatched. This is known as cooperative multitasking. Each session must cooperate by returning to the central dispatching kernel.
This limitation essentially defeats the purpose of asynchronous/parallel/non-blocking execution - by restricting to only one callback (block of code) executing at any given moment. No other callback can start running while another is already running!
So... is there any way in Perl to implement multi-tasking (parallel, non-blocking, asynchronous execution of code) without fork()'ing - similar to DEFERREDs in Python?
Coro is a mix between POE and threads. From reading its CPAN documentation, I think that IO::Async does real asynchronous execution. threads can be used too - at least Padre IDE successfully uses them.
I'm not very familiar with Twisted or POE, but basic parallel execution is pretty simple with threads. Interpreters are generally not compiled with threading support, so you would need to check for that. The forks package is a drop-in replacement for threading (implements the full API) but using processes seamlessly. Then you can do stuff like this:
my $thread = async {
print "you can pass a block of code as an arg unlike Python :p";
return some_func();
};
my $result = $thread->join();
I've definitely implemented callbacks from an event loop in an async process using forks and I don't see why it wouldn't work with threads.
Twisted also uses cooperative multi-tasking just like POE & Coro.
However it looks like Twisted Deferred does (or can) make use of threads. NB. See this answer from the SO question Twisted: Making code non-blocking
So you would need to go the same route with POE (though using fork is probably preferable).
So one POE solution would be to use: POE::Wheel::Run - portably run blocking code and programs in subprocesses.
For alternatives to POE take a look at AnyEvent and Reflex.
I believe you use select for that kind of thing. More similarly to forking, there's threading.
POE is fine if you want asynchronous processing but using only a single cpu (core) is fine.
For example if the app is I/O limited a single process will be enough most of the time.
No other callback can start running while another is already running!
As far as I can tell - this is the same with all languages (per CPU thread of course; modern web servers usually spawn a least one process or thread per CPU core, so it will look (to users) like stuff it working in parallel, but the long-running callback didn't get interrupted, some other core just did that work).
You can't interrupt an interrupt, unless the interrupted interrupt has been programmed specifically to accommodate it.
Imagine code that takes 1min to run, and a PC with 16 cores - now imagine a million people try to load that page, you can deliver working results to 16 people, and "time out" all the rest, or, you can crash your web server and give no results to anyone. Folks choose not to crash their web server, which is why they never permit callbacks to interrupt other callbacks (not that they could even if they tried - the caller never gets control back to make a new call before the prior one has ended anyhow...)

Recommended communication pattern for web frontend of command line app

I have a perl app which processes text files from the local filesystem (think about it as an overly-complicated grep).
I want to design a webapp which allows remote users to invoke the perl app by setting the required parameters.
Once it's running it would be desirable some sort of communication between the perl app and the webapp about the status of the process (running, % done, finished).
Which would be a recommended way of communication between the two processes? I was thinking in a database table, but I'm not really sure it's a good idea.
any suggestions are appreciated.
Stackers, go ahead and edit this answer to add code examples or links to them.
DrNoone, two approaches come to mind.
callback
Your greppy app needs to offer a callback function that returns the status and which is periodically called by the Web app.
event
This makes sense if you are already using a Web server/app framework which exposes an event loop usable from external applications (rather unlikely in Perl land). The greppy app fires events on status changes and the Web app attaches/listens to them and acts accordingly.
For IPC as you envision it, a plain database is not so suitable. Look into message queues instead. For great interop, pick AMPQ compliant implementation.
If you run the process using open($handle, "cmd |") you can read the results in real time and print them straight to STDOUT while your response is open. That's probably the simplest approach.

Which logging module to use under Perl's AnyEvent?

I am using the wonderful AnyEvent for creating an asynchronous TCP server (specifically, a MUD server).
In order to keep everything running smoothly and with as few blocking/synchronous pieces of code possible, I have replaced some modules I was using with their asynchronous counterpart, for example AnyEvent::Memcached and AnyEvent::Gearman. This allows the main program to be quite speedy, which is desirable. I have coded around the need for some of these calls to be synchronous.
One problem I currently have, and the focus of this question, is logging.
Before turning to AnyEvent for this server program, I was using Log::Log4perl as it allows me to fine-tune which modules or subroutines should be logged, at which level and to which log output (screen, file, etc).
The problem here is that the Log4perl actions (warn, info, etc) are currently performed synchronously but I have no requirement for that as long as the log lines eventually end up on the screen / file (and in the correct order).
Is Log::Log4perl still the right choice when using an asynchronous event handler such as AnyEvent, or should I look at a different module? If so, which is recommended?
AnyEvent::Log, which comes with AnyEvent, uses AnyEvent::IO, which appends to files asynchronously when IO::AIO is available (and synchronously when not).
What you are trying to avoid? If it's synchronous file IO (writing to log files/stdout etc.) then your problem would probably be solved with an asynchronous and/or buffering appender(s) rather than replacing all use of Log4perl in your code.
Log::Log4perl::Appender::Buffer seems like it might be a good start, but a completely async appender doesn't appear to exist anymore.

How can I fork a Perl CGI program to hive off long-running tasks?

I am writing a Bulk Mail scheduler controlled from a Perl/CGI Application and would like to learn abut "good" ways to fork a CGI program to run a separate task? Should one do it at all? Or is it better to suffer the overhead of running a separate job-queue engine like Gearman or TheSchwartz as has been suggested recently. Does the answer/perspective change when using an near-MVC framework like CGI::Application over vanilla CGI.pm? The last comes from a possible project that I have in mind for a CGI::Application Plugin - that would make "forking" a process relatively simple to call.
Look at Proc::Daemon - it's the simplest thing that works. From your CGI script, do the CGI business (getting input, returning a response to the browser), then call Proc::Daemon::init() which does the fork, daemonizes your process and makes the parent exit. Then your script (now a daemon) does its long-running tasks and exits when they're done.
You'll want to update something (file, database record) while running as a daemon, so subsequent CGI invocations can check what it did (or how it's progressing).
Would something like POE be useful? It's more event-driven than forked, but it may meet your needs.