Programmable arguments in perl pipes - perl

I'm gradually working my way up the perl learning curve (with thanks to contributors to this REALLY helpful site), but am struggling with how to approach this particular issue.
I'm building a perl utility which utilises three (c++) third party programmes. Normally these are run: A $file_list | B -args | C $file_out
where process A reads multiple files, process B modifies each individual file and process C collects all input files in the pipe and produces a single output file, with a null input file signifying the end of the input stream.
The input files are large(ish) at around 100Mb and around 10 in number. The processes are CPU intensive and the whole process need to be applied to thousands of groups of files each day, so the simple solution of reading and writing intermediate files to disk is simply too inefficient. In addition, the process above is only part of a processing sequence, where the input files are already in memory and the output file also needs to be in memory for further processing.
There are a number of solutions to this already well documented and I have a prototype version utilising IPC::Open3(). So far, so good. :)
However - when piping each file to process A through process B I need to modify the arguments in process B for each input file without interrupting the forward flow to process C. This is where I come unstuck and am looking for some suggestions.
As further background:
Running in Ubuntu 16.04 LTS (currently within Virtual box)and perl v5.22.1
The programme will run on (and within) a single machine by one user (me !), i.e. no external network communication or multi user or public requirement - so simplicity of programming is preferred over strong security.
Since the process must run repeatedly without interruption, robust/reliable I/O handling is required.
I have access to the source code of each process, so that could be modified (although I'd prefer not to).
My apologies for the lack of "code to date", but I thought the question is more one of "How do I approach this?" rather than "How do I get my code to work?".
Any pointers or help would be very much appreciated.

You need a fourth program (call it D) that determines what the arguments to B should be and executes B with those arguments and with D's stdin and stdout connected to B's stdin and stdout. You can then replace B with D in your pipeline.
What language you use for D is up to you.

If you're looking to feed output from different programs into the pipes, I'd suggest what you want to look at is ... well, pipe.
This lets you set up a pipe - that works much like the ones you get from IPC::Open3 but have a bit more control over what you read/write into it.

Related

Is it possible to send commands between two MATLAB windows open on the same computer?

I want to have two MATLAB windows open on the same computer. The desired scenario is as follows: MATLAB window 1 is continuously running a script that has nothing to do with MATLAB window 2. At the same time, MATLAB window 2 is running a script that continuously checks for a certain condition, and if it is met, then it will terminate the script running on MATLAB window 1, and then terminate its own script as well. I want to have two MATLAB windows instead of one since I believe it will be more time efficient for what I am trying to do. I found an interesting "KeyInject" program at http://au.mathworks.com/matlabcentral/fileexchange/40001-keyinject , but I was wondering if there is a simpler way already built into MATLAB.
Do you want simple, or a flexible, infinitely expandable version 1.0? Simple would be to trigger System A via a file created by System B.
Simple would have System B create a file, then System A would check for the file with the command
if exist ( fileName, 'file' )
then do your shutdown commands. On startup, System A would delete the file with
delete ( fileName );
The second option is to use the udp command. UDP allows any data to be sent between processes, whether on the same computer or over a network. (See https://www.mathworks.com/help/instrument/udp.html for more info).
I see several ways:
Restructure to avoid this XY problem
Use (mat) files (as Hoki suggested), possibly using the parallel computing toolbox to keep everything in one MATLAB session.
Write some MEX functions that communicate with each other via a global pipe.
Write an Auto(Hot)key script.
Option 2 is probably easiest. Take a look at events and listeners if you write in OOP, otherwise, you'd have to poll inside a loop
Option 3 is harder and way more time consuming to implement, but allows for much faster detection of the condition, and much faster data transfer between the sessions. Use only if speed is essential...but I guess that doesn't apply :)
Option 4: the AutoHotkey solution is probably the most Horrible Thing® you could do on an already Horrible Construction®, but oh what fun!! In both MATLAB sessions, you create a (hidden) figure with the name Window1 or Window2, respectively. These window names are something that AutoHotkey can easily track. If the conditions are met, you update the corresponding window name, triggering the remainder of the AutoHotkey script: press a button in the other window! If you need to transfer data between the windows: you can create basic edit boxes in both GUIs, and copy-paste the data between them. If you're on Linux: you can use Autokey for the same purpose, but by then you're basically writing Python code doing the heavy lifting, so just use Python.
Or, you know, use KeyInject. Less fun.

Executing system commands safely while coding in Perl

Should one really use external commands while coding in Perl? I see several disadvantages of it. It's not system independent plus security risks might also be there. What do you think? If there is no way and you have to use the shell commands from Perl then what is the safest way to execute that particular command (like checking pid, uid etc)?
It depends on how hard it is going to be to replicate the functionality in Perl. If I needed to run the m4 macro processor on something, I'd not think of trying to replicate that functionality in Perl myself, and since there's no module on http://search.cpan.org/ that looks suitable, it would appear others agree with me. In that case, then, using the external program is sensible. On the other hand, if I needed to read the contents of a directory, then the combination of readdir() et al plus stat() or lstat() inside Perl is more sensible than futzing with the output of ls.
If you need to execute commands, think very carefully about how you invoke them. In particular, you probably want to avoid the shell interpreting the arguments, so use the array form of system (see also exec), etc, rather than a single string for the command plus arguments (which means the shell is used to process the command line).
Executing external commands can be expensive simply because it involves forking new process and watching for its output if you need it.
Probably more importantly, should external process fail for any reason, it may be difficult to understand what happened by means of your script. Worse still, surprisingly often external process can be stuck forever, so will be your script. You can use special tricks like opening pipe and watching for output in loop, but this itself is error-prone.
Perl is very capable of doing many things. So, if you stick to using only Perl native constructs and modules to accomplish your tasks, not only it will be faster because you never fork, but it will be more reliable and easier to catch errors by looking at native Perl objects and structures returned by library routines. And of course, it will be automatically portable to different platforms.
If your script runs under elevated permissions (like root or under sudo), you should be very careful as to what external programs you execute. One of the simple ways to ensure basic security is to always specify commands by full name, like /usr/bin/grep (but still think twice and just do grep by Perl itself!). However, even this may not be enough if attacker is using LD_PRELOAD mechanism to inject rogue shared libraries.
If you are willing to go very secure, it is suggested to use tainted check by using -T flag like this:
#!/usr/bin/perl -T
Taint flag will be also enabled by Perl automatically if your script was determined to have different real and effective user or group ids.
Tainted mode will severely limit your ability to do many things (like system() call) without Perl complaining - see more at http://perldoc.perl.org/perlsec.html#Taint-mode, but it will give you much higher security confidence.
Should one really use external commands while coding in Perl?
There's no single answer to this question. It all depends on what you are doing within the wide range of potential uses of Perl.
Are you using Perl as a glorified shell script on your local machine, or just trying to find a quick-and-dirty solution to your problem? In that case, it makes a lot of sense to run system commands if that is the easiest way to accomplish your task. Security and speed are not that important; what matters is the ability to code quickly.
On the other hand, are you writing a production program? In that case, you want secure, portable, efficient code. It is often preferable to write the functionality in Perl (or use a module), rather than calling an external program. At least, you should think hard about the benefits and drawbacks.

"All programs are interpreted". How?

A computer scientist will correctly explain that all programs are
interpreted and that the only question is at what level. --perlfaq
How are all programs interpreted?
A Perl program is a text file read by the perl program which causes the perl program to follow a sequence of actions.
A Java program is a text file which has been converted into a series of byte codes which are then interpreted by the java program to follow a sequence of actions.
A C program is a text file which is converted via the C compiler into an assembly program which is converted into machine code by the assembler. The machine code is loaded into memory which causes the CPU to follow a sequence of actions.
The CPU is a jumble of transistors, resistors, and other electrical bits which is laid out by hardware engineers so that when electrical impulses are applied, it will follow a sequence of actions as governed by the laws of physics.
Physicists are currently working out what makes those rules and how they are interpreted.
Essentially, every computer program is interpreted by something else which converts it into something else which eventually gets translated into how the electrons in your local neighborhood fly around.
EDIT/ADDED: I know the above is a bit tongue-in-cheek, so let me add a slightly less goofy addition:
Interpreted languages are where you can go from a text file to something running on your computer in one simple step.
Compiled languages are where you have to take an extra step in the middle to convert the language text into machine- or byte-code.
The latter can easily be easily be converted into the former by a simple transformation:
Make a program called interpreted-c, which can take one or more C files and can run a program which doesn't take any arguments:
#!/bin/sh
MYEXEC=/tmp/myexec.$$
gcc -o $MYEXEC ${1+"$#"} && $MYEXEC
rm -f $MYEXEC
Now which definition does your C program fall into? Compare & contrast:
$ perl foo.pl
$ interpreted-c foo.c
Machine code is interpreted by the processor at runtime, given that the same machine code supplied to a processor of a certain arch (x86, PowerPC etc), should theoretically work the same regardless of the specific model's 'internal wiring'.
EDIT:
I forgot to mention that an arch may add new instructions for things like accessing new registers, in which case code written to use it won't work on older processors in the range. Much like when you try to use an old version of a library and then try to use capabilities only found in newer libraries.
Example: many Linux distros are released as 686 only, despite the fact it's in the 'x86 family'. This is due to the use of new instructions.
My first thought was too look inside the CPU — see below — but that's not right. The answer is much much simpler than that.
A high-level description of a CPU is:
1. execute the current op
2. grab the next op
3. goto 1
Compare it to Perl's interpreter:
while ((PL_op = op = op->op_ppaddr(aTHX))) {
}
(Yeah, that's the whole thing.)
There can be no doubt that the CPU is an interpreter.
It just goes to show how useless it is to classify something is interpreted or not.
Original answer:
Even at the CPU level, programs get rewritten into simpler instructions to allow the CPU to execute more them more quickly. This is done by changing the order in which they are executed and executing them in parallel. For example, Intel's Hyperthreading.
Even deeper, each instruction is considered a program of its own, one that routes electronic signals. See microcode.
The Levels of interpretions are really easy to explain:
2: Runtimelanguage (CLR, Java Runtime...) & Scriptlanguage (Python, Ruby...)
1: Assemblies
0: Binary Code
Edit: I changed the level of Scriptinglanguages to the same level of Runtimelanguages. Thank's for the hint. :-)
I can write a Game Boy interpreter that works similarly to how the Java Virtual Machine works, treating the z80 machine instructions as byte code. Assuming the original was written in C1, does that mean C suddenly became an interpreted language just because I used it like one?
From another angle, gcc can compile C into machine code for a number of different processors. There's no reason the target machine has to be the same as the machine you're compiling on. In fact, this is a common way to compile C code for AVRs and other microcontrollers.
As a matter of abstraction, the compiler's job is to translate flat text into a structure, then translate that structure into something that can be executed somewhere. Whatever is doing the execution may have its own levels of breaking out the structure before really executing it.
A lot of power becomes available once you start thinking along these lines.
A good book on this is Structure and Interpretation of Computer Programs. Even if you only get through the first chapter (or half of the first chapter), I think you'll learn a lot.
1 I think most Game Boy stuff was hand coded ASM, but the principle remains.

How can I control an interactive Unix application programmatically through Perl?

I have inherited a 20-year-old interactive command-line unix application that is no longer supported by its vendor. We need to automate some tasks in this application.
The most troublesome of these is creating thousands of new records with slightly different parameters (e.g. different identifiers, different names). The records have to be created in sequence, one at a time, which would take many months (and therefore dollars) to do manually. In most cases, creating a record has a very predictable pattern of keying in commands, reading responses, keying in further commands, etc. However, some record creation operations will result in error conditions ('record with this identifier already exists') that require a different set of commands to be exit gracefully.
I can see a few different ways to do this:
Named pipes. Write a Perl script that runs the target application with STDIN and STDOUT set to named pipes then sends the target application the sequence of commands to create a record with the required parameters, and then instructs the target application to exit and shut down. We then run the script as many times as required with different parameters.
Application. Find another Unix tool that can be used to script interactive programs. The only ones I have been able to find though are expect, but this does not seem top be maintained; and chat, which I recall from ages ago, and which seems to do more-or-less what I want, but appears to be only for controlling modems.
One more potential complication: I think the target application was written for a VT100 terminal and it uses some sort of escape sequences to do things like provide highlighting.
My question is what approach should I take? One of these, or something completely different? I quite like the idea of using named pipes and then having a Perl script that opens the FIFOs and reads and writes as required, as it provides a lot of flexibility, but from what I have read it seems like there's a lot of potential problems if I go down this path.
Thanks in advance.
I'd definitely stick to Perl for the extra flexibility, as chaos suggested. Are you aware of the Expect perl module? It's a lot nicer than the named pipe approach.
Note also with named pipes, you can't force the output coming back from your legacy application to be unbuffered, which could be annoying. I think Expect.pm uses pseudo-ttys to get around this problem, but I'm not sure. See the discussion in perlipc in the section "Bidirectional Communication with Another Process" for more details.
expect is a lot more solid than you're probably giving it credit for, but if I were you I'd still go with the Perl option, wanting to have a full and familiar programming language for managing the process and having confidence that whatever weird issues arise, there will be ways of addressing them.
Expect, either with the Tcl or Perl implementations, would be my first attempt. If you are seeing odd sequences in the output because it's doing odd terminal things, just filter those from the output before you do your matching.
With named pipes, you're going to end up reinventing Expect anyway.

What is best module for parallel processing in Perl?

What is best module for parallel process in Perl? I have never done the parallel processing in Perl.
What is good Perl module for parallel process which is going to used in DB access and mailing ?
I have looked in module Parallel::ForkManager. Any idea appreciated.
Well, it all depends on particular case. Depending on what you exactly need to achieve, you might be better of with one of following:
Parallel::TaskManager
POE
AnyEvent
For some tasks there are also specialized modules - for example LWP::Parallel::UserAgent, which basically means - you have to give us much more details about what you want to achieve to get best possible answer.
Parallel::ForkManager, as the POD says, can limit the number of processes forked off. You could then use the children to do any work. I remember using the module for a downloader.
Many-core Engine for Perl (MCE) has been posted on CPAN.
http://code.google.com/p/many-core-engine-perl/
https://metacpan.org/module/MCE
MCE comes with various examples showing real-world use case scenarios on parallelizing something as small as cat (try with -n) to greping for patterns and word count aggregation.
barrier_sync.pl
A barrier sync demonstration.
cat.pl Concatenation script, similar to the cat binary.
egrep.pl Egrep script, similar to the egrep binary.
wc.pl Word count script, similar to the wc binary.
findnull.pl
A parallel driven script to report lines containing
null fields. It is many times faster than the binary
egrep command. Try against a large file containing
very long lines.
flow_model.pl
Demonstrates MCE::Flow, MCE::Queue, and MCE->gather.
foreach.pl, forseq.pl, forchunk.pl
These take the same sqrt example from Parallel::Loops
and measures the overhead of the engine. The number
indicates the size of #input which can be submitted
and results displayed in 1 second.
Parallel::Loops: 600 Forking each #input is expensive
MCE foreach....: 34,000 Sends result after each #input
MCE forseq.....: 70,000 Loops through sequence of numbers
MCE forchunk...: 480,000 Chunking reduces overhead
interval.pl
Demonstration of the interval option appearing in MCE 1.5.
matmult/matmult_base.pl, matmult_mce.pl, strassen_mce.pl
Various matrix multiplication demonstrations benchmarking
PDL, PDL + MCE, as well as parallelizing Strassen
divide-and-conquer algorithm. Also included are 2 plain
Perl examples.
scaling_pings.pl
Perform ping test and report back failing IPs to
standard output.
seq_demo.pl
A demonstration of the new sequence option appearing
in MCE 1.3. Run with seq_demo.pl | sort
tbray/wf_mce1.pl, wf_mce2.pl, wf_mce3.pl
An implementation of wide finder utilizing MCE.
As fast as MMAP IO when file resides in OS FS cache.
2x ~ 3x faster when reading directly from disk.
You could also look at threads, Coro, Parallel::Iterator.