How to get first file in directory by alphabetical order? - perl

I want to execute $filename = `ls *.gz | head -n 1`; through perl but I think the pipe is causing an error. Execution of -e aborted due to compilation errors.
This will be part of the perl script rather than run via -e.
What would be the correct way to do this?

How about
my $filename = (sort glob "*.gz")[0];
You state the alphabeticall order thus the sort, which by default uses the "standard string comparison order". Note that ls can be aliased, while its defaults also may depend on the system.
Going out to the shell would make sense only if you use some particular strengths of ls, that would take a lot of work to do in Perl. For mere sorting there is no reason to go out of your program. It is far less efficient and adds a whole list of new problems to solve.
A good point was raised by mob. Because of sort invocation, sort BLOCK|SUB LIST, one could wonder whether glob above could be taken in an unintended way, as a SUB. It's not, as the builtin certainly runs first. However, that's a little close and this is just clearer
my ($filename) = sort (glob "*.gz");

zdim's solution uses sort and glob, both of which do a lot of needless work.
The glob is something I explain in Wasting time thinking about wasted time, in which I look at some really bad benchmarks we had in Intermediate Perl.
The sort compares a bunch of files to each other, even if it already know that the two files it wants to compare cannot be the one you want. You only care about which one comes first. A linear scan does just fine (and we show an example of that in Learning Perl to find the maximum number in a list):
opendir my $dh, $dir or die "Could not open dir $dir";
my $first;
while( my $file = readdir $dh ) {
$first = $file if $file lt $first;
}
It's a bit more complicated as you probably want to filter out some files (the virtual dirs . and .., and maybe all the hidden files), but a grep handles that:
my $file = grep { ! /\A\.\.?\z/ } readdir $dh
Even better, this wouldn't even have to know how to get the next file. Some other sort of iterator would provide that without high_water knowing how it works. That's a more complicated example that I won't present here (and is one of those areas where Perl should look to Python, and that I demonstrate in Object::Iterate).
But, sometimes the easier, less pure thing is a good enough solution. Maybe not for this problem, but for some other problem. For example, if you have an external process (ls) generate all the input, all its resources can be freed once it does its job. A Perl pipe can do that:
$ perl -e 'open my $ph, q(-|), q(/bin/ls *.gz); print scalar <$ph>'
a.gz
Or, with head as well (notice the disappearing scalar) so I print all the lines instead of the first line (even though in this example there is only one line):
$ perl -e 'open my $ph, q(-|), q(/bin/ls *.gz | /usr/bin/head -n 1); print <$ph>'
a.gz
In your simple case, where you only get one line of input, the backticks can do the job:
$ perl -e 'print `/bin/ls *.gz | /usr/bin/head -n 1`'
a.gz
I don't know what you were having problems with your one liner, but that's something to consider: The shell things tend to be fragile and difficult to get right. Then, when you get it right, someone else messes it up. Or simple translations from Perl strings to the shell don't come out as expected.
I suspect that you had quoting issues, which is why I use generalized quoting, q(), inside my Perl one-liners. It gets even more fun on Windows where you can't use single ticks to quote the argument to -e, but if you use double ticks in a unix shell, you get interpolation.
Remember though, asking for an external process means you have to be really careful. I used /bin/ls to be sure I got what I wanted and not some other thing in my path (although you can also limit $ENV{PATH}). I write much more about that in Mastering Perl, although perlsec has some advice too.

Related

'Correct' way to have perl arguments interpreted by current shell

Sorry, this is pretty basic, and I suspect a duplicate, but after some searching I'm coming up empty:
Given the following script:
#!/usr/bin/perl
use strict;
use warnings;
use IPC::Run3;
my $stdout2;
print $ARGV[0];
print "\n";
my #cmd1 = split /\s+/, $ARGV[0] ;
run3 (\#cmd1, \undef, \$stdout2, \$stdout2);
print $stdout2
And running it like so:
£ perl comp.pl "md5sum *(.)"
md5sum *(.)
md5sum: '*(.)': No such file or directory
Fair enough. The *(.) isn't being intrepreted by the shell and probably most would consider this a feature. But I would like it to be intepreted by the current shell (or zsh specifically would be fine).
The question is how I can do this without complicating the shell command to run the perl script.
Prepending "zsh" and "-c" to cmd1 is ok if that's a reasonable way to do it. It just seems like...it isn't.
My intention is also to pass slightly more complex commands to this script eventually, like so:
perl comp.pl 'md5sum *(.)' 'ssh remoteHost "md5sum *(.)"'
I have no objection to non-perl answers to the problem you can probably infer I'm trying to solve (I suspect rsync could do this) but I'm primarily interested in solving this through Perl as there'll eventually be business-specific logic in this comparison.
EDIT
I tried various forms of:
my $cmd = $ARGV[0];
run3 (\$cmd, \undef, \$stdout2, \$stdout2);
the documentation seems to think this would be ok, but I get:
Not an ARRAY reference at /usr/local/share/perl/5.22.1/IPC/Run3.pm line 320.
The IPC::Run3 docs say that one can pass a string instead of an arrayref for the command
run3($cmd, $stdin, $stdout, $stderr, \%options)
...
$cmd
Usually $cmd will be an ARRAY reference and the child is invoked via
system #$cmd;
But $cmd may also be a string in which case the child is invoked via
system $cmd;
In this case the string $cmd is passed to the shell if it contains shell metacharacters. So take input without splitting it, $cmd = $ARGV[0], or join it after validation, $cmd = join ' ', #cmd;
Even in general this is not the preferred way, and the docs warn to see system for "pitfalls" of it.
Things are yet much worse here since you'd be passing user input directly for execution! Never mind possible nefarious intents, just think of what a good typo can do. Even without that, there is simply a difference between typing a command at the terminal and passing it to a script, which may edit it, may get modified, pick up bugs, etc.
If nothing else, I'd urge to add code for substantial checks of submitted input. An analysis may involve identifying the known and accepted metacharacters while suitably quoting parts of input that shouldn't be interpreted, for example using String::ShellQuote.
But I'd really suggest to reconsider the design, so to not submit complete commands to the script. Rather, specify with keywords what should happen. Things like globbing (assembling a file list) are done from Perl really nicely and with a lot of control. Do outside only what is necessary; generally there'll be no need for the shell then.

How do you check for the existence of file names with a specific string in Perl

I am rewriting a Bash script in Perl in order to learn the latter.
The script creates a file using the current date in a custom format and a ".txt" extension but checks first to make sure no file with the date in question already exists.
In Bash, I accomplish this with ls |grep $customDate as a condition. That is, if ls |grep $customDate is true, a warning is issued and no file is create while if ls |grep $customDate is false, the file gets created with the custom date plus a ".txt" extension
How can I mimic this in Perl?
For testing purposes, I wrote the code below but it does not print out anything - even when I have created a file that meets the condition:
use POSIX qw( strftime );
$customDate = strftime "%Y_%m%b_%d%a", localtime;
opendir(DIR, ".") or die "$!";
my #FILES = grep { /${customDate}*/ } readdir(DIR);
closedir(DIR);
print "$_\n" for #FILES;
I apologize if my question is unclear
"I am rewriting a Bash script in Perl in order to learn the latter."
I think you're taking the wrong approach to learning Perl, or to learning any language.
While there are always a lot of similarities between procedural languages, it is always wrong to focus on those above the differences. Programming languages must be learned from scratch if you hope ever to be able to read and write them well
I regularly see Perl code on Stack Overflow that has clearly been written by someone with the wrong head on. For instance, the clearest signs of a C programmer are
Declaring everything in one block at the top of a source file
Over-use of scalar and parentheses
Under-use of the default variable $_ and regular expressions
Using the C-style for loop, which usually looks something like this in Perl
my $i;
for ($i=0; $i<=scalar(#data); $i++)
{
process($data[$i])
}
Apart from ignoring perlstyle completely, the author is grasping for something familiar instead of embracing the new language. In idiomatic Perl that should look like
process($_) for #data
Reaching further, it is easy to become complacent about the consequences of phrases you may be writing glibly in the shell
You need to be aware that your shell statement
ls |grep $customDate
is starting new processes to run /bin/ls and /bin/grep, and piping information between them and back to the shell process. The Linux shell and its supporting utilities are designed to get trivial jobs done easily, but I believe they are being used too much with elaborate shell script one-liners that are opaque and beyond debugging
It's very hard to tell exactly what it is that you need your program to do, but it's looking like you want to discover whether there are any files that contain a string like 2016_05May_30Mon in the current directory
I have to say that's a horrible date-time format and I've struggling to believe that it's what you want, but I would prefer Perl's core Time::Piece module over POSIX any time
In this instance I would also make use od Perl's regular expressions, the -X *file test operators, and Perl's glob operator instead of opendir, readdir, closedir. None of those have a direct equivalent in any shell language
So, assuming that my guesses about your intention are correct, I would write this
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $dtime = localtime()->strftime('%Y_%m%b_%d%a');
say for grep { -f and /$dtime/ } glob '*.txt';
which isn't remotely like your translation from shell to Perl
The reason you're not getting what you expect is the * in the grep is looking for the last character of the "$customDate" repeated as many times as it likes (which is not what you expect from the * in this case).
If your file has a "somedata.txt" ext, you should update the code as such, which will look for your date string then any number of characters followed by a txt:
$customDate = strftime "%Y_%m%b_%d%a", localtime;
opendir(DIR, ".") or die "$!";
my #FILES = grep { /${customDate}.*\.txt/ } readdir(DIR);
closedir(DIR);
print "$_\n" for #FILES;

exit code of system() call with a single scalar argument in Perl

There is a system() call in a Perl script with multiple pipes, using a single scalar argument. The call looks more or less like this:
system("zcat /foo.gz | grep '^.{6}X|Y|Z' | awk '{print $2,$3,$4,$6}' | bzip2 > /foo.processed.bz2");
The file in question (foo.gz) is quite large, about 2GB compressed in size. I guess that's why it was originally done via a system call.
Questions:
The problem now is, that this system call always seem to return 0, whether one of the system commands fail or not. I assume this is because it gets invoked via sh -c '...'. Is that correct?
Is there a way to check if a system() call was successful if only a single scalar argument is passed?
Is there a better way to process a large file like this, in a way thats equally or more efficient (in terms of speed mainly)?
Thanks for any hints as I am not really familiar with Perl.
Two things:
When you do a system call, the value returned is the last value in the pipeline. Thus, you're getting the status code of the bzip2 command.
The reason the program is doing this is because the people who wrote the program probably didn't know any better. I've seen Perl programs use system calls for finding the basename of the file, doing a find, and even doing a copy/rename/move. These are all things that can be done faster and easier inside the Perl program. And, you don't have the whole Windows/Unix compatibility issues.
You're always better off using Perl modules for things like this. In this case, I bet the Perl modules will be even faster than the shell pipeline, and you'll have more control over the entire operation.
There's a set called IO::Compress that can handle both Zip and BZip2.
I use Archive::Zip which is a great module, but you want to use the Bzip2 compression algorithm, and Archive::Zip can't handle that.
system() returns what the /bin/sh shell returns. When multiple commands are pipelined, the shell forks a new process for each of them and the status code of the last command in the chain is returned, in this case bzip2.
Based on your comments and answers, I'd do it like that now:
$infile =~ s/(.*\.gz)\s*$/gzip -dc < $1|/;
open(OUTFH, "| /bin/bzip > $outfile") or die "Can't open $outfile: $!";
open(INFH, $infile) or die "Can't open $infile: $!";
while (my $line = <INFH>) {
if ($line =~ /^.{6}X|Y|Z) {
# TODO: the awk part...
print OUTFH $line;
}
}
close(INFH);
close(OUTFH);
Please feel free to comment and vote up/down.
You'd be better doing the text processing from within perl itself - that's what perl's for :)
system() only ever returns 0 or 1. To capture actual output, try calling it via backticks: `command` rather than system('command')

finding many thousands of files in a directory pattern in Perl

I would like to find a file pattern on a directory pattern in Perl that will return many thousands of entries, like this:
find ~/mydir/*/??/???/???? -name "\*.$refinfilebase.search" -print
I've been told there are different ways to handle it? I.e.:
File::Find
glob()
opendir, readdir, grep
Diamond operator, e.g.: my #files = <$refinfilebase.search>
Which one would be most adequate to be able to run the script on older versions of Perl or minimal installations of Perl?
For very large directories, opendir() is probably safest, as it doesn't need to read everything in or do any filtering on it. This can be faster as the ordering isn't important, and on very large directories, on some operating systems, this can be a performance hit. opendir is also built-in with all systems.
Note the actual way it behaves may be different on different platforms. So you need to be careful in coding with it. This mainly affects which it returns for things like the parent and current directory, which you may need to treat specially.
glob() is more useful when you only want some files, matching by a pattern. File::Find is more useful when recursing through a set of nested directories. If you don't need either, opendir() is a good base.
Also you have DirHandle
DirHandle:
use DirHandle;
$d = new DirHandle ".";
if (defined $d) {
while (defined($_ = $d->read)) { something($_); }
$d->rewind;
while (defined($_ = $d->read)) { something_else($_); }
undef $d;
}
For use cases of readdir and glob see
What reasons are there to prefer glob over readdir (or vice-versa) in Perl?
I prefer to use glob for quickly grab a list of files in a dir (no subdirs) and process them like
map{process_bam($_)} glob(bam_files/*.bam)
This is more convenient because it does not take the . and .. even is you ask for (*) and also returns the full path if you use a dir in the glob pattern.
Also you can use glob quickly as a oneliner piped to xargs or in a bash for loop when you need to preprocess the filenames of the list:
perl -lE 'print join("\n", map {s/srf\/(.+).srf/$1/;$_} glob("srf/198*.srf"))' | xargs -n 1.....
Readdir has adventages in other scenarios so you need to use the one that fits better for your actions.

What reasons are there to prefer glob over readdir (or vice-versa) in Perl?

This question is a spin-off from this one. Some history: when I first learned Perl, I pretty much always used glob rather than opendir + readdir because I found it easier. Then later various posts and readings suggested that glob was bad, and so now I pretty much always use readdir.
After thinking over this recent question I realized that my reasons for one or the other choice may be bunk. So, I'm going to lay out some pros and cons, and I'm hoping that more experienced Perl folks can chime in and clarify. The question in a nutshell is are there compelling reasons to prefer glob to readdir or readdir to glob (in some or all cases)?
glob pros:
No dotfiles (unless you ask for them)
Order of items is guaranteed
No need to prepend the directory name onto items manually
Better name (c'mon - glob versus readdir is no contest if we're judging by names alone)
(From ysth's answer; cf. glob cons 4 below) Can return non-existent filenames:
#deck = glob "{A,K,Q,J,10,9,8,7,6,5,4,3,2}{\x{2660},\x{2665},\x{2666},\x{2663}}";
glob cons:
Older versions are just plain broken (but 'older' means pre 5.6, I think, and frankly if you're using pre 5.6 Perl, you have bigger problems)
Calls stat each time (i.e., useless use of stat in most cases).
Problems with spaces in directory names (is this still true?)
(From brian's answer) Can return filenames that don't exist:
$ perl -le 'print glob "{ab}{cd}"'
readdir pros:
(From brian's answer) opendir returns a filehandle which you can pass around in your program (and reuse), but glob simply returns a list
(From brian's answer) readdir is a proper iterator and provides functions to rewinddir, seekdir, telldir
Faster? (Pure guess based on some of glob's features from above. I'm not really worried about this level of optimization anyhow, but it's a theoretical pro.)
Less prone to edge-case bugs than glob?
Reads everything (dotfiles too) by default (this is also a con)
May convince you not to name a file 0 (a con also - see Brad's answer)
Anyone? Bueller? Bueller?
readdir cons:
If you don't remember to prepend the directory name, you will get bit when you try to do filetests or copy items or edit items or...
If you don't remember to grep out the . and .. items, you will get bit when you count items, or try to walk recursively down the file tree or...
Did I mention prepending the directory name? (A sidenote, but my very first post to the Perl Beginners mail list was the classic, "Why does this code involving filetests not work some of the time?" problem related to this gotcha. Apparently, I'm still bitter.)
Items are returned in no particular order. This means you will often have to remember to sort them in some manner. (This could be a pro if it means more speed, and if it means that you actually think about how and if you need to sort items.) Edit: Horrifically small sample, but on a Mac readdir returns items in alphabetical order, case insensitive. On a Debian box and an OpenBSD server, the order is utterly random. I tested the Mac with Apple's built-in Perl (5.8.8) and my own compiled 5.10.1. The Debian box is 5.10.0, as is the OpenBSD machine. I wonder if this is a filesystem issue, rather than Perl?
Reads everything (dotfiles too) by default (this is also a pro)
Doesn't necessarily deal well with a file named 0 (see pros also - see Brad's answer)
You missed the most important, biggest difference between them: glob gives you back a list, but opendir gives you a directory handle. You can pass that directory handle around to let other objects or subroutines use it. With the directory handle, the subroutine or object doesn't have to know anything about where it came from, who else is using it, and so on:
sub use_any_dir_handle {
my( $dh ) = #_;
rewinddir $dh;
...do some filtering...
return \#files;
}
With the dirhandle, you have a controllable iterator where you can move around with seekdir, although with glob you just get the next item.
As with anything though, the costs and benefits only make sense when applied to a certain context. They do not exist outside of a particular use. You have an excellent list of their differences, but I wouldn't classify those differences without knowing what you were trying to do with them.
Some other things to remember:
You can implement your own glob with opendir, but not the other way around.
glob uses its own wildcard syntax, and that's all you get.
glob can return filenames that don't exist:
$ perl -le 'print glob "{ab}{cd}"'
glob pros: Can return 'filenames' that don't exist:
my #deck = List::Util::shuffle glob "{A,K,Q,J,10,9,8,7,6,5,4,3,2}{\x{2660},\x{2665},\x{2666},\x{2663}}";
while (my #hand = splice #deck,0,13) {
say join ",", #hand;
}
__END__
6♥,8♠,7♠,Q♠,K♣,Q♦,A♣,3♦,6♦,5♥,10♣,Q♣,2♠
2♥,2♣,K♥,A♥,8♦,6♠,8♣,10♠,10♥,5♣,3♥,Q♥,K♦
5♠,5♦,J♣,J♥,J♦,9♠,2♦,8♥,9♣,4♥,10♦,6♣,3♠
3♣,A♦,K♠,4♦,7♣,4♣,A♠,4♠,7♥,J♠,9♥,7♦,9♦
glob makes it convenient to read all the subdirectories of a given fixed depth, as in glob "*/*/*". I've found this handy in several occasions.
Here is a disadvantage for opendir and readdir.
{
open my $file, '>', 0;
print {$file} 'Breaks while( readdir ){ ... }'
}
opendir my $dir, '.';
my $a = 0;
++$a for readdir $dir;
print $a, "\n";
rewinddir $dir;
my $b = 0;
++$b while readdir $dir;
print $b, "\n";
You would expect that code would print the same number twice, but it doesn't because there is a file with the name of 0. On my computer it prints 251, and 188, tested with Perl v5.10.0 and v5.10.1
This problem also makes it so that this just prints out a bunch of empty lines, regardless of the existence of file 0:
use 5.10.0;
opendir my $dir, '.';
say while readdir $dir;
Where as this always works just fine:
use 5.10.0;
my $a = 0;
++$a for glob '*';
say $a;
my $b = 0;
++$b while glob '*';
say $b;
say for glob '*';
say while glob '*';
I fixed these issues, and sent in a patch which made it into Perl v5.11.2, so this will work properly with Perl v5.12.0 when it comes out.
My fix converts this:
while( readdir $dir ){ ... }
into this:
while( defined( $_ = readdir $dir ){ ...}
Which makes it work the same way that read has worked on files. Actually it is the same bit of code, I just added another element to the corresponding if statements.
Well, you pretty much cover it. All that taken into account, I would tend to use glob when I'm throwing together a quick one-off script and its behavior is just what I want, and use opendir and readdir in ongoing production code or libraries where I can take my time and clearer, cleaner code is helpful.
That was a pretty comprehensive list. readdir (and readdir + grep) has less overhead than glob and so that is a plus for readdir if you need to analyze lots and lots of directories.
For small, simple things, I prefer glob. Just the other day, I used it and a twenty line perl script to retag a large portion of my music library. glob, however, has a pretty strange name. Glob? It's not intuitive at all, as far as a name goes.
My biggest hangup with readdir is that it treats a directory in a way that's somewhat odd to most people. Usually, programmers don't think of a directory as a stream, they think of it as a resource, or list, which glob provides. The name is better, the functionality is better, but the interface still leaves something to be desired.
glob pros:
3) No need to prepend the directory name onto items manually
Exception:
say for glob "*";
--output:--
1perl.pl
2perl.pl
2perl.pl.bak
3perl.pl
3perl.pl.bak
4perl.pl
data.txt
data1.txt
data2.txt
data2.txt.out
As far as I can tell, the rule for glob is: you must provide a full path to the directory to get full paths back. The Perl docs do not seem to mention that, and neither do any of the posts here.
That means that glob can be used in place of readdir when you want just filenames (rather than full paths), and you don't want hidden files returned, i.e. ones starting with '.'. For example,
chdir ("../..");
say for glob("*");
On a similar note, File::Slurp has a function called read_dir.
Since I use File::Slurp's other functions a lot in my scripts, read_dir has also become a habit.
It also has following options: err_mode, prefix, and keep_dot_dot.
First, do some reading. Chapter 9.6. of the Perl Cookbook outlines the point I want to get to nicely, just under the discussion heading.
Secondly, do a search for glob and dosglob in your Perl directory. While many different sources (ways to get the file list) can be used, the reason why I point you to dosglob is that if you happen to be on a Windows platform (and using the dosglob solution), it is actually using opendir/readdir/closedir. Other versions use built-in shell commands or precompiled OS specific executables.
If you know you are targetting a specific platform, you can use this information to your advantage. Just for reference I looked into this on Strawberry Perl Portable edition 5.12.2, so things may be slightly different on newer or original versions of Perl.