Simple perl file copy methods without using File::Copy - perl

I have very little perl experience and have been trying a few methods on OS X before I attempt to use Macperl on a more difficult to access OS 9 with very limited memory.
I have been trying simple file copy methods without using File::Copy.
Both of the following appear to work:
open R,"<old";
open W,">new";
print W <R>;
open $r,"<old";
open $w,">new";
while (<$r>) { print $w $_ }
Why can't I use $r and $w with the first method?
How does the first method work without a 'while'?
Is there a better way to do this?

Is there a better way to do this?
There sure is... File::Copy is a core module (no installation requred), so there's little reason to avoid using it.
use File::Copy;
copy('old', 'new');
Alternatively, you can use a system call to the underlying OS, in your case, OS-X
system('cp', 'old', 'new');
UPDATE Oops, you're on OS9, so I'm not sure what system calls are available.
You can use your first method with lexical file handles, but you need to disambiguate a little.
open $r, '<', 'old';
open $w, '>', 'new';
print {$w} <$r>;
Bare in mind this is unidiomatic code, and if you just want to create a direct copy, the first method is preferable (EDIT If your memory constraints allow for it).

Perl operators and functions can return different things depending on what their context expects.
The first method works because the print function creates what is called a list context for the <> operator - then thee operator will "slurp in" the entire file.
In the second example, the <> operator is called in the condition of the loop, which creates a scalar context, returning one line at a time (some asterisks here, but that's another story.)
Here is some explanation about contexts: http://perlmaven.com/scalar-and-list-context-in-perl.
And, both methods should work with the R and W filehandles (which are old fashioned Perl filehandles that are different from regular variables), and with the $r/$w notation that actually denotes variables which hold a filehandle reference. The difference is subtle but in most everyday use cases these can be used interchangeably. Have you tried using $ variables in the first example?

In addition to the Hellmar Becker's answer:
The print $w <$r>; does not work (gives a syntax error) because if the FILEHANDLE argument is a variable, perl tries to interpret the print's argument list beginning ($w <$r) as an operator (see the NOTE in http://perldoc.perl.org/functions/print.html). To disambigue put parentheses around the <$r>:
print $w (<$r>);

Related

What is the Perl's IO::File equivalent to open($fh, ">:utf8",$path)?

It's possible to white a file utf-8 encoded as follows:
open my $fh,">:utf8","/some/path" or die $!;
How do I get the same result with IO::File, preferably in 1 line?
I got this one, but does it do the same and can it be done in just 1 line?
my $fh_out = IO::File->new($target_file, 'w');
$fh_out->binmode(':utf8');
For reference, the script starts as follows:
use 5.020;
use strict;
use warnings;
use utf8;
# code here
Yes, you can do it in one line.
open accepts one, two or three parameters. With one parameter, it is just a front end for the built-in open function. With two or three parameters, the first parameter is a filename that may include whitespace or other special characters, and the second parameter is the open mode, optionally followed by a file permission value.
[...]
If IO::File::open is given a mode that includes the : character, it passes all the three arguments to the three-argument open operator.
So you just do this.
my $fh_out = IO::File->new('/some/path', '>:utf8');
It is the same as your first open line because it gets passed through.
I would suggest to try out Path::Tiny. For example, to open and write out your file
use Path::Tiny;
path('/some/path')->spew_utf8(#data);
From the docs, on spew, spew_raw, spew_utf8
Writes data to a file atomically. [ ... ]
spew_raw is like spew with a binmode of :unix for a fast, unbuffered, raw write.
spew_utf8 is like spew with a binmode of :unix:encoding(UTF-8) (or PerlIO::utf8_strict ). If Unicode::UTF8 0.58+ is installed, a raw spew will be done instead on the data encoded with Unicode::UTF8.
The module integrates many tools for handling files and directories, paths and content. It is often simple calls like above, but also method chaining, recursive directory iterator, hooks for callbacks, etc. There is error handling throughout, consistent and thoughtful dealing with edge cases, flock on input/ouput handles, its own tiny and useful class for exceptions ... see docs.
Edit:
You could also use File::Slurp if it was not discouraged to use
e.g
use File::Slurp qw(write_file);
write_file( 'filename', {binmode => ':utf8'}, $buffer ) ;
The first argument to write_file is the filename. The next argument is
an optional hash reference and it contains key/values that can modify
the behavior of write_file. The rest of the argument list is the data
to be written to the file.
Some good reasons to not use?
Not reliable
Has some bugs
And as #ThisSuitIsBlackNot said File::Slurp is broken and wrong

How is a Perl filehandle a scalar if it can return multiple lines?

I have kind of fundamental question about scalars in Perl. Everything I read says scalars hold one value:
A scalar may contain one single value in any of three different
flavors: a number, a string, or a reference. Although a scalar may not
directly hold multiple values, it may contain a reference to an array
or hash which in turn contains multiple values.
--from perldoc
Was curious how the code below works
open( $IN, "<", "phonebook.txt" )
or die "Cannot open the file\n";
while ( my $line = <$IN> ) {
chomp($line);
my ( $name, $area, $phone ) = split /\|/, $line;
print "$name $phone $phone\n";
}
close $IN;
Just to clarify the code above is opening a pipe delimited text file in the following format name|areacode|phone
It opens the file up and then it splits them into $name $area $phone; how does it go through the multiple lines of the file and print them out?
Going back to the perldoc quote from above "A scalar may contain a single value of a string, number, reference." I am assuming that it has to be a reference, but doesn't even really seem like a reference and if it is looks like it would a reference of a scalar? so I am wondering what is going on internally that allows Perl to iterate through all of the lines in the code?
Nothing urgent, just something I noticed and was curious about. Thanks.
It looks like Borodin zeroed in on the part you wanted, but I'll add to it.
There are variables, which store things for us, and there are operators, which do things for us. A file handle, the thing you have in $IN, isn't the file itself or the data in the file. It's a connection that the program to use to get information from the file.
When you use the line input operator, <>, you give it a file handle to tell it where to grab the next line from. By itself, it defaults to ARGV, but you can put any file handle in there. In this case, you have <$IN>. Borodin already explained the reference and bareword stuff.
So, when you use the line input operator, it look at the connection you give in then gets a line from that file and returns it. You might be able to grok this more easily with it's function form:
my $line = readline( $IN );
The thing you get back doesn't come out of $IN, but the thing it points to. Along the way, $IN keeps track of where it is in the file. See seek and tell.
Along the same lines are Perl's regexes. Many people call something like /foo.*bar/ a regular expression. They are slightly wrong. There's a regular expression inside the pattern match operator //. The pattern is the instructions, but it doesn't do anything by itself until the operator uses it.
I find in my classes if I emphasize the difference between the noun and verb parts of the syntax, people have a much easier time with this sort of stuff.
Old Answer
Through each iteration of the while loop, exactly one value is put into the scalar variables. When the loop is done with a line, everything is reset.
The value in $line is a single value: the entire line which you have not broken up yet. Perl doesn't care what that single value looks like. With each iteration, you deal with exactly one line and that's what's in $line. Remember, these are variables, which means you can modify and replace their values, so they can only hold one thing at a time, but there can be multiple times.
The scalars $name, $area, and $phone have single values, each produced by split. Those are lexical variables (my), so they are only visible inside the specific loop iteration where they are defined.
Beyond that, I'm not sure which scalar you might be confused about.
The old-fashioned way of opening files is to use a bare name for the file handle, like so
open IN, 'phonebook.txt'
A file handle is a special type of value, like scalar, hash, array etc. but it has no prefix symbol to differentiate it. (This isn't actually the full extent of the truth, but I am worried about confusing you if I add even more detail.)
Perl still works like this, but it is best avoided for a couple of reasons.
All such file handles are global, and there is no way to restrict access to them by scope
There is no way to pass the value to a subroutine or store it in a data structure
So Perl was enhanced several years ago so that you can use references to file handles. These can be stored in scalar variables, arrays, or hashes, and can be passed as subroutine parameters.
What happens now when you write
open my $in, '<', 'phonebook.txt'
is that perl autovivifies an anonymous file handle, and puts a reference to it in variable $in, so yes, you were right, it is a reference. (Another thing that was changed about the same time was the move to three-parameter open calls, which allow you to open a file called, say, >.txt for input.)
I hope that helps you to understand. It's an unnecessary level of detail, but it can often help you to remember the way Perl works to understand the underlying details.
Incidentally, it is best to keep to lower-case letters for lexical variables, even for file handle references. I often add fh to the end to indicate that the variable holds a file handle, like $in_fh. But there's no need to use capitals, which are generally reserved for global variables like Package::Names.
Update - The Rest of the Story
I thought I should add something to explain what I have mised out, for fear of misleading people who care about the gory detail.
Perl keeps a symbol table hash - a stash - that work very like ordinary Perl hashes. There is one such stash for each package, including the default package main. Note that this hash nothing to do with lexical variables - declared with my - which are stored entirely separately.
Ther indexes for the stashes are the names of the package variables, without the initial symbol. So, for example, if you have
our $val;
our #val;
our %val;
then the stash will have only a single element, with a key of val and a value which is a reference to an intermediate structure called a typeglob. This is another hash structure, with one element for each different type of variable that has been declared. In this case our val typeglob will have three elements, for the scalar, array, and hash varieties of the val variables.
One of these elements may also be an IO variable type, which is where file handles are kept. But, for historical reasons, the value that is passed around as a file handle is in fact a reference to the typeglob that contains it. That is why, if you write open my $in, '<', 'phonebook.txt' and then print $in you will see something like GLOB(0x269581c) - the GLOB being short for typeglob.
Apart from that, the account above is accurate. Perl autovivifies an anonymous typeglob in the current package, and uses only its IO slot for the file handle.
Scalars in Perl are denoted by a $ and they can indeed contain the type of values you mention in your questions but next to that they can also contain a file handle. You can create file handles in Perl in two ways one way is Lexical
open my $filehandle, '>', '/path/to/file' or die $!;
and the other is global
open FILEHANDLE, '>', '/path/to/file' or die $!;
You should use the Lexical version which is what you're doing.
The while loop in your code uses the <> operator on your lexical filehandle which returns a line out of your file every time it's called, until it's out of lines (when End Of File is reached) in which case it returns false.
I went into a bit more detail on file handles as it seems it's a concept you're not completely clear on.

Special character in the file exit the while loop in Perl

I wrote a simple parser for a .txt file with the following instructions:
my $file2 = "test.txt";
open ($process, "<",$file2) or die "couldn't manage to open the file:$file2!";
while (<$process>)
{
...
}
In some files that I am trying to parse there is a special character that is like the right arrow (->) and that I don't manage to paste here from the file.
Every time the parser hits that character (->), it exits the file without processing it till the end.
Is there a way to avoid it and continue processing the file till the very end?
I am using perl 5.6.1 (I cannot use a newer one) and the files that I need to process might have these special characters.
Thanks for your help.
I don't think it's perl that's causing your problem, but almost certainly something in that middle missing block. Are you using eval in the while block on the input from the file? This is a minimal example that shows that the stream containing -> doesn't cause difficulties:
#!/usr/bin/env perl
use warnings;
use strict;
while(<DATA>) {
print "Data[$.]: $_";
}
__DATA__
this is some data
this is also some data
-> this looks fine
foo->dingle also looks fine
This produces:
$ perl ./foo.pl
Data[1]: this is some data
Data[2]: this is also some data
Data[3]: -> this looks fine
Data[4]: foo->dingle also looks fine
So, the -> characters in perl are special:
"-> " is an infix dereference operator, just as it is in C and C++. If
the right side is either a [...] , {...} , or a (...) subscript, then
the left side must be either a hard or symbolic reference to an array,
a hash, or a subroutine respectively. (Or technically speaking, a
location capable of holding a hard reference, if it's an array or hash
reference being used for assignment.) See perlreftut and perlref.
Otherwise, the right side is a method name or a simple scalar variable
containing either the method name or a subroutine reference, and the
left side must be either an object (a blessed reference) or a class
name (that is, a package name). See perlobj.
So if I had to guess, you're definitely using eval to try to parse your content and it's failing to dereference the left side of the operator and crashing. Please provide command line error messages or the code in the while loop if you want further assistance.

Why is the perl print filehandle syntax the way it is?

I am wondering why the perl creators chose an unusual syntax for printing to a filehandle:
print filehandle list
with no comma after filehandle. I see that it's to distinguish between "print list" and "print filehandle list", but why was the ad-hoc syntax preferred over creating two functions - one to print to stdout and one to print to given filehandle?
In my searches, I came across the explanation that this is an indirect object syntax, but didn't the print function exist in perl 4 and before, whereas the object-oriented features came into perl relatively late? Is anyone familiar with the history of print in perl?
Since the comma is already used as the list constructor, you can't use it to separate semantically different arguments to print.
open my $fh, ...;
print $fh, $foo, $bar
would just look like you were trying to print the values of 3 variables. There's no way for the parser, which operates at compile time, to tell that $fh is going to refer to a file handle at run time. So you need a different character to syntactically (not semantically) distinguish between the optional file handle and the values to actually print to that file handle.
At this point, it's no more work for the parser to recognize that the first argument is separated from the second argument by blank space than it would be if it were separated by any other character.
If Perl had used the comma to make print look more like a function, the filehandle would always have to be included if you are including anything to print besides $_. That is the way functions work: If you pass in a second parameter, the first parameter must also be included. There isn't one function I can think of in Perl where the first parameter is optional when the second parameter exists. Take a look at split. It can be written using zero to four parameters. However, if you want to specify a <limit>, you have to specify the first three parameters too.
If you look at other languages, they all include two different ways ways to print: One if you want STDOUT, and another if you're printing to something besides STDOUT. Thus, Python has both print and write. C has both printf and fprintf. However, Perl can do this with just a single statement.
Let's look at the print statement a bit more closely -- thinking back to 1987 when Perl was first written.
You can think of the print syntax as really being:
print <filehandle> <list_to_print>
To print to OUTFILE, you would say:
To print to this file, you would say:
print OUTFILE "This is being printed to myfile.txt\n";
The syntax is almost English like (PRINT to OUTFILE the string "This is being printed to myfile.txt\n"
You can also do the same with thing with STDOUT:
print STDOUT "This is being printed to your console";
print STDOUT " unless you redirected the output.\n";
As a shortcut, if the filehandle was not given, it would print to STDOUT or whatever filehandle the select was set to.
print "This is being printed to your console";
print " unless you redirected the output.\n";
select OUTFILE;
print "This is being printed to whatever the filehandle OUTFILE is pointing to\n";
Now, we see the thinking behind this syntax.
Imagine I have a program that normally prints to the console. However, my boss now wants some of that output printed to various files when required instead of STDOUT. In Perl, I could easily add a few select statements, and my problems will be solved. In Python, Java, or C, I would have to modify each of my print statements, and either have some logic to use a file write to STDOUT (which may involve some conniptions in file opening and dupping to STDOUT.
Remember that Perl wasn't written to be a full fledge language. It was written to do the quick and dirty job of parsing text files more easily and flexibly than awk did. Over the years, people used it because of its flexibility and new concepts were added on top of the old ones. For example, before Perl 5, there was no such things as references which meant there was no such thing as object oriented programming. If we, back in the days of Perl 3 or Perl 4 needed something more complex than the simple list, hash, scalar variable, we had to munge it ourselves. It's not like complex data structures were unheard of. C had struct since its initial beginnings. Heck, even Pascal had the concept with records back in 1969 when people thought bellbottoms were cool. (We plead insanity. We were all on drugs.) However, since neither Bourne shell nor awk had complex data structures, so why would Perl need them?
Answer to "why" is probably subjective and something close to "Larry liked it".
Do note however, that indirect object notation is not a feature of print, but a general notation that can be used with any object or class and method. For example with LWP::UserAgent.
use strict;
use warnings;
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = get $ua "http://www.google.com";
my $response_content = decoded_content $response;
print $response_content;
Any time you write method object, it means exactly the same as object->method. Note also that parser seems to only reliably work as long as you don't nest such notations or do not use complex expressions to get object, so unless you want to have lots of fun with brackets and quoting, I'd recommend against using it anywhere except common cases of print, close and rest of IO methods.
Why not? it's concise and it works, in perl's DWIM spirit.
Most likely it's that way because Larry Wall liked it that way.

Finding files with Perl

File::Find and the wanted subroutine
This question is much simpler than the original title ("prototypes and forward declaration of subroutines"!) lets on. I'm hoping the answer, however simple, will help me understand subroutines/functions, prototypes and scoping and the File::Find module.
With Perl, subroutines can appear pretty much anywhere and you normally don't need to make forward declarations (except if the sub declares a prototype, which I'm not sure how to do in a "standard" way in Perl). For what I usually do with Perl there's little difference between these different ways of running somefunction:
sub somefunction; # Forward declares the function
&somefunction;
somefunction();
somefunction; # Bare word warning under `strict subs`
I often use find2perl to generate code which I crib/hack into parts of scripts. This could well be bad style and now my dirty laundry is public, but so be it :-) For File::Find the wanted function is a required subroutine - find2perl creates it and adds sub wanted; to the resulting script it creates. Sometimes, when I edit the script I'll remove the "sub" from sub wanted and it ends up as &wanted; or wanted();. But without the sub wanted; forward declaration form I get this warning:
Use of uninitialized value $_ in lstat at findscript.pl line 29
My question is: why does this happen and is it a real problem? It is "just a warning", but I want to understand it better.
The documentation and code say $_ is localized inside of sub wanted {}. Why would it be undefined if I use wanted(); instead of sub wanted;?
Is wanted using prototypes somewhere? Am I missing something obvious in Find/File.pm?
Is it because wanted returns a code reference? (???)
My guess is that the forward declaration form "initializes" wanted in some way so that the first use doesn't have an empty default variable. I guess this would be how prototypes - even Perl prototypes, such as they exist - would work as well. I tried grepping through the Perl source code to get a sense of what sub is doing when a function is called using sub function instead of function(), but that may be beyond me at this point.
Any help deepening (and speeding up) my understanding of this is much appreciated.
EDIT: Here's a recent example script here on Stack Overflow that I created using find2perl's output. If you remove the sub from sub wanted; you should get the same error.
EDIT: As I noted in a comment below (but I'll flag it here too): for several months I've been using Path::Iterator::Rule instead of File::Find. It requires perl >5.10, but I never have to deploy production code at sites with odd, "never upgrade", 5.8.* only policies so Path::Iterator::Rule has become one of those modules I never want to do with out. Also useful is Path::Class. Cheers.
I'm not a big fan of File::Find. It just doesn't work right. The find command doesn't return a list of files, so you either have to use a non-local array variable in your find to capture your list of files you've found (not good), or place your entire program in your wanted subroutine (even worse). Plus, the separate subroutine means that your logic is separate from your find command. It's just ugly.
What I do is inline my wanted subroutine inside my find command. Subroutine stays with the find. Plus, my non-local array variable is now just part of my find command and doesn't look so bad
Here's how I handle the File::Find -- assuming I want files that have a .pl suffix:
my #file_list;
find ( sub {
return unless -f; #Must be a file
return unless /\.pl$/; #Must end with `.pl` suffix
push #file_list, $File::Find::name;
}, $directory );
# At this point, #file_list contains all of the files I found.
This is exactly the same as:
my #file_list;
find ( \&wanted, $directory );
sub wanted {
return unless -f;
return unless /\.pl$/;
push #file_list, $File::Find::name;
}
# At this point, #file_list contains all of the files I found.
In lining just looks nicer. And, it keep my code together. Plus, my non-local array variable doesn't look so freaky.
I also like taking advantage of the shorter syntax in this particular way. Normally, I don't like using the inferred $_, but in this case, it makes the code much easier to read. My original Wanted is the same as this:
sub wanted {
my $file_name = $_;
if ( -f $file_name and $file_name =~ /\.pl$/ ) {
push #file_list, $File::Find::name;
}
}
File::Find isn't that tricky to use. You just have to remember:
When you find a file you don't want, you use return to go to the next file.
$_ contains the file name without the directory, and you can use that for testing the file.
The file's full name is $File::Find::name.
The file's directory is $File::Find::dir.
And, the easiest way is to push the files you want into an array, and then use that array later in your program.
Removing the sub from sub wanted; just makes it a call to the wanted function, not a forward declaration.
However, the wanted function hasn't been designed to be called directly from your code - it's been designed to be called by File::Find. File::Find does useful stuff like populating$_ before calling it.
There's no need to forward-declare wanted here, but if you want to remove the forward declaration, remove the whole sub wanted; line - not just the word sub.
Instead of File::Find, I would recommend using the find_wanted function from File::Find::Wanted.
find_wanted takes two arguments:
a subroutine that returns true for any filename that you would want.
a list of the files you are searching for.
find_wanted returns an array containing the list of filenames that it found.
I used code like the following to find all the JPEG files in certain directories on a computer:
my #files = find_wanted( sub { -f && /\.jpg$/i }, #dirs );
Explanation of some of the syntax, for those that might need it:
sub {...} is an anonymous subroutine, where ... is replaced with the code of the subroutine.
-f checks that a filename refers to a "plain file"
&& is boolean and
/\.jpg$/i is a regular expression that checks that a filename ends in .jpg (case insensitively).
#dirs is an array containing the directory names to be searched. A single directory could be searched as well, in which case a scalar works too (e.g. $dir).
Why not use open and invoke the shell find? The user can edit $findcommand (below) to be anything they want, or can define it in real time based on arguments and options passed to a script.
#!/usr/bin/perl
use strict; use warnings;
my $findcommand='find . -type f -mtime 0';
open(FILELIST,"$findcommand |")||die("can't open $findcommand |");
my #filelist=<FILELIST>;
close FILELIST;
my $Nfilelist = scalar(#filelist);
print "Number of files is $Nfilelist \n";