Perl '-s' file test operator problem - perl

I'm debugging a piece of code which uses the Perl '-s' function to get the size of some files.
my $File1 = /myfolder/.../mysubfolder1/document.pdf
my $File2 = /myfolder/.../mysubfolder2/document.pdf
my $File3 = /myfolder/.../mysubfolder1/document2.pdf ($File3 is actually a link to /myfolder/.../mysubfolder2/document.pdf aka $File2)
The code which is buggy is:
my $size = int((-s $File)/1024);
Where $File is replaced with $File1 - $File3.
For some reasons I can't explain this does not work on every file.
For $File1 and $File3 it works but not for $File2. I could understand if both $File2 and $File3 would not work, it would mean that the file /myfolder/.../mysubfolder2/document.pdf is somehow corrupt.
I even added a test if (-e $File)|{ before the -s to be sure the file exists, but the three files do exist.
There is an even more strange thing: there is an .htaccess in /myfolder/.../mysubfolder1/ but no .htaccess in /myfolder/.../mysubfolder2/. If it was inverse I would think the .htaccess would block the -s call somehow.
Any thoughts?

If -s fails, it returns undef and sets the error in $!. What is $!?

I suppose that if you check the size of that file with "stat" you will get something less than 1024 bytes :)
your int((-s $fn)/1024) will return 0 if size is less than 1024

To address the end of your comment, .htaccess file controls access to files by a web server's request. Once the user requests a URL which executes a valid permissible CGI/whatever script (I'm assuming yoour Perl code is in web context), THAT script has absolutely no permissioning issues regarding .htaccess (unless you actually code your Perl to read its contents and respect them explicitly by hand).
The only permissioning that can screw up your Perl file is the file system permissions in your OS.
To get the file size, your web user needs:
Execute permission on the directory containing the file
Possibly, read permission on the directory containing the file (not sure if the file size is stored in the inode?)
Possibly, read permission on the file iteself.
If all your 3 files (2 good and 1 bad) are in the same directory, check the file's read permissions.
If they are in different directories, check the file's read perms AND directory perms.

Change int((-s $file)/1024) to sprintf('%.0f', (-s $file)/1024) - you'll see something then, the file is probably under 1024 bytes, so the int() will happily return 0.

Related

Is this a standard Perl language construction or a customization: open HANDLE, ">$fname"

Not a Perl guru, working with an ancient script, ran into a construct I didn't recognize that yields results I don't expect. Curious whether this is the standard language, or a PM customization of sorts:
open FILE1, ">./$disk_file" or die "Can't open file: $disk_file: $?";
From the looks of this, file is to be opened for writing, but the log error says that file is not found. Perl's file i/o expects 3 parameters, not 2. Log doesn't have the die output, instead saying: "File not found"
Confused a bit here.
EDIT: Made it work using the answers below. Seemed like I was running a cashed version of the .pl for some time, instead of the newly-edited. Finally it caught up with a 2-param open, thanks y'all for your help!
That is the old 2-argument form of open. The second argument is a bit magical:
if it starts with '>' the remainder of the string is used as the name of a file to open for writing
if it starts with '<' the remainder of the string is used as the name of a file to open for reading (this is the default if '<' is omitted)
if it ends with '|' the string up to that point is interpreted as a command which is executed with its STDOUT connected to a pipe which your script will open for reading
if it starts with '|' the string after that point is interpreted as a command which is executed with its STDIN connected to a pipe which your script will open for writing
This is a potentially security vulnerability because if your script accepts a filename as user input, the user can add a '|' at the beginning or end to trick your script into running a command.
The 3-argument form of open was added in (I think) version 5.8 so it has been a standard part of Perl for a very long time.
The FILE1 part is known as a bareword filehandle - which is a global. Modern style would be to use a lexical scalar like my $file1 instead.
See perldoc perlopen for details but, in brief...
Perl's open() will accept either two or three parameters (there's even a one-parameter version - which no-one ever uses). The two-parameter version is a slightly older style where the open mode and the filename are joined together in the second parameter.
So what you have is equivalent to:
open FILE1, '>', "./$disk_file" or die "Can't open file: $disk_file: $?";
A couple of other points.
We prefer to use lexical variables as filehandles these days (so, open my $file1, ... instead of open FILE1, ...).
I think you'll find that $! will be more useful in the error message than $?. $? contains the error from a child process, but there's no child process here.
Update: And none of this seems to be causing the problems that you're seeing. That seems to be caused by a file actually not being in the expected place. Can you please edit your question to add the exact error message that you're seeing.
The other answers here are correct that's the two-argument syntax. They've done a good job covering why and how you should ideally change it, so I won't rehash here.
However they haven't tried to help you fix it, so let me try that...
This is a guess, but I suspect $disk_file contains a filename with a path (eg my_logs/somelog.log), and the directory part (my_logs in my entirely guessed example) doesn't exists, so is throwing an error. You could create that directory, or alter whatever sets that variable so it's writing to a location that does exist.
Bear in mind these paths will be relative to wherever you're running the script from - not relative to the script itself, so if there's a log directory (or whatever) in the same dir as the script you may want to cd to the script's dir first.

Change output filename from WGET when using input file option

I have a perl script that I wrote that gets some image URLs, puts the urls into an input file, and proceeds to run wget with the --input-file option. This works perfectly... or at least it did as long as the image filenames were unique.
I have a new company sending me data and they use a very TROUBLESOME naming scheme. All files have the same name, 0.jpg, in different folders.
for example:
cdn.blah.com/folder/folder/202793000/202793123/0.jpg
cdn.blah.com/folder/folder/198478000/198478725/0.jpg
cdn.blah.com/folder/folder/198594000/198594080/0.jpg
When I run my script with this, wget works fine and downloads all the images, but they are titled 0.jpg.1, 0.jpg.2, 0.jpg.3, etc. I can't just count them and rename them because files can be broken, not available, whatever.
I tried running wget once for each file with -O, but it's embarrassingly slow: starting the program, connecting to the site, downloading, and ending the program. Thousands of times. It's an hour vs minutes.
So, I'm trying to find a method to change the output filenames from wget without it taking so long. The original approach works so well that I don't want to change it too much unless necessary, but i am open to suggestions.
Additional:
LWP::Simple is too simple for this. Yes, it works, but very slowly. It has the same problem as running individual wget commands. Each get() or get_store() call makes the system re-connect to the server. Since the files are so small (60kB on average) with so many to process (1851 for this one test file alone) that the connection time is considerable.
The filename i will be using can be found with /\/(\d+)\/(\d+.jpg)/i where the filename will simply be $1$2 to get 2027931230.jpg. Not really important for this question.
I'm now looking at LWP::UserAgent with LWP::ConnCache, but it times out and/or hangs on my pc. I will need to adjust the timeout and retry values. The inaugural run of the code downloaded 693 images (43mb) in just a couple minutes before it hung. Using simple, I only got 200 images in 5 minutes.
use LWP::UserAgent;
use LWP::ConnCache;
chomp(#filelist = <INPUTFILE>);
my $browser = LWP::UserAgent->new;
$browser->conn_cache(LWP::ConnCache->new());
foreach(#filelist){
/\/(\d+)\/(\d+.jpg)/i
my $newfilename = $1.$2;
$response = $browser->mirror($_, $folder . $newfilename);
die 'response failure' if($response->is_error());
}
LWP::Simple's getstore function allows you to specify a URL to fetch from and the filename to store the data from it in. It's an excellent module for many of the same use cases as wget, but with the benefit of being a Perl module (i.e. no need to outsource to the shell or spawn off child processes).
use LWP::Simple;
# Grab the filename from the end of the URL
my $filename = (split '/', $url)[-1];
# If the file exists, increment its name
while (-e $filename)
{
$filename =~ s{ (\d+)[.]jpg }{ $1+1 . '.jpg' }ex
or die "Unexpected filename encountered";
}
getstore($url, $filename);
The question doesn't specify exactly what kind of renaming scheme you need, but this will work for the examples given by simply incrementing the filename until the current directory doesn't contain that filename.

Perl File::Copy isn't working

I am trying to copy a file to a new filename using File::Copy but getting an error saying the file doesn't exist.
print "\nCopying $hash->{Filename1} to $hash->{Filename2}.\n"
copy( $hash->{Filename1}, $hash->{Filename2} ) or die "Unable to copy model. Copy failed: $!";
I have checked that both references are populated (by printing them) and that $hash->{Filename1} does actually exist - and it does.
my error message is this
Unable to copy model. Copy failed: No such file or directory at B:\Script.pl line 467.
Anyone got any ideas of what I might have done wrong? I use this exact same line earlier in my script with no problems so I'm a bit confused.
Is there a file size limit on File::Copy?
Many thanks.
Filename1 may exist but what about Filename2?
Your error message states "No such file or directory at ..." so I'd be investigating the possibility that the directory you're trying to copy the file to is somehow deficient.
You may also want to check permissions if the destination directory and file do exist.
First step is to print out both file names before attempting the copy so that you can see what they are, and investigate the problem from that viewpoint. You should also post those file names in your question so that we can help further. It may well be that there's a dodgy character in one of the filenames, such as a newline you forgot to chomp off.
Re your question on file size limits, I don't believe the module itself imposes one. If you don't provide the buffer size, it uses a maximum of 2G for the chunks usedfor transferring data but there's nothing in the module that restricts the overall size.
It may be that the underlying OS restricts it but, unless your file is truly massive or you're very low on disk space, that's not going to come into play. However, since you appear to be working from the b: drive, that may be a possibility you want to check. I wasn't even aware people used floppy disks any more :-)
Check that there is no extra whitespace or other hard to spot problems with your filename variables with:
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper( { filename1 => $hash->{Filename1}, filename2 => $hash->{Filename2} } );

perl file size calculation not working

I am trying to write a simple perl script that will iterate through the regular files in a directory and calculate the total size of all the files put together. However, I am not able to get the actual size of the file, and I can't figure out why. Here is the relevant portion of the code. I put in print statements for debugging:
$totalsize = 0;
while ($_ = readdir (DH)) {
print "current file is: $_\t";
$cursize = -s $_;
print "size is: $cursize\n";
$totalsize += $cursize;
}
This is the output I get:
current file is: test.pl size is:
current file is: prob12.pl size is:
current file is: prob13.pl size is:
current file is: prob14.pl size is:
current file is: prob15.pl size is:
So the file size remains blank. I tried using $cursize = $_ instead but the only effect of that was to retrieve the file sizes for the current and parent directories as 4096 bytes each; it still didn't get any of the actual file sizes for the regular files.
I have looked online and through a couple of books I have on perl, and it seems that perl isn't able to get the file sizes because the script can't read the files. I tested this by putting in an if statement:
print "Cannot read file $_\n" if (! -r _);
Sure enough for each file I got the error saying that the file could not be read. I do not understand why this is happening. The directory that has the files in question is a subdirectory of my home directory, and I am running the script as myself from another subdirectory in my home directory. I have read permissions to all the relevant files. I tried changing the mode on the files to 755 (from the previous 711), but I still got the Cannot read file output for each file.
I do not understand what's going on. Either I am mixed up about how permissions work when running a perl script, or I am mixed up about the proper way to use -s _. I appreciate your guidance. Thanks!
If it isn't just your typo -s _ instead of the correct -s $_ then please remember that readdir returns file names relative to the directory you've opened with opendir. The proper way would be something like
my $base_dir = '/path/to/somewhere';
opendir DH, $base_dir or die;
while ($_ = readdir DH) {
print "size of $_: " . (-s "$base_dir/$_") . "\n";
}
closedir DH;
You could also take a look at the core module IO::Dir which offers a tie way of accessing both the file names and the attributes in a simpler manner.
You have a typo:
$cursize = -s _;
Should be:
$cursize = -s $_;

Can I get the MD5sum of a directory with Perl?

I am writing a Perl script (in Windows) that is using File::Find to index a network file system. It works great, but it takes a very long time to crawl the file system. I was thinking it would be nice to somehow get a checksum of a directory before traversing it, and it the checksum matches the checksum that was taken on a previous run, do not traverse the directory. This would eliminate a lot of processing, since the files on this file system do not change often.
On my AIX box, I use this command:
csum -h MD5 /directory
which returns something like this:
5cfe4faf4ad739219b6140054005d506 /directory
The command takes very little time:
time csum -h MD5 /directory
5cfe4faf4ad739219b6140054005d506 /directory
real 0m0.00s
user 0m0.00s
sys 0m0.00s
I have searched CPAN for a module that will do this, but it looks like all the modules will give me the MD5sum for every file in a directory, not for the directory itself.
Is there a way to get the MD5sum for a directory in Perl, or even in Windows for that matter as I could call a Win32 command from Perl?
Thanks in advance!
Can you just read the last modified dates of the files and folders? Surely that's going to be faster than building MD5's?
In order to get a checksum you must read the files, this means you will need to walk the filesystem, which puts you back in the same boat you are trying to get out of.
From what I know you cannot get an md5 of a directory. md5sum on other systems complains when you provide a directory. csum is most likely giving you a hash of the directory file contents of the top level directory, not traversing the tree.
You can grab the modified times for the files and hash them how you like by doing something like this:
sub dirModified($){
my $dir = #_[0];
opendir(DIR, "$dir");
my #dircontents = readdir(DIR);
closedir(DIR);
foreach my $item (#dircontents){
if( -f $item ){
print -M $item . " : $item - do stuff here\n";
} elsif( -d $item && $item !~ /^\.+$/ ){
dirModified("$dir/$item");
}
}
}
Yes it will take some time to run.
In addition to the other good answers, let me add this: if you want a checksum, then please use a checksum algorithm instead of a (broken!) hash function.
I don't think you don't need a cryptographically secure hash function in your file indexer -- instead you need a way to see if there are changes in the directory listings without storing the entire listing. Checksum algorithms do that: they return a different output when the input is changed. They might do it faster since they are simpler than hash functions.
It is true that a user could change a directory in a way that wouldn't be discovered by the checksum. However, a user would have to change the file names like this on purpose since normal changes in file names will (with high probability) give different checksums. Is it then necessary to guard against this "attack"?
One should always consider the consequences of each attack and choose the appropriate tools.
I did one of these in python if your interested:
http://akiscode.com/articles/sha-1directoryhash.shtml