Confusion about file counting - perl

I've seen numerous posts about using Perl to count specific files in a directory, one of which showed this code
#filesInDirectory = glob("$directory$fileNameRegex");
$numberOfFiles = #filesInDirectory;
Then if I follow this with:
print LOGFILE $#filesInDirectory;
print LOGFILE $numberOfFiles;
The log will have 01. Where the confusion is this: Why is $numberOfFiles 1 instead of 0 when the file I'm looking for doesn't exist?

$numberOfFiles contains the numbers of elements #filesInDirectory. Since #filesInDirectory contains one element, $numberOfFiles contains 1.
$#filesInDirectory contains the index of the last element in #filesInDirectory. Since #filesInDirectory contains one element ($filesInDirectory[0]), $numberOfFiles contains 0.
Unless you mess with $[ (don't!!), $#a will always be one less than #a in scalar context.

Filename globs are not regular expressions, so if $fileNameRegex is what its name implies, it's the wrong thing to use here.
Globs are the pattern-matching language you use to expand filenames into a shell command line: * matches any string of characters (equivalent to .* in a regular expression), ? matches a single character (equivalent to . in a regular expression), and [] works pretty much the same as in regular expressions.
If you need to test filenames against a regular expression, you'll have to do a grep on the list you get back from glob (or readdir).

If you're wondering why #filesInDirectory has something in it instead of being empty, then you already have a solution
But if you're asking why $numberOfFiles is different from $#filesInDirectory then it's because the first contains the number of elements in #filesInDirectory while the second contains the index of the last element of #filesInDirectory
Perl indices start at zero, so if #array has exactly one element then $#array will be zero. If #array is empty then $#array will be -1

Related

adding delimiters to end of file

I am working on a TPT script to process some large files we have. Right now, each record length in the file has a delimiter, |.
The problem is that not all fields are used by each record. For example, record 1 may have 100 fields and record 2 may have 260. For TPT to work, we need to have a delimiter for each field, so the records that have less than 261 fields populated, I need to append the appropriate number of pipes to the end of each record.
So, taking my example above, record one would have 161 pipes appended to the end and record two would have 1.
I have a perl script which will count the number of pipes in each record, but I am not sure how to take that info and accomplish the task of appending that many pipes to the field.
perl -ne 'print scalar(split(/\|/, $_)) . "\n"'
Any advice?
To get the number of pipe symbols, you can use the tr operator.
my $count = tr/|//;
Just subtract the number of pipe symbols from the maximal number to get the number of pipes to add, use the x (times) operator to get them:
perl -lne 'print $_, "|" x (260 - tr/|//)'
I'm not sure the number is correct, it depends on whether the pipes also start or end the line.

Convert Perl to Shell

I have Perl script that I use to SNMP walk devices. However the server I have available to me does not allow me to install all the modules needed. So I need to convert the script to Shell (sh). I can run the script on individual devices but would like it to read from a text like it did in Perl. The Perl Script starts with:
open(TEST, "cat test.txt |");
#records=<TEST>;
close(TEST);
foreach $line (#records)
{
($field1, $field2, $field3)=split(/\s+/, $line);
# Run and record SNMP walk results.
Depending on exactly what the input is and what you are trying to do, that perl code fragment would likely translate to:
while read field1 field2 field3
do
# Run and record SNMP walk results.
echo "1=$field1 2=$field2 3=$field3"
done <text.txt
For example, if text.txt is:
$ cat text.txt
one two three
i ii iii
Then, the above code produces the output:
1=one 2=two 3=three
1=i 2=ii 3=iii
As you can see, the shell read command reads a line (record) at a time and also does splitting on whitespace. There are many options for read to control whether newlines or something else divide records (-d) and whether splitting is to be done on whitespace or something else (IFS) or whether backslashes in the input are to be treated as escape characters or not (-r). See man bash.
while read string; do
str1=${string%% *}
str3=${string##* }
temp=${string#$str1 }
str2=${temp%% *}
echo $str1 $str2 $str3
done <test.txt
alternate version
while read string; do
str1=${string%% *}
temp=${string#$str1 }
str2=${temp%% *}
temp=${string#$str1 $str2 }
str3=${temp%% *}
echo $str1 $str2 $str3
done <test.txt
POSIX substring parameter expansion
${parameter%word}
Remove Smallest Suffix Pattern. The word shall be expanded to produce
a pattern. The parameter expansion shall then result in parameter,
with the smallest portion of the suffix matched by the pattern
deleted.
${parameter%%word}
Remove Largest Suffix Pattern. The word shall be expanded to produce a
pattern. The parameter expansion shall then result in parameter, with
the largest portion of the suffix matched by the pattern deleted.
${parameter#word}
Remove Smallest Prefix Pattern. The word shall be expanded to produce
a pattern. The parameter expansion shall then result in parameter,
with the smallest portion of the prefix matched by the pattern
deleted. ${parameter##word} Remove Largest Prefix Pattern. The word
shall be expanded to produce a pattern. The parameter expansion shall
then result in parameter, with the largest portion of the prefix
matched by the pattern deleted.
${parameter##word}
Remove Largest Prefix Pattern. The word shall be expanded to produce a
pattern. The parameter expansion shall then result in parameter, with
the largest portion of the prefix matched by the pattern deleted.

What does =~/^0$/ mean in Perl?

I'm new to Perl and I have been learning about the Perl basics for past two days.
I'm converting a Perl script to Java program gradually.
In the Perl script, I came across this code.
if( $arr[$i]=~/^0$/ ){
...
...
}
I know that $arr[$i] means getting the ith element from the array arr.
But what does =~/^0$/ mean?
To what are they comparing the array's element?
I searched for this, but I couldn't find it.
Someone please explain me.
FYI, the arr contains floating values.
if ($arr[$i]) =~ /^0$/) is roughly equivalent to if ($arr[$i] eq "0"), but not exactly the same, as it will match both the strings "0" and "0\n". If $arr[$1] was read from a file or stdin and it has not been chomped, this can be a very significant distinction.
if ($arr[$i] == 0), on the other hand, will match any string beginning with a non-numeric character or a string of zeroes/whitespace which is not followed by a numeric character, although it will generate a warning if the string contains non-whitespace, non-digit characters or contains only whitespace (and warnings are enabled, of course).
=~ is a binding operator.
"Binary "=~" binds a scalar expression to a pattern match"
/^0$/ on the right hand side is the regex
^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)
And the zero has no special meaning.
^ and $ are regex anchors which says $arr[$i] should begin with 0 and there is end of string immediately after it.
It can be written as
if ($arr[$i] eq "0" or $arr[$i] eq "0\n")

Split by dot using Perl

I use the split function by two ways. First way (string argument to split):
my $string = "chr1.txt";
my #array1 = split(".", $string);
print $array1[0];
I get this error:
Use of uninitialized value in print
When I do split by the second way (regular expression argument to split), I don't get any errors.
my #array1 = split(/\./, $string); print $array1[0];
My first way of splitting is not working only for dot.
What is the reason behind this?
"\." is just ., careful with escape sequences.
If you want a backslash and a dot in a double-quoted string, you need "\\.". Or use single quotes: '\.'
If you just want to parse files and get their suffixes, better use the fileparse() method from File::Basename.
Additional details to the information provided by Mat:
In split "\.", ... the first parameter to split is first interpreted as a double-quoted string before being passed to the regex engine. As Mat said, inside a double-quoted string, a \ is the escape character, meaning "take the next character literally", e.g. for things like putting double quotes inside a double-quoted string: "\""
So your split gets passed "." as the pattern. A single dot means "split on any character". As you know, the split pattern itself is not part of the results. So you have several empty strings as the result.
But why is the first element undefined instead of empty? The answer lies in the documentation for split: if you don't impose a limit on the number of elements returned by split (its third argument) then it will silently remove empty results from the end of the list. As all items are empty the list is empty, hence the first element doesn't exist and is undefined.
You can see the difference with this particular snippet:
my #p1 = split "\.", "thing";
my #p2 = split "\.", "thing", -1;
print scalar(#p1), ' ', scalar(#p2), "\n";
It outputs 0 6.
The "proper" way to deal with this, however, is what #soulSurfer2010 said in his post.

Why am I only getting one match?

I am parsing a log file trying to pull out the lines where the phrase "failures=" is a unique non-zero digit.
The first part of my perl one liner will pull out all the lines where "failures" are greater than zero. But that part of the log file repeats until a new failure occurs, i.e., after the first failure the log entries will be "failures=1" until the second error then it will read, "failures=2".
What I'd like to do is pull only the first line where that value changes and I thought I had it with this:
cat -n Logstats.out | perl -nle 'print "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+).*tracking_id="(\d+)"/ && $2 > 0' | perl -ne 'print unless $a{/failures=\d+/}++'
However, that only pulls the first non-zero "failure" line and nothing else. What do I need to change for it to pull all the unique values for "failures"?
thanks in advance!
Update: The amount of text in each line up to the "tracking_id" is more text than I can post. Sorry!
2011-09-06 14:14:18 [INFO] [logStats]: name=somename id=d6e6f4f0-4c0d-93b6-7a71-8e3100000030
successes=1 failures=0 skipped=0 eventDelta=41 original=188 simulated=229
totalDelta=41 averageDelta=41 min=0 max=41 averageOriginalDuration=188 averageSimulatedDuriation=229(txid = b3036eca-6288-48ef-166f-5ae200000646
date = 2011-09-02 08:00:00 type = XML xml
=
perl -ne 'print unless $a{/failures=\d+/}++'
does not work because a hash subscript is evaluated in scalar context, and the m// operator does not return the match in scalar context. Instead, it returns 1. So (since every line matches), what you wrote is equivalent to:
perl -ne 'print unless $a{1}++'
and I think you can see the problem there.
There's a number of ways to work around that, but I'd use:
perl -ne 'print unless /(failures=\d+)/ and $a{$1}++'
However, I'd do the whole thing in one call to perl, including numbering the lines:
perl -nle '
print "Line No. $.: failures=$1; eventDelta=$2; tracking_id=$3"
if /failures=(\d+).*?eventDelta=(.\d+).*?tracking_id="(\d+)"/
&& $1 > 0
&& !$seen{$1}++
' Logstats.out
($. automatically counts input lines in Perl. The line breaks can be removed if desired, but it will actually work as is.)
you could use a hash to store te results and print it:
perl -nle '$f{$2} ||= "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+ ).*tracking_id="(\d+)"/ && $2;END{ print $f{$_} for keys %f }' Logstats.out
(not tested due to missing input data...)
HTH,
Paul
Since your input does not match your regex, I can't really help you. But I can tell you that this is doing a lot of backtracking--and that's bad if there is a lot of data after the part that you're interested in.
So here is some alternative ideas:
qr{ \s # a single space
failures=(\d+) # the entry for failures
\s+ # at least one space
skipped=\d+ # skipped
\s+
eventDelta=(.\d+)
.*? # any number of non-newline characters *UNTIL* ...
\btracking_id="(\d+)" # another specified sequence of data
}x;
The parser will scan "skipped=" and then a group of digits a lot faster than scanning the rest of the line and backtracking when it fails back to 'eventDelta=', it is better to put it in, if you know it will always be there.
Since you don't put tracking_id in your example, I can't tell how it occurs, so in this case we used a non-greedy any match which will always be looking for the next sequence. Again, if there is a lot of data in the line, then you do not want to scan to then end and backtrack until you find that you've already read 'tracking_id="nnn"'. However, lookaheads cost processing time, it is still better to spell out 'skipped=' and all possible values then a non-greedy "any match".
You'll also notice that after accepting any data, I specify that tracking_id should appear at a word boundary, which disambiguates it from the possible--though not likely 'backtracking_id='.