Whats wrong with this code to read file? - perl

I have been trying to read a file called "perlthisfile.txt" which is basically the output of nmap on my computer.
I want to get only the ip addresses printed out, so i wrote the following code but it is not working:
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
print"\n running \n";
open (MYFILE, 'perlthisfile.txt') or die "Cannot open file\n";
while(<MYFILE>) {
chomp;
my #value = split(' ', <MYFILE>);
print"\n before foreach \n";
foreach my $val (#value) {
if (looks_like_number($val)) {
print "\n looks like number block \n";
if ($val == /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/) {
print "\n$val\n";
}
}
}
}
close(MYFILE);
exit 0;
And when i ran this code the output was:
running
before foreach
before foreach
looks like number block
before foreach
looks like number block
before foreach
looks like number block
My perlthisfile.txt:
Starting Nmap 6.00 ( http://nmap.org ) at 2013-10-16 22:59 EST
Nmap scan report for BoB2.iiNet (10.1.1.1)
Nmap scan report for android-fbff3c3812154cdc (10.1.1.3)
All 1000 scanned ports on android-fbff3c3812154cdc (10.1.1.3) are closed
Nmap scan report for 10.1.1.5
All 1000 scanned ports on 10.1.1.5 are open|filtered
Nmap scan report for 10.1.1.6
All 1000 scanned ports on 10.1.1.6 are closed

Several issues here. As #toolic said, calling <MYFILE> inside the split is probably not what you want - it will read the next record from the file, use $_ instead.
Also, you are using == with a regex, you should use the binding operator, =~ (== is only used for numeric comparisons in Perl):
if ($val =~ /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/){
I suggest that looks_like_number is redundant if the regex works. I suspect that you are using it because == gives something like isn't numeric in numeric eq (==) depending on the version of perl you are using.

You had a few errors, one of which is regex which should have optional part for port number (: and following \d{1,5})
#!/usr/bin/perl
use strict;
use warnings;
open (my $MYFILE, '<', 'perlthisfile.txt') or die $!;
my $looks_like_ip = qr/( \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} (?: : \d{1,5})? )/x;
while (<$MYFILE>) {
chomp;
my #value = split;
print"\n before foreach \n";
foreach my $val (#value) {
if (my ($match) = $val =~ /$looks_like_ip/){
print "\n$match\n";
}
# else { print "$val doesn't contain IP\n" }
}
}
close($MYFILE) or warn $!;

If this is what it looks to be, which is a quick hack to extract IPs, you might get away with something simple such as:
perl -nlwe '/((?:\d+\.)+\d+)/ && print $1' perlthisfile.txt
Which is to say, not a very strict regex by any means, it just matches numbers joined by periods. If you'd like to only print unique IPs, you can make use of a hash to dedupe:
perl -nlwe '/((?:\d+\.)+\d+)/ && !$seen{$1}++ && print $1" perlthisfile.txt
With a slightly tighter regex that also matches port numbers:
perl -nlwe '/((?:\d+[\.:]){3,4}\d+)/ && print $1' perlthisfile.txt
This will disallow shorter chains of numbers, and allow for a port number.
This last regex explained:
/( # opening parenthesis, starts a string capture
(?: # a non-capturing parenthesis
\d+ # match a number, repeated one or more times
[\.:] # [ ... ] is a character class, it matches one of the literal
# characters inside it, and only one time
){3,4} # closing the non-capturing parenthesis, adding a quantifier
# that says this parenthesis can match 3 or 4 times
\d+ # match one or more numbers
)/x # close capturing parenthesis (added `/x` switch)
The /x switch is just so that you can use the above regex as-is, with comments and whitespace.
The logic behind this is simply: We want a string consisting of a number followed by a period or a colon. We want this string 3 or 4 times. End with another number.
The + and {3,4} are quantifiers, they dictate how many times the item to the left of it is supposed to match. By default, every item matches one time, but by using a quantifier you can change that. + is shorthand for {1,}, and you also have:
? -> {1,0}
* -> {0,}
The syntax is {min,max}, and when a number is missing, that means as many times as possible.

Related

Extract preceding and trailing characters to a matched string from file in awk

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

perl regex too greedy

I went through similar questions asked by other members and applied (or tried to apply) solutions from their inquiry but they did not work on my issue. My pattern match and grouping is too greedy and does not stop at first pipe(|). If I get more specific, I think it can but I'm trying to figure out how I can stop the pattern match at the first instance of the pipe?
Here are couple of lines
09:30:00.063|IN:|8=FIX.4.2|9=206|35=D|34=5159|49=CLIENT|52=20191024-13:30:00.050|56=SERV|57=DEST|1=05033|11=ABZ5702|15=USD|21=1|38=2000|40=2|44=92.48|47=A|54=5|55=RC|60=20191024-13:30:00.050|111=0|114=N|336=X|5700=AP|9281=SOV|10=202
09:37:21.208|IN:|8=FIX.4.2|9=170|35=D|34=5184|49=CLIENT|52=20191024-13:37:21.206|56=SERV|57=ATXB|1=J5129|11=136404|15=USD|21=1|38=100|40=2|44=1.39|47=A|54=2|55=DIW|59=2|60=20191024-13:30:00.206|10=029
I'm expecting my perl script to return the following output from the above data:
09:30:00.063|13:30:00.050|ABZ5702
09:37:21.208|13:37:21.206|136404
I tried all this and few other veriations but could not get it to produce the above output:
#$msg =~ s/([^|]*).*|52=([^|]*).*|11=([^|]*).*/$1|$2|$3/;
$msg =~ s/(.+)\|??.*|52=([^|]*).*|11=([^|]*).*/$1|$2|$3/;
#$msg =~ s/^([^|]*).??|52=([^|]*).??|11=([^|]*).*/$1|$2|$3/;
#$msg =~ s/^([^\|??]*).*|52=([^\|??]*).*|11=([^\|??]*).*/$1|$2|$3/;
#$msg =~ s/(.*\|??).*|52=(.+\|??).*|11=(.+\|??).*/one $1|two $2|three $3/;
#$msg =~ s/(.*?|).*|52=(.*?|).*|11=(.*|?).*/$1|$2|$3/;
#$msg =~ /(.*)|??.*|52=(.*)|??.*|11=(.*)|??.*/$1|$2|$3/;
#$msg =~ s/|.*-[0-3][0-9]:/|/;
print "$msg\n";```
I realize there are other more than one way to skin the cat but there are cases where I need to use the pattern match approach. How can I get it to produce the expected output using the pattern matching where it stops each group at first pipe(|)? Can someone tell me what am I doing wrong?
Try this:
s/(.*?)\|.*\|52=([^|]*).*\|11=([^|]*).*/$1 $2 $3/;
There were a couple of pipe delimiters that needed escaping.
You need to look at non-greedy matching https://docstore.mik.ua/orelly/perl/cookbook/ch06_16.htm
The first matching group is (.*?) instead of (.*). The ? means we match as little as possible.
In general, for parsing FIX in perl, as long as there are no repeating groups, I would recommend splitting on | first and then creating a hash of tag-value pairs.
I would do it a little bit different - split line into array and work on individual element of array.
The regex may be an acceptable solution for one particular case if format of line predetermined and will never change.
use strict;
use warnings;
use Data::Dumper;
my $debug = 0;
while( my $line = <DATA> ) {
my #array = split /\|/, $line;
print Dumper(\#array) if $debug;
$array[7] =~ s/.+?-//;
$array[11] =~ s/\d+=//;
printf "%s\n", join '|', #array[0,7,11];
}
__DATA__
09:30:00.063|IN:|8=FIX.4.2|9=206|35=D|34=5159|49=CLIENT|52=20191024-13:30:00.050|56=SERV|57=DEST|1=05033|11=ABZ5702|15=USD|21=1|38=2000|40=2|44=92.48|47=A|54=5|55=RC|60=20191024-13:30:00.050|111=0|114=N|336=X|5700=AP|9281=SOV|10=202
09:37:21.208|IN:|8=FIX.4.2|9=170|35=D|34=5184|49=CLIENT|52=20191024-13:37:21.206|56=SERV|57=ATXB|1=J5129|11=136404|15=USD|21=1|38=100|40=2|44=1.39|47=A|54=2|55=DIW|59=2|60=20191024-13:30:00.206|10=029

Program argument is 100 but returns the value as 0100

Right now I am trying to do an assignment where I have to
- Extract information from an HTML file
- Save it to a scalar
- Run a regular expression to find the number of seats available in the designated course (the program argument is the course number for example 100 for ICS 100)
- If the course has multiple sessions, I have to find the sum of the seats available and print
- The output is just the number of seats available
The problem here is that when I was debugging and checking to make sure that my variable I have the program arg saved to was storing the correct value, it was storing the values with an extra 0 behind it.
ex.) perl filename.pl 100
ARGV[0] returns as 0100
I've tried storing the True regular expression values to an array, saving using multiple scalar variables, and changing my regular expression but none worked.
die "Usage: perl NameHere_seats.pl course_number" if (#ARGV < 1);
# This variable will store the .html file contents
my $fileContents;
# This variable will store the sum of seats available in the array #seatAvailable
my $sum = 0;
# This variable will store the program argument
my $courseNum = $ARGV[0];
# Open the file to read contents all at once
open (my $fh, "<", "fa19_ics_class_availability.html") or die ("Couldn't open 'fa19_ics_class_availability.html'\n");
# use naked brakets to limit the $/
{
#use local $/ to get <$fh> to read the whole file, and not one line
local $/;
$fileContents = <$fh>;
}
# Close the file handle
close $fh;
# Uncomment the line below to check if you've successfully extracted the text
# print $fileContents;
# Check if the course exists
die "No courses matched...\n" if ($ARGV[0] !~ m/\b(1[0-9]{2}[A-Z]?|2[0-8][0-9][A-Z]?|29[0-3])[A-Z]?\b/);
while ($fileContents =~ m/$courseNum(.+?)align="center">(\d)</) {
my $num = $2;
$sum = $sum + $num;
}
print $sum;
# Use this line as error checking to make sure #ARGV[0] is storing proper number
print $courseNum;
The current output I am receiving when program argument is 100 is just 0, and I assume it's because the regular expression is not catching any values as true therefore the sum remains at a value of 0. The output should be 15...
This is a link to the .html page > https://laulima.hawaii.edu/access/content/user/emeyer/ics/215/FA19/01/perl/fa19_ics_class_availability.html
You're getting "0100" because you have two print() statements.
print $sum;
...
print $courseNum;
And because there are no newlines or other output between them, you get the two values printed out next to each other. $sum is '0' and $courseNum is '100'.
So why is $sum zero? Well, that's because your regex isn't picking up the data you want it to match. Your regex looks like this:
m/$courseNum(.+?)align="center">(\d)</
You're looking for $courseNum followed by a number of other characters, followed by 'align="center">' and then your digit. This doesn't work for a number of reasons.
The string "100" appears many times in your text. Many times it doesn't even mean a course number (e.g. "100%"). Perhaps you should look for something more precise (ICS $coursenum).
The .+? doesn't do what you think it does. The dot doesn't match newline characters unless you use the /s option on the match operator.
But even if you fix those first two problems, it still won't work as there are a number of numeric table cells for each course and you're doing nothing to ensure that you're grabbing the last one. Your current code will get the "Curr. Enrolled" column, not the "Seats Avail" one.
This is a non-trivial HTML parsing problem. It shouldn't be addressed using regexes (HTML should never be parsed using regexes). You should look at one of the HTML parsing modules from CPAN - I think I'd use Web::Query.
Update: An example solution using Web::Query:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use File::Basename;
use Web::Query;
my $course_num = shift
or die 'Usage: perl ' . basename $0 . " course_number\n";
my $source = 'fa19_ics_class_availability.html';
open my $fh, '<', $source
or die "Cannot open '$source': $!\n";
my $html = do { local $/; <$fh> };
my $count_free;
wq($html)
# Get each table row in the table
->find('table.listOfClasses tr')
->each(sub {
my ($i, $elem) = #_;
my #tds;
# Get each <td> in the <tr>
$elem->find('td')->each(sub { push #tds, $_[1] });
# Ignore rows that don't have 13 columns
return if #tds != 13;
# Ignore rows that aren't about the right course
return if $tds[2]->text ne "ICS $course_num";
# Add the number of available places
$count_free += $tds[8]->text;
});
say $count_free;

Matching of data from output table

We need to match certain data element by element that is an output in tabular form obtained on the command prompt.The following is the approach being currently followed wherein the $Var contains the output. Is there an optimal way of doing this without directing the command output to file.
Please share your thoughts.
$Var = "iSCSI Storage LHN StgMgmt Name IP Name
==============================================================
0 Storage_1 15.178.209.194 admin
1 acct-mgmt 15.178.209.194 storage1
2 acct-mgmt2 15.178.209.194 storage2";
#tab = split("\n",$Var);
foreach (#tab) {
next if ($_ !~ /^\d/);
$_ =~ s/\s+//g;
$first=0 if($_ =~ /Storage/i && /15.178.209.194/);
push(#Array, $_); }
$_ =~ /Storage/i && /15.178.209.194/ is silly. That gets broken up like this: ($_ =~ /Storage/i) && (/15.178.209.194/). Either use $_ consistently or don't - the // and s/// operators automatically operate on $_.
Also you should know that in the regex /15.178.209.194/, the .s are being interpreted as any character. Either escape them or use the index() function.
Additionally, I would recommend that you separate each line using split(). This allows you to compare each individual column. You can use split() with a regex like so: #array = split(/\s+/, $string);.
Finally, I'm not really sure what $first is for, but I notice that all three sample lines in that input trigger $first=0 as they all contain that IP and the string "storage".
If I understand you correctly you want to invoke your script like this:
./some_shell_command | perl perl_script.pl
What you want to use is the Perl diamond operator <>:
#!/usr/bin/perl
use strict;
use warnings;
my $first;
my #Array;
for (<>) {
next unless /^\d/;
s/\s+/ /g;
$first = 0 if /Storage/i && /15.178.209.194/;
push(#Array, $_);
}
I've removed the redundant uses of $_ and fixed your substitution, since you probably don't want to remove all spaces.

Extracting specific lines with Perl

I am writing a perl program to extract lines that are in between the two patterns i am matching. for example the below text file has 6 lines. I am matching load balancer and end. I want to get the 4 lines that are in between.
**load balancer**
new
old
good
bad
**end**
My question is how do you extract lines in between load balancer and end into an array. Any help is greatly appreciated.
You can use the flip-flop operator to tell you when you are between the markers. It will also include the actual markers, so you'll need to except them from the data collection.
Note that this will mash together all the records if you have several, so if you do you need to store and reset #array somehow.
use strict;
use warnings;
my #array;
while (<DATA>) {
if (/^load balancer$/ .. /^end$/) {
push #array, $_ unless /^(load balancer|end)$/;
}
}
print #array;
__DATA__
load balancer
new
old
good
bad
end
You can use the flip-flop operator.
Additionally, you can also use the return value of the flipflop to filter out the boundary lines. The return value is a sequence number (starting with 1) and the last number has the string E0 appended to it.
# Define the marker regexes separately, cuz they're ugly and it's easier
# to read them outside the logic of the loop.
my $start_marker = qr{^ \s* \*\*load \s balancer\*\* \s* $}x;
my $end_marker = qr{^ \s* \*\*end\*\* \s* $}x;
while( <DATA> ) {
# False until the first regex is true.
# Then it's true until the second regex is true.
next unless my $range = /$start_marker/ .. /$end_marker/;
# Flip-flop likes to work with $_, but it's bad form to
# continue to use $_
my $line = $_;
print $line if $range !~ /^1$|E/;
}
__END__
foo
bar
**load balancer**
new
old
good
bad
**end**
baz
biff
Outputs:
new
old
good
bad
If you prefer a command line variation:
perl -ne 'print if m{\*load balancer\*}..m{\*end\*} and !m{\*load|\*end}' file
For files like this, I often use a change in the Record Separator ( $/ or $RS from English )
use English qw<$RS>;
local $RS = "\nend\n";
my $record = <$open_handle>;
When you chomp it, you get rid of that line.
chomp( $record );