Extract preceding and trailing characters to a matched string from file in awk - perl

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!

One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest

One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz

You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.

You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

Related

Can one concatenate two Perl scripts which use different input record separators?

Two Perl scripts, using different input record separators, work together to convert a LaTeX file into something easily searched for human-readable phrases and sentences. Of course, they could be wrapped together by a single shell script. But I am curious whether they can be incorporated into a single Perl script.
The reason for these scripts: It would be a hassle to find "two three" inside short.tex, for instance. But after conversion, grep 'two three' will return the first paragraph.
For any LaTeX file (here, short.tex), the scripts are invoked as follows.
cat short.tex | try1.pl | try2.pl
try1.pl works on paragraphs. It gets rid of LaTeX comments. It makes sure that each word is separated from its neighbors by a single space, so that no sneaky tabs, form feeds, etc., lurk between words. The resulting paragraph occupies a single line, consisting of visible characters separated by single spaces --- and at the end, a sequence of at least two newlines.
try2.pl slurps the entire file. It makes sure that paragraphs are separated from each other by exactly two newlines. And it ensures that the last line of the file is non-trivial, containing visible character(s).
Can one elegantly concatenate two operations such as these, which depend on different input record separators, into a single Perl script, say big.pl? For instance, could the work of try1.pl and try2.pl be accomplished by two functions or bracketed segments inside the larger script?
Incidentally, is there a Stack Overflow keyword for "input record separator"?
###File try1.pl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
while (<>) {
s/[\x25].*\n/ /g; # remove all LaTeX comments. They start with %
s/[\t\f\r ]+/ /g; # collapse each "run" of whitespace to one single space
s/^\s*\n/\n/g; # any line that looks blank is converted to a pure newline;
s/(.)\n/$1/g; # Any line that does not look blank is joined to the subsequent line
print;
print "\n\n"; # make sure each paragraph is separated from its fellows by newlines
}
###File try2.pl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = undef; # input record separator: entire text or file is a single record.
while (<>) {
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines separate paragraphs. Like cat -s
s/[\n]+$/\n/; # last line is nontrivial; no blank line at the end
print;
}
###File short.tex:
\paragraph{One}
% comment
two % also 2
three % or 3
% comment
% comment
% comment
% comment
% comment
% comment
So they said%
that they had done it.
% comment
% comment
% comment
Fleas.
% comment
% comment
After conversion:
\paragraph{One} two three
So they said that they had done it.
Fleas.
To combine try1.pl and try2.pl into a single script you could try:
local $/ = "";
my #lines;
while (<>) {
[...] # Same code as in try1.pl except print statements
push #lines, $_;
}
$lines[-1] =~ s/\n+$/\n/;
print for #lines;
A pipe connects the output of one process to the input of another process. Neither one knows about the other nor cares how it operates.
But, putting things together like this breaks the Unix pipeline philosophy of small tools that each excel at a very narrow job. Should you link these two things, you'll always have to do both tasks even if you want one (although you could get into configuration to turn off one, but that's a lot of work).
I process a lot of LaTeX, and I control everything through a Makefile. I don't really care about what the commands look like and I don't even have to remember what they are:
short-clean.tex: short.tex
cat short.tex | try1.pl | try2.pl > $#
Let's do it anyways
I'll limit myself to the constraint of basic concatenation instead of complete rewriting or rearranging, most because there are some interesting things to show.
Consider what happens should you concatenate those two programs by simply adding the text of the second program at the end of the text of the first program.
The output from the original first program still goes to standard output and the second program now doesn't get that output as input.
The input to the program is likely exhausted by the original first program and the second program now has nothing to read. That's fine because it would have read the unprocessed input to the first program.
There are various ways to fix this, but none of them make much sense when you already have two working program that do their job. I'd shove that in the Makefile and forget about it.
But, suppose you do want it all in one file.
Rewrite the first section to send its output to a filehandle connected to a string. It's output is now in the programs memory. This basically uses the same interface, and you can even use select to make that the default filehandle.
Rewrite the second section to read from a filehandle connected to that string.
Alternately, you can do the same thing by writing to a temporary file in the first part, then reading that temporary file in the second part.
A much more sophisticated program would the first program write to a pipe (inside the program) that the second program is simultaneously reading. However, you have to pretty much rewrite everything so the two programs are happening simultaneously.
Here's Program 1, which uppercases most of the letters:
#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
print tr/a-z/A-Z/r;
}
and here's Program 2, which collapses whitespace:
#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
print s/\s+/ /gr;
}
They work serially to get the job done:
$ perl program1.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D
$ perl program2.pl
The quick brown dog jumped over the lazy fox.
The quick brown dog jumped over the lazy fox.
^D
$ perl program1.pl | perl program2.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D
Now I want to combine those. First, I'll make some changes that don't affect the operation but will make it easier for me later. Instead of using implicit filehandles, I'll make those explicit and one level removed from the actual filehandles:
Program 1:
#!/usr/bin/perl
use v5.26;
$|++;
my $output_fh = \*STDOUT;
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
Program 2:
#!/usr/bin/perl
$|++;
my $input_fh = \*STDIN;
while( <$input_fh> ) { # safer line input operator
print s/\s+/ /gr;
}
Now I have the chance to change what those filehandles are without disturbing the meat of the program. The while doesn't know or care what that filehandle is, so let's start by writing to a file in Program 1 and reading from that same file in Program 2:
Program 1:
#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
Program 2:
#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
However, you can no longer run these in a pipeline because Program 1 doesn't use standard output and Program 2 doesn't read standard input:
% perl program1.pl
% perl program2.pl
You can, however, now join the programs, shebang and all:
#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
You can skip the file and use a string instead, but at this point, you've gone beyond merely concatenating files and need a little coordination for them to share the scalar with the data. Still, the meat of the program doesn't care how you made those filehandles:
#!/usr/bin/perl
use v5.26;
my $output_string;
open my $output_fh, '>', \ $output_string or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
#!/usr/bin/perl
$|++;
open my $input_fh, '<', \ $output_string or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
So let's go one step further and do what the shell was already doing for us.
#!/usr/bin/perl
use v5.26;
pipe my $input_fh, my $output_fh;
$output_fh->autoflush(1);
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
From here, it gets a bit tricky and I'm not going to go to the next step with polling filehandles so one thing can write and the the next thing reads. There are plenty of things that do that for you. And, you're now doing a lot of work to avoid something that was already simple and working.
Instead of all that pipe nonsense, the next step is to separate code into functions (likely in a library), and deal with those chunks of code as named things that hide their details:
use Local::Util qw(remove_comments minify);
while( <<>> ) {
my $result = remove_comments($_);
$result = minify( $result );
...
}
That can get even fancier where you simply go through a series of steps without knowing what they are or how many of them there will be. And, since all the baby steps are separate and independent, you're basically back to the pipeline notion:
use Local::Util qw(get_input remove_comments minify);
my $result;
my #steps = qw(get_input remove_comments minify)
while( ! eof() ) { # or whatever
no strict 'refs'
$result = &{$_}( $result ) for #steps;
}
A better way makes that an object so you can skip the soft reference:
use Local::Processor;
my #steps = qw(get_input remove_comments minify);
my $processer = Local::Processor->new( #steps );
my $result;
while( ! eof() ) { # or whatever
$result = $processor->$_($result) for #steps;
}
Like I did before, the meat of the program doesn't care or know about the steps ahead of time. That means that you can move the sequence of steps to configuration and use the same program for any combination and sequence:
use Local::Config;
use Local::Processor;
my #steps = Local::Config->new->get_steps;
my $processer = Local::Processor->new;
my $result;
while( ! eof() ) { # or whatever
$result = $processor->$_($result) for #steps;
}
I write quite a bit about this sort of stuff in Mastering Perl and Effective Perl Programming. But, because you can do it doesn't mean you should. This reinvents a lot that make can already do for you. I don't do this sort of thing without good reason—bash and make have to be pretty annoying to motivate me to go this far.
The motivating problem was to generate a "cleaned" version of a LaTeX file, which would be easy to search, using regex, for complex phrases or sentences.
The following single Perl script does the job, whereas previously I required one shell script and two Perl scripts, entailing three invocations of Perl. This new, single script incorporates three consecutive loops, each with a different input record separator.
First loop:
input = STDIN, or a file passed as argument; record separator=default, loop by line; print result to fileafterperlLIN, a temporary
file on the hard drive.
Second loop:
input = fileafterperlLIN;
record separator = "", loop by paragraph;
print result to fileafterperlPRG, a temporary file on the hard drive.
Third loop:
input = fileafterperlPRG;
record separator = undef, slurp entire file
print result to STDOUT
This has the disadvantage of printing to and reading from two files on the hard drive, which may slow it down. Advantages are that the operation seems to require only one process; and all the code resides in a single file, which should make it easier to maintain.
#!/usr/bin/perl
# 2019v04v05vFriv17h18m41s
use strict;
use warnings;
use 5.18.2;
my $diagnose;
my $diagnosticstring;
my $exitcode;
my $userName = $ENV{'LOGNAME'};
my $scriptpath;
my $scriptname;
my $scriptdirectory;
my $cdld;
my $fileafterperlLIN;
my $fileafterperlPRG;
my $handlefileafterperlLIN;
my $handlefileafterperlPRG;
my $encoding;
my $count;
sub diagnosticmessage {
return unless ( $diagnose );
print STDERR "$scriptname: ";
foreach $diagnosticstring (#_) {
printf STDERR "$diagnosticstring\n";
}
}
# Routine setup
$scriptpath = $0;
$scriptname = $scriptpath;
$scriptname =~ s|.*\x2f([^\x2f]+)$|$1|;
$cdld = "$ENV{'cdld'}"; # A directory to hold temporary files used by scripts
$exitcode = system("test -d $cdld && test -w $cdld || { printf '%\n' 'cdld not a writeable directory'; exit 1; }");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$scriptdirectory = "$cdld/$scriptname"; # To hold temporary files used by this script
$exitcode = system("test -d $scriptdirectory || mkdir $scriptdirectory");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
diagnosticmessage ( "scriptdirectory=$scriptdirectory" );
$exitcode = system("test -w $scriptdirectory && test -x $scriptdirectory || exit 1;");
die "$scriptname: system returned exitcode=$exitcode: $scriptdirectory not writeable or not executable. bail\n" unless $exitcode == 0;
$fileafterperlLIN = "$scriptdirectory/afterperlLIN.tex";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
$exitcode = system("printf '' > $fileafterperlLIN;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$fileafterperlPRG = "$scriptdirectory/afterperlPRG.tex";
diagnosticmessage ( "fileafterperlPRG=$fileafterperlPRG" );
$exitcode=system("printf '' > $fileafterperlPRG;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
# This script's job: starting with a LaTeX file, which may compile beautifully in pdflatex but be difficult
# to read visually or search automatically,
# (1) convert any line that looks blank --- a "trivial line", containing only whitespace --- to a pure newline. This is because
# (a) LaTeX interprets any whitespace line following a non-blank or "nontrivial" line as end of paragraph, whereas
# (b) Perl needs two consecutive newlines to signal end of paragraph.
# (2) remove all LaTeX comments;
# (3) deal with the \unskip LaTeX construct, etc.
# The result will be
# (4) each LaTeX paragraph will occupy a unique line
# (5) exactly one pair of newlines --- visually, one blank line --- will divide each pair of consecutive paragraphs
# (6) first paragraph will be on first line (no opening blank line) and last paragraph will be on last line (no ending blank line)
# (7) whitespace in output will consist of only
# (a) a single space between readable strings, or
# (b) double newline between paragraphs
#
$handlefileafterperlLIN = undef;
$handlefileafterperlPRG = undef;
$encoding = ":encoding(UTF-8)";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
open($handlefileafterperlLIN, ">> $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for appending: $!";
# Loop 1 / line:
# Default input record separator: loop through one line at a time, delimited by \n
$count = 0;
while (<>) {
$count = $count + 1;
diagnosticmessage ( "line $count" );
s/^\s*\n/\n/mg; # Convert any trivial line to a pure newline.
print $handlefileafterperlLIN $_;
}
close($handlefileafterperlLIN);
open($handlefileafterperlLIN, "< $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for reading: $!";
open($handlefileafterperlPRG, ">> $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for appending: $!";
# Loop PRG / paragraph:
local $/ = ""; # Input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
$count = 0;
while (<$handlefileafterperlLIN>) {
$count = $count + 1;
diagnosticmessage ( "paragraph $count" );
s/(?<!\x5c)[\x25].*\n/ /g; # Remove all LaTeX comments.
# They start with % not \% and extend to end of line or newline character. Join to next line.
# s/(?<!\x5c)([\x24])/\x2a/g; # 2019v04v01vMonv13h44m09s any $ not preceded by backslash \, replace $ by * or something.
# This would be only if we are going to run detex on the output.
s/(.)\n/$1 /g; # Any line that has something other than newline, and then a newline, is joined to the subsequent line
s|([^\x2d])\s*(\x2d\x2d\x2d)([^\x2d])|$1 $2$3|g; # consistent treatment of triple hyphen as em dash
s|([^\x2d])(\x2d\x2d\x2d)\s*([^\x2d])|$1$2 $3|g; # consistent treatment of triple hyphen as em dash, continued
s/[\x0b\x09\x0c\x20]+/ /gm; # collapse each "run" of whitespace other than newline, to a single space.
s/\s*[\x5c]unskip(\x7b\x7d)?\s*(\S)/$2/g; # LaTeX whitespace-collapse across newlines
s/^\s*//; # Any nontrivial line: No indenting. No whitespace in first column.
print $handlefileafterperlPRG $_;
print $handlefileafterperlPRG "\n\n"; # make sure each paragraph ends with 2 newlines, hence at least 1 blank line.
}
close($handlefileafterperlPRG);
open($handlefileafterperlPRG, "< $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for reading: $!";
# Loop slurp
local $/ = undef; # Input record separator: entire file is a single record.
$count = 0;
while (<$handlefileafterperlPRG>) {
$count = $count + 1;
diagnosticmessage ( "slurp $count" );
s/[\n][\n]+/\n\n/g; # Exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/[\n]+$/\n/; # Last line is visible or "nontrivial"; no trivial (blank) line at the end
s/^[\n]+//; # No trivial (blank) line at the start. The first line is "nontrivial."
print STDOUT;
}

Replace single space with multiple spaces in perl

I have a requirement of replacing a single space with multiple spaces so that the second field always starts at a particular position (here 36 is the position of second field always).
I have a perl script written for this:
while(<INP>)
{
my $md=35-index($_," ");
my $str;
$str.=" " for(1..$md);
$_=~s/ +/$str/;
print "$_" ;
}
Is there any better approach with just using the regex in =~s/// so that I can use it on CLI directly instead of script.
Assuming that the fields in your data are demarcated by spaces
while (<$fh>) {
my ($first, #rest) = split;
printf "%-35s #rest\n", $first;
}
The first field is now going to be 36 wide, aligned left due to - in the format of printf. See sprintf for the many details. The rest is printed with single spaces between the original space-separated fields, but can instead be done as desired (tab separated, fixed width...).
Or you can leave the "rest" after the first field untouched by splitting the line into two parts
while (<$fh>) {
my ($first, $rest) = /(\S+)\s+(.*)/;
printf "%-35s $rest\n", $first;
}
(or use split ' ', $_, 2 instead of regex)
Please give more detail if there are other requirements.
One approach is to use plain ol' Perl formats:
#!/usr/bin/perl
use warnings;
use strict;
my($first, $second, $remainder);
format STDOUT =
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #<<<<<< #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$first, $second,$remainder
.
while (<DATA>) {
($first, $second, $remainder) = split(/\s+/, $_, 3);
write;
}
exit 0;
__DATA__
ABCD TEST EFGH don't touch
FOO BAR FUD don't touch
Test output. I probably miscounted the columns, but you should get the idea:
$ perl dummy.pl
ABCD TEST EFGH don't touch
FOO BAR FUD don't touch
Other option would be Text::Table

How to remove non numeric charcter from numeric string?

I have following data. I would like to print last column without non numeric character from a string. Kindly help me
N THR K 149A
CA THR K 149A
C THR K 149A
O THR K 149A
CB THR K 149A
OG1 THR K 149A
CG2 THR K 149A
N SER K 149B
CA SER K 149B
C SER K 149B
O SER K 149B
CB SER K 149B
for solving the above problem I have tried by following program.
#!/usr/bin/perl -w
open(F1, "$ARGV[0]") or die;
chomp(#arr=<F1>);
close F1;
for($i=0;$i<=$#arr;$i++)
{
#pdb=split(/\h/,$arr[$i]);
if($pdb[3] =~ /[A-Z]/*$);{
$pdb[3] =~ s/\D//g;
print "$pdb[1] $pdb[2] $pdb[3]\n";
}
}
Ok, unless this is a typo, it is the thing wrong with your code.
if($pdb[3] =~ /[A-Z]/*$);{
In this code, you have placed the slash / in the middle of your regex, and also placed a semi-colon there which does not belong anywhere on that line. Also, you are using * as the quantifier, which will not work as intended, because it will allow a match on the empty string (zero matches), which will match all strings. The correct line is:
if($pdb[3] =~ /[A-Z]+$/) {
However, this entire line is incorrect, when taken in context:
if($pdb[3] =~ /[A-Z]*$/) {
$pdb[3] =~ s/\D//g;
Here you only remove non-digits if upper case letters are found. Besides the fact that you are checking for two different things, you do not need to check before substituting, because a substitution will not do anything if it does not match. So... something like this:
if ($foo =~ /A/) {
$foo =~ s/A//g;
is completely redundant, because s/A//g will not do anything unless there is already an A in the string.
Also, a few more things you should know:
Always use
use strict;
use warnings;
As it will help you prevent a lot of simple mistakes.
Use three argument open, with lexical file handle, and check the return value including the error:
open my $fh, "<", $file or die "Cannot open $file: $!";
You do not need to quote variables, such as with "$ARGV[0]". You leave out the quotes: $ARGV[0].
You are using a C-style for loop. Using a Perl-style loop is preferred, in my opinion:
for my $i (0 .. $#arr)
But you should not be using array indexes unless you need the indexes themselves, so the better loop is:
for my $line (#arr)
But again, as a general rule, it is better to read a file line-by-line than slurping it into an array. For this purpose you would use a while loop, which iterates over the file handle instead of exhausting it all at once:
while (<$fh>) {
# process line $_
}
Using /\h/ as the field delimiter for split is wrong, unless you intended that consecutive whitespace indicates empty fields. The default split is ' ', which splits on multiple whitespace /\s+/, and also strips leading whitespace. With CSV data, it is possibly correct to split on single delimiters, but in that case you should use the specific delimiter, and not a character class like \h.
Like I said before, using the * quantifier in a regex match is horribly wrong. You might notice that a regex such as /[A-Z]*/ matches anything if you try it out: perl -lnwe 'print /[A-Z]*/ ? "match!" : "no match";' That is because it is allowed to match the empty string, and all strings match the empty string.
And like I also said, you do not need to check before you substitute. At least not for the same thing. So, when simplified, your code becomes:
open my $fh, "<", $ARGV[0] or die "Cannot open $ARGV[0]: $!";
while (<$fh>) { # short for while ($_ = <$fh>)
chomp; # short for chomp($_)
my #fields = split; # short for split(' ', $_)
$fields[3] =~ s/\D//g;
print "#fields[1,2,3]\n"; # quoting an array inserts spaces between elements
}
Note that I used an array slice, where we only use the elements with the indicated elements. You can also write this such as:
print join(" ", $fields[1], $fields[2], $fields[3]), "\n";
You might note also that this is possible to achieve using a one-liner:
perl -anlwe '$F[3] =~ s/\D//g; print "#F[1,2,3]"'
The -a switch autosplits the line on whitespace, storing the fields in #F. The -l switch chomps the line and adds newline to print. And the -n switch reads input from STDIN or argument files, whichever is supplied.
Try this
perl -ne 'print "$1\n" if m/(\d+)\D$/' datafile

Whats wrong with this code to read file?

I have been trying to read a file called "perlthisfile.txt" which is basically the output of nmap on my computer.
I want to get only the ip addresses printed out, so i wrote the following code but it is not working:
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
print"\n running \n";
open (MYFILE, 'perlthisfile.txt') or die "Cannot open file\n";
while(<MYFILE>) {
chomp;
my #value = split(' ', <MYFILE>);
print"\n before foreach \n";
foreach my $val (#value) {
if (looks_like_number($val)) {
print "\n looks like number block \n";
if ($val == /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/) {
print "\n$val\n";
}
}
}
}
close(MYFILE);
exit 0;
And when i ran this code the output was:
running
before foreach
before foreach
looks like number block
before foreach
looks like number block
before foreach
looks like number block
My perlthisfile.txt:
Starting Nmap 6.00 ( http://nmap.org ) at 2013-10-16 22:59 EST
Nmap scan report for BoB2.iiNet (10.1.1.1)
Nmap scan report for android-fbff3c3812154cdc (10.1.1.3)
All 1000 scanned ports on android-fbff3c3812154cdc (10.1.1.3) are closed
Nmap scan report for 10.1.1.5
All 1000 scanned ports on 10.1.1.5 are open|filtered
Nmap scan report for 10.1.1.6
All 1000 scanned ports on 10.1.1.6 are closed
Several issues here. As #toolic said, calling <MYFILE> inside the split is probably not what you want - it will read the next record from the file, use $_ instead.
Also, you are using == with a regex, you should use the binding operator, =~ (== is only used for numeric comparisons in Perl):
if ($val =~ /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/){
I suggest that looks_like_number is redundant if the regex works. I suspect that you are using it because == gives something like isn't numeric in numeric eq (==) depending on the version of perl you are using.
You had a few errors, one of which is regex which should have optional part for port number (: and following \d{1,5})
#!/usr/bin/perl
use strict;
use warnings;
open (my $MYFILE, '<', 'perlthisfile.txt') or die $!;
my $looks_like_ip = qr/( \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} (?: : \d{1,5})? )/x;
while (<$MYFILE>) {
chomp;
my #value = split;
print"\n before foreach \n";
foreach my $val (#value) {
if (my ($match) = $val =~ /$looks_like_ip/){
print "\n$match\n";
}
# else { print "$val doesn't contain IP\n" }
}
}
close($MYFILE) or warn $!;
If this is what it looks to be, which is a quick hack to extract IPs, you might get away with something simple such as:
perl -nlwe '/((?:\d+\.)+\d+)/ && print $1' perlthisfile.txt
Which is to say, not a very strict regex by any means, it just matches numbers joined by periods. If you'd like to only print unique IPs, you can make use of a hash to dedupe:
perl -nlwe '/((?:\d+\.)+\d+)/ && !$seen{$1}++ && print $1" perlthisfile.txt
With a slightly tighter regex that also matches port numbers:
perl -nlwe '/((?:\d+[\.:]){3,4}\d+)/ && print $1' perlthisfile.txt
This will disallow shorter chains of numbers, and allow for a port number.
This last regex explained:
/( # opening parenthesis, starts a string capture
(?: # a non-capturing parenthesis
\d+ # match a number, repeated one or more times
[\.:] # [ ... ] is a character class, it matches one of the literal
# characters inside it, and only one time
){3,4} # closing the non-capturing parenthesis, adding a quantifier
# that says this parenthesis can match 3 or 4 times
\d+ # match one or more numbers
)/x # close capturing parenthesis (added `/x` switch)
The /x switch is just so that you can use the above regex as-is, with comments and whitespace.
The logic behind this is simply: We want a string consisting of a number followed by a period or a colon. We want this string 3 or 4 times. End with another number.
The + and {3,4} are quantifiers, they dictate how many times the item to the left of it is supposed to match. By default, every item matches one time, but by using a quantifier you can change that. + is shorthand for {1,}, and you also have:
? -> {1,0}
* -> {0,}
The syntax is {min,max}, and when a number is missing, that means as many times as possible.

Summing a column of numbers in a text file using Perl

Ok, so I'm very new to Perl. I have a text file and in the file there are 4 columns of data(date, time, size of files, files). I need to create a small script that can open the file and get the average size of the files. I've read so much online, but I still can't figure out how to do it. This is what I have so far, but I'm not sure if I'm even close to doing this correctly.
#!/usr/bin/perl
open FILE, "files.txt";
##array = File;
while(FILE){
#chomp;
($date, $time, $numbers, $type) = split(/ /,<FILE>);
$total += $numbers;
}
print"the total is $total\n";
This is how the data looks in the file. These are just a few of them. I need to get the numbers in the third column.
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Your program is reasonably close to working. With these changes it will do exactly what you want
Always use use strict and use warnings at the start of your program, and declare all of your variables using my. That will help you by finding many simple errors that you may otherwise overlook
Use lexical file handles, the three-parameter form of open, and always check the return status of any open call
Declare the $total variable outside the loop. Declaring it inside the loop means it will be created and destroyed each time around the loop and it won't be able to accumulate a total
Declare a $count variable in the same way. You will need it to calculate the average
Using while (FILE) {...} just tests that FILE is true. You need to read from it instead, so you must use the readline operator like <FILE>
You want the default call to split (without any parameters) which will return all the non-space fields in $_ as a list
You need to add a variable in the assignment to allow for athe AM or PM field in each line
Here is a modification of your code that works fine
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($date, $time, $ampm, $numbers, $type) = split;
$total += $numbers;
$count += 1;
}
print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";
output
The total is 124533
The count is 3
The average is 41511
It's tempting to use Perl's awk-like auto-split option. There are 5 columns; three containing date and time information, then the size and then the name.
The first version of the script that I wrote is also the most verbose:
perl -n -a -e '$total += $F[3]; $num++; END { printf "%12.2f\n", $total / ($num + 0.0); }'
The -a (auto-split) option splits a line up on white space into the array #F. Combined with the -n option (which makes Perl run in a loop that reads the file name arguments in turn, or standard input, without printing each line), the code adds $F[3] (the fourth column, counting from 0) to $total, which is automagically initialized to zero on first use. It also counts the lines in $num. The END block is executed when all the input is read; it uses printf() to format the value. The + 0.0 ensures that the arithmetic is done in floating point, not integer arithmetic. This is very similar to the awk script:
awk '{ total += $4 } END { print total / NR }'
First drafts of programs are seldom optimal — or, at least, I'm not that good a programmer. Revisions help.
Perl was designed, in part, as an awk killer. There is still a program a2p distributed with Perl for converting awk scripts to Perl (and there's also s2p for converting sed scripts to Perl). And Perl does have an automatic (built-in) variable that keeps track of the number of lines read. It has several names. The tersest is $.; the mnemonic name $NR is available if you use English; in the script; so is $INPUT_LINE_NUMBER. So, using $num is not necessary. It also turns out that Perl does a floating point division anyway, so the + 0.0 part was unnecessary. This leads to the next versions:
perl -MEnglish -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $NR; }'
or:
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $.; }'
You can tune the print format to suit your whims and fancies. This is essentially the script I'd use in the long term; it is fairly clear without being long-winded in any way. The script could be split over multiple lines if you desired. It is a simple enough task that the legibility of the one-line is not a problem, IMNSHO. And the beauty of this is that you don't have to futz around with split and arrays and read loops on your own; Perl does most of that for you. (Granted, it does blow up on empty input; that fix is trivial; see below.)
Recommended version
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $. if $.; }'
The if $. tests whether the number of lines read is zero or not; the printf and division are omitted if $. is zero so the script outputs nothing when given no input.
There is a noble (or ignoble) game called 'Code Golf' that was much played in the early days of Stack Overflow, but Code Golf questions are no longer considered good questions. The object of Code Golf is to write a program that does a particular task in as few characters as possible. You can play Code Golf with this and compress it still further if you're not too worried about the format of the output and you're using at least Perl 5.10:
perl -Mv5.10 -n -a -e '$total += $F[3]; END { say $total / $. if $.; }'
And, clearly, there are a lot of unnecessary spaces and letters in there:
perl -Mv5.10 -nae '$t+=$F[3];END{say$t/$.if$.}'
That is not, however, as clear as the recommended version.
#!/usr/bin/perl
use warnings;
use strict;
open my $file, "<", "files.txt";
my ($total, $cnt);
while(<$file>){
$total += (split(/\s+/, $_))[3];
$cnt++;
}
close $file;
print "number of files: $cnt\n";
print "total size: $total\n";
printf "avg: %.2f\n", $total/$cnt;
Or you can use awk:
awk '{t+=$4} END{print t/NR}' files.txt
Try doing this :
#!/usr/bin/perl -l
use strict; use warnings;
open my $file, '<', "my_file" or die "open error [$!]";
my ($total, $count);
while (<$file>){
chomp;
next if /^$/;
my ($date, $time, $x, $numbers, $type) = split;
$total += $numbers;
$count++;
}
print "the average is " . $total/$count . " and the total is $total";
close $file;
It is as simple as this:
perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' your_file
tested below:
> cat temp
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Now the execution:
> perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' temp
The average size is 41511
>
explanation:
-F -a says store the line in an array format.with the default separator as space or tab.
so nopw $F[3] has you size of the file.
sum up all the sizes in the 4th column untill all the lines are processed.
END will be executed after processing all the lines in the file.
so $. at the end will gives the number of lines.
so $a/$. will give the average.
This solution opens the file and loops through each line of the file. It then splits the file into the five variables in the line by splitting on 1 or more spaces.
open the file for reading, "<", and if it fails, raise an error or die "..."
my ($total, $cnt) are our column total and number of files added count
while(<FILE>) { ... } loops through each line of the file using the file handle and stores the line in $_
chomp removes the input record separator in $_. In unix, the default separator is a newline \n
split(/\s+/, $_) Splits the current line represented by$_, with the delimiter \s+. \s represents a space, the + afterward means "1 or more". So, we split the next line on 1 or more spaces.
Next we update $total and $cnt
#!/usr/bin/perl
open FILE, "<", "files.txt" or die "Error opening file: $!";
my ($total, $cnt);
while(<FILE>){
chomp;
my ($date, $time, $am_pm, $numbers, $type) = split(/\s+/, $_);
$total += $numbers;
$cnt++;
}
close FILE;
print"the total is $total and count of $cnt\n";`