How to capture a certain group of lines using perl regex? - perl

How to remove a trailing blank space in a regex substitution?
Here the Data
__Data__
Test - results
dkdkdkdkdkdkdkdkdkkkdkd
slsldldldldldldldldldll
Information
ddkdkdkeieieieieieieiei
eieieieieieieieieieieiei
Test - summary
dkdkdkdkdkdkdkdkkdkdkdk
dkdkdkdkdkdkdkdkdkdkdkk
What I would like to remove these lines shown above:
Information
ddkdkdkeieieieieieieiei
eieieieieieieieieieieiei
My attempt using regex expression
$/ = "__Data__";
$_ =~ s/^(Test.*[^\n]+)\n(Information.*[^\n]+)\n(Test.*Summary[^\n]+)/$1$3/ms;
print $_
The input of the data is the same as the output. In other words, nothing changes.

Why not this:
while (<DATA>) {
if ( m/^Information/..m/^Test/ ) {
next unless m/^Test/;
}
s{\s+$}{};
print "$_\n";
}

if you want to remove the information section use
$s =~ s/Information.*Test/Test/s;
if $s contains the data you gave then the following is returned
Test - results
dkdkdkdkdkdkdkdkdkkkdkd
slsldldldldldldldldldll
Test - summary
dkdkdkdkdkdkdkdkkdkdkdk
dkdkdkdkdkdkdkdkdkdkdkk
If you want the information only then use
$s =~ s/(.*?)(Information.*?)(Test.*)/$2/s;
In both cases notice the the "s" at the end of the substitution, that flag allows it to process multiple lines.

Related

Why is my last line is always output twice?

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s+(\D+)/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt
Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s+(\D+)/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA
You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s+//g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.
You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.
Finally, print what's left. Without an argument, print uses the value in the topic variable $_.
Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago
Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.
I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

Matching of data from output table

We need to match certain data element by element that is an output in tabular form obtained on the command prompt.The following is the approach being currently followed wherein the $Var contains the output. Is there an optimal way of doing this without directing the command output to file.
Please share your thoughts.
$Var = "iSCSI Storage LHN StgMgmt Name IP Name
==============================================================
0 Storage_1 15.178.209.194 admin
1 acct-mgmt 15.178.209.194 storage1
2 acct-mgmt2 15.178.209.194 storage2";
#tab = split("\n",$Var);
foreach (#tab) {
next if ($_ !~ /^\d/);
$_ =~ s/\s+//g;
$first=0 if($_ =~ /Storage/i && /15.178.209.194/);
push(#Array, $_); }
$_ =~ /Storage/i && /15.178.209.194/ is silly. That gets broken up like this: ($_ =~ /Storage/i) && (/15.178.209.194/). Either use $_ consistently or don't - the // and s/// operators automatically operate on $_.
Also you should know that in the regex /15.178.209.194/, the .s are being interpreted as any character. Either escape them or use the index() function.
Additionally, I would recommend that you separate each line using split(). This allows you to compare each individual column. You can use split() with a regex like so: #array = split(/\s+/, $string);.
Finally, I'm not really sure what $first is for, but I notice that all three sample lines in that input trigger $first=0 as they all contain that IP and the string "storage".
If I understand you correctly you want to invoke your script like this:
./some_shell_command | perl perl_script.pl
What you want to use is the Perl diamond operator <>:
#!/usr/bin/perl
use strict;
use warnings;
my $first;
my #Array;
for (<>) {
next unless /^\d/;
s/\s+/ /g;
$first = 0 if /Storage/i && /15.178.209.194/;
push(#Array, $_);
}
I've removed the redundant uses of $_ and fixed your substitution, since you probably don't want to remove all spaces.

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks
I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;
I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...
This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;