Strange regular expression [closed] - perl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This regular expression will match exactly one / and one . in a line. But why is it matching? Can anyone explain to me each characters role in this regular expression clearly?
if ($fp =~ m{^[^/]*/[^/]*$} and $fp =~ m{^[^.]*.[^.]$})
{
print $fp;
}

if($fp =~ m{^[^/]*/[^/]*$} and $fp =~ m{^[^.]*.[^.]$}) {
^\ / ^^\ / ^^
| | || | ||
------------- | || | ||
begin line | || | ||
--------------- || | ||
any char but / || | ||
------------------| | ||
zero or more | | ||
------------------ | ||
one / | ||
--------------------- ||
any char but / ||
------------------------|
zero or more |
------------------------|
end of line
So it search
begin or line (^),
followed by zero or more occurrence (*) of any char but / ([^/])
followed by a /
followed by zero or more occurrence (*) of any char but / ([^/])
followed by end of line ($)
The "." search is similar and the 'if' triggers if both are true.
Note that [...] searches a char in a range. For instance [abc] searches either a 'a', a 'b', or a 'c'. If first char is '^' test is reversed and [^/] is any char, but '/'.

While the previous answers are correct in explaining the regex, they do fail to point out that the 2nd regex is actually broken. As written it will match
start of line
followed by zero-or-more non-. (dot) characters
followed by ANY character, except \n
followed by ONE non-. (dot) character
end of line
Proof:
$ echo "This should NOT match" | perl -ne 'print if m{^[^.]*.[^.]$}'
This should NOT match <--- INCORRECT MATCH
$ echo "This should. match" | perl -ne 'print if m{^[^.]*.[^.]$}'
<--- INCORRECT MIS-MATCH
$ echo "This should match.!" | perl -ne 'print if m{^[^.]*.[^.]$}'
This should match.! <-- CORRECT (by luck)
$ echo "This should match." | perl -ne 'print if m{^[^.]*.[^.]$}'
This should match. <-- CORRECT
Correct would be
the . needs to be escaped (\.)
the 2nd character class needs a *
$ echo "This should NOT match" | perl -ne 'print if m{^[^.]*\.[^.]*$}'
<-- CORRECT
$ echo "This should. match" | perl -ne 'print if m{^[^.]*\.[^.]*$}'
This should. match <-- CORRECT
$ echo "This should match.!" | perl -ne 'print if m{^[^.]*\.[^.]*$}'
This should match.! <-- CORRECT
$ echo "This should match." | perl -ne 'print if m{^[^.]*\.[^.]*$}'
This should match. <-- CORRECT

The first expresion: m matches { opens expresion ^ first of line, [^/]* any character not '/' 0 or more times, '/' literal '/', again [^/]*, $ end of line, } closes the expresion.

Related

How to surround a string in double quotes

I have a file with the following
firsttext=cat secondtext=dog thirdtext=mouse
and I want it to return this string:
"firsttext=cat" "secondtext=dog" "thirdtext=mouse"
I yave tried this one-liner but it gives me an error.
cat oneline | perl -ne 'print \"$_ \" '
Can't find string terminator '"' anywhere before EOF at -e line 1.
I don't understand the error.Why can't it just add the quotation marks?
Also, if I have a variable in this string, I want it to be interpolated like:
firsttext=${animal} secondtext=${othervar} thirdtext=mouse
Which should output
"firsttext=cat" "secondtext=dog" "thirdtext=mouse"
perl -lne '#f = map qq/"$_"/, split; print "#f";' oneline
What you want is this:
cat oneline | perl -ne 'print join " ", map { qq["$_"] } split'
The -ne option only splits on lines, it won't split on arbitrary whitespace without other options set.

PERL : Using Text::Wrap and specify the end of line

Yes, I'm re-writing cowsay :)
#!/usr/bin/perl
use Text::Wrap;
$Text::Wrap::columns = 40;
my $FORTUNE = "The very long sentence that will be outputted by another command and it can be very long so it is word-wrapped The very long sentence that will be outputted by another command and it can be very long so it is word-wrapped";
my $TOP = " _______________________________________
/ \\
";
my $BOTTOM = "\\_______________________________________/
";
print $TOP;
print wrap('| ', '| ', $FORTUNE) . "\n";
print $BOTTOM;
Produces this
_______________________________________
/ \
| The very long sentence that will be
| outputted by another command and it
| can be very long so it is
| word-wrapped The very long sentence
| that will be outputted by another
| command and it can be very long so it
| is word-wrapped
\_______________________________________/
How can I get this ?
_______________________________________
/ \
| The very long sentence that will be |
| outputted by another command and it |
| can be very long so it is |
| word-wrapped The very long sentence |
| that will be outputted by another |
| command and it can be very long so it |
| is word-wrapped |
\_______________________________________/
I could not find a way in the documentation, but you can apply a small hack if you save the string. It is possible to assign a new line ending by using a package variable:
$Text::Wrap::separator = "|$/";
You also need to prevent the module from expanding tabs and messing with the character count:
$Text::Wrap::unexpand = 0;
This is simply a pipe | followed by the input record separator $/ (newline most often). This will add a pipe to the end of the line, but no padding space, which will have to be added manually:
my $text = wrap('| ', '| ', $FORTUNE) . "\n";
$text =~ s/(^.+)\K\|/' ' x ($Text::Wrap::columns - length($1)) . '|'/gem;
print $text;
This will match the beginning of each line, ending with a |, add the padding space by multiplying a space by columns minus length of matched string. We use the /m modifier to make ^ match newlines inside the string. .+ by itself will not match newlines, which means each match will be an entire line. The /e modifier will "eval" the replacement part as code, not a string.
Note that it is somewhat of a quick hack, so bugs are possible.
If you're willing to download a more powerful module, you can use Text::Format. It has a lot more options for customizing, but the most relevant one is rightFill which fills the rest of the columns in each line with spaces.
Unfortunately, you can't customize the left and right sides with non-space characters. You can use a workaround by doing regex substitutions, just as Text::NWrap does in its source code.
#!/usr/bin/env perl
use utf8;
use Text::Format;
chop(my $FORTUNE = "The very long sentence that will be outputted by another command and it can be very long so it is word-wrapped " x 2);
my $TOP = "/" . '‾'x39 . "\\\n";
my $BOTTOM = "\\_______________________________________/\n";
my $formatter = Text::Format->new({ columns => 37, firstIndent => 0, rightFill => 1 });
my $text = $formatter->format($FORTUNE);
$text =~ s/^/| /mg;
$text =~ s/\n/ |\n/mg;
print $TOP;
print $text;
print $BOTTOM;

Printing reverse complement of DNA in single-line Perl

I want to write a quick single-line perl script to produce the reverse complement of a sequence of DNA. The following isn't working for me, however:
$ cat sample.dna.sequence.txt | perl -ne '{while (<>) {$seq = $_; $seq =~ tr /atcgATCG/tagcTAGC/; $revComp = reverse($seq); print $revComp;}}'
Any suggestions? I'm aware that
tr -d "\n " < input.txt | tr "[ATGCatgcNn]" "[TACGtacgNn]" | rev
works in bash, but I want to do it with perl for the practice.
Your problem is that is that you're using both -n and while (<>) { }, so you end up with while (<>) { while (<>) { } }.
If you know how to do <file.txt, why did you switch to cat file.txt|?!
perl -0777ne's/\n //g; tr/ATGCatgcNn/TACGtacgNn/; print scalar reverse $_;' input.txt
or
perl -0777pe's/\n //g; tr/ATGCatgcNn/TACGtacgNn/; $_ = reverse $_;' input.txt
Or if you don't need to remove the newlines:
perl -pe'tr/ATGCatgcNn/TACGtacgNn/; $_ = reverse $_;' input.txt
If you need to use cat, the following one liner should work for you.
ewolf#~ $cat foo.txt
atNgNt
gatcGn
ewolf#~ $cat foo.txt | perl -ne '$seq = $_; $seq =~ tr/atcgATCG/tagcTAGC/;print reverse( $seq )'
taNcNa
ctagCn
Considering the DNA sequences in single-line format in a multifasta file:
cat multifasta_file.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt
If your multifasta file is not in single-line format, you can transform your file to single-line before using the command above, like this:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' <multifasta_file.txt >multifasta_file_singleline.txt<="" p="">
Then,
cat multifasta_file_SingleLine.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt
Hope it is useful for someone. It took me some time to build it.
The problem is that you're using -n in the perl flag, yet you've written your own loop. -n wraps your supplied code in a while loop like while(<STDIN>){...}. So the STDIN file handle has already been read from and your code does it again, getting EOF (end of file) or rather 'undefined'. You either need to remove the n from -ne or remove the while loop from your code.
Incidentally, a complete complement tr pattern, including ambiguous bases, is:
tr/ATGCBVDHRYKMatgcbvdhrykm/TACGVBHDYRMKtacgvbhdyrmk/
Ambiguous bases have complements too. For example, a V stands for an A, C, or G. Their complements are T, G, and C, which is represented by the ambiguous base B. Thus, V and B are complementary.
You don't need to include any N's or n's in your tr pattern (as was demonstrated in another answer) because the complement is the same and leaving them out will leave them untouched. It's just extra processing to put them in the pattern.

splitting on pipe character in perl

I have a little problem. I want to split a line at every pipe character found using the split operator. Like in this example.
echo "000001d17757274585d28f3e405e75ed|||||||||||1||||||||||||||||||||||||" | \
perl -ane '$data = $_ ; chop $data ; #d = split(/\|/ , $data) ; print $#d+1,"\n" ;'
I would expect an ouput of 36
as awk splitting with the delimiter | return 36, but instead I get 12, as if the split stopped at the 1 character in the line.
echo "000001d17757274585d28f3e405e75ed|||||||||||1|||||||||||||||||||||||||||||||||||||||" | \
awk -F"|" '{print NF}'
Any idea. I have tried many ways of quoting the |, but without success.
Many thanks by advance.
According to split:
By default, empty leading fields are preserved, and empty trailing ones are deleted.
You need to specify a negative limit to the split to get the trailing ones:
split(/\|/, $data, -1)

How to strip characters within a filename?

I am having trouble on stripping characters within a filename.
For example:
1326847080_MUNDO-Cinco-Cosas-Que-Aprendimos-Del-Debate-De-Los-Republicanos-1.xml
1326836220_PLANETACNN-Una-Granja-De-Mariposas-Ayuda-A-Reducir-La-Tala-De-Bosques-En-Tanzania-3.xml
This is the output I want:
1326847080_MUNDO-1.xml
1326836220_PLANETACNN-3.xml
for i in *.xml
do
j=$(echo $i | sed -e s/-.*-/-/)
echo mv $i $j
done
or in one line:
for i in *.xml; do echo mv $i $(echo $i | sed -e s/-.*-/-/); done
remove echo to actually perform the mv command.
Or, without sed, using bash builtin pattern replacement:
for i in *.xml; do echo mv $i ${i//-*-/-}; done
rename to the rescue, with Perl regular expressions. This command will show which moves will be made; just remove -n to actually rename the files:
$ rename -n 's/([^-]+)-.*-([^-]+)/$1-$2/' *.xml
1326836220_PLANETACNN-Una-Granja-De-Mariposas-Ayuda-A-Reducir-La-Tala-De-Bosques-En-Tanzania-3.xml renamed as 1326836220_PLANETACNN-3.xml
1326847080_MUNDO-Cinco-Cosas-Que-Aprendimos-Del-Debate-De-Los-Republicanos-1.xml renamed as 1326847080_MUNDO-1.xml
The regular expression explained:
Save the part up to (but excluding) the first dash as match 1.
Save the part after the last dash as match 2.
Replace the part from the start of match 1 to the end of match 2 with match 1, a dash, and match 2.
sorry for the late reply , but i saw it today :( .
I think you are looking for the following
input file ::
cat > abc
1326847080_MUNDO-Cinco-Cosas-Que-Aprendimos-Del-Debate-De-Los-Republicanos-1.xml
1326836220_PLANETACNN-Una-Granja-De-Mariposas-Ayuda-A-Reducir-La-Tala-De-Bosques-En-Tanzania-3.xml
code : (its a bit too basic , even for my liking)
while read line
do
echo $line ;
fname=`echo $line | cut -d"-" -f1`;
lfield=`echo $line | sed -n 's/\-/ /gp' | wc -w`;
lname=`echo $line | cut -d"-" -f${lfield}`;
new_name="${fname}-${lname}";
echo "new name is :: $new_name";
done < abc ;
output ::
1326847080_MUNDO-Cinco-Cosas-Que-Aprendimos-Del-Debate-De-Los-Republicanos-1.xml
new name is :: 1326847080_MUNDO-1.xml
1326836220_PLANETACNN-Una-Granja-De-Mariposas-Ayuda-A-Reducir-La-Tala-De-Bosques-En-Tanzania-3.xml
new name is :: 1326836220_PLANETACNN-3.xml