Perl: How to extract a string between brackets - perl

I have a file in moinmoin text format:
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
All the words between the '[[' and ']]' are the short description of the entry. I need to extract the whole entry, but not each individual word.
I found an answer for a similar question here: https://stackoverflow.com/a/2700749/819596
but can't understand the answer: "my #array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;"
Anything that works will be accepted but explanations would help greatly, ie: what (?0) or /xg does.

The code probably will look like this:
use warnings;
use strict;
my #subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/
\[ \[ # matching two `[` signs
\s* # ... and, if any, whitespace after them
([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
]]
/x;
# main processing loop:
while (<DATA>) { # reading the source file line by line
if (/$pattern/) { # if line is matched by our pattern
push #subjects, $1; # ... push the captured group of symbols into our array
}
}
print $_, "\n" for #subjects; # print our array of subject line by line
__DATA__
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
As I see, what you need can be described as follows: in each line of file try to find this sequence of symbols...
[[, an opening delimiter,
then 0 or more whitespace symbols,
then all the symbols that make a subject (which should be saved),
then ]], a closing delimiter
As you see, this description quite naturally translates into a regex. The only thing that is probably not needed is /x regex modifier, which allowed me to extensively comment it. )

If the text will never contain ], you can simply use the following as previously recommended:
/\[\[ ( [^\]]* ) \]\]/x
The following allows ] in the contained text, but I recommend against incorporating it into a larger pattern:
/\[\[ ( .*? ) \]\]/x
The following allows ] in the contained text, and is the most robust solution:
/\[\[ ( (?:(?!\]\]).)* ) \]\]/x
For example,
if (my ($match) = $line =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/x) {
print "$match\n";
}
or
my #matches = $file =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/xg;
/x: Ignore whitespace in pattern. Allows spaces to be added to make the pattern readable without changing the meaning of the pattern. Documented in perlre.
/g: Find all matches. Documented in perlop.
(?0) was used to make the pattern recursive since the linked node had to deal with arbitrary nesting of curlies. * /g: Find all matches. Documented in perlre.

\[\[(.*)]]
\[ is a literal [,
] is a literal ],
.* means every sequence of 0 or more character,
something enclosed in parentheses is a capturing group, hence you can access it later in your script with $1 (or $2 .. $9 depending on how many groups you have).
Put all together you will match two [ then everything up to the last occurrence of two successive ]
Update
On a second read of your question I suddenly are confused, do you need the content between [[ and ]], or the whole line - in that case leave the parentheses out completely and just test if the pattern matches, no need to capture.

The answer you found is for recursive pattern matching, that i think you don't need.
/x allows to use meaningless spaces and comments in the regexp.
/g runs the regexp through all the string. Without it runs only till the first match.
/xg is /x and /g combined.
(?0) runs the regexp itself again (recursion)
If i understand ok, you need something like this:
$text="* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
";
#array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",#array);
# this prints " Virtualbox Guest Additions, Abiword Wordprocessor, Sylpheed E-Mail, Kupfer"

I would recommend using "extract_bracketed" or "extract_delimited" from module Text::Balanced - see here: http://perldoc.perl.org/Text/Balanced.html

perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp
tested below:
> cat temp
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
>
> perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp
Virtualbox Guest Additions
Abiword Wordprocessor
Sylpheed E-Mail
Kupfer
>
s/.[[(.)]].*/\1/g
.*[[->match any charater till [[
(.*)]] store any charater after the string "[[" till "]]" in \1
.*->matching the rest of the line.
then since we have our data in \1 we can simply use it for printing on the console.

my #array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;
The 'x' flag means that whitespace is ignored in the regex, to allow for a more readable expression. The 'g' flag means that the result will be a list of all matches from left to right (match *g*lobally).
The (?0) represents the regular expression inside the first group of parentheses. It's a recursive regular expression, equivalent to a set of rules such as:
E := '{' ( NoBrace | E) '}'
NoBrace := [^{}]*

Related

Matching special character (###!~`%^&()[]}{;') and replace it with _ (underscore) in perl

I want to remove all special characters except this 2 character .-
$name=~s/[^\w\d\.-]/_/g ;
But the line above it not only removes the special character but also non-alphabet characters e.g Arabic or other none alphabet characters.
How to remove only these characters (###!~`%^&()[]}{;',)
There are a few things to consider here.
First, do \d and \w really do what you think they do? Recent perls are Unicode aware (and in some cases locale aware), and those character classes aren't the same in every situation.
Since you know what you want to exclude, you can just put that directly into the character class. You need escape only the ] so it doesn't end the character class:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/[(###!~`%^&()[\]}{;',)]/_/g;
say $name;
Mark Jason Dominus has written about the "American" and "Prussian" approaches to cleansing data. You can specify what to exclude, or what to include.
If you specify the things to exclude, you potentially pass through some things that you should have excluded but did not. This may be because you forgot or didn't even know you should exclude it. These unintended situations may bite you.
If you specify only the things that are safe, you potentially miss out on things you should pass through, but bad things don't get through by mistakes of omission.
You then might try this, where you don't use the character class shortcuts:
$name =~ s/[^0-9A-Za-z.-]/_/g;
But the output is a bit weird because this also replaces whitespace. You might add the \s shortcut in there:
$name =~ s/[^0-9A-Za-z\s.-]/_/g;
But the meaning of \s has also changed over time too (vertical tab!) and is also Unicode aware. You could list the whitespace you would accept:
$name =~ s/[^0-9A-Za-z\x20.-]/_/g;
But no this is getting a bit weird. There's another way. You can go back to the ASCII versions of the character class shortcuts with the /a flag:
$name =~ s/[^\d\w\s.-]/_/ga;
The regex operator flags are in perlop since they apply to an operator. But, for as long as I've been using Perl and telling that to people in classes, someone I still go to perlre first.
Transliterate
Second, the substitution operator may be more than you need though. If you want to change single characters into other single characters, the transliteration operator may be what you need. It changes the character on the left with the corresponding character on the right:
$name =~ tr/abc/XYZ/; # a -> X, b -> Y, c -> Z
If you don't have enough characters to match up on the right, it reuses the last character:
$name =~ tr/abc/XY/; # a -> X, b -> Y, c -> Y
So, in your case with one underscore:
$name =~ tr/##!~`%^&()[]}{;',/_/;
Since the sequence of characters in tr/// aren't a regular expression, you don't worry about metacharacters.
Just for giggles
If this pattern is something you want to use in multiple places, you might want to give it a name with a user-defined Unicode property. Once it has a name, you use that everywhere and can update for everyone at the same time:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/\p{IsForbidden}/_/g;
say $name;
sub IsForbidden {
# see https://perldoc.perl.org/perlunicode#User-Defined-Character-Properties
state $exclude = q|##!~`%^&()[]}{;',|;
state $string =
join '',
map { sprintf "%X\n", ord }
split( //, $exclude );
return $string;
}
Building on Gene's comment, specify what you want to replace but I'd escape each special character. Note, to replace #, use \#\# in character array as shown in line 2:
$name = "# # R ! ~## ` % ^ & ( O ){{();,'`## { } ;!!! ' N , ";
$name =~ s/[\#\!\~\`\%\&\^\(\)\{\}\;\'\,\#\#]//g;
$name =~ s/ *//g;
print $name;
### Outputs RON

Can somebody explain this obfuscated perl regexp script?

This code is taken from the HackBack DIY guide to rob banks by Phineas Fisher. It outputs a long text (The Sixth Declaration of the Lacandon Jungle). Where does it fetch it? I don't see any alphanumeric characters at all. What is going on here? And what does the -r switch do? It seems undocumented.
perl -Mre=eval <<\EOF
''
=~(
'(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})')
;$:="\."^ '~';$~='#'
|'(';$^= ')'^'[';
$/='`' |'.';
$,= '('
EOF
The basic idea of the code you posted is that each alphanumeric character has been replaced by a bitwise operation between two non-alphanumeric characters. For instance,
'`'|'%'
(5th line of the "star" in your code)
Is a bitwise or between backquote and modulo, whose codepoints are respectively 96 and 37, whose "or" is 101, which is the codepoint of the letter "e". The following few lines all print the same thing:
say '`' | '%' ;
say chr( ord('`' | '%') );
say chr( ord('`') | ord('%') );
say chr( 96 | 37 );
say chr( 101 );
say "e"
Your code starts with (ignore whitespaces which don't matter):
'' =~ (
The corresponding closing bracket is 28 lines later:
^'(').'"})')
(C-f this pattern to see it on the web-page; I used my editor's matching parenthesis highlighting to find it)
We can assign everything in between the opening and closing parenthesis to a variable that we can then print:
$x = '(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})';
print $x;
This will print:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
The remaining of the code is a bunch of assignments into some variables; probably here only to complete the pattern: the end of the star is:
$:="\."^'~';
$~='#'|'(';
$^=')'^'[';
$/='`'|'.';
$,='(';
Which just assigns simple one-character strings to some variables.
Back to the main code:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
This code is inside a regext which is matched against an empty string (don't forget that we had first '' =~ (...)). (?{...}) inside a regex runs the code in the .... With some whitespaces, and removing the string within the eval, this gives us:
# fetch an url from the web using curl _quitely_ (-s)
($: = `curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)
# replace end of paragraphs with newlines in the HTML fetched
=~ s|</p>|\n|g;
# Remove all HTML tags
$: =~ s/<.*?>//g;
# Print everything between SEXTA and Mexicano (+2 chars)
print $: =~ /(SEXTA.*?Mexicano..)/s
You can automate this unobfuscation process by using B::Deparse: running
perl -MO=Deparse yourcode.pl
Will produce something like:
'' =~ m[(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})];
$: = 'P';
$~ = 'h';
$^ = 'r';
$/ = 'n';
$, = '(';

how can i use perl to calculate the frequency of a variable

PASS AC=0;AF=0.048;
AN=2;
ASP;
BaseQRankSum=0.572;
CAF=[0.9605,.,0.03949];
CLNACC=RCV000111759.1,RCV000034730
I'm a new here.I want to know how to match CAF = [0.9605,.,0.03949] using regular expression,thank you.
while (<>) {
if (
/^CAF= # start of line, then literal 'CAF='
\[ # literal '['
[^\]]+ # 1+ characters different from ']'
\]; # closing ']'
/x
)
{
print;
}
}
The /x modifier allows for linebreaks and comments in the regex (to improve readability).
Or, as a one liner:
perl -ne 'print if (/^CAF=\[[^\]]+\];/);' <your_file>
This prints the complete lines containing the desired pattern.
You need to read the documentation for Perl regex. What you are asking doesn't look more complex than a beginner could match having read the docs:
http://perldoc.perl.org/perlre.html

How to truncate the extension for special case in Perl?

I'm working on a script to truncate all the extensions for a file using the regex as below but it seem doesn't works well as this command does remove some data that I want as it will basically removing everything whenever it see a dot.
The regex I use currently:-
/\..*?$/
It would remove some files like
b10_120.00c.current.all --> b10_120
abc_10.77.log.bac.temp.ls --> abc_10
but I'm looking for an output in b10_120.00c and abc_10.77
Aside from that, is there a way to printout the output such as it keep certain extension only? Such as for the above 2 examples, it will displays b10_120.00c.current and abc_10.77.log. Thank you very much.
The following will strip file name extensions off:
s/\.[^.]+$//;
Explanation
\. matches a literal .
[^.]+ matches every character that is not a .
$ till end of string
Update
my ($new_file_name) = ( $file_name =~ m/^( [^.]+ \. [^.]+ )/x );
Explanation
^ anchor at the start of the string
[^.]+ matches every character that is not a .
\. matches a literal .
[^.]+ matches every character that is not a .
Test
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More 'tests' => 2;
my %file_name_map = (
'b10_120.00c.current.all' => 'b10_120.00c',
'abc_10.77.log.bac.temp.ls' => 'abc_10.77',
);
sub new_file_name {
my $file_name = shift;
my ($new_file_name) = ( $file_name =~ m/^( [^.]+ \. [^.]+ )/x );
return $new_file_name;
}
for my $file_name ( keys %file_name_map ) {
is $file_name_map{$file_name}, new_file_name($file_name),
"Got $file_name_map{$file_name}";
}
$file =~ s/(\.[^.]+).*/$1/; # SO requires 30 chars in answer, that is stupid
You should use \. for the dot in the regular expression.
Also please explain in more details how you want to process file name.
Instead of a regex, I would suggest using this package:
http://perldoc.perl.org/File/Basename.html

regular expression is not working

my $pat = '^x.*d$';
my $dir = '/etc/inet.d';
if ( $dir =~ /$pat/xmsg ) {
print "found ";
}
how to make it sucess
Your pattern is looking for strings starting with x (^x) and ending in d (d$). The path you are trying does not match as it doesn't start with x.
You can use YAPE::Regex::Explain to help you understand regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/^x.*d$/xms;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?msx-i:^x.*d$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?msx-i: group, but do not capture (with ^ and $
matching start and end of line) (with .
matching \n) (disregarding whitespace and
comments) (case-sensitive):
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
x 'x'
----------------------------------------------------------------------
.* any character (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
d 'd'
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Also, you should not need the g modifier in this case. The documentation has plenty of information about regexes: perlre
There is an 'x' too much :
my $pat = '^.*d$';
my $dir = '/etc/inet.d';
if ( $dir =~ /$pat/xmsg ) {
print "found ";
}
My guess is that you're trying to list all files in /etc/init.d whose name matches the regular expression.
Perl isn't smart enough to figure out that when you name a string variable $dir, assign to it the full pathname of an existing directory, and pattern match against it, you don't intend to match against the pathname,
but against the filenames in that directory.
Some ways to fix this:
perldoc -f glob
perldoc -f readdir
perldoc File::Find
You may just want to use this:
if (glob('/etc/init.d/x*'))
{
warn "found\n";
}