Why doesn't '*' work as a perl regexp in my .Rbuildignore file? - perl

When I try to build a package with the following in my .Rbuildignore file,
*pdf
*Rdata
I get the errors:
Warning in readLines(ignore_file) :
incomplete final line found on '/home/user/project/.Rbuildignore'
and
invalid regular expression '*pdf'
I thought '*' was a wildcard for one or more characters?

There are two styles of pattern matching for files:
regular expressions. These are used for general string pattern matching. See ?regex
globs. These are typically used by UNIX shells. See ?Sys.glob
You seem to be thinking in terms of globs but .Rbuildignore uses regular expressions. To convert a glob to a regular expression try
> glob2rx("*pdf")
[1] "^.*pdf$"

See help(regex) for help on regular expression, esp. the Perl variant, and try
.*pdf
.*Rdata
instead. The 'dot' matches any chartacter, and the 'star' says that it can repeat zero or more times. I just tried it on a package of mine and this did successfully ignore a pdf vignette as we asked it to.

In a perl regexp, use .*? as a wildcard.
But I think that what you actually want is pdf$ and Rdata$ as entries in .Rbuildignore seem to affect files whose paths they match only partially, too. $ means "end of the path".

* is a quantifier that attaches to a previous expression to allow between 0 and infinite repetitions of it. Since you have not preceded the quantifier with an expression, this is an error.
. is an expression that matches any character. So I suspect that you want .*pdf, .*Rdata, etc.

Related

Oddities in fail2ban regex

This appears to be a bug in fail2ban, with different behaviour between the fail2ban-regex tool and a failregex filter
I am attempting to develop a new regex rule for fail2ban, to match:
\"%20and%20\"x\"%3D\"x
When using fail2ban-regex, this appears to produce the desired result:
^<HOST>.*GET.*\\"%20and%20\\"x\\"%3D\\"x.* 200.*$
As does this:
^<HOST>.*GET.*\\\"%20and%20\\\"x\\\"%3D\\\"x.* 200.*$
However, when I put either of these into a filter, I get the following error:
Failed during configuration: '%' must be followed by '%' or '(', found:…
To have this work in a filter you have to double-up the ‘%’, ie ‘%%’:
^<HOST>.*GET.*\\\"%%20and%%20\\\"x\\\"%%3D\\\"x.* 200.*$
While this gets the required hits running as a filter, it gets none running through fail2ban-regex.
I tried the \\\\ as Andre suggested below, but this gets no results in fail2ban-regex.
So, as this appears to be differential behaviour, I am going to file it as a bug.
According to Python's own site a singe backslash "\" has to be written as "\\\\" and there's no mention of %.
Regular expressions use the backslash character ('') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\'
as the pattern string, because the regular expression must be \, and
each backslash must be expressed as \ inside a regular Python string
literal
I would just go with:
failregex = (?i)^<HOST> -.*"(GET|POST|HEAD|PUT).*20and.*3d.*$
the .* wil match anything inbetween anyways and (?i) makes the entire regex case-insensitive

Not able to understand a command in perl

I need help to understand what below command is doing exactly
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
and $abc{hier} contains a path "/home/test1/test2/test3"
Can someone please let me know what the above command is doing exactly. Thanks
s/PATTERN/REPLACEMENT/ is Perl's substitution operator. It searches a string for text that matches the regex PATTERN and replaces it with REPLACEMENT.
By default, the substitution operator works on $_. To tell it to work on a different variable, you use the binding operator - =~.
The default delimiter used by the substitution operator is a slash (/) but you can change that to any other character. This is useful if your PATTERN or your REPLACEMENT contains a slash. In this case, the programmer has used # as the delimiter.
To recap:
$abc{hier} =~ s#PATTERN#REPLACEMENT#;
means "look for text in $abc{hier} that matches PATTERN and replace it with REPLACEMENT.
The substitution operator also has various options that change its behaviour. They are added by putting letters after the final delimiter. In this case we have a g. That means "make the substitution global" - or match and change all occurrences of PATTERN.
In your case, the REPLACEMENT string is empty (we have two # characters next to each other). So we're replacing the PATTERN with nothing - effectively deleting whatever matches PATTERN.
So now we have:
$abc{hier} =~ s#PATTERN*##g;
And we know it means, "in the variable $abc{hier}, look for any string that matches PATTERN and replace it with nothing".
The last thing to look at is the PATTERN (or regular expression - "regex"). You can get the full definition of regexes in perldoc perlre. But to explain what we're using here:
/tools : is the fixed string "/tools"
.* : is zero or more of any character
/dfII : is the fixed string "/dfII"
/? : is an optional slash character
.* : is (again) zero or more of any character
So, basically, we're removing bits of a file path from a value that's stored in a hash.
This =~ means "Do a regex operation on that variable."
(Actually, as ikegami correctly reminds me, it is not necessarily only regex operations, because it could also be a transliteration.)
The operation in question is s#something#else#, which means replace the "something" with something "else".
The g at the end means "Do it for all occurences of something."
Since the "else" is empty, the replacement has the effect of deleting.
The "something" is a definition according to regex syntax, roughly it means "Starting with '/tools' and later containing '/dfII', followed pretty much by anything until the end."
Note, the regex mentions at the end /?.*. In detail, this would mean "A slash (/) , or maybe not (?), and then absolutely anything (.) any number of times including 0 times (*). Strictly speaking it is not necessary to define "slash or not", if it is followed by "anything any often", because "anything" includes as slash, and anyoften would include 0 or one time; whether it is followed by more "anything" or not. I.e. the /? could be omitted, without changing the behaviour.
(Thanks ikeagami for confirming.)
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
The above commands use regular expression to strip/remove trailing /tools.*/dfII and
/tools.*/dfII/.* from value of hier member of %abc hash.
It is pretty basic perl except non standard regular expression limiters (# instead of standard /). It allows to avoid escaping / inside the regular expression (s/\/tools.*\/dfII\/?.*//g).
My personal preferred style-guide would make it s{/tools.*/dfII/?.*}{}g .

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.
This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'
sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally
Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

Get prev directory path in a variable in linux

I am trying to get the parent directory of a given directory in a variable in linux script but I am unable to get it.
MN_CURR=/home/sshekhar/Desktop
MN_PREV=`$MN_CURR/..`
echo " Displayng $MN_PREV"
I am using CentOS. Can anyone please help?
Following on from my comment, when using POSIX shell, while the parameter expansions are limited compared to a more advanced shell such as bash, ksh, or zsh, POSIX shell does provide expansions to handle string length and substring removal.
In your case you want to remove the last component of the path (the suffix beginning with '/') leaving the parent directory. For that you can use:
MN_PREV=${MN_CURR%/*}
(which will remove all characters from the right -- up to and including the last '/')
The reference documentation for the POSIX shell parameter expansions can be found at POSIX Programmers Guide - 2.6.2 Parameter Expansion. The expansions concerning string length and substring removal are:
${#parameter}
String Length. The length in characters of the value of parameter shall be substituted. If parameter is '*' or '#', the result of the expansion is unspecified. If parameter is unset and set -u is in effect, the expansion shall fail.
${parameter%[word]}
Remove Smallest Suffix Pattern. The word shall be expanded to produce a pattern. The parameter expansion shall then result in parameter, with the smallest portion of the suffix matched by the pattern deleted. If present, word shall not begin with an unquoted '%'.
${parameter%%[word]}
Remove Largest Suffix Pattern. The word shall be expanded to produce a pattern. The parameter expansion shall then result in parameter, with the largest portion of the suffix matched by the pattern deleted.
${parameter#[word]}
Remove Smallest Prefix Pattern. The word shall be expanded to produce a pattern. The parameter expansion shall then result in parameter, with the smallest portion of the prefix matched by the pattern deleted. If present, word shall not begin with an unquoted '#'.
${parameter##[word]}
Remove Largest Prefix Pattern. The word shall be expanded to produce a pattern. The parameter expansion shall then result in parameter, with the largest portion of the prefix matched by the pattern deleted.

Funky 'x' usage in perl

My usual 'x' usage was :
print("#" x 78, "\n");
Which concatenates 78 times the string "#". But recently I came across this code:
while (<>) { print if m{^a}x }
Which prints every line of input starting with an 'a'. I understand the regexp matching part (m{^a}), but I really don't see what that 'x' is doing here.
Any explanation would be appreciated.
It's a modifier for the regex. The x modifier tells perl to ignore whitespace and comments inside the regex.
In your example code it does not make a difference because there are no whitespace or comments in the regex.
The "x" in your first case, is a repetition operator, which takes the string as the left argument and the number of times to repeat as the right argument. Perl6 can replicate lists using the "xx" repetition operator.
Your second example uses the regular expression m{^a}x. While you may use many different types of delimiters, neophytes may like to use the familiar notation, which uses a forward slash: m/^a/x
The "x" in a regex is called a modifier or a flag and is but one of many optional flags that may be used. It is used to ignore whitespace in the regex pattern, but it also allows the use of normal comments inside. Because regex patterns can get really long and confusing, using whitespace and comments are very helpful.
Your example is very short (all it says is if the first letter of the line starts with "a"), so you probably wouldn't need whitespace or comments, but you could if you wanted to.
Example:
m/^a # first letter is an 'a'
# <-- you can put more regex on this line because whitespace is ignored
# <-- and more here if you want
/x
In this use case 'x' is a regex modifier which "Extends your pattern's legibility by permitting whitespace and comments." according to the perl documentation. However it seems redundant here