How to escape Unicode escapes in Groovy's /pattern/ syntax

How to escape Unicode escapes in Groovy's /pattern/ syntax - unicode

The following Groovy commands illustrate my problem.
First of all, this works (as seen on lotrepls.appspot.com) as expected (note that \u0061 is 'a').
>>> print "a".matches(/\u0061/)
true
Now let's say that we want to match \n, using the Unicode escape \u000A. The following, using "pattern" as a string, behaves as expected:
>>> print "\n".matches("\u000A");
Interpreter exception: com.google.lotrepls.shared.InterpreterException:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed,
Script1.groovy: 1: expecting anything but ''\n''; got it anyway
# line 1, column 21. 1 error
This is expected because in Java at least, Unicode escapes are processed early (JLS 3.3), so:
print "\n".matches("\u000A")
really is the same as:
print "\n".matches("
")
The fix is to escape the Unicode escape, and let the regex engine process it, as follows:
>>> print "\n".matches("\\u000A")
true
Now here's the question part: how can we get this to work with the Groovy /pattern/ syntax instead of using string literal?
Here are some failed attempts:
>>> print "\n".matches(/\u000A/)
Interpreter exception: com.google.lotrepls.shared.InterpreterException:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed,
Script1.groovy: 1: expecting EOF, found '(' # line 1, column 19.
1 error
>>> print "\n".matches(/\\u000A/)
false
>>> print "\\u000A".matches(/\\u000A/);
true

~"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]"
Appears to be working as it should. According to the docs I've seen, the double backslashes shouldn't be required with a slashy string, so I don't know why the compiler's not happy with them.

Firstly, it seems Groovy changed in this regard in the meantime, at least on https://groovyconsole.appspot.com/ and a local Groovy shell, "\n".matches(/\u000A/) works perfectly fine, evaluating to true.
In case you have a similar situation again, just encode the backslash with a unicode escape like in "\n".matches(/\u005Cu000A/) as then the unicode escape to character conversion makes it a backslash again and then the sequence for the regex parser is kept.
Another option would be to separate the backslash from the u for example by using "\n".matches(/${'\\'}u000A/) or "\n".matches('\\' + /u000A/)

Related

Oddities in fail2ban regex

This appears to be a bug in fail2ban, with different behaviour between the fail2ban-regex tool and a failregex filter
I am attempting to develop a new regex rule for fail2ban, to match:
\"%20and%20\"x\"%3D\"x
When using fail2ban-regex, this appears to produce the desired result:
^<HOST>.*GET.*\\"%20and%20\\"x\\"%3D\\"x.* 200.*$
As does this:
^<HOST>.*GET.*\\\"%20and%20\\\"x\\\"%3D\\\"x.* 200.*$
However, when I put either of these into a filter, I get the following error:
Failed during configuration: '%' must be followed by '%' or '(', found:…
To have this work in a filter you have to double-up the ‘%’, ie ‘%%’:
^<HOST>.*GET.*\\\"%%20and%%20\\\"x\\\"%%3D\\\"x.* 200.*$
While this gets the required hits running as a filter, it gets none running through fail2ban-regex.
I tried the \\\\ as Andre suggested below, but this gets no results in fail2ban-regex.
So, as this appears to be differential behaviour, I am going to file it as a bug.

According to Python's own site a singe backslash "\" has to be written as "\\\\" and there's no mention of %.
Regular expressions use the backslash character ('') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\'
as the pattern string, because the regular expression must be \, and
each backslash must be expressed as \ inside a regular Python string
literal
I would just go with:
failregex = (?i)^<HOST> -.*"(GET|POST|HEAD|PUT).*20and.*3d.*$
the .* wil match anything inbetween anyways and (?i) makes the entire regex case-insensitive

In Perl, why does variable interpolation fail for a hexadecimal escape sequence?

In Perl, if I run the code:
print "Literal Hex: \x{50} \n";
I get this: "Literal Hex: P"
However, if I run the code:
my $hex_num = 50;
print "Interpolated Hex: \x{$hex_num}";
The variable does not interpolate properly and I get this: "Interpolated Hex:"
Similar failure results when I attempt to use variable interpolation in unicode and octal escape sequences.
Is it possible to use escape sequences (e.g. \x, \N) with interpolated string variables? I was under the impression that a $variable contained within double quotes is always interpolated, but is this the exception?
Note: Thanks to this question, I am aware of the workaround: chr(hex($hex_num)), but my above questions regarding variable interpolation for escape sequences still stand.

Interpolation is not recursive, everything is interpolated just once, from left to right. Therefore, when \x{$hex} is being processed, the following applies (cited from perlop):
If there are no valid digits between the braces, the generated character is the NULL
character ("\x{00}").
Zero is really there:
perl -MO=Deparse -e '$h=50;print "<\x{$h}>"'
$h = 50;
print "<\000>";
-e syntax OK

You should put in your variable the complete scape sequence:
my $hex_num = "\x50";
print "Interpolated Hex: $hex_num", "\n";

The issue I had was adding an escaped var into another variable such as:
$MYVAR = "20";
$myQuery = "\x02\x12\x10\x$MYVAR\x10";
Tried a number of \\x, \Q\x and various other escape sequences to no avail!!!
My workaround was not a direct escape but converting the var prior to adding to the string.
$MYVAR = chr(hex(20));
Did quite a bit of searching for a direct regex solution but had to run with this in the end.

print<<EOF in perl- To print $

Here I want to print the '$' sign. How to do that?
#!/perl/bin/perl
print <<EOF;
This sign $ is called dollar
It's a multiline
string
EOF
This is giving me result.
This sign is called dollar
It's a multiline
string
I want to print $.

Using EOF is equivalent to "EOF" - the here document is interpolated as if in double quotes. Backslash the dollar sign \$ or explicitly use single quotes to supress interpolation.
print << 'EOF';
...
EOF

Running your code with use warnings turned on gives me this:
Name "main::is" used only once: possible typo at foo.pl line 8.
Use of uninitialized value $is in concatenation (.) or string at foo.pl line 8.
This sign called dollar
It's a multiline
string
As you can see, the is is gone from the sentence, and so is the dollar sign. The warnings tell me why: a variable $is was found inside the string. Since it was empty, it was replaced by the empty string. Because you did not have warnings turned on, this was done quietly.
The moral is: Always use use warnings. Also beneficial in this case would have been use strict, as it would have caused the script to fail compilation due to an undeclared variable $is.
As for how to fix it, I believe choroba has the solution in his answer.

Behavior of . and , operators in Perl for concatenation and parsing

I was trying to play with the . and , operators in Perl and got something weird which I was unable to figure out:
If I run this:
print hello . this,isatest, program
the output is:
hellothisisatestprogram
What I could understand is that it is treating both the text before and after the dot operator as string and concatenating them.
But what about the commas? Why is it getting omitted and not concatenated?

First period (.) is treated as concatenation operator. Subsequent commas separate multiple parameters of print. The result is the same - all parts are concatenated. If you want to print literal commas, enclose this,isatest, program in quotes - "this,isatest, program" to form single argument.
http://perldoc.perl.org/functions/print.html

I think this is what you want:
perl -e 'print "hello"." this,isatest,program"."\n"'
Run above code and check the output. If it gives you desired output then I guess we have an explanation.

New line character in Scala

Is there a shorthand for a new line character in Scala? In Java (on Windows) I usually just use "\n", but that doesn't seem to work in Scala - specifically
val s = """abcd
efg"""
val s2 = s.replace("\n", "")
println(s2)
outputs
abcd
efg
in Eclipse,
efgd
(sic) from the command line, and
abcdefg
from the REPL (GREAT SUCCESS!)
String.format("%n") works, but is there anything shorter?

A platform-specific line separator is returned by
sys.props("line.separator")
This will give you either "\n" or "\r\n", depending on your platform. You can wrap that in a val as terse as you please, but of course you can't embed it in a string literal.
If you're reading text that's not following the rules for your platform, this obviously won't help.
References:
scala.sys package scaladoc (for sys.props)
java.lang.System.getProperties javadoc (for "line.separator")

Your Eclipse making the newline marker the standard Windows \r\n, so you've got "abcd\r\nefg". The regex is turning it into "abcd\refg" and Eclipse console is treaing the \r slightly differently from how the windows shell does. The REPL is just using \n as the new line marker so it works as expected.
Solution 1: change Eclipse to just use \n newlines.
Solution 2: don't use triple quoted strings when you need to control newlines, use single quotes and explicit \n characters.
Solution 3: use a more sophisticated regex to replace \r\n, \n, or \r

Try this interesting construction :)
import scala.compat.Platform.EOL
println("aaa"+EOL+"bbb")

If you're sure the file's line separator in the one, used in this OS, you should do the following:
s.replaceAll(System.lineSeparator, "")
Elsewhere your regex should detect the following newline sequences: "\n" (Linux), "\r" (Mac), "\r\n" (Windows):
s.replaceAll("(\r\n)|\r|\n", "")
The second one is shorter and, I think, is more correct.

var s = """abcd
efg""".stripMargin.replaceAll("[\n\r]","")

Use \r\n instead
Before:
After:

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to escape Unicode escapes in Groovy's /pattern/ syntax - unicode

~"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]" Appears to be working as it should. According to the docs I've seen, the double backslashes shouldn't be required with a slashy string, so I don't know why the compiler's not happy with them.

Related

Oddities in fail2ban regex

In Perl, why does variable interpolation fail for a hexadecimal escape sequence?

print<<EOF in perl- To print $

Behavior of . and , operators in Perl for concatenation and parsing

New line character in Scala

Categories

Resources