How to express optional tokens in CUP - cup

I'm using java CUP to generate a LARL(1) parser for a tiny programming language. Is there a compact form to express optional non-terminals in a rule? For example, in the following rule
statement ::= IDENT WHITE EQ WHITE value WHITE SEMICOLON |
IDENT WHITE EQ WHITE value SEMICOLON |
IDENT WHITE EQ value SEMICOLON |
IDENT EQ value SEMICOLON |
block;
I repeat four times essentially the same thing, when it would be natural to put something like
statement ::= IDENT (WHITE) EQ (WHITE) value (WHITE) SEMICOLON |
block;

Related

Powershell replace first instance of word starting on new line

So lets say I have a multi line string as below.
#abc
abc def
abc
I want to only replace the first instance of abc that starts on a new line with xyz while ignoring any whitespaces that might precede it (like in the above example)
So my replaced string should read
#abc
xyz def
abc
Not very good at regex so would appreciate suggestions. Thanks!
To do that, you need a regular expression that anchors to the beginning of a line, allows for multiple leading whitespaces and uses a word boundary to make sure you do not replace part of a larger string.
$multilineText = #"
#abc
abc def
abc
"#
$toReplace = 'abc'
$replaceWith = 'xyz'
# create the regex string.
# Because the example `abc` in real life could contain characters that have special meaning in regex,
# you need to escape these characters in the `$toReplace` string.
$regexReplace = '(?m)^(\s*){0}\b' -f [regex]::Escape($toReplace)
# do the replacement and capture the result to write to a new file perhaps?
$result = ([regex]$regexReplace).Replace($multilineText, "`$1$replaceWith", 1)
# show on screen
$result
The above works Case-Sensitive, but if you do not want that, simply change (?m) into (?mi) in the $regexReplace definition.
Output:
#abc
xyz def
abc
Regex details:
(?m) Match the remainder of the regex with the options: ^ and $ match at line breaks (m)
^ Assert position at the beginning of a line (at beginning of the string or after a line break character)
( Match the regular expression below and capture its match into backreference number 1
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
){0}
\b Assert position at a word boundary
Special Characters in Regex
Char
Description
Meaning
\
Backslash
Used to escape a special character
^
Caret
Beginning of a string
$
Dollar sign
End of a string
.
Period or dot
Matches any single character
|
Vertical bar or pipe symbol
Matches previous OR next character/group
?
Question mark
Match zero or one of the previous
*
Asterisk or star
Match zero, one or more of the previous
+
Plus sign
Match one or more of the previous
( )
Opening and closing parenthesis
Group characters
[ ]
Opening and closing square bracket
Matches a range of characters
{ }
Opening and closing curly brace
Matches a specified number of occurrences of the previous
The 1 just replaces the first instance of 'abc' with 'xyz' within a string.
Write-Host "Replace Example One" -ForegroundColor Yellow -BackgroundColor DarkGreen
$test = " abc def abc "
[regex]$pattern = "abc"
$pattern.replace($test, "xyz", 1)
Write-Host "Replace Example Two" -ForegroundColor Green -BackgroundColor Blue
$test = Get-Content "c:\test\text.txt"
[regex]$pattern = "abc"
$x = $pattern.replace($test, "xyz", 1)
Write-Host $x
Write-Host "Replace Example Three" -ForegroundColor White -BackgroundColor Red
$multilineText = #"
#abc
abc def
abc
"#
[regex]$pattern = "abc"
$y = $pattern.replace($multilineText, "xyz", 1)
Write-Host $y

Can somebody explain this obfuscated perl regexp script?

This code is taken from the HackBack DIY guide to rob banks by Phineas Fisher. It outputs a long text (The Sixth Declaration of the Lacandon Jungle). Where does it fetch it? I don't see any alphanumeric characters at all. What is going on here? And what does the -r switch do? It seems undocumented.
perl -Mre=eval <<\EOF
''
=~(
'(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})')
;$:="\."^ '~';$~='#'
|'(';$^= ')'^'[';
$/='`' |'.';
$,= '('
EOF
The basic idea of the code you posted is that each alphanumeric character has been replaced by a bitwise operation between two non-alphanumeric characters. For instance,
'`'|'%'
(5th line of the "star" in your code)
Is a bitwise or between backquote and modulo, whose codepoints are respectively 96 and 37, whose "or" is 101, which is the codepoint of the letter "e". The following few lines all print the same thing:
say '`' | '%' ;
say chr( ord('`' | '%') );
say chr( ord('`') | ord('%') );
say chr( 96 | 37 );
say chr( 101 );
say "e"
Your code starts with (ignore whitespaces which don't matter):
'' =~ (
The corresponding closing bracket is 28 lines later:
^'(').'"})')
(C-f this pattern to see it on the web-page; I used my editor's matching parenthesis highlighting to find it)
We can assign everything in between the opening and closing parenthesis to a variable that we can then print:
$x = '(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})';
print $x;
This will print:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
The remaining of the code is a bunch of assignments into some variables; probably here only to complete the pattern: the end of the star is:
$:="\."^'~';
$~='#'|'(';
$^=')'^'[';
$/='`'|'.';
$,='(';
Which just assigns simple one-character strings to some variables.
Back to the main code:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
This code is inside a regext which is matched against an empty string (don't forget that we had first '' =~ (...)). (?{...}) inside a regex runs the code in the .... With some whitespaces, and removing the string within the eval, this gives us:
# fetch an url from the web using curl _quitely_ (-s)
($: = `curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)
# replace end of paragraphs with newlines in the HTML fetched
=~ s|</p>|\n|g;
# Remove all HTML tags
$: =~ s/<.*?>//g;
# Print everything between SEXTA and Mexicano (+2 chars)
print $: =~ /(SEXTA.*?Mexicano..)/s
You can automate this unobfuscation process by using B::Deparse: running
perl -MO=Deparse yourcode.pl
Will produce something like:
'' =~ m[(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})];
$: = 'P';
$~ = 'h';
$^ = 'r';
$/ = 'n';
$, = '(';

How to replace character in Powershell with initial character + a space

Hopefully this question isn't already answered on the site. I want to replace every number in a string with that number and a space. So here's what I have right now:
"31222829" -replace ("[0-9]","$0 ")
The [0-9] looks for any numbers, and replaces it with that character and the space. However, it doesn't work. I saw from another website to use $0 but I'm not sure what it means.The output I was looking for was
3 1 2 2 2 8 2 9
But it just gives me a blank line. Any suggestions?
LardPies
This probably isn't the right way to do it, but it works.
("31222829").GetEnumerator() -join " "
The .GetEnumerator method iterates over each character in the string
The -join operator will then join all of those characters with the " " space
tl;dr
PS> "31222829" -replace '[0-9]', '$& '
3 1 2 2 2 8 2 9
Note that the output string has a trailing space, given that each digit in the input ([0-9]) is replaced by itself ($&) followed by a space.
As for what you tried:
"31222829" -replace ("[0-9]","$0 ")
While enclosing the two RHS operands in (...) doesn't impede functionality, it's not really helpful to conceive of them as an array - just enumerate them with ,, don't enclose them in (...).
Generally, use '...' rather than "..." for the RHS operands (the regex to match and the replacement operand), so as to prevent confusion between PowerShell's string expansion (interpolation) and what the -replace operator ultimately sees.
Case in point: Due to use of "..." in the replacement operand, PowerShell's string interpolation would actually expand $0 as a variable up front, which in the absence of a variable expands to the empty string - that is why you ultimately saw a blank string.
Even if you had used '...', however, $0 has no special meaning in the replacement operand; instead, you must use $& to represent the matched string, as explained in this answer.
To unconditionally separate ALL characters with spaces:
Drew's helpful answer definitely works.
Here's a more PowerShell-idiomatic alternative:
PS> [char[]] '31222829' -join ' '
3 1 2 2 2 8 2 9
Casting a string to [char[]] returns its characters as an array, which -join then joins with a space as the separator.
Note: Since -join only places the specified separator (' ') between elements, the resulting string does not have a trailing space.
You can use a regex with positive lookahead to avoid the trailing space. Lookahead and lookbehind are zero-length assertions similar to ^ and $ that match the start/end of a line. The regex \d(?=.) will match a digit when followed by another character.
PS> '123' -replace '\d(?=.)', '$0 '
1 2 3
To verify there's no trailing space:
PS> "[$('123' -replace '\d(?=.)', '$0 ')]"
[1 2 3]

Perl - partial pattern matching in a sequence of letters

I am trying to find a pattern using perl. But I am only interested with the beginning and the end of the pattern. To be more specific I have a sequence of letters and I would like to see if the following pattern exists. There are 23 characters. And I'm only interested in the beginning and the end of the sequence.
For example I would like to extract anything that starts with ab and ends with zt. There is always
So it can be
abaaaaaaaaaaaaaaaaaaazt
So that it detects this match
but not
abaaaaaaaaaaaaaaaaaaazz
So far I tried
if ($line =~ /ab[*]zt/) {
print "found pattern ";
}
thanks
* is a quantifier and meta character. Inside a character class bracket [ .. ] it just means a literal asterisk. You are probably thinking of .* which is a wildcard followed by the quantifier.
Matching entire string, e.g. "abaazt".
/^ab.*zt$/
Note the anchors ^ and $, and the wildcard character . followed by the zero or more * quantifier.
Match substrings inside another string, e.g. "a b abaazt c d"
/\bab\S*zt\b/
Using word boundary \b to denote beginning and end instead of anchors. You can also be more specific:
/(?<!\S)ab\S*zt(?!\S)/
Using a double negation to assert that no non-whitespace characters follow or precede the target text.
It is also possible to use the substr function
if (substr($string, 0, 2) eq "ab" and substr($string, -2) eq "zt")
You mention that the string is 23 characters, and if that is a fixed length, you can get even more specific, for example
/^ab.{19}zt$/
Which matches exactly 19 wildcards. The syntax for the {} quantifier is {min, max}, and any value left blank means infinite, i.e. {1,} is the same as + and {0,} is the same as *, meaning one/zero or more matches (respectively).
Just a * by itself wont match anything (except a literal *), if you want to match anything you need to use .*.
if ($line =~ /^ab.*zt$/) {
print "found pattern ";
}
If you really want to capture the match, wrap the whole pattern in a capture group:
if (my ($string) = $line =~ /^(ab.*zt)$/) {
print "found pattern $string";
}

Trouble separating G0 and G1 rules in grammar

I'm trying to get what seems like a very basic Marpa grammar working. The code I use is below:
use strict;
use warnings;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{
source => \(<<'END_OF_SOURCE'),
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ~ word
AndExpr ~ word*
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
END_OF_SOURCE
}
);
my $reader = Marpa::R2::Scanless::R->new(
{
grammar => $grammar,
}
);
my $input = 'foo';
$reader->read(\$input);
my $value = $reader->value;
print Dumper $value;
This prints $VAR1 = \'foo';. So it recognizes one word just fine. But I want it to recognize a string of words
my $input='foo bar'
Now the script prints:
Error in SLIF G1 read: Parse exhausted, but lexemes remain, at position 4
I think this is because ExprSingle uses the ~ (match) operator, which makes it part of the tokenizing level, G0, instead of the structural level G1; the :discard rule allows space between G1 rules, not G0 ones. So I change the grammar like so:
ExprSingle ::= Expr AndExpr
Now no warning is printed, but the resulting value is undef instead of something containing 'foo' and 'bar'. I'm honestly not sure what that means, since, before, the failed parse threw an actual error.
I tried changing the grammar to separate what I think are G0 and G1 rules further, but still no luck:
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ::= token
AndExpr ::= token*
token ~ word
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
The final value is still undef. trace_terminals shows both 'foo' and 'bar' being accepted as tokens. What do I need to do to fix this grammar (by which I mean get a value containing the strings 'foo' and 'bar' instead of just undef)?
Rules by default return a value of undef, so in your case a return of \undef from $reader->value() means your parse succeeded. That is, a return of undef means failure, while a return of \undef means success where the parse evaluated to undef.
A good, fast way to start with a more helpful semantics is to add the following line:
:default ::= action => ::array
This causes the parse to generate an AST.