Laundering tainted data - perl

When I do laundering tainted data with checking whether it has any bad characters are there unicode-properties which will filter the bad characters?

User-Defined Character Properties in perlunicode
package Characters::Sid_com;
sub InBad {
return <<"BAD";
0000\t10FFFF
BAD
}
sub InEvil {
return <<"EVIL";
0488
0489
EVIL
}
sub InStupid {
return <<"STUPID";
E630\tE64F
F8D0\tF8FF
STUPID
}
⋮
die 'No.' if $tring =~ /
(?: \p{Characters::Sid_com::InBad}
| \p{Characters::Sid_com::InEvil}
| \p{Characters::Sid_com::InStupid}
)
/x;

I think "no" is an understatement for an answer, but there you have it. No, Unicode does not have a concept of "bad" or "good" characters (let alone "ugly" ones).

XML (and thus XHTML) can only contains these chars:
\x09 \x0A \x0D
\x{0020}-\x{D7FF}
\x{E000}-\x{FFFD}
\x{10000}-\x{10FFFF}
Of the above, the following should be avoided:
\x7F-\x84
\x86-\x9F
\x{FDD0}-\x{FDEF}
\x{1FFFE}-\x{1FFFF}
\x{2FFFE}-\x{2FFFF}
\x{3FFFE}-\x{3FFFF}
\x{4FFFE}-\x{4FFFF}
\x{5FFFE}-\x{5FFFF}
\x{6FFFE}-\x{6FFFF}
\x{7FFFE}-\x{7FFFF}
\x{8FFFE}-\x{8FFFF}
\x{9FFFE}-\x{9FFFF}
\x{AFFFE}-\x{AFFFF}
\x{BFFFE}-\x{BFFFF}
\x{CFFFE}-\x{CFFFF}
\x{DFFFE}-\x{DFFFF}
\x{EFFFE}-\x{EFFFF}
\x{FFFFE}-\x{FFFFF}
\x{10FFFE}-\x{10FFFF}
If you are generating XHTML, you need to escape the following:
& ⇒ &
< ⇒ <
> ⇒ > (optional)
" ⇒ " (optional except in attribute values delimited with ")
' ⇒ &apos; (optional except in attribute values delimited with ')
HTML should have the same if not looser requirements, so if you stick to this, you should be safe.

Related

ANTLR4 lexer rule creates errors or conflicts on perl grammar

I am having an issue on my PERL grammar, here are the relevant parts of my grammar :
element
: element (ASTERISK_CHAR | SLASH_CHAR | PERCENT_CHAR) element
| word
;
SLASH_CHAR: '/';
REGEX_STRING
: '/' (~('/' | '\r' | '\n') | NEW_LINE)* '/'
;
fragment NEW_LINE
: '\r'? '\n'
;
If the rule REGEX_STRING is not commented, then the following perl doesn't parse :
$b = 1/2;
$c = 1/2;
<2021/08/20-19:24:37> <ERROR> [parsing.AntlrErrorLogger] - Unit 1: <unknown>:2:6: extraneous input '/2;\r\n$c = 1/' expecting {<EOF>, '=', '**=', '+=', '-=', '.=', '*=', '/=', '%=', CROSS_EQUAL, '&=', '|=', '^=', '&.=', '|.=', '^.=', '<<=', '>>=', '&&=', '||=', '//=', '==', '>=', '<=', '<=>', '<>', '!=', '>', '<', '~~', '++', '--', '**', '.', '+', '-', '*', '/', '%', '=~', '!~', '&&', '||', '//', '&', '&.', '|', '|.', '^', '^.', '<<', '>>', '..', '...', '?', ';', X_KEYWORD, AND, CMP, EQ, FOR, FOREACH, GE, GT, IF, ISA, LE, LT, OR, NE, UNLESS, UNTIL, WHEN, WHILE, XOR, UNSIGNED_INTEGER}
Note that it doesn't matter where the lexer rule REGEX_STRING is used, even if it is not present anywhere in the parser rules just being here makes the parsing fails (so the issue is lexer side).
If I remove the lexer rule REGEX_STRING, then it gets parsed just fine, but then I can't parse :
$dateCalc =~ /^([0-9]{4})([0-9]{2})([0-9]{2})/
Also, I noticed that this perl parses, so there seems to be some kind of interaction between the first and the second '/'.
$b = 12; # Removed the / between 1 and 2
$c = 1/2; # Removing the / here would work as well.
I can't seem to find how to write my regex lexer rule to not make something fail.
What am I missing ? How can I parse both expressions just fine ?
The basic issue here is that ANTLR4, like many other parsing frameworks, performs lexical analysis independent of the syntax; the same tokens are produced regardless of which tokens might be acceptable to the parser. So it is the lexical analyser which must decide whether a given / is a division operator or the start of a regex, a decision which can really only be made using syntactic information. (There are parsing frameworks which do not have this limitation, and thus can be used to implement for scannerless parsers. These include PEG-based parsers and GLR/GLR parsers.)
There's an example of solving this lexical ambiguity, which also shows up in parsing ECMAScript, in the ANTLR4 example directory. (That's a github permalink so that the line numbers cited below continue to work.)
The basic strategy is to decide whether a / can start a regular expression based on the immediately previous token. This works in ECMAScript because the syntactic contexts in which an operator (such as / or /=) can appear are disjoint from the contexts in which an operand can appear. This will probably not translate directly into a Perl parser, but it might help show the possibilities.
Line 780-782: The regex token itself is protected by a semantic guard:
RegularExpressionLiteral
: {isRegexPossible()}? '/' RegularExpressionBody '/' RegularExpressionFlags
;
Lines 154-182: The guard function itself is simple, but obviously required a certain amount of grammatical analysis to generate the correct test. (Note: The list of tokens has been abbreviated; see the original file for the complete list):
private boolean isRegexPossible() {
if (this.lastToken == null) {
return true;
}
switch (this.lastToken.getType()) {
case Identifier:
case NullLiteral:
...
// After any of the tokens above, no regex literal can follow.
return false;
default:
// In all other cases, a regex literal _is_ possible.
return true;
}
}
}
Lines 127-147 In order for that to work, the scanner must retain the previous token in the member variable last_token. (Comments removed for space):
#Override
public Token nextToken() {
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
this.lastToken = next;
}
return next;
}

Scala Syntax Specification mismatch if-else with one line expression end by semicolon?

I'm learning Scala Syntax Specification.
Confused by the if-else syntax:
Expr1 ::= ‘if’ ‘(’ Expr ‘)’ {nl} Expr [[semi] ‘else’ Expr]
| ...
How could it match below if-else with one line expression end by semicolon ?
if (true) // \n
println(1); //\n
else //\n
println(2); //\n
Notice there're 4 lines and each followed by a '\n'. I have these questions:
When the 1st ; after println(1) match semi before else( [[semi] ‘else’ Expr] ), how to match the 2nd '\n' after ; after println(1) ?
How to match the 3rd '\n' after else ?
How to match the 2nd ; and the 4th '\n' after println(2) ? Since if-else don't match any ; or '\n' at tail.
I think you are being confused by thinking that all newlines must match the nl token. That is not correct.
Newlines are in general simply treated as whitespace. There is a very long subsection on newlines in the Lexical Syntax chapter section 1.2 Newline characters which explains in detail, when, exactly, a newline character is an nl token and when it isn't.
Only the first newline character in your example is an nl token, the other three are just whitespace.
in Scala, semicolon ; doesn't exist (is ignored)
if-else statement is so simple with brackets as :
if (true) {
"\n" // this will be returned
println(1) // this will be ignored
"\n" // this will be ignored
} else {
"\n" // this will be returned
println(2) // this will be ignored
"\n" // this will be ignored
}
or, you can use without accolades, but the statement must be writed in one line:
if (true)
"\n" // this will be returned, can not have another line here
else
"\n"
without comments: if (true) "\n" else "\n"
More about if-else in Scala

Clean string from html tags and special characters

I want to clean my text from html tags, html spacial characters and characters like < > [ ] / \ * ,
I used $str = preg_replace("/&#?[a-zA-Z0-9]+;/i", "", $str);
it works well with html special characters but some characters doesn't remove like :
( /*/*]]>*/ )
how can I remove these characters?
If you are really using php as it looks like, you can just use:
$str = htmlspecialchars($str);
All HTML chars will be escaped (which could be better than just stripping them). If you really want just to filter these characters, what you need to do is escape those characters on the chars list:
$str = preg_replace("/[\&#\?\]\[\/\\\<\>\*\:\(\);]*/i","",$str);
Notice there's just one "/[]*/i", I removed the a-zA-Z0-9 as you should want these chars in. You can also classify only the desired chars to enter your string (will give you trouble with accentuations like á é ü if you use them, you have to specify every accepted char):
$str = preg_replace("/[^a-zA-Z0-9áÁéÉíÍãÃüÜõÕñÑ\.\+\-\_\%\$\#\!\=;]*/","",$str);
Notice also there's never too much to escape characters, unless for example for the intervals (\a-\z would do fine, \a-\z would match a, or -, or z).
I hope it helps. :)
Regular expression for html tags is:
/\<(.*)?\>/
so use something like this:
// The regular expression to remove HTML tags
$htmltagsregex = '/\<(.*)?\>/';
// what shit will substitute it
$nothing = '';
// the string I want to apply it to
$string = 'this is a string with <b>HTML tags</b> that I want to <strong>remove</strong>';
// DO IT
$result = preg_replace ($htmltagsregex,nothing,$string);
and it will return
this is a string with HTML tags that I want to remove
That's all

Scala string pattern matching for mathematical symbols

I have the following code:
val z: String = tree.symbol.toString
z match {
case "method +" | "method -" | "method *" | "method ==" =>
println("no special op")
false
case "method /" | "method %" =>
println("we have the special div operation")
true
case _ =>
false
}
Is it possible to create a match for the primitive operations in Scala:
"method *".matches("(method) (+-*==)")
I know that the (+-*) signs are used as quantifiers. Is there a way to match them anyway?
Thanks from a avidly Scala scholar!
Sure.
val z: String = tree.symbol.toString
val noSpecialOp = "method (?:[-+*]|==)".r
val divOp = "method [/%]".r
z match {
case noSpecialOp() =>
println("no special op")
false
case divOp() =>
println("we have the special div operation")
true
case _ =>
false
}
Things to consider:
I choose to match against single characters using [abc] instead of (?:a|b|c).
Note that - has to be the first character when using [], or it will be interpreted as a range. Likewise, ^ cannot be the first character inside [], or it will be interpreted as negation.
I'm using (?:...) instead of (...) because I don't want to extract the contents. If I did want to extract the contents -- so I'd know what was the operator, for instance, then I'd use (...). However, I'd also have to change the matching to receive the extracted content, or it would fail the match.
It is important not to forget () on the matches -- like divOp(). If you forget them, a simple assignment is made (and Scala will complain about unreachable code).
And, as I said, if you are extracting something, then you need something inside those parenthesis. For instance, "method ([%/])".r would match divOp(op), but not divOp().
Much the same as in Java. To escape a character in a regular expression, you prefix the character with \. However, backslash is also the escape character in standard Java/Scala strings, so to pass it through to the regular expression processing you must again prefix it with a backslash. You end up with something like:
scala> "+".matches("\\+")
res1 : Boolean = true
As James Iry points out in the comment below, Scala also has support for 'raw strings', enclosed in three quotation marks: """Raw string in which I don't need to escape things like \!""" This allows you to avoid the second level of escaping, that imposed by Java/Scala strings. Note that you still need to escape any characters that are treated as special by the regular expression parser:
scala> "+".matches("""\+""")
res1 : Boolean = true
Escaping characters in Strings works like in Java.
If you have larger Strings which need a lot of escaping, consider Scala's """.
E. g. """String without needing to escape anything \n \d"""
If you put three """ around your regular expression you don't need to escape anything anymore.

Lisp grammar in yacc

I am trying to build a Lisp grammar. Easy, right? Apparently not.
I present these inputs and receive errors...
( 1 1)
23 23 23
ui ui
This is the grammar...
%%
sexpr: atom {printf("matched sexpr\n");}
| list
;
list: '(' members ')' {printf("matched list\n");}
| '('')' {printf("matched empty list\n");}
;
members: sexpr {printf("members 1\n");}
| sexpr members {printf("members 2\n");}
;
atom: ID {printf("ID\n");}
| NUM {printf("NUM\n");}
| STR {printf("STR\n");}
;
%%
As near as I can tell, I need a single non-terminal defined as a program, upon which the whole parse tree can hang. But I tried it and it didn't seem to work.
edit - this was my "top terminal" approach:
program: slist;
slist: slist sexpr | sexpr;
But it allows problems such as:
( 1 1
Edit2: The FLEX code is...
%{
#include <stdio.h>
#include "a.yacc.tab.h"
int linenumber;
extern int yylval;
%}
%%
\n { linenumber++; }
[0-9]+ { yylval = atoi(yytext); return NUM; }
\"[^\"\n]*\" { return STR; }
[a-zA-Z][a-zA-Z0-9]* { return ID; }
.
%%
An example of the over-matching...
(1 1 1)
NUM
matched sexpr
NUM
matched sexpr
NUM
matched sexpr
(1 1
NUM
matched sexpr
NUM
matched sexpr
What's the error here?
edit: The error was in the lexer.
Lisp grammar can not be represented as context-free grammar, and yacc can not parse all lisp code.
It is because of lisp features such as read-evaluation and programmable reader. So, in order just to read an arbitrary lisp code, you need to have a full lisp running. This is not some obscure, non-used feature, but it is actually used. E.g., CL-INTERPOL, CL-SQL.
If the goal is to parse a subset of lisp, then the program text is a sequence of sexprs.
The error is really in the lexer. Your parentheses end up as the last "." in the lexer, and don't show up as parentheses in the parser.
Add rules like
\) { return RPAREN; }
\( { return LPAREN; }
to the lexer and change all occurences of '(', ')' to LPAREN and RPAREN respectively in the parser. (also, you need to #define LPAREN and RPAREN where you define your token list)
Note: I'm not sure about the syntax, could be the backslashes are wrong.
You are correct in that you need to define a non-terminal. That would be defined as a set of sexpr. I'm not sure of the YACC syntax for that. I'm partial to ANTLR for parser generators and the syntax would be:
program: sexpr*
Indicating 0 or more sexpr.
Update with YACC syntax:
program : /* empty */
| program sexpr
;
Not in YACC, but might be helpful anyway, here's a full grammar in ANTLR v3 that works for the cases you described(excludes strings in the lexer because it's not important for this example, also uses C# console output because that's what I tested it with):
program: (sexpr)*;
sexpr: list
| atom {Console.WriteLine("matched sexpr");}
;
list:
'('')' {Console.WriteLine("matched empty list");}
| '(' members ')' {Console.WriteLine("matched list");}
;
members: (sexpr)+ {Console.WriteLine("members 1");};
atom: Id {Console.WriteLine("ID");}
| Num {Console.WriteLine("NUM");}
;
Num: ( '0' .. '9')+;
Id: ('a' .. 'z' | 'A' .. 'Z')+;
Whitespace : ( ' ' | '\r' '\n' | '\n' | '\t' ) {Skip();};
This won't work exactly as is in YACC because YACC generates and LALR parser while ANTLR is a modified recursive descent. There is a C/C++ output target for ANTLR if you wanted to go that way.
Do you neccesarily need a yacc/bison parser? A "reads a subset of lisp syntax" reader isn't that hard to implement in C (start with a read_sexpr function, dispatch to a read_list when you see a '(', that in turn builds a list of contained sexprs until a ')' is seen; otherwise, call a read_atom that collects an atom and returns it when it can no longer read atom-constituent characters).
However, if you want to be able to read arbritary Common Lisp, you'll need to (at the worst) implement a Common Lisp, as CL can modify the reader run-time (and even switch between different read-tables run-time under program control; quite handy when you're wanting to load code written in another language or dialect of lisp).
It's been a long time since I worked with YACC, but you do need a top-level non-terminal. Could you be more specific about "tried it" and "it didn't seem to work"? Or, for that matter, what the errors are?
I'd also suspect that YACC might be overkill for such a syntax-light language. Something simpler (like recursive descent) might work better.
You could try this grammar here.
I just tried it, my "yacc lisp grammar" works fine :
%start exprs
exprs:
| exprs expr
/// if you prefer right recursion :
/// | expr exprs
;
list:
'(' exprs ')'
;
expr:
atom
| list
;
atom:
IDENTIFIER
| CONSTANT
| NIL
| '+'
| '-'
| '*'
| '^'
| '/'
;