Is there a rule to match unicode printable characters in parboiled2? - scala

As part of a larger parser, I am writing a rule to match strings like the following using parboiled2:
Italiana Relè
I would like to use something simple like the following:
CharPredicate.Printable
But the parser is failing with an org.parboiled2.ParseError because of the unicode character at the end of the string.
Is there a simple option that I'm not aware of for matching printable unicode characters?

Take a look at https://github.com/sirthias/parboiled2/blob/master/parboiled-core/src/main/scala/org/parboiled2/CharPredicate.scala#L112 - it is very easy to do your own predicates, for instance:
val latinSupplementCharsPredicate = CharPredicate('\u00c0' to '\u00dc') ++ CharPredicate('\u00e0' to '\u00fd')

Related

PHP: Using preg_replace to replace an unknown string between two known strings

I have $stringF. Contained within $stringF is the following (the string is all one line, not word-wrapped as below):
http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=
AFQjCNHWQk0M4bZi9xYO4OY4ZiDqYVt2SA&clid=
c3a7d30bb8a4878e06b80cf16b898331&cid=52779892300270&ei=
H4IAW6CbK5WGhQH7s5SQAg&url=https://abcnews.
go.com/Lifestyle/wireStory/latest-royal-wedding-thousands-streets-windsor-55280649
I want to locate that string and make it look like this:
https://abcnews.go.com/Lifestyle/wireStory/latest-royal-
wedding-thousands-streets-windsor-55280649
Basically I need to use preg_replace to find the following string:
http://news.google.com/news/url?sa= ***SOME UNKNOWN CONTENT*** &url=http
and replace it with the following string:
http
I'm a little rusty with my php, and even rustier with regular expressions, so I'm struggling to figure this one out. My code looks like this:
$stringG = preg_replace('http://news.google.com/news/url?sa=*&url=http','http',$stringH);
except I know I can't use wildcards and I know I need to specially deal with the special characters (colon, forward slash, question mark, and sign, etc). Hoping someone can help me out here.
Also of note is that my $stringF contains multiple instances of such strings, so I need the preg_replace to be not greedy - otherwise it will replace a huge chunk of my string unnecessarily.
PHP has tools for that, no need to use a regex. parse_url to get the components of an url (scheme, host, path, anchor, query, ...) and parse_str to get the keys/values of the query part.
$url = 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHWQk0M4bZi9xYO4OY4ZiDqYVt2SA&clid=c3a7d30bb8a4878e06b80cf16b898331&ci=52779892300270&ei=H4IAW6CbK5WGhQH7s5SQAg&url=https://abcnews.go.com/Lifestyle/wireStory/latest-royal-wedding-thousands-streets-windsor-55280649';
parse_str(parse_url($url, PHP_URL_QUERY), $arr);
echo $arr['url'];

Coffeescript Regex interpolation

Coffeescript supports strings interpolation:
user = "world"
greeting = "Hello #{user}!"
Is it possible to use interpolation in regex just like in strings? E.g.
regex = /Hello #{user}/g
P.S. I know that I can use RegExp(greeting, 'g'), I just want a bit cleaner code.
Block Regular Expressions (Heregexes) support interpolation.
Block Regular Expressions
Similar to block strings and comments,
CoffeeScript supports block regexes — extended regular expressions
that ignore internal whitespace and can contain comments and
interpolation. Modeled after Perl's /x modifier, CoffeeScript's block
regexes are delimited by /// and go a long way towards making complex
regular expressions readable.
This coffeescript code:
name="hello"
test=///#{name}///
compiles to
var name, test;
name = "hello";
test = RegExp("" + name);

Trouble determining the pattern for NSRegularExpression...?

i am relatively new to NSRegularExpression and just can't come up with a pattern to find a string within a string....
here is the string...
##$294#001#[12345-678[123-456-7#15665#2
I want to extract the string..
#001#[12345-678[123-456-7#
for more info I know that there will be 3 digits(like 001) between two # 's and 20 characters between the last two # 's..
I have tried n number of combinations but nothing seem to work. any help is appreciated.
How about something like this:
#[0-9]{3}#.{20}#
If you know that the 20 characters will always consist of digits, [ and -, your pattern would become:
#[0-9]{3}#[0-9\[\-]{20}#
Be careful with the backslashes: When you use create the pattern with a string literal (#"..."), you need to add an extra backslash before each backslash.
You can test NSRegularExpression patterns without recompiling each time by using RegexTester https://github.com/liyanage/regextester

preg_match a keyword variable against a list of latin and non-latin chars keywords in a local UTF-8 encoded file

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.
How do I go about matching both Latin and non-Latin keywords.
The badwords.txt file includes one word per line as in this example
bad
nasty
racist
سفالة
وساخة
جنس
Code used for matching:
$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);
foreach ($badwords as $key => $val) {
if (!empty($val)) {
$val = trim($val);
$regexp = "/\b" . $val . "\b/i";
if (preg_match($regexp, $query))
$badFlag = 1;
if ($badFlag == 1) {
// Bad word detected die...
}
}
}
I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.
The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set
$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';
and modify the regexp as follows:
$regexp = "/" . $wstart . $val . $wend . "/iu";
Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.
It looks like strtolower() is one of them, you need to use mb_strtolower($query, 'UTF-8'). If that doesn't fix it, you'll need to read through the code and find every point where you process $query or badwords.txt and check the documentation for UTF-8 bugs.
As far as I know, preg_match() is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.
Please also double check that badwords.txt is a UTF-8 file and that $query contains a valid UTF-8 string (if it's coming from the browser, you set it with a <meta> tag).
If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with View -> Encoding -> Unicode).
You shouldn't need to use iconv or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.

How to output """ in the "here docs" of scala?

In scala, "here docs" is begin and end in 3 "
val str = """Hi,everyone"""
But what if the string contains the """? How to output Hi,"""everyone?
Since unicode escaping via \u0022 in multi-line string literals won’t help you, because they would be evaluated as the very same three quotes, your only chance is to concatenate like so:
"""Hi, """+"""""""""+"""everyone"""
The good thing is, that the scala compiler is smart enough to fix this and thus it will make one single string out of it when compiling.
At least, that’s what scala -print says.
object o {
val s = """Hi, """+"""""""""+"""everyone"""
val t = "Hi, \"\"\"everyone"
}
and using scala -print →
Main$$anon$1$o.this.s = "Hi, """everyone";
Main$$anon$1$o.this.t = "Hi, """everyone";
Note however, that you can’t input it that way. The format which scala -print outputs seems to be for internal usage only.
Still, there might be some easier, more straightforward way of doing this.
It's a totally hack that I posted on a similar question, but it works here too: use Scala's XML structures as an intermediate format.
val str = <a>Hi,"""everyone</a> text
This will give you a string with three double quotation marks.
you can't
scala heredocs are raw strings and don't use any escape codes
if you need tripple quotes in a string use string-concatenation add them
You can't using the triple quotes, as far as I know. In the spec, section 1.3.5, states:
A multi-line string literal is a sequence of characters enclosed in triple quotes
""" ... """. The sequence of characters is arbitrary, except that it may contain
three or more consuctive quote characters only at the very end. Characters must
not necessarily be printable; newlines or other control characters are also permitted.
Unicode escapes work as everywhere else, but none of the escape sequences in
(§1.3.6) is interpreted.
So if you want to output three quotes in a string, you can still use the single quote string with escaping:
scala> val s = "Hi, \"\"\"everyone"
s: java.lang.String = Hi, """everyone