Lexer command 'more' in antlr doesn't match the expected value - command

I've been using different lexer modes in antlr and I've been encountered problems with the 'more' command in the lexer, as it doesn't match everything inside this respective token. To make things more clear, here is what my code looks like roughly:
//DEFAULT_MODE
fragment A: ('A'); //same done for A-Z
KEYWORD_CLASS: C L A S S;
NUM: [0-9];
KEYWORD_SMTH: S M T H->mode(NUMBER_MODE);
mode NUMBER_MODE;
NUMBER: NUM+ ->mode(ANOTHER_MODE);
NO_NUMBER: ~[0-9]->more, mode(DEFAULT_MODE);
Now when I try to test the parser rule
rule: KEYWORD_SMTH NUMBER? CLASS;
then I'm expecting to match the following phrase:
SMTH CLASS
But for some reason the first letter of the C is not matched to the Token. I have to type something like
SMTH gCLASS
in order to match the keyword CLASS. If I understand correctly, the 'more' command will match everything that is not a number and bring it back to default mode, so it can be part of another token. Can someone please tell me where my mistake is? Thanks.

Assuming you omitted the rule that skips/hides spaces, this is what happens when tokenising SMTH CLASS:
token KEYWORD_SMTH is created for the text text "SMTH"
the mode changes from DEFAULT_MODE to NUMBER_MODE
the beginning of a token is created for the text "C" (NO_NUMBER...)
the mode changes from NUMBER_MODE to DEFAULT_MODE
inside the DEFAULT_MODE, the "C" previously matched is glued to whatever "LASS" is tokenised as (note this will NOT match the KEYWORD_CLASS!)
So, assuming that "LASS" is tokenised as an IDENTIFIER token or similar, you will have ended up with 2 tokens:
KEYWORD_SMTH (text "SMTH")
IDENTIFIER (text "C" + "LASS")

Related

How to escape special charcters?

I am using a html purifier package for purifying my rich text from any xss before storing in database.
But my rich text allows for Wiris symbols which uses special character as → or  .
Problem is the package does not allow me to escape these characters. It removes them completely.
What should I do to escape them ??
Example of the string before purifying
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts/><none/><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>
After purifying
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts></mprescripts><none><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>
My guess is that these entities are failing the regexen that HTML Purifier is using to check for valid entities in HTMLPurifier_EntityParser, here:
$this->_textEntitiesRegex =
'/&(?:'.
// hex
'[#]x([a-fA-F0-9]+);?|'.
// dec
'[#]0*(\d+);?|'.
// string (mandatory semicolon)
// NB: order matters: match semicolon preferentially
'([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
// string (optional semicolon)
"($semi_optional)".
')/';
$this->_attrEntitiesRegex =
'/&(?:'.
// hex
'[#]x([a-fA-F0-9]+);?|'.
// dec
'[#]0*(\d+);?|'.
// string (mandatory semicolon)
// NB: order matters: match semicolon preferentially
'([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
// string (optional semicolon)
// don't match if trailing is equals or alphanumeric (URL
// like)
"($semi_optional)(?![=;A-Za-z0-9])".
')/';
Notice how it expects numeric entities to start with 0 currently. (Perfectly sane since it's designed to handle pure HTML, without add-ons, and to make that safe; but in your use-case, you want more entity flexibility.)
You could extend that class and overwrite the constructor (where these regexen are being defined, by instead defining your own where you remove the 0* from the // dec part of the regexen), instantiating that, try setting $this->_entity_parser on a Lexer created with HTMLPurifier_Lexer::create($config) to your instantiated EntityParser object (this is the part I am least sure about whether it would work; you might have to create a Lexer patch with extends as well), then supply the altered Lexer to the config using Core.LexerImpl.
I have no working proof-of-concept of these steps for you right now (especially in the context of Laravel), but you should be able to go through those motions in the purifier.php file, before the return.
I solved the problem by setting key Core.EscapeNonASCIICharacters to true
under my default key in my purifier.php file and the problem has gone.

Match part of a string with regex

I have two arrays of strings and I want to check if a string of array a matches a string from array b. Those strings are phone numbers that might come in different formats. For example:
Array a might have a phone number with prefix like so +44123123123 or 0044123123123
Array b have a standard format without prefixes like so 123123123
So I'm looking for a regex that can match a part of a string like +44123123123 with 123123123
Btw I'm using Swift but I don't think there's a native way to do it (at least a more straightforward solution)
EDIT
I decided to reactivate the question after experimenting with the library #Larme mentioned because of inconsistent results.
I'd prefer a simper solution as I've stated earlier.
SOLUTION
Thanks guys for the responses. I saw many comments saying that Regex is not the right solution for this problem. And this is partly true. It could be true (or false) depending on my current setup/architecture ( which thinking about it now I realise that I should've explained better).
So I ended up using the native solution (hasSuffix/contains) but to do that I had to do some refactoring on the way the entire flow was structured. In the end I think it was the least complicated solution and more performant of the two. I'll give the bounty to #Alexey Inkin for being the first to mention the native solution and the right answer to #Ωmega for providing a more complete solution.
I believe regex is not the right approach for this task.
Instead, you should do something like this:
var c : [String] = b.filter ({ (short : String) -> Bool in
var result = false
for full in a {
result = result || full.hasSuffix(short)
}
return result
})
Check this demo.
...or similar solution like this:
var c : [String] = b.filter ({ (short : String) -> Bool in
for full in a {
if full.hasSuffix(short) { return true }
}
return false
})
Check this demo.
As you do not mention requirements to prefixes, the simplest solution is to check if string in a ends with a string in b. For this, take a look at https://developer.apple.com/documentation/swift/string/1541149-hassuffix
Then, if you have to check if the prefix belongs to a country, you may replace ^00 with + and then run a whitelist check against known prefixes. And the prefix itself can be obtained as a substring by cutting b's length of characters. Not really a regex's job.
I agree with Alexey Inkin that this can also nicely be solved without regex. If you really want a regex, you can try something like the following:
(?:(\+|00)(93|355|213|1684|376))?(\d+)
^^^^^^^^^^^^^^^^^^^^^ Add here all your expected country prefixes (see below)
^^^ ^^ Match a country prefix if it exists but don't give it a group number
^^^^^^^ Match the "prefix-prefix" (+ or 00)
^^^^ Match the local phone number
Unfortunatly with this regex, you have to provide all the expected country prefixes. But you can surely get this list online, e.g. here: https://www.countrycode.org
With this regex above you will get the local phone number in matching group 3 (and the "prefix-prefix" in group 1 and the country code in group 2).

IBM Watson Assistant: Chatbot Entity Confusion over regular expression

I have an entity called "#material_number" in which two values are stored.
First value is "material_number1" with the pattern (\d{3}).(\d{3})
The second value is "material_number2" with the pattern (\d{3}).(\d{3}).(\d{3})
When the user enters a material number, I store the value with a context variable called "$materialnumber" and I set the value of this variable to "?#material_number.literal?". And at the end the bot responds "Oh okay, the material number is $materialnumber."
The problem is that when the user enters a material number like "123.123.123", the bot thinks that the material number is "123.123". Basically it neglects the last three digits and prompts back "Oh okay, the material number is 123.123".
What can I do in order to fix this confusion?
I quickly tested this and there are two problems. First, the dot (. is a special wildcard and needs to be escaped. Second, Watson Assistant does not support the full regex options and seems to match both numbers when typing in the longer number.
You can simply escape using a \ and change your definition or use mine:
num1: (\d{3}\.){1}\d{3}
num2: (\d{3}\.){2}\d{3}
Because of the trouble with the regex evaluation I solved that in the expression itself. Watson Assistant holds the longer match as second value (if matched). The following expression looks if the long number, material_number2, has been matched, then extracts the correct value for it. It assumes that the shorter (incorrect) match is stored first.
{
"context": {
"materialnumber": "<? #matrial_number:matnum2 ? entities.material_number[1].literal : entities.material_number[0].literal ?>"
}
}

Erlang mnesia equivalent of "select * from Tb"

I'm a total erlang noob and I just want to see what's in a particular table I have. I want to just "select *" from a particular table to start with. The examples I'm seeing, such as the official documentation, all have column restrictions which I don't really want. I don't really know how to form the MatchHead or Guard to match anything (aka "*").
A very simple primer on how to just get everything out of a table would be very appreciated!
For example, you can use qlc:
F = fun() ->
Q = qlc:q([R || R <- mnesia:table(foo)]),
qlc:e(Q)
end,
mnesia:transaction(F).
The simplest way to do it is probably mnesia:dirty_match_object:
mnesia:dirty_match_object(foo, #foo{_ = '_'}).
That is, match everything in the table foo that is a foo record, regardless of the values of the fields (every field is '_', i.e. wildcard). Note that since it uses record construction syntax, it will only work in a module where you have included the record definition, or in the shell after evaluating rr(my_module) to make the record definition available.
(I expected mnesia:dirty_match_object(foo, '_') to work, but that fails with a bad_type error.)
To do it with select, call it like this:
mnesia:dirty_select(foo, [{'_', [], ['$_']}]).
Here, MatchHead is _, i.e. match anything. The guards are [], an empty list, i.e. no extra limitations. The result spec is ['$_'], i.e. return the entire record. For more information about match specs, see the match specifications chapter of the ERTS user guide.
If an expression is too deep and gets printed with ... in the shell, you can ask the shell to print the entire thing by evaluating rp(EXPRESSION). EXPRESSION can either be the function call once again, or v(-1) for the value returned by the previous expression, or v(42) for the value returned by the expression preceded by the shell prompt 42>.

JQuery UI Autocomplete returning all values

I have the following code:
$("#auto").autocomplete({
source: "js/search.php",
minLength: "3" });
This code is assign to an input text box where i type a name and after 3 letters it should return the ones that have similar letters. For my case it is returning all values, even those not related to the 3 letters already typed. My question is:
How to send my search.php file the value inside the input so it should know what to search for. For the moment it searches for everything. I checked the value that was going to php and it was empty. Since the query to mysql uses LIKE '%VARIABLE%' and the variable is empty it searches for '%%' which is all cases.
How can i send the correct informacion from JS to PHP with the simplest form.
Here is the explanation :
http://www.simonbattersby.com/blog/jquery-ui-autocomplete-with-a-remote-database-and-php/
Regards