Can RegexParser support patterns with (white)spaces in them? - scala

We want to create regex patterns with whitespace. However that seems to conflict with the token parsing done by the RegexParser: the pieces of the input character stream are broken into separate tokens before the individual Rules (/Parsers) ever see the inputs. Therefore the Rules will never be able to match their intended inputs.
Is there any workaround or suggested approach for this?

The RegexParser skips whitespace by default:
The parsing methods call the method skipWhitespace (defaults to true) and, if true, skip any whitespace before each parser is called.
RegexParser Doc
Overriding skipWhitespace() should fix your issue.

Related

How to properly parse a multi-character token in tree-sitter scanner function

Tree-sitter allows you to use an external scanner for those tokens that are tricky to parse or that depend on specific states like multiline strings.
The scanner takes a convenient lexer object with several methods that allow you to "scan" the document looking for the proper token characters.
Two of the key parts of this lexer are lookahead, which tells you the next character the lexer is "lookin at" and advance, which will move the lexer pointer to the next caracter.
However, after reading the docs and several other parsers that make use of this it's still not clear to me if calling this methods will "affect" the overall tree-sitter parser of if they are just local to my function invocation.
Specially tricky is trying to parse a multi-character token (more than 2 characters in fact) because you need to "advance the lexer, consuming the potential next chars that may be part of other tokens. One possible escape is to just return false after consuming the tokens and let tree-sitter go to the next step in the parsing, but thay may skip other valid tokens that potentially depend on the characters that I already consumed.
Of course I can move this parsing to the bottom of the scan function, but then maybe other shorter tokens may shadow this longer one and also produce an incorrect parsing.
As far as I know, there is no way to "rewind" the parser to undo the "consumption" of the characters, so I am not sure how to deal with this.
The tokens that I'm trying to parse are {js| for string opening and |js} for string closing.

Parsing commas in Sphinx

I have a field which can have multiple commas which are actually critical to some regex pattern matching.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
I cannot change the char prior to indexing (e.g. COMMA) so I have some anchor for the pattern and can't properly pattern extract w/o.
My only thought is to use exceptions to map ,=>COMMA (this won't process large text fields so not a huge issue). I'm curious as to if this will work and what the pipeline is i.e. what it could possibly break that I'm not considering. AFAIK Exceptions happen first and do not obey charset so this might in fact work. I get I can test it to see if it does but again I am more concenred with what this might break given my rudimentary knowledge of the pipeline of Sphinx Indexing.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
Just use U+2C to add comma to your charset_table, e.g.
charset_table=a..z,A..Z,0..9,U+2C
You might also want to add it to blend_chars instead to consider a comma both a word separator and not.

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

Lua pattern matching for email address

I having the following code:
if not (email:match("[A-Za-z0-9%.]+#[%a%d]+%.[%a%d]+")) then
print(false)
end
It doesn't currently catch
"test#yahoo,ca" or "test#test1.test2,com"
as an error.
I thought by limiting the input to %a - characters and %d - digits, I would by default catch any punctuation, including commas.
But I guess I'm wrong. Or there's something else that I'm just not seeing.
A second pair of eyes would be appreciated.
In the example of "test#test1.test2,com", the pattern matches test#test1.test2 and stops because of the following ,. It's not lying, it does match, just not what you expected. To fix, use anchors:
^[A-Za-z0-9%.]+#[%a%d]+%.[%a%d]+$
You can further simplify it to:
^[%w.]+#%w+%.%w+$
in which %w matches an alphanumeric character.
I had a hard time finding a true email validation function for Lua.
I couldn't find any that would allow some of the special cases that emails to allow. Things like + or quotes are actually acceptable in emails.
I wrote my own Lua function that could pass all the tests that are outlined in the spec for email addresses.
http://ohdoylerules.com/snippets/validate-email-with-lua
I also added a bunch of commentd, so if there is some strange validation that you want to ignore, just remove the if statement for that particular check.

Need token delimiter that doesn't conflict with PowerShell or RegEx special characters

I have written a PowerShell function that expands tokens quite nicely. Unfortunately, I used % as my token delimiter, and now I find that % is a special character in PowerShell v3 that is likely to conflict. I am thinking I might use ~ as my delimiter, but I wonder if there is a best practice here? Something that is sure to not conflict with either PowerShell or RegEx, and since my users are not always IT folks, and only need to use my tool a few weeks out of the year, something that is really obvious would be helpful.
I am currently expanding tokens before I do any other RegEx processing, and before I do any $ExecutionContext.InvokeCommand.ExpandString(), but I keep finding new opportunities to expand the tool and I may be doing those things earlier at some point, so I want to future proof the data as much as possible.
Any advice is greatly appreciated!
Gordon
How about using # as your delimiter?
It's not used at all in regex. In Powershell is designates a comment, but only if it appears outside the context of a quoted string literal. In this application you're using it as a string delimiter, so it's always going to be used in a quoted string argument.