Can actions in Lex access individual regex groups? - lex

Can actions in Lex access individual regex groups?
(NOTE: I'm guessing not, since the group characters - parentheses - are according to the documentation used to change precedence. But if so, do you recommend an alternative C/C++ scanner generator that can do this? I'm not really hot on writing my own lexical analyzer.)
Example:
Let's say I have this input: foo [tagName attribute="value"] bar and I want to extract the tag using Lex/Flex. I could certainly write this rule:
\[[a-z]+[[:space:]]+[a-z]+=\"[a-z]+\"\] printf("matched %s", yytext);
But let's say I would want to access certain parts of the string, e.g. the attribute but without having to parse yytext again (as the string has already been scanned it doesn't really make sense to scan part of it again). So something like this would be preferable (regex groups):
\[[a-z]+[[:space:]]+[a-z]+=\"([a-z]+)\"\] printf("matched attribute %s", $1);

You can separate it to start conditions. Something like this:
%x VALUEPARSE ENDSTATE
%%
char string_buf[100];
<INITIAL>\[[a-z]+[[:space:]]+[a-z]+=\" {BEGIN(VALUEPARSE);}
<VALUEPARSE>([a-z]+) (strncpy(string_buf, yytext, yyleng);BEGIN(ENDSTATE);} //getting value text
<ENDSTATE>\"\] {BEGIN(INITIAL);}
%%
About an alternative C/C++ scanner generator - I use QT class QRegularExpression for same things, it can very easy get regex group after match.

Certainly at least some forms of them do.
But the default lex/flex downloadable from sourceforge.org do not seem to list it in their documentation, and this example leaves the full string in yytext.
From IBM's LEX documentation for AIX:
(Expression)
Matches the expression in the parentheses.
The () (parentheses) operator is used for grouping and causes the expression within parentheses to be read into the yytext array. A group in parentheses can be used in place of any single character in any other pattern.
Example: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.

Related

Searching for two Word wildcard strings that are nested

I'm having trouble finding the proper Word wildcard string to find numbers that fit the following patterns:
"NN NN NN" or "NN NN NN.NN" (where N is any number 0-9)
The trouble is the first string is a subset of the second string. My goal is to find a single wildcard string that will capture both. Unfortunately, I need to use an operator that is zero or more occurrences for the ".NN" portion and that doesn't exist.
I'm having to do two searches, and I'm using the following patterns:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}?[!0-9]
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}.[0-9]{2}
The problem is that first pattern (in bold). It works well unless I have the number in a table or something and there is nothing after it to match (or not match, if you will) the [!0-9].
You could use a single wildcard Find:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1,4}
or:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1;4}
to capture both. Which you use depends on your regional settings.

How to extract special information from Watson Assistant (Conversation)?

I have the user input "What is the hostname of serial GX0211229342?". The serial can be a numeric or alphanumeric mix (e.g. 7842344 or H52WBD1 etc).
How can I extract GX0211229342 from the sentence and set it into context in Watson assistant (Watson Conversation)?
Your case is tricky because if the ID is only letters it could be any part of the sentence. Using the $, you have told the regex processor to look for the pattern at the end of the sentence. Hence, it only works for those cases.
What you could do is to make use of a non-capturing group provided by the RE2 syntax. There are some examples of non-capturing group here on SO. Basically, search for something like the following (not tested):
(?:serial)(?:number)?[0-9a-zA-Z]+
The first ("serial") would be detected and ignored, the "number" is optional and would be ignored, then comes the alphanumeric.
If the serial number can be defined by 1 or 2, any number of regular expressions then you have the option of creating a serial number entity based on those regular expressions.
The conversation service will be able to identify the serial numbers based on the entity pattern matching.
I figure it out, using Watson entity pattern, and the regular expression should be this: ([0-9]+[a-zA-Z]+|[a-zA-Z]+[0-9]+)[0-9a-zA-Z]*
it will be used to extract alphanumeric from input.
and one more pattern is [0-9]+ it was used to extract numbers.
Thank you all help.

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

Any way to extend org-mode tags to include characters like "-"?

BRIEF: can the characters permitted in org-mode tags be extenced in any way?
E.g. to include -, dashes?
DETAIL:
I see
http://orgmode.org/org.html#Tags
Tags are normal words containing letters, numbers, ‘_’, and ‘#’. Tags
must be preceded and followed by a single colon, e.g., ‘:work:’.
I am somewhat surprised that this is not extensible. Is it, and I have missed it?
TODO keywords can include dashes. Occasionally I would like to treat TODOs as intermiscible wth tags - e.g. move a TODO to a tag, and vice versa - but this syntax difference gets in the way.
Before I start coding, does anyone know why dashes are not allowed? I conjecture confusion with timestamps.
Not certain, but it is likely because certain agenda search strings use "-" as an operator, and there is not yet a way to escape those using "-".
I have used hyphens for properties and spaces for todo kw with no problems for my purposes, but I have not tried them in searches.
The new exporter will be included in 8.0 and with it a detailed description of syntax. If you want to file a feature request to allow hyphens, perhaps by allowing escaping in search strings, now is the time to do it.

Include slashes and parentheses in tokens

Background
I have search indexes containing Greek characters. Many people don't know how to type Greek so they enter something called "beta-code". Beta-code can be converted into Greek. For example, beta-code "NO/MOU" would be converted to "νόμου". Characters such as a slash or parenthesis is used to indicate an accent.
Desired Behavior
I want users to be able to search using either beta-code or text in the Greek script. I figured out that the Whoosh Variations class provides the mechanism I need and it almost solves my problem.
Problem
The Variation class works well except for when a slash or a parenthesis are used to indicate an accent in a users' query. The problem is the query are parsed such that the the special characters used to denote the accent result in the words being split up. For example, a search for "NO/MOU" results in the Variations class being asked to find variations of "no" and "mou" instead of "NO/MOU".
Question
Is there a way to influence how the query is parsed such that slashes and parentheses are included in the search words (i.e. that a search for "NO/MOU" results in a search for a token of ""NO/MOU" instead of "no" and "mou")?
The search parser uses a Tokenizer class for breaking up the search string into individual terms. Whoosh will use the class that is associated with the schema. For example, the case below, the SimpleAnalyzer() will be used when searching the "content" field.
Schema( verse_id = NUMERIC(unique=True, stored=True),
content = TEXT(analyzer=SimpleAnalyzer()) )
By default, the SimpleAnalyzer() uses the following regular expression to tokenize search terms: "\w+(.?\w+)*"
To use a different regular expression, assign the first argument to the SimpleAnalyzer to another regular expression. For example, to include beta-code characters (slashes, parentheses, etc.) in tokens, use the following SimpleAnalyzer:
SimpleAnalyzer( rcompile(r"[\w/*()=\+|&']+(\.?[\w/*()=\+|&']+)*") )
Searches will now allow terms to include the special beta-code characters and the Variations class will be able to convert the term to the unicode version.