How to use regex to include/exclude some input files in sc.textFile? - scala

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile().
I have attempted to do the following:
sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")
This should match the following:
/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz
Any idea how to achieve this?

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.
Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.
The syntax of the glob includes:
* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[a-b] (character range)
{a,b} (alternation)
\c (escape character)
Following the example in the accepted answer, it is possible to write your path as:
sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")
It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:
sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

Related

What's the common denominator for regex "pattern" in OpenAPI?

I'm using FastAPI, which allows pattern=re.compile("(?P<foo>[42a-z]+)...").
https://editor.swagger.io/ shows an error for this pattern.
My guess is that Python's named group syntax (?P<name>...) is different from ES2018 (?<name>...).
But, come to think of it, the idea of OpenAPI is interoperability, and some other language, esp. a compiled language may use yet another notation, or may not support named groups in the regular expressions at all.
What common denominator of regular expression syntax should I use?
OpenAPI uses json schema, and the json schema spec defines regex as "A regular expression, which SHOULD be valid according to the ECMA-262 regular expression dialect." Here is the relevant ECMA-262 section.
Of course non-javascript implementations probably won't care too much about it, and just use the default regex library of their platform. So good luck with figuring out the common denominator :)
I suggest just using as simple regexes as possible. And add some tests for it, using the library that you use in production.
Json Schema recommends a specific subset of regular expressions because the authors accept that most implementations will not support full ECMA 262 syntax:
https://json-schema.org/understanding-json-schema/reference/regular_expressions.html
A single unicode character (other than the special characters below) matches itself.
.: Matches any character except line break characters. (Be aware that what constitutes a line break character is somewhat dependent on your platform and language environment, but in practice this rarely matters).
^: Matches only at the beginning of the string.
$: Matches only at the end of the string.
(...): Group a series of regular expressions into a single regular expression.
|: Matches either the regular expression preceding or following the | symbol.
[abc]: Matches any of the characters inside the square brackets.
[a-z]: Matches the range of characters.
[^abc]: Matches any character not listed.
[^a-z]: Matches any character outside of the range.
+: Matches one or more repetitions of the preceding regular expression.
*: Matches zero or more repetitions of the preceding regular expression.
?: Matches zero or one repetitions of the preceding regular expression.
+?, *?, ??: The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired and you want to match as few characters as possible.
(?!x), (?=x): Negative and positive lookahead.
{x}: Match exactly x occurrences of the preceding regular expression.
{x,y}: Match at least x and at most y occurrences of the preceding regular expression.
{x,}: Match x occurrences or more of the preceding regular expression.
{x}?, {x,y}?, {x,}?: Lazy versions of the above expressions.
P.S. Kudos to #erosb for the idea how to find this recommendation.

Searching for two Word wildcard strings that are nested

I'm having trouble finding the proper Word wildcard string to find numbers that fit the following patterns:
"NN NN NN" or "NN NN NN.NN" (where N is any number 0-9)
The trouble is the first string is a subset of the second string. My goal is to find a single wildcard string that will capture both. Unfortunately, I need to use an operator that is zero or more occurrences for the ".NN" portion and that doesn't exist.
I'm having to do two searches, and I'm using the following patterns:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}?[!0-9]
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}.[0-9]{2}
The problem is that first pattern (in bold). It works well unless I have the number in a table or something and there is nothing after it to match (or not match, if you will) the [!0-9].
You could use a single wildcard Find:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1,4}
or:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1;4}
to capture both. Which you use depends on your regional settings.

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

Include slashes and parentheses in tokens

Background
I have search indexes containing Greek characters. Many people don't know how to type Greek so they enter something called "beta-code". Beta-code can be converted into Greek. For example, beta-code "NO/MOU" would be converted to "νόμου". Characters such as a slash or parenthesis is used to indicate an accent.
Desired Behavior
I want users to be able to search using either beta-code or text in the Greek script. I figured out that the Whoosh Variations class provides the mechanism I need and it almost solves my problem.
Problem
The Variation class works well except for when a slash or a parenthesis are used to indicate an accent in a users' query. The problem is the query are parsed such that the the special characters used to denote the accent result in the words being split up. For example, a search for "NO/MOU" results in the Variations class being asked to find variations of "no" and "mou" instead of "NO/MOU".
Question
Is there a way to influence how the query is parsed such that slashes and parentheses are included in the search words (i.e. that a search for "NO/MOU" results in a search for a token of ""NO/MOU" instead of "no" and "mou")?
The search parser uses a Tokenizer class for breaking up the search string into individual terms. Whoosh will use the class that is associated with the schema. For example, the case below, the SimpleAnalyzer() will be used when searching the "content" field.
Schema( verse_id = NUMERIC(unique=True, stored=True),
content = TEXT(analyzer=SimpleAnalyzer()) )
By default, the SimpleAnalyzer() uses the following regular expression to tokenize search terms: "\w+(.?\w+)*"
To use a different regular expression, assign the first argument to the SimpleAnalyzer to another regular expression. For example, to include beta-code characters (slashes, parentheses, etc.) in tokens, use the following SimpleAnalyzer:
SimpleAnalyzer( rcompile(r"[\w/*()=\+|&']+(\.?[\w/*()=\+|&']+)*") )
Searches will now allow terms to include the special beta-code characters and the Variations class will be able to convert the term to the unicode version.

Can actions in Lex access individual regex groups?

Can actions in Lex access individual regex groups?
(NOTE: I'm guessing not, since the group characters - parentheses - are according to the documentation used to change precedence. But if so, do you recommend an alternative C/C++ scanner generator that can do this? I'm not really hot on writing my own lexical analyzer.)
Example:
Let's say I have this input: foo [tagName attribute="value"] bar and I want to extract the tag using Lex/Flex. I could certainly write this rule:
\[[a-z]+[[:space:]]+[a-z]+=\"[a-z]+\"\] printf("matched %s", yytext);
But let's say I would want to access certain parts of the string, e.g. the attribute but without having to parse yytext again (as the string has already been scanned it doesn't really make sense to scan part of it again). So something like this would be preferable (regex groups):
\[[a-z]+[[:space:]]+[a-z]+=\"([a-z]+)\"\] printf("matched attribute %s", $1);
You can separate it to start conditions. Something like this:
%x VALUEPARSE ENDSTATE
%%
char string_buf[100];
<INITIAL>\[[a-z]+[[:space:]]+[a-z]+=\" {BEGIN(VALUEPARSE);}
<VALUEPARSE>([a-z]+) (strncpy(string_buf, yytext, yyleng);BEGIN(ENDSTATE);} //getting value text
<ENDSTATE>\"\] {BEGIN(INITIAL);}
%%
About an alternative C/C++ scanner generator - I use QT class QRegularExpression for same things, it can very easy get regex group after match.
Certainly at least some forms of them do.
But the default lex/flex downloadable from sourceforge.org do not seem to list it in their documentation, and this example leaves the full string in yytext.
From IBM's LEX documentation for AIX:
(Expression)
Matches the expression in the parentheses.
The () (parentheses) operator is used for grouping and causes the expression within parentheses to be read into the yytext array. A group in parentheses can be used in place of any single character in any other pattern.
Example: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.