What does the "?" operator do in Elixir? - unicode

The Ecto source code makes use of expressions ?0, ?1, etc. You can see how they evaluate:
iex(14)> ?0
48
iex(15)> ?1
49
iex(16)> ?2
50
What does that mean though? This is very hard to search for. What does the ?<character> actually do?

From: https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html#unicode-and-code-points
In Elixir you can use a ? in front of a character literal to reveal its code point:
If you aren't familiar with code points:
Unicode organizes all of the characters in its repertoire into code charts, and each character is given a unique numerical index. This numerical index is known as a Code Point.
The ?<character> can also be used in interesting ways for pattern matching and guard clauses.
defp parse_unsigned(<<digit, rest::binary>>) when digit in ?0..?9,
do: parse_unsigned(rest, false, false, <<digit>>)
...
defp parse_unsigned(<<?., digit, rest::binary>>, false, false, acc) when digit in ?0..?9,
do: parse_unsigned(rest, true, false, <<acc::binary, ?., digit>>)
The Elixir docs on it also clarify that it is only syntax. As #sabiwara points out:
Those constructs exist only at the syntax level. quote do: ?A just returns 65 and doesn't show any ? operator
As #Everett noted in the comments, there is a helpful package called Xray that provides some handy utility functions to help understand what's happening.
For example Xray.codepoint(some_char) can do what ?<char> does but it works for variables whereas ? only works with literals. Xray.codepoints(some_string) will do the whole string.

Related

What Is The Character For A Wildcard While Searching In VSCode

I'm deobfuscating some code and I forget the operator to use a wildcard while searching for text in VSCode. By this I mean in VSCode whenever you search for code (CMD/CTRL + F), what is the character for a wild card (i.e searching for "date{WILDCARD HERE}" would return "date1","date2","date", etc.)
I don't recall a wildcard option (I've never used it at least). But the search feature supports using regular expressions.
Given your examples of date1, date2, date, etc. assuming it followed a pattern of date<n> where n is a number (or nothing in the case of just "date"), the regular expression of date[1-9]* should achieve what you want.
You can test the expression out on this site. Input the regular expression and some sample data and see how it matches.

What's the common denominator for regex "pattern" in OpenAPI?

I'm using FastAPI, which allows pattern=re.compile("(?P<foo>[42a-z]+)...").
https://editor.swagger.io/ shows an error for this pattern.
My guess is that Python's named group syntax (?P<name>...) is different from ES2018 (?<name>...).
But, come to think of it, the idea of OpenAPI is interoperability, and some other language, esp. a compiled language may use yet another notation, or may not support named groups in the regular expressions at all.
What common denominator of regular expression syntax should I use?
OpenAPI uses json schema, and the json schema spec defines regex as "A regular expression, which SHOULD be valid according to the ECMA-262 regular expression dialect." Here is the relevant ECMA-262 section.
Of course non-javascript implementations probably won't care too much about it, and just use the default regex library of their platform. So good luck with figuring out the common denominator :)
I suggest just using as simple regexes as possible. And add some tests for it, using the library that you use in production.
Json Schema recommends a specific subset of regular expressions because the authors accept that most implementations will not support full ECMA 262 syntax:
https://json-schema.org/understanding-json-schema/reference/regular_expressions.html
A single unicode character (other than the special characters below) matches itself.
.: Matches any character except line break characters. (Be aware that what constitutes a line break character is somewhat dependent on your platform and language environment, but in practice this rarely matters).
^: Matches only at the beginning of the string.
$: Matches only at the end of the string.
(...): Group a series of regular expressions into a single regular expression.
|: Matches either the regular expression preceding or following the | symbol.
[abc]: Matches any of the characters inside the square brackets.
[a-z]: Matches the range of characters.
[^abc]: Matches any character not listed.
[^a-z]: Matches any character outside of the range.
+: Matches one or more repetitions of the preceding regular expression.
*: Matches zero or more repetitions of the preceding regular expression.
?: Matches zero or one repetitions of the preceding regular expression.
+?, *?, ??: The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired and you want to match as few characters as possible.
(?!x), (?=x): Negative and positive lookahead.
{x}: Match exactly x occurrences of the preceding regular expression.
{x,y}: Match at least x and at most y occurrences of the preceding regular expression.
{x,}: Match x occurrences or more of the preceding regular expression.
{x}?, {x,y}?, {x,}?: Lazy versions of the above expressions.
P.S. Kudos to #erosb for the idea how to find this recommendation.

Difference between STRPOS and POSITION in postgresql

Beginner here.
Is there any other difference apart from syntax in position and strpos function?
If not then why do we have two functions which can achieve the same thing just with a bit of syntax difference?
Those functions do the exactly same thing and differ only in syntax. Documentation for strpos() says:
Location of specified substring (same as position(substring in string), but note the
reversed argument order)
Reason why they both exist and differ only in syntax is that POSITION(str1 IN str2) is defined by ANSI SQL standard. If PostgreSQL had only strpos() it wouldn't be able to run ANSI SQL queries and scripts.
You can use both commands in order to reach the same goal, i.e. finding the location of a substring in a given string. However, they have different syntaxes and order of arguments:
strpos(String, Substring);
position(Substring in String);
Take a look at all string functions and operators of PostgreSQL here

Force CL-Lex to read whole word

I'm using CL-Lex to implement a lexer (as input for CL-YACC) and my language has several keywords such as "let" and "in". However, while the lexer recognizes such keywords, it does too much. When it finds words such as "init", it returns the first token as IN, while it should return a "CONST" token for the "init" word.
This is a simple version of the lexer:
(define-string-lexer lexer
(...)
("in" (return (values :in $#)))
("[a-z]([a-z]|[A-Z]|\_)" (return (values :const $#))))
How do I force the lexer to fully read the whole word until some whitespace appears?
This is both a correction of Kaz's errors, and a vote of confidence for the OP.
In his original response, Kaz states the order of Unix lex precedence exactly backward. From the lex documentation:
Lex can handle ambiguous specifications. When more than one expression can
match the current input, Lex chooses as follows:
The longest match is preferred.
Among rules which matched the same number of characters, the rule given
first is preferred.
In addition, Kaz is wrong to criticize the OP's solution of using Perl-regex word-boundary matching. As it happens, you are allowed (free of tormenting guilt) to match words in any way that your lexer generator will support. CL-LEX uses Perl regexes, which use \b as a convenient syntax for the more cumbersome lex approximate of :
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in wordn"); }
<NIW>a { printf("'a' not in wordn"); }
All things being equal, finding a way to unambiguously match his words is probably better than the alternative.
Despite Kaz wanting to slap him, the OP has answered his own question correctly, coming up with a solution that takes advantage of the flexibility of his chosen lexer generator.
Your example lexer above has two rules, both of which match a sequence of exactly two characters. Moreover, they have common matches (the language matched by the second is a strict superset of the first).
In the classic Unix lex, if two rules both match the same length of input, precedence is given to the rule which occurs first in the specification. Otherwise, the longest possible match dominates.
(Although without RTFM, I can't say that that is what happens in CL-LEX, it does make a plausible hypothesis of what is happening in this case.)
It looks like you're missing a regex Kleene operator to match a longer token in the second rule.

How can I write a regex that matches words that overlap themselves?

I'm trying to match a word forwards and backwards in a string but it isn't catching all matches. For example, searching for the word "AB" in the string "AAABAAABAAA", I create and use the regex /AB|BA/, but it only matches the two "AB" substrings, and ignores the "BA" substrings.
I'm using RegexKitLite on the iPhone, but I think this is a more general regex problem (I see the same behavior in online regex testers). Nevertheless, here's the code I'm using to enumerate the matches:
[#"AAABAAABAAA" enumerateStringsMatchedByRegex:#"AB|BA" usingBlock:
^(NSInteger captureCount,
NSString * const capturedStrings[captureCount],
const NSRange capturedRanges[captureCount],
volatile BOOL * const stop) {
NSLog(#"%#", capturedStrings[0]);
}];
Output:
AB
AB
I don't know which online tester you tried, but http://www.regextester.com/ (for example) will not consider the same character for multiple matches. In this case, since ABA matches AB, the B is not considered for the BA match. It's purely a guess that RegexKitLite is implemented similarly.
Even if you don't consider the mirrored variant, the original search string may overlap with itself. For example, if you search ABCA|ACBA in ABCABCACBACBA you'll get two of four matches, searching in both directions will be the same.
It should be possible to find matches incrementally, but perhaps not with RegexKitLite
I would say, thats not possible in one turn. The regex matches for the given pattern and "eats" the matched characters. So if you search AB|BA in ABA the first found pattern is AB, then the regex continue to search on the third A.
So it is not possible to find overlapping patterns with the same regex and using the | operator.
I'm not sure how you'd accomplish exactly what I think you're asking for without reversing the string and testing twice.
However, I suppose it depends on what you're after exactly. If you're simply trying to determine if the pattern occurs in the string backwards or forwards, and not so much how it occurs, then you could do something like this:
ABA?|BAB?
The ? makes the last character optional on each side of the |. In the case of AAABAAABAAA, it'll find ABA twice. In the case of AB it'll find AB, and in the case of BA it'll find BA.
Here it is with test cases...
http://regexhero.net/tester/?id=a387ae0a-1707-4d9e-856b-ebe2176679bb