Constructing a simple regex, with only +, *, (), and | [closed] - pcre

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Let {a,b,c} be the alphabet. I have to construct a regex that matches any input over this alphabet, and if aa appears in the input, then cc must appear as well (somewhere in the input).
No look ahead, no look behind, no backreferences, just by using the quantifiers + and *, grouping via parentheses and alternatives via |.
The problem is I don't know how to approach this. For instance these inputs must match:
abba
bccb
bccaa
"" (empty input)
bccbaa
ccbaabb
The following must not match:
aa
abaaab
baaa
caac
How can I construct such a regex, using only these tools?
Update
I have thought about
((cc(b|c)*aa)|(aa(b|c)*cc))+|(ab|ba|ca|ca|bb|bc|cc)*
What do you think, does this fulfill the specification?

(b|c|a(b|c))*(a|)|(a|b|c)*(aa(a|b|c)*cc|cc(a|b|c)*aa)(a|b|c)*
Will match:
Any number of bs or cs (even zero), or an a if followed by a b or c, plus an optional unaccompanied a at the end. These rules together ensure that two as are always separated by a b or c, and will match the empty string and single chars as well.
A string that includes an aa somewhere, followed eventually by a cc
A string that includes a cc somewhere, followed eventually by an aa
(For reference, if you need each aa to match up with a cc, you're kinda screwed. That's no longer regular. A string like ccccaaaa would require counting how many ccs have been seen so far, and FSAs can't count.)

It's more-o-less trivial for the given set of params, I suppose:
/^((b|c|ab|ac|a$)*|(a|b|c)*(cc(a|b|c)*aa|aa(a|b|c)*cc)(a|b|c)*)$/;
Explanation: obviously you need to match for three cases here:
the whole string doesn't contain 'aa' sequence. This condition is expressed with the following pattern:
/^(b|c|ab|ac|a$)*$/
...that is: "match any number of any combination of b and c symbols, ab, ac sequences or single a item at the end of the string".
the whole string does contain 'aa' sequence, followed (somewhere) by 'cc' sequence - and it's still composed of [abc] range only:
/^(a|b|c)* aa(a|b|c)* cc(a|b|c)* $/
(somehow without whitespace * are treated as italic text markers even within the <code> section; you obviously don't need it in the regex)
the whole string does contain 'aa' sequence, preceded (somewhere) by 'cc' sequence - and it's still composed of [abc] range only:
/^(a|b|c)* cc(a|b|c)* aa(a|b|c)* $/
Now you have three parts of the regex, and it's quite easy to combine it into the simple pattern, I suppose.

Related

Regex match invalid pattern ios swift 4 [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

Why would the swift language designers choose to seperare numbers by underscores instead of a comma [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Numeric literals can contain extra formatting to make them easier to read. Both integers and floats can be padded with extra zeroes and can contain underscores to help with readability. Neither type of formatting affects the underlying value of the literal:
let paddedDouble = 000123.456
let oneMillion = 1_000_000
let justOverOneMillion = 1_000_000.000_000_1
Well for one thing not all locales separate numbers with commas. Moreover, using comma separators could become confusing syntactically; consider the following function call:
foo(123,456)
Is it one literal 123,456, or two distinct arguments 123 and 456?
Reason #1: Commas would be ambiguous – it would be impossible to distinguish the following cases:
var prices: [Double] = [1,234.00, 99.99]
# evaluates to
var prices: [Double] = [1.00, 234.00, 99.99]
However underscores are not ambiguous in this case:
var prices: [Double] = [1_234.00, 99.99]
Reason #2: The underscore is generally used to indicate a discarded value which makes sense in this context (it is essentially a discarded digit).
Reason #3: Swift is inspired by Ruby, which does the same thing.
I am not a Swift language designer (and I'm not sure anyone who is posts on SO in any official capacity, so this might not be the best place for directing questions at them), but I have a couple of guesses:
the comma is an operator in Swift (works about the same as in C)
not everybody uses comma as the thousands separator
you can use underscore to break up any numeric literal in any way that helps it be readable to you, not just as a thousands separator

Fully correct Unicode visual string reversal

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]
Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.
This pseudo-algorithm handles all the easy cases I know about:
Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)
Reverse the order of this list.
For each string in the list:
Segment the string into grapheme clusters.
Reverse the order of the grapheme clusters.
Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)
Join the sequence back into a string.
Recombine the list of reversed words to produce the final reversed string.
... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?

_iVar vs. iVar_ for variable naming [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I used to use prefix underscore for instance variable naming, to distinguish from local variables.
I happend to see the "Google Objective-C Style Guide", and found that it suggests to use trailing underscores(refer it HERE), but without any detail explanation about why.
I know it is a coding style issue, but I do wonder is there any advantages of using trailing underscores?
Related: Question about #synthesize (see the blockquotes at the bottom of the answer)
The advantage is: _var is a convention (C99, Cocoa guidelines) for private, but it is so common (even on Apple templates) that Apple uses double underscore __var. Google solves it by using trailing underscore instead.
I updated the other answer with a couple more reasons...
Leading underscore is also discouraged in C++ (see
What are the rules about using an underscore in a C++ identifier?) and Core Data properties
(try adding a leading underscore in the model and you'll get "Name
must begin with a letter").
Trailing underscore seems the sensible choice, but if you like
something else, collisions are unlikely to happen, and if they do,
you'll get a warning from the compiler.
The prefix underscore usually is used in system / SDK libraries. So using prefix underscore may cause overridden of variable in super class and a bug like this is not so easy to found.
Take a look at any class provided by system, like NSView, and you will find that.
Apple uses single leading underscore for ivars so that their variable names won't collide with ours. When you name your ivars, use anything but a single leading underscore.
There is no advantage as such using trailing underscores. We follow the coding style so that it eases the code readability. Underscore, in this case helps us differentiate between iVars and local variables. As far as i know, _iVar is more prominent than iVar_
Almost all programming languages based on english, where we write and read from left to right.
Using leading underscores makes easier finding iVars, and recognition that a variable is iVar.

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.