Regex expression for detecting 2 consecutive words when first word starts with # - regex-group

I wanted to know the regex expression that detects names starting with #. For eg, in the sentence "Hi #Steve Rogers, how are you?", I want to extract out #Steve Rogers using regex. I tried using Pattern.compile("#\\s*(\\w+)").matcher(text), but only "#Steve" get detected. What else should I use.??
Thanks

Try (#[\w\s]+)
It will only capture word and spaces after the #
See example at https://regex101.com/r/4Pv9bu/1

If you don't want to match an # sign followed by a space only like # and if there can be more than a single word after it:
(?<!\S)#\w+(?:\h+\w+)?
Explanation
(?<!\S) Assert a whitespace boundary to the left
# Match literally
\w+ Match 1+ word characters
(?:\s+\w+)? Optionally match 1+ horizontal whitespace chars and 1+ word chars
Regex demo
In Java
String regex = "(?<!\\S)#\\w+(?:\\h+\\w+)?";

Related

How to capture a word boundary in Swift NSRegularExpression?

I would like to capture all words within a string which start with some prefix. For example all words which start with a t
if let regex = try? NSRegularExpression(pattern: #"t[^ ]+"#, options: NSRegularExpression.Options.caseInsensitive) {
let input = "this is the best test"
let matches = regex.matches(in: input, options: [], range: NSRange(location: 0, length: input.count))
for match in matches {
print((input as NSString).substring(with: match.range))
}
}
In the code above I am using a simple space as delimiter (#"t[^ ]+"#) and the output is as expected:
this
the
test
However, not only spaces but all word boundaries should be considered. So I replace the space with \b to match all boundaries (#"t[^\b]+"#). However, this does not work:
this is the
t test
It seems that this code does not look for word boundaries but simple for b... Why is this?
I thought using # before and after the regex would create a raw string and thus deliver the \ correctly to the regex system. So #"t[^\b]+"# should the same as "t[^\\b]+" and be translated to t[^\b]+, shouldn't it?
Or is the word boundary operator \b not available in Swift regex?
EDIT:
According to the ICU Documentation \b matches a word boundary, thus [^\b] (anything but a word boundary) should not be the same as [^b] (anything but a b), should it?
However, it seems that \b can not be used in sets, can it? But \Bshould do the same (anything but a word boundary).
So I tried using #"t\B+"# instead. However this does not find any match at all.
The question remains: How to match a word boundary in Swift NSRegularExpression?
The #"t[^\b]+"# string literal results in a t[^\b]+ regex and it simply matches t and then one or more chars other than a b char (the [^\b] is equal to [^b] in ICU regex flavor).
To match a t and then one or more word chars (that is, up to the next leftmost word boundary), you can use
pattern: #"t\w+"#
where \w+ will match one or more word chars.
A [...] is a character set/class. And a character class is meant to match characters. \b is a word boundary only outside a character class, because a word boundary is not a character, it is a zero-width assertion that matches a certain position in a string. All zero-width assertions lose their special, "zero-width" meaning in a character class. [.$] doesn't mean a . or end of string, it matches either a . or $ char. [.\z] does not match . or the very end of string, it matches . or z as \ is omitted since \z is not a valid escape sequence.
Also, t\B+ makes very little sense as \B, also a zero-width assertion, matches a location in the string that is not a word boundary position. Note that zero-width assertions do not consume text, i.e. no text is added to the overall match memory buffer, and the regex index remains where it was before trying the zero-width assertion pattern. By adding + after \B, you just tell the regex engine to match a location after t that is not a word boundary, so the regex engine matches t\B+ the same way as if it were a t\B, i.e. it only matches a t that is followed with a word char (letter, digit, connector punctuation).
\w matches (and consumes) word chars, so if you need to match (and really get as a result) any chars after t till the first word boundary, you just need to use this \w pattern, t\w* or t\w+ (if there must be at least one word char after t).

How to do negate or subtract a regex from another regex result in just one line of regex

I am trying to do a regex string to find all cases of force unwrapping in swift. This will search all words with exclamation points in the entire code base. However, the regex that I already have has included implicit declaration of variable which I am trying to exclude.
This is the regex that I already have.
(:\s)?\w+(?<!as)\)*!
And it works fine. It searches for "variableName!", "(variableName)!", "hello.hello!". The exclusion of force casting also works. It avoids cases like "hello as! UIView", But I am trying also to exclude another cases such as "var hello: UIView!" which has an exclamation point. That's the problem I am having. I tried negative lookahead and negative lookbehind and nothing solved this kind of case.
This is the sample regex I am working on
(:\s)?\w+(?<!as)\)*!
And this is the result
testing.(**test)))!**
Details lists capture **groups!**
hello as! hello
**Hello!**
**testing!**
testing**.test!**
Hello != World
var noNetworkBanner**: StatusBarNotificationBanner!** <-- need to exclude
"var noNetworkBanner**: StatusBarNotificationBanner!**" <-- need to exclude
You may use
(?<!:\s)\b\w+(?<!\bas)\b\)*!
I added \b word boundaries to match whole words only, and changed the (:\s)? optional group to a negative lookbehind, (?<!:\s), that disallows a : + space before the word you need to match.
See the regex demo and the regex graph:
Details
(?<!:\s) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a : and a whitespace
\b - word boundary
\w+ - 1+ word chars
(?<!\bas) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a whole word as
\b - word boundary
\)* - 0 or more ) chars
! - a ! char.

What does this Perl while loop mean?

while ($aaa =~ m/= "(\D.*?)"/g)
I figured that it matches while $aaa is like anything = "something" it returns something (without the quotation mark).
But what does this piece of code mean?
m/= "(\D.*?)"/
You seem to have figured out most of it. The =, , and " all literally match those characters. The () capture a part of the matched string and make it available as $1. The part inside the parenthesis matches a non-digit character (\D), followed by zero or more (*?) non-newline characters (.) until the ". * would also match zero or more times, but prefers to match more characters so would end up matching until the last " in the string instead of the next one, as *? does.
All of this is documented in perlre.
The equals sign and quotation mark are taken literally, \D means any non-digit, .*? followed or not by zero or more characters, of any kind.
From left to right:
m/= "(\D.*?)"/g
match operator,
start regex:
equals sign, whitespace, double quotation mark,
start group:
one non-digit character, zero or more characters,
end group,
double quotation mark,
end regex
match globally

searching a word with a particular character in it in perl

am trying to search a word where it starts with any character (Capital letter) but ends with zero in perl.
For example
ABC0
XYZ0
EIU0
QW0
What I have tried -
$abc =~ /^[A-Z].+0$/
But I am not getting proper output for this. Can anybody help me please?
The ^ anchores at the start of a string, the $ at the end. .+ matches as many non-newline-characters as possible. Therefore
"ABC0 XYZ0 EIU0 QW0" =~ /^[A-Z].+0$/
matches the whole string.
The \b assertion matches at word edges: everywhere a word character and a non-word-character are adjacent. The \w charclass holds only word characters, the \S charclass all non-space-characters. Either of these is better than ..
So you may want to use /\b[A-Z]\W*0\b/.
This might work :
$abc =~ /\b[A-Z].*0\b/
\b matches word boundaries.

How to get a perfect match for a regexp pattern in Perl?

I've to match a regular-expression, stored in a variable:
#!/bin/env perl
use warnings;
use strict;
my $expr = qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx;
$str = "abcd[3] xyzg[4:0]";
if ($str =~ m/$expr/) {
print "\n%%%%%%%%% $`-----$&-----$'\n";
}
else {
print "\n********* NOT MATCHED\n";
}
But I'm getting the outout in $& as
%%%%%%%%% -----abcd[3] xyzg-----[4:0]
But expecting, it shouldn't go inside the if clause.
What is intended is:
if $str = "abcd xyzg" => %%%%%%%%% -----abcd xyzg----- (CORRECT)
if $str = "abcd[2] xyzg" => %%%%%%%%% -----abcd[2] xyzg----- (CORRECT)
if $str = "abcd[2] xyzg[3] => %%%%%%%%% -----abcd[2] xyzg[3]----- (CORRECT)
if $str = "abcd[2:0] xyzg[3] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2:0] xyzg[3:0] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2] xyzg[3:0]" => ********* NOT MATCHED (CORRECT/INTENDED)
but output is %%%%%%%%% -----abcd[2] xyzg-----[3:0] (WRONG)
OR better to say this is not intended.
In this case, it should/my_expectation go to the else block.
Even I don't know, why $& take a portion of the string (abcd[2] xyzg), and $' having [3:0]?
HOW?
It should match the full, not a part like the above. If it didn't, it shouldn't go to the if clause.
Can anyone please help me to change my $expr pattern, so that I can have what is intended?
By default, Perl regexes only look for a matching substring of the given string. In order to force comparison against the entire string, you need to indicate that the regex begins at the beginning of the string and ends at the end by using ^ and $:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)$/;
(Also, there's no reason to have the /x modifier, as your regex doesn't include any literal whitespace or # characters, and there's no reason for the /s modifier, as you're not using ..)
EDIT: If you don't want the regex to match against the entire string, but you want it to reject anything in which the matching portion is followed by something like "[0:0]", the simplest way would be to use lookahead:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\]|(?=[^[\w])|$ ))/x;
This will match anything that takes the following form:
beginning of the string (which your example in the comments seems to imply you want)
zero or more whitespace characters
one or more word characters
optional: [, one or more digits, ]
one or more whitespace characters
one or more word characters
one of the following, in descending order of preference:
[, one or more digits, ]
an empty string followed by (but not including!) a character that is neither [ nor a word character (The exclusion of word characters is to keep the regex engine from succeeding on "a[0] bc[1:2]" by only matching "a[0] b".)
end of string (A space is needed after the $ to keep it from merging with the following ) to form the name of a special variable, and this entails the reintroduction of the /x option.)
Do you have any more unstated requirements that need to be satisfied?
The short answer is your regexp is wrong.
We can't fix it for you without you explaining what you need exactly, and the community is not going to write a regexp exactly for your purpose because that's just too localized a question that only helps you this one time.
You need to ask something more general about regexps that we can explain to you, that will help you fix your regexp, and help others fix theirs.
Here's my general answer when you're having trouble testing your regexp. Use a regexp tool, like the regex buddy one.
So I'm going to give a specific answer about what you're overlooking here:
Let's make this example smaller:
Your pattern is a(bc+d)?. It will match: abcd abccd etc. While it will not match bcd nor bzd in the case of abzd it will match as matching only a because the whole group of bc+d is optional. Similarly it will match abcbcd as a dropping the whole optional group that couldn't be matched (at the second b).
Regexps will match as much of the string as they can and return a true match when they can match something and have satisfied the entire pattern. If you make something optional, they will leave it out when they have to including it only when it's present and matches.
Here's what you tried:
qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx
First, s and x aren't needed modifiers here.
Second, this regex can match:
Any or no whitespace followed by
a word of at least one alpha character followed by
optionally a grouped square bracketed number with at least one digit (eg [0] or [9999]) followed by
at least one white space followed by
a word of at least one alpha character followed by
optionally a square bracketed number with at least one digit.
Clearly when you ask it to match abcd[0] xyzg[0:4] the colon ends the \d+ pattern but doesn't satisfy the \] so it backtracks the whole group, and then happily finds the group was optional. So by not matching the last optional group, your pattern has matched successfully.