One regex to capture words separated by one space-character in combination with the opposite capture occurrences more than one space-character - regex-group

I would like to have just one regex to capture words separated by one space-character in combination with the opposite capture occurrences more than one space-character
I would like to have the following example covered:
This line with sometimes more than 1 space needs to be captured in 3 matches with 2 groups.
I expect the following groups:
([This line with][ ])([sometimes more than][ ])([1][ ])space needs to be captured in 3 matches with 2 groups.
To capture one of the two is no problem.
i.e.
to capture more than one space-char:
([\s]{2,})
and to capture words separated by only one space-char(see https://stackoverflow.com/a/60288115/3710053):
\S+(?:\s\S+)*

You might use an alternation to match either a word followed by a repeating pattern of a single space and a word OR match 2 or more spaces
\S+(?: \S+)*| {2,}
Explanation
\S+ Match 1+ non whitespace chars
(?: \S+)* Repeat 0+ times matching a space and 1+ non whitespace chars
| Or
{2,} Repeat 2 or more times matching a space
Regex demo
If you want to match whitespace chars instead, you could replace the space with \s but note that it could also possibly match newlines.
Edit
For the updated question, you could use 2 capturing groups:
(\S+(?: \S+)*)( {2,})
Explanation
( Capture group 1
\S+ Match 1+ non whitespace chars
(?: \S+)* Repeat 0+ times matching a space and 1+ non whitespace chars
) Close group 1
( Capture group 2
{2,} Match 2 or more spaces
) Close group 2
Regex demo

Related

RegEx: matching 20 letters with 1 optional dash

I'm struggling to write a regex that matches the following requirements:
up to 20 characters (English letters and numbers)
may have one optional dash ( - ) but can't start or end with it
I could come up with this patters: ^[a-zA-Z0-9-]{0,20}$ but this one allows for multiple dashes and one may enter the dash at the begin/end of the input string.
You can use
^(?=.{0,20}$)(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)?)?$
See the regex demo.
Details:
^ - start of string
(?=.{0,20}$) - zero to twenty chars allowed in the string
(?: - a non-capturing group start:
[a-zA-Z0-9]+ - one or more alphanumeric chars
(?:-[a-zA-Z0-9]+)? - an optional sequence of a - and one or more
alphanumeric chars
)? - end of the non-capturing group, repeat one or zero times (i.e. the pattern match is optional)
$ - end of string.
Have a try with:
^(?:[^\W_]{1,20}|(?!.{22})[^\W_]+-[^\W_]+)$
See an online demo
^ - Start-line anchor;
(?: - Open non-capture group;
[^\W_]{1,20} - Match between 1-20 alphanumeric characters;
| - Or;
(?!.{22})[^\W_]+-[^\W_]+ - Negative lookahead to assert position is not followed by 22 characters, and next we matching 1+ alphanumeric characters between an hyphen;
)$ - Close non-capture group before matching end-line anchor.
Note that the above assumes upto 20 alphanumeric characters but with one optional hyphen that would take the max count to 21 characters.
Another idea by use of a lookahead and word boundary at the end.
^(?!.{21})[A-Za-z\d]+-?[A-Za-z\d]*\b$
^(?!.{21}) the lookahead checks at start for max 20 characters
[A-Za-z\d]+ starting with one or more alphanumeric characters
-?[A-Za-z\d]* optional hyphen followed by any amount alnum
\b$ the word boundary forces to end with an alphanumeric char
See this demo at regex101
FYI: If \pL (letter) can be used to shorten: ^(?!.{21})[\pL\d]+-?[\pL\d]*\b$

How to replace a character using sed with different lengths in preceding string

I have a file in which I want to replace the "_" string with "-" in cases where it makes up a part of my gene name. Examples of the gene names and my intended output are:
aa1c1_123 -> aa1c1-123
aa1c2_456 -> aa1c1-456
aa1c10_789 -> aa1c1-789
In essence, the first four characters are fixed, followed by 1 or 2 characters depending on the chromosome, an underscore and then the remainder of the gene ID which could vary in length and character. Important is that there are other strings in this gene information column contains other strings with underscores (e.g. "gene_id", "transcript_id", "five_prime_utr") so using sed -i.bak s/_/-/g' file.gtf
can't be done.
Perhaps not the most elegant way, but this should work:
sed -i.bak 's/\([0-9a-z]\{4\}[0-9][0-9]\?\)_/\1-/g' file.gtf
i.e. capture a group (referenced by \1 in the substitution) of 4 characters consisting of lower case letters and digits followed by exactly one digit and perhaps another digit, which is followed by an underscore; if found, replace it by the group's content and a dash. This should exclude your other occurrences consisting of only characters and an underscore.

Regex expression for detecting 2 consecutive words when first word starts with #

I wanted to know the regex expression that detects names starting with #. For eg, in the sentence "Hi #Steve Rogers, how are you?", I want to extract out #Steve Rogers using regex. I tried using Pattern.compile("#\\s*(\\w+)").matcher(text), but only "#Steve" get detected. What else should I use.??
Thanks
Try (#[\w\s]+)
It will only capture word and spaces after the #
See example at https://regex101.com/r/4Pv9bu/1
If you don't want to match an # sign followed by a space only like # and if there can be more than a single word after it:
(?<!\S)#\w+(?:\h+\w+)?
Explanation
(?<!\S) Assert a whitespace boundary to the left
# Match literally
\w+ Match 1+ word characters
(?:\s+\w+)? Optionally match 1+ horizontal whitespace chars and 1+ word chars
Regex demo
In Java
String regex = "(?<!\\S)#\\w+(?:\\h+\\w+)?";

ios regex mention until special character + avoid space

i have string like below
#a b#mail.com has joined
i want to show #a b c as mention, but hard to detect space, so i put the string like below before using regex
#``a b#mail.com`` has joined
So i want to detect start with
#``
and end with
``
can anyone help me the regex, i tried so much but still not working, here is the regex im testing
^#``.*``{3,}
You might use a capture group and capture any char except an # using a negated character class.
^#``([^#\r\n]+).*?``
The pattern matches
^#`` Start of string, match # and 2 backticks
( Capture group 1
[^#\r\n]+ Match 1+ occurrences of any char except # or a newline
) Close group 1
.*?`` Match as least as possible chars until 2 backticks
Regex demo

create a generic regex for a string in perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.
Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.
The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine