How to keep the first 4 words of a text? - postgresql

I want to shorten the field of the 4th words:
unaccent('unaccent', lower(regexp_replace(titre, '[^\w]+','_','g')))

If you don't need the remaining words to be separated by exactly the same amount of whitespace as the input, then you could turn the string into an array, take the first four elements and convert that back into a string where the words are delimited with a single space:
array_to_string((regexp_split_to_array(titre, '[^\w]'))[1:4], ' ')
(regexp_split_to_array(titre, '[^\w]') will turn e.g. the string one two three four five six into an array with six elements. [1:4] then extracts the first four elements (or all if there are less than four) and array_to_string converts this back into a string. So one two three four five six will be converted to one two three four.
However one two three four five six will also be converted to one two three four.

You can use This
=regexextract(A2,"\w+(?:\W+\w+){5}")
EXPLANATION
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (21 times):
--------------------------------------------------------------------------------
\W+ non-word characters (all but a-z, A-Z, 0-
9, _) (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
){5} end of grouping

Related

RegEx: matching 20 letters with 1 optional dash

I'm struggling to write a regex that matches the following requirements:
up to 20 characters (English letters and numbers)
may have one optional dash ( - ) but can't start or end with it
I could come up with this patters: ^[a-zA-Z0-9-]{0,20}$ but this one allows for multiple dashes and one may enter the dash at the begin/end of the input string.
You can use
^(?=.{0,20}$)(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)?)?$
See the regex demo.
Details:
^ - start of string
(?=.{0,20}$) - zero to twenty chars allowed in the string
(?: - a non-capturing group start:
[a-zA-Z0-9]+ - one or more alphanumeric chars
(?:-[a-zA-Z0-9]+)? - an optional sequence of a - and one or more
alphanumeric chars
)? - end of the non-capturing group, repeat one or zero times (i.e. the pattern match is optional)
$ - end of string.
Have a try with:
^(?:[^\W_]{1,20}|(?!.{22})[^\W_]+-[^\W_]+)$
See an online demo
^ - Start-line anchor;
(?: - Open non-capture group;
[^\W_]{1,20} - Match between 1-20 alphanumeric characters;
| - Or;
(?!.{22})[^\W_]+-[^\W_]+ - Negative lookahead to assert position is not followed by 22 characters, and next we matching 1+ alphanumeric characters between an hyphen;
)$ - Close non-capture group before matching end-line anchor.
Note that the above assumes upto 20 alphanumeric characters but with one optional hyphen that would take the max count to 21 characters.
Another idea by use of a lookahead and word boundary at the end.
^(?!.{21})[A-Za-z\d]+-?[A-Za-z\d]*\b$
^(?!.{21}) the lookahead checks at start for max 20 characters
[A-Za-z\d]+ starting with one or more alphanumeric characters
-?[A-Za-z\d]* optional hyphen followed by any amount alnum
\b$ the word boundary forces to end with an alphanumeric char
See this demo at regex101
FYI: If \pL (letter) can be used to shorten: ^(?!.{21})[\pL\d]+-?[\pL\d]*\b$

How to replace a character using sed with different lengths in preceding string

I have a file in which I want to replace the "_" string with "-" in cases where it makes up a part of my gene name. Examples of the gene names and my intended output are:
aa1c1_123 -> aa1c1-123
aa1c2_456 -> aa1c1-456
aa1c10_789 -> aa1c1-789
In essence, the first four characters are fixed, followed by 1 or 2 characters depending on the chromosome, an underscore and then the remainder of the gene ID which could vary in length and character. Important is that there are other strings in this gene information column contains other strings with underscores (e.g. "gene_id", "transcript_id", "five_prime_utr") so using sed -i.bak s/_/-/g' file.gtf
can't be done.
Perhaps not the most elegant way, but this should work:
sed -i.bak 's/\([0-9a-z]\{4\}[0-9][0-9]\?\)_/\1-/g' file.gtf
i.e. capture a group (referenced by \1 in the substitution) of 4 characters consisting of lower case letters and digits followed by exactly one digit and perhaps another digit, which is followed by an underscore; if found, replace it by the group's content and a dash. This should exclude your other occurrences consisting of only characters and an underscore.

One regex to capture words separated by one space-character in combination with the opposite capture occurrences more than one space-character

I would like to have just one regex to capture words separated by one space-character in combination with the opposite capture occurrences more than one space-character
I would like to have the following example covered:
This line with sometimes more than 1 space needs to be captured in 3 matches with 2 groups.
I expect the following groups:
([This line with][ ])([sometimes more than][ ])([1][ ])space needs to be captured in 3 matches with 2 groups.
To capture one of the two is no problem.
i.e.
to capture more than one space-char:
([\s]{2,})
and to capture words separated by only one space-char(see https://stackoverflow.com/a/60288115/3710053):
\S+(?:\s\S+)*
You might use an alternation to match either a word followed by a repeating pattern of a single space and a word OR match 2 or more spaces
\S+(?: \S+)*| {2,}
Explanation
\S+ Match 1+ non whitespace chars
(?: \S+)* Repeat 0+ times matching a space and 1+ non whitespace chars
| Or
{2,} Repeat 2 or more times matching a space
Regex demo
If you want to match whitespace chars instead, you could replace the space with \s but note that it could also possibly match newlines.
Edit
For the updated question, you could use 2 capturing groups:
(\S+(?: \S+)*)( {2,})
Explanation
( Capture group 1
\S+ Match 1+ non whitespace chars
(?: \S+)* Repeat 0+ times matching a space and 1+ non whitespace chars
) Close group 1
( Capture group 2
{2,} Match 2 or more spaces
) Close group 2
Regex demo

How to match only whitespace and letter with regex in flutter?

have this input
2019-12-04T21:24:24 or 2019-12-04 21:24:24
I tried to match if "T" is present or " " is present
I see two solutions
match all between 10 and 11 lenght
match only letter and whitespace
I tried this but nothing happen
^[a-zA-Z]{10,11}$
^.{10,11}$
I think there's a misunderstanding in your regex : what you've written means "Does the input is equivalent to a succession of 10 or 11 characters?", which will always be false for a DateTime. You should select the 11th letter then check if this character matches (T|\s) (either the letter T or a space).
You want
^[0-9]{4}-[0-9]{2}-[0-9]{2}[T ][0-9]{2}:[0-9]{2}:[0-9]{2}$
See the regex demo.
Details:
^ - start of string
[0-9]{4}-[0-9]{2}-[0-9]{2} - four digits, -, two digits, -, two digits
[T ] - T or a space
[0-9]{2}:[0-9]{2}:[0-9]{2} - two digits, :, two digits, :, two digits
$ - end of string.
just add this code to your textfield
for example:
inputFormatters: [FilteringTextInputFormatter.allow(RegExp("[ آ-ی]"))],
for space just space in your list RegExp

create a generic regex for a string in perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.
Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.
The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine