create a generic regex for a string in perl - perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.

Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.

The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine

Related

RegEx: matching 20 letters with 1 optional dash

I'm struggling to write a regex that matches the following requirements:
up to 20 characters (English letters and numbers)
may have one optional dash ( - ) but can't start or end with it
I could come up with this patters: ^[a-zA-Z0-9-]{0,20}$ but this one allows for multiple dashes and one may enter the dash at the begin/end of the input string.
You can use
^(?=.{0,20}$)(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)?)?$
See the regex demo.
Details:
^ - start of string
(?=.{0,20}$) - zero to twenty chars allowed in the string
(?: - a non-capturing group start:
[a-zA-Z0-9]+ - one or more alphanumeric chars
(?:-[a-zA-Z0-9]+)? - an optional sequence of a - and one or more
alphanumeric chars
)? - end of the non-capturing group, repeat one or zero times (i.e. the pattern match is optional)
$ - end of string.
Have a try with:
^(?:[^\W_]{1,20}|(?!.{22})[^\W_]+-[^\W_]+)$
See an online demo
^ - Start-line anchor;
(?: - Open non-capture group;
[^\W_]{1,20} - Match between 1-20 alphanumeric characters;
| - Or;
(?!.{22})[^\W_]+-[^\W_]+ - Negative lookahead to assert position is not followed by 22 characters, and next we matching 1+ alphanumeric characters between an hyphen;
)$ - Close non-capture group before matching end-line anchor.
Note that the above assumes upto 20 alphanumeric characters but with one optional hyphen that would take the max count to 21 characters.
Another idea by use of a lookahead and word boundary at the end.
^(?!.{21})[A-Za-z\d]+-?[A-Za-z\d]*\b$
^(?!.{21}) the lookahead checks at start for max 20 characters
[A-Za-z\d]+ starting with one or more alphanumeric characters
-?[A-Za-z\d]* optional hyphen followed by any amount alnum
\b$ the word boundary forces to end with an alphanumeric char
See this demo at regex101
FYI: If \pL (letter) can be used to shorten: ^(?!.{21})[\pL\d]+-?[\pL\d]*\b$

How to replace a character using sed with different lengths in preceding string

I have a file in which I want to replace the "_" string with "-" in cases where it makes up a part of my gene name. Examples of the gene names and my intended output are:
aa1c1_123 -> aa1c1-123
aa1c2_456 -> aa1c1-456
aa1c10_789 -> aa1c1-789
In essence, the first four characters are fixed, followed by 1 or 2 characters depending on the chromosome, an underscore and then the remainder of the gene ID which could vary in length and character. Important is that there are other strings in this gene information column contains other strings with underscores (e.g. "gene_id", "transcript_id", "five_prime_utr") so using sed -i.bak s/_/-/g' file.gtf
can't be done.
Perhaps not the most elegant way, but this should work:
sed -i.bak 's/\([0-9a-z]\{4\}[0-9][0-9]\?\)_/\1-/g' file.gtf
i.e. capture a group (referenced by \1 in the substitution) of 4 characters consisting of lower case letters and digits followed by exactly one digit and perhaps another digit, which is followed by an underscore; if found, replace it by the group's content and a dash. This should exclude your other occurrences consisting of only characters and an underscore.

What is the difference between '>-' and '|-' in yaml?

I wanted to know exactly what is the difference between '>-' and '|-' especially in kubernetes yaml manifests
Newlines in folded block scalars (>) are subject to line folding, newlines in literal block scalars (|) are not.
Line folding replaces a single newline between non-empty lines with a space, and in the case of empty lines, reduces the number of newline characters between the surrounding non-empty lines by one:
a: > # folds into "one two\nthree four\n\nfive\n"
one
two
three
four
five
Line folding does not occur between lines when at least one line is more indented, i.e. contains whitespace at the beginning that is not part of the block's general indentation:
a: > # folds into "one\n two\nthree four\n\n five\n"
one
two
three
four
five
Adding - after either | or > will strip the newline character from the last line:
a: >- # folded into "one two"
one
two
b: >- # folded into "one\ntwo"
one
two
In contrast, | emits every newline character as-is, the sole exception being the last one if you use -.
Ok I got one main difference between > and | from here: https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
Values can span multiple lines using | or >. Spanning multiple lines
using a “Literal Block Scalar” | will include the newlines and any
trailing spaces. Using a “Folded Block Scalar” > will fold newlines to
spaces; it’s used to make what would otherwise be a very long line
easier to read and edit. In either case the indentation will be
ignored.
Examples are:
include_newlines: |
exactly as you see
will appear these three
lines of poetry
fold_newlines: >
this is really a
single line of text
despite appearances
In fact the ">" is in my understanding, the equivalent of the escape characters '\' at the end of a bash script for example
If one can tell me what is the "-" used for in kubernetes yaml manifests it will complete my understanding :)

Copy selected content form one sheet of notepad++ to another

I have a data which is pipe separated ex.
1|2|3|4|5|6|7|8|9|10|
I have to copy and paste (to new sheet) only that which is between pipe 6 - 9
I have 10,000 rows like this
how can we do this? How can we write a macro for the same? Is there any other solution?
Copy the entire text into a new buffer then edit the text to remove the unwanted parts. Can do that with a regular expression replace-all of ^(?:[^|\r\n]*\|){5}([^|\r\n]*)\|.*$ with \1.
Explanation
^ - start of line
(?: - start of a non-capturing group
[^|\r\n]* - zero or more characters that are not a | or newlines or carriage returns
\| - a |
){5} - exactly 5 occurences of the previous group
-- the efect of the above is to match the unwanted leading characters
([^|\r\n]*) - a group containing the characters to keep
-- the wanted part of the line is saved in capture group 1
\|.*$ - a | then everything else to the end of the line
-- matches the unwanted right-hand part of the line
The final $ is not strictly needed. But, when considered with the opening ^, it serves to document that the regular expression looks at the whole line.

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one