Regex - replacing part of a matched group possible? - sed

Is it possible to replace half or x characters of a matched group?
I have had a request for a partial email capturing, so something like example123#abcdef.com becomes ***mple123#***def.com
I can do this if the characters before and after the # are 3 characters long,
([^|#]{0,3})([^]{0,3})
This captures 123#abc.com perfectly and I can substitute for ***#***.com but if it's over 3 characters long on each end, for instance example123#abcdef.com it becomes ******e123#abcdef.com
The other way I can see is capture everything until the # and everything to the . but then this won't be a partial capture. Is this possible?

You can use
sed -E 's/^[^#]{0,3}|(#)[^.]{0,3}/\1***/g'
Details
-E - enables the POSIX ERE syntax that does not require too much escaping here
^[^#]{0,3} - zero to three occurrences of any char other than a # at the start of the string
| - or
(#) - Group 1: a # char
[^.]{0,3} - zero to three occurrences of any char other than a .
\1*** replaces with Group 1 value + ***.

Related

ios regex mention until special character + avoid space

i have string like below
#a b#mail.com has joined
i want to show #a b c as mention, but hard to detect space, so i put the string like below before using regex
#``a b#mail.com`` has joined
So i want to detect start with
#``
and end with
``
can anyone help me the regex, i tried so much but still not working, here is the regex im testing
^#``.*``{3,}
You might use a capture group and capture any char except an # using a negated character class.
^#``([^#\r\n]+).*?``
The pattern matches
^#`` Start of string, match # and 2 backticks
( Capture group 1
[^#\r\n]+ Match 1+ occurrences of any char except # or a newline
) Close group 1
.*?`` Match as least as possible chars until 2 backticks
Regex demo

Regex for currency - Exclude commas from the count limit

I am using the following regex in my app:
^(([0-9|(\\,)]{0,10})?)?(\\.[0-9]{0,2})?$
So it allows 10 characters before the decimal and 2 character after it.
But I am inserting one additional functionality of formatting textfield as currency while typing. So if I have 1234567 it becomes 1,234,567 after formatting. The regex fails when I enter 10 characters instead of 10 digits. Ideally it should be that regex ignores the commas when counting 10.
I tried this too ^(([0-9|(\\,)]{0,13})?)?(\\.[0-9]{0,2})?$ but it doesn't seem the right approach.
Can anyone help me get a proper regex instead of using this tweak.
You may use
"^(?:,*[0-9]){0,10}(?:\.[0-9]{0,2})?$"
Or, if there must be a digit after . in the fractional part use
"^(?:,*[0-9]){0,10}(?:\.[0-9]{1,2})?$"
See the regex demo. The (?:,*[0-9]){0,10} part is what does the job: it matches any 0+ , chars followed with a single digit 0 to 10 times. If , can also appear before ., add ,* after the ((?:,*[0-9]){0,10})?.
Details
^ - start of string
(?:,*[0-9]){0,10} - 0 to 10 occurrences of 0+ commas followed with a digit
(?:\.[0-9]{0,2})? - an optional sequence of:
\. - a period
[0-9]{0,2} - 0 to 2 digits (if there must be a digit after . use [0-9]{1,2})
$ - end of string.

create a generic regex for a string in perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.
Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.
The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine

Regex rule match up to string

I need to used grep / egrep / sed to extract certain parts out of a SNORT rule string.
given a string that can be in the format:
alert tcp any any -> any any (msg:"Some message";
content:"c1"; content:"GET /blah"; offset:0; depth:9; content:"something else";)
How would I go about extracting just the following:
content:"GET /blah"; offset:0; depth:9;
Given that the following are true:
It must match up until the start of the next content match (if there is one)
A rule may only have this content term, it may have more and they may be in any order
Other modifiers may be applied before, after or in between the offset and depth operators, they must also be extracted as follows:
content:"GET "; offset:5; http_uri; depth:12;
Rules can be "malformed" i.e. instead of having a single semicolon after the content term it may have two or more.
What I have so far which I believe would work in other regex systems is:
(GET|POST).*?(?=content)
The idea behind this being that .*? is an ungreedy match on any character any number of times and a non grabbing (not sure if that's the term) match on the next term "content".
I believe this breaks though if there is no following content term and also doesn't seem to extract anything in grep or egrep.
Not sure what to do, any ideas?
This should do the trick:
grep -Po '\bcontent\s*:\s*"(GET|POST)\b[^"]*"((?!;\s*content\s*:)[^"]|"[^"]*")*;'
Sample input:
alert tcp any any -> any any (msg:"Some message";
content:"c1"; content:"GET /blah"; offset:0; depth:9; content:"something else";)
content:"GET "; offset:5; http_uri; depth:12;
Output:
content:"GET /blah"; offset:0; depth:9;
content:"GET "; offset:5; http_uri; depth:12;
Explanation:
Instead of looking ahead for the next content, I am using a negative lookahead to consume anything other than the word content. This way, end of line also qualifies as the end of the match.
The regex in detail:
\b - word boundary (to prevent matching e.g. othercontent)
content\s*:\s* - literally: content followed by a colon; with optional spaces
" - opening quote
(GET|POST) - either one of these verbs
\b - word boundary (to prevent matching e.g. POSTAL)
[^"]*" - everything upto and including the closing quote
( - begin repeating subpattern
(?!;\s*content\s*:) - negative lookahead, to make sure we stop before any subsequent content
[^"] - any non-quote; spaces, letters, colons, semicolons...
| - or...
"[^"]*" - some attribute string; matching this as a whole to prevent the negative lookahead to pick up something between quotes
)* - end repeating subpattern; zero or more times
; - closing semicolon

Copy selected content form one sheet of notepad++ to another

I have a data which is pipe separated ex.
1|2|3|4|5|6|7|8|9|10|
I have to copy and paste (to new sheet) only that which is between pipe 6 - 9
I have 10,000 rows like this
how can we do this? How can we write a macro for the same? Is there any other solution?
Copy the entire text into a new buffer then edit the text to remove the unwanted parts. Can do that with a regular expression replace-all of ^(?:[^|\r\n]*\|){5}([^|\r\n]*)\|.*$ with \1.
Explanation
^ - start of line
(?: - start of a non-capturing group
[^|\r\n]* - zero or more characters that are not a | or newlines or carriage returns
\| - a |
){5} - exactly 5 occurences of the previous group
-- the efect of the above is to match the unwanted leading characters
([^|\r\n]*) - a group containing the characters to keep
-- the wanted part of the line is saved in capture group 1
\|.*$ - a | then everything else to the end of the line
-- matches the unwanted right-hand part of the line
The final $ is not strictly needed. But, when considered with the opening ^, it serves to document that the regular expression looks at the whole line.