Match alphanumeric characters in case statement - sh

I need to check in my script, whether $1 is a valid hostname, hostname with port specified (host:port), or something else.
The firs case works fine, but the second (([a-z0-9.-]+)) does not
case $1 in
(*:*)
foo
;;
([a-z0-9.-]+)
bar
;;
(*)
asdf
;;
esac
How can I only match strings consisting of [a-z0-9.-] in case statement ?

The plus sign doesn't have any special meaning in pattern matching notation.
In this case, you need to take the opposite approach and handle invalid strings before valid ones. For example:
case $1 in
*[!a-z0-9.-]*)
# handle string that contains non-*alphanumeric* characters
;;
*)
# handle string that consists of all *alphanumeric* characters
;;
esac
Concerning your actual question, my naive attempt would be:
# exclude non-ASCII characters from a-z and 0-9
LC_ALL=C
case $1 in
*:*:*)
# handle string that contains multiple colons
;;
*:*[!0-9]*)
# handle string that contains non-digit characters after the colon
;;
*[!a-z0-9.-]*:*)
# handle string that consists of an invalid hostname and a valid port
;;
*:*)
# handle string that consists of a valid hostname and port
# perform further validation of hostname and port
;;
*[!a-z0-9.-]*)
# handle string that forms an invalid hostname
;;
*)
# handle string that forms a valid hostname
# perform further validation of hostname
;;
esac

Related

what does this $tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge; (perl code) mean?

What does this mean?
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
This comes from a cve study blog which written in Perl. I know this is a regular expression, the content in the second {} should replace that in the first, but I do NOT get what '\\'.($2 || $1)means.
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
It is a substitution operator s/// applied to the string $tok, with the modifiers sge. The delimiters of the operator has been changed from / to {}. Lets break that regex down
s{
\\(.) # (1) match a backslash followed by 1 character, capture
| # (2) or
( # (3) start capture parens
[\$\#] # (4) either a literal $ or #
| # (5) or
\\$ # (6) backslash at the end of line (including newline)
) # end capture parens
}{ # replace with
'\\'.($2 || $1)} # (7) backslash concatenated with either capture 2 or 1
sge; # (8) s = . matches newline, g = match multiple times, e = eval
Judging (at a glance) from the rest of that blog code, this code is not written by someone skilled at Perl. So I will take their comments at face value:
# must protect unescaped "$" and "#" symbols, and "\" at end of string
The eval (8) is apparently to concatenate a backslash with either capture group 2 (2) or 1 (1), depending on which is "true". Or rather, which one matched the string.
Looking closer at the code, (1) and (6) are very similar. The latter one will trigger only at the end of a line that does not have a newline, whereas the first one will handle all other cases, including end of line with a newline (because of /s modifier).
(1) will match any escaped character, so \1, or \$ or \\ anything with a backslash followed by a character. If we look at the replacement part (7), we see that this capture group is the fallback, which will only trigger if the second capture group fails. The second capture group also only matches if the first fails. Confusing? Maybe a little.
(2) triggers if the matching character is not a backslash followed by a character. Now we are looking for a literal $ or #. Or failing that, a backslash at the end of line. But wait a minute, we already checked for backslash? Yes, but this is an edge case.
In the case of (1) matching, $2 will be undefined, and $1, the first capture group, a single character, will be put back into the text. The backslash that was before it will be removed in (1), and then put back in (7). This will not really do anything, just make the regex not destroy already escaped characters.
In the case of (2) matching, it will either be an end of line backslash that is consumed (6) and put back (7), or it will be a $ or # which is consumed (4) and put back (7), with a backslash in front.
So basically what the OP says in the comment is happening.

perl regex - pattern matching

Can anyone explain what is being done below?
$name=~m,common/([^/]+)/run.*/([^/]+)/([^/]+)$,;
common, run and / are match themselves.
() captures.
[^/]+ matches 1 or more characters that aren't /.
.* matches 0 or more characters that aren't Line Feeds.[1]
$ is equivalent to (\n?\z).[2]
\n optionally matches a Line Feed.
\z matches the end of the string.
I think it's trying to match a path of one or both of the following forms:
.../common/XXX/runYYY/XXX/XXX
common/XXX/runYYY/XXX/XXX
Where
XXX is a sequence of at least one character that doesn't contain /.
YYY is a sequence of any number of characters (incl zero) that doesn't contain /.
It matches more than that, however.
It matches uncommon/XXX/runYYY/XXX/XXX
It matches common/XXX/runYYY/XXX/XXX/XXX/XXX/XXX/XXX
The parts in bold are captured (available to the caller).
When the s flag isn't used.
When the m flag isn't used.

Regex - replacing part of a matched group possible?

Is it possible to replace half or x characters of a matched group?
I have had a request for a partial email capturing, so something like example123#abcdef.com becomes ***mple123#***def.com
I can do this if the characters before and after the # are 3 characters long,
([^|#]{0,3})([^]{0,3})
This captures 123#abc.com perfectly and I can substitute for ***#***.com but if it's over 3 characters long on each end, for instance example123#abcdef.com it becomes ******e123#abcdef.com
The other way I can see is capture everything until the # and everything to the . but then this won't be a partial capture. Is this possible?
You can use
sed -E 's/^[^#]{0,3}|(#)[^.]{0,3}/\1***/g'
Details
-E - enables the POSIX ERE syntax that does not require too much escaping here
^[^#]{0,3} - zero to three occurrences of any char other than a # at the start of the string
| - or
(#) - Group 1: a # char
[^.]{0,3} - zero to three occurrences of any char other than a .
\1*** replaces with Group 1 value + ***.

Regex rule match up to string

I need to used grep / egrep / sed to extract certain parts out of a SNORT rule string.
given a string that can be in the format:
alert tcp any any -> any any (msg:"Some message";
content:"c1"; content:"GET /blah"; offset:0; depth:9; content:"something else";)
How would I go about extracting just the following:
content:"GET /blah"; offset:0; depth:9;
Given that the following are true:
It must match up until the start of the next content match (if there is one)
A rule may only have this content term, it may have more and they may be in any order
Other modifiers may be applied before, after or in between the offset and depth operators, they must also be extracted as follows:
content:"GET "; offset:5; http_uri; depth:12;
Rules can be "malformed" i.e. instead of having a single semicolon after the content term it may have two or more.
What I have so far which I believe would work in other regex systems is:
(GET|POST).*?(?=content)
The idea behind this being that .*? is an ungreedy match on any character any number of times and a non grabbing (not sure if that's the term) match on the next term "content".
I believe this breaks though if there is no following content term and also doesn't seem to extract anything in grep or egrep.
Not sure what to do, any ideas?
This should do the trick:
grep -Po '\bcontent\s*:\s*"(GET|POST)\b[^"]*"((?!;\s*content\s*:)[^"]|"[^"]*")*;'
Sample input:
alert tcp any any -> any any (msg:"Some message";
content:"c1"; content:"GET /blah"; offset:0; depth:9; content:"something else";)
content:"GET "; offset:5; http_uri; depth:12;
Output:
content:"GET /blah"; offset:0; depth:9;
content:"GET "; offset:5; http_uri; depth:12;
Explanation:
Instead of looking ahead for the next content, I am using a negative lookahead to consume anything other than the word content. This way, end of line also qualifies as the end of the match.
The regex in detail:
\b - word boundary (to prevent matching e.g. othercontent)
content\s*:\s* - literally: content followed by a colon; with optional spaces
" - opening quote
(GET|POST) - either one of these verbs
\b - word boundary (to prevent matching e.g. POSTAL)
[^"]*" - everything upto and including the closing quote
( - begin repeating subpattern
(?!;\s*content\s*:) - negative lookahead, to make sure we stop before any subsequent content
[^"] - any non-quote; spaces, letters, colons, semicolons...
| - or...
"[^"]*" - some attribute string; matching this as a whole to prevent the negative lookahead to pick up something between quotes
)* - end repeating subpattern; zero or more times
; - closing semicolon

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one