Text file search for match strings regex - powershell

I am trying to understand how regex works and what are the possibilities of working with it.
So I have a txt file and I am trying to search for 8 char long strings containing numbers. for now I use a quite simple option:
clear
Get-ChildItem random.txt | Select-String -Pattern [0-9][a-z] | foreach {$_.line}
It sort of works but I am trying to find a better option. ATM it takes too long to read through the left out text since it writes entire lines and it does not filter them by length.

You can use a lookahead to assert that a string contains at least 1 digit, then specify the length of the match and finally anchor it with ^ (start of string) and $ (end of string) if the string is on a line of its own, or \b (word boundary) if it's part of an HTML document as your comments seem to suggest:
Get-ChildItem C:\files\ |Select-String -Pattern '^(?=.*\d)\w{8}$'
Get-ChildItem C:\files\ |Select-String -Pattern '\b(?=.*\d)\w{8}\b'

The pattern [0-9][a-z] matches a digit followed by a letter. If you want to match a sequence of 8 characters use .{8}. The dot in regular expressions matches any character except newlines. A number in curly brackets matches the preceding expression the given number of times.
If you want to match non-whitespace characters use \S instead of .. If you want to match only digits and letters use [0-9a-z] (a character class) instead of ..
For a more thorough introduction please go find a tutorial. The subject is way too complex to be covered by a single answer on SO.

What you're currently searching for is a single number ranging from 0-9 followed by a single lowercase letter ranging from a-z.
this, for example, will match any 8 char long strings containing only alphanumeric characters.
\w{8}
i often forget what some regex classes are, and it may be useful to you as a learning tool, but i use this as a point of reference: http://regexr.com/
It can also validate what you're typing inline via a text field so you can see if what you're doing works or not.
If you need more of a tutorial than a reference, i found this extremely useful when i learned: regexone.com

Related

powershell -ilike operations too similar

is there a way to say, that between a * there is only a numeric value of maybe two values.
i want to select items, but the way i can differentiate them is very limited.
i want to store values like "31.04.2003" with following line of code:
$contentDateReal = $content_ -ilike '*"*.*.*",'
this works for me, in the most times but, sometimes i got values like: "Installation Acrobat Reader 10.0.1 "
those one also fit the -ilike filter but i dont want them. is there a way to say, that i only want values that contains numbers, and that before the first dot, there is only 2 ("xx") index sizes, after the first dot also ("xx"), and after the second one there is space for four index values like "xxxx" or "2020".
While, you can use character ranges such as [0-9] to match a character (digit) in that range, PowerShell's wildcard expressions do not support matching a varying number of these characters.
That is, '10' -like '[0-9][0-9]' is $true, but '2' -like '[0-9][0-9]' is not.
Note: -ilike is just an alias for -like, which is case-insensitive by default, as all PowerShell operators are; conversely, use -clike for case-sensitive matching. This naming convention applies to all operators that (also) process text.
While you do want to match fixed numbers of digits, matching with a fixed number of [0-9] ranges may still yield false positives if additional digits are present at the start or at the end, so to rule these out you need to use the more sophisticated matching that regular expressions (regexes) provide:
PowerShell supports regexes via the -match operator (among others), so you could use the following:
('Some Software 31.04.2003', 'Installation Acrobat Reader 10.0.1').ForEach({
if ($_ -match '\b(\d{2}\.\d{2}\.\d{4})\b') {
"'$_' matched; extracted version number: $($Matches[1])"
}
})
The above yields the following, because only the first string matched:
'Some Software 31.04.2003' matched; extracted version number: 31.04.2003
Explanation of the regex:
\b matches at word boundaries, which means that something other than word character (a letter, a digit, or _) must occur at that position (which can include the start and end of the string).
\d matches a digit (roughly equivalent to [0-9], the latter limiting matching to the decimal digits in the ASCII sub-range of Unicode); {2}, for instance, stipulates that exactly 2 instances of digits must be present.
\. represents a verbatim . (it must be \-escaped, because . is a regex metacharacter representing any character).
Enclosing a subexpression in (...) creates a so-called capture group, which additionally captures what the subexpression matched, and makes that available starting with index 1 (for the first of potentially multiple (unnamed) capture groups) in the automatic $Matches variable variable.
Note that -match - unlike -like - matches substrings by default, so there's no need to also match what comes before or after the version number.

Design Powershell script for find the Numbers which contain file

Everyone help to design the script to find the Numbers which contain file..
For example:
20200514_EE#998501_12.
I need numbers 12 then write to the txt file
the contain will generated different sequence numbers..
For example: #20200514_EE#998501_123.#
so, I need numbers 123 then write to the txt file
How to write the script in Powershell or bat file ?
Very appreciate!
Thanks
Tony
You can do the following as a start. You have not provided enough information/examples to work through any issues you are experiencing.
'#20200514_EE#998501_123.#' -replace '^.*?(\d+)\D*$','$1'
'#20200514_EE#998501_123' -replace '^.*?(\d+)\D*$','$1'
-replace uses regex matching and then replaces with a string and/or matched substitute. ^ is the start of the string. .*? lazily matches all characters. \d+ matches one or more digits in a capture group due to the encapsulating (). \D* matches zero or more non-digits. $ matches the end of the string. For the replacement, $1 is capture group 1, which is what was captured by (\d+).
You can use the .Split() method also in combination with -replace.
'#20200514_EE#998501_123.#'.Split('_')[-1] -replace '\D+$'

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.
This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'
sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally
Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

Alternating string

I am trying to use regular expressions to match against a string that starts with 7 numbers, then has a "K" inbetween it, and then 3 numbers again. For example:
1234567K890.
I currently have $_a -match '^\d{7}K\d{3}'. However, this does not work for my purposes. Does anyone have a solution?
PS C:\> "1234567K890" -match "\d{7}(k)\d{3}"
This \d{7} matches 7 digits then (k) matches letter k and \d{3} matches last three characters.
Tested this, works for your example and some others:
$string = "1234567K890"
$string -match '^[0-9]{7}(k)[0-9]{3}$'"
It matches against exactly 7 numbers, then against K (casing does not matter), then against exactly 3 numbers. The characters at the beginning and the end of the string restrict against whitespace at the beginning and end of the string -- if you want whitespace to be allowed, you can just remove them.
Here's a powershell regex reference, which may help in the future.

How to get a perfect match for a regexp pattern in Perl?

I've to match a regular-expression, stored in a variable:
#!/bin/env perl
use warnings;
use strict;
my $expr = qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx;
$str = "abcd[3] xyzg[4:0]";
if ($str =~ m/$expr/) {
print "\n%%%%%%%%% $`-----$&-----$'\n";
}
else {
print "\n********* NOT MATCHED\n";
}
But I'm getting the outout in $& as
%%%%%%%%% -----abcd[3] xyzg-----[4:0]
But expecting, it shouldn't go inside the if clause.
What is intended is:
if $str = "abcd xyzg" => %%%%%%%%% -----abcd xyzg----- (CORRECT)
if $str = "abcd[2] xyzg" => %%%%%%%%% -----abcd[2] xyzg----- (CORRECT)
if $str = "abcd[2] xyzg[3] => %%%%%%%%% -----abcd[2] xyzg[3]----- (CORRECT)
if $str = "abcd[2:0] xyzg[3] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2:0] xyzg[3:0] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2] xyzg[3:0]" => ********* NOT MATCHED (CORRECT/INTENDED)
but output is %%%%%%%%% -----abcd[2] xyzg-----[3:0] (WRONG)
OR better to say this is not intended.
In this case, it should/my_expectation go to the else block.
Even I don't know, why $& take a portion of the string (abcd[2] xyzg), and $' having [3:0]?
HOW?
It should match the full, not a part like the above. If it didn't, it shouldn't go to the if clause.
Can anyone please help me to change my $expr pattern, so that I can have what is intended?
By default, Perl regexes only look for a matching substring of the given string. In order to force comparison against the entire string, you need to indicate that the regex begins at the beginning of the string and ends at the end by using ^ and $:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)$/;
(Also, there's no reason to have the /x modifier, as your regex doesn't include any literal whitespace or # characters, and there's no reason for the /s modifier, as you're not using ..)
EDIT: If you don't want the regex to match against the entire string, but you want it to reject anything in which the matching portion is followed by something like "[0:0]", the simplest way would be to use lookahead:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\]|(?=[^[\w])|$ ))/x;
This will match anything that takes the following form:
beginning of the string (which your example in the comments seems to imply you want)
zero or more whitespace characters
one or more word characters
optional: [, one or more digits, ]
one or more whitespace characters
one or more word characters
one of the following, in descending order of preference:
[, one or more digits, ]
an empty string followed by (but not including!) a character that is neither [ nor a word character (The exclusion of word characters is to keep the regex engine from succeeding on "a[0] bc[1:2]" by only matching "a[0] b".)
end of string (A space is needed after the $ to keep it from merging with the following ) to form the name of a special variable, and this entails the reintroduction of the /x option.)
Do you have any more unstated requirements that need to be satisfied?
The short answer is your regexp is wrong.
We can't fix it for you without you explaining what you need exactly, and the community is not going to write a regexp exactly for your purpose because that's just too localized a question that only helps you this one time.
You need to ask something more general about regexps that we can explain to you, that will help you fix your regexp, and help others fix theirs.
Here's my general answer when you're having trouble testing your regexp. Use a regexp tool, like the regex buddy one.
So I'm going to give a specific answer about what you're overlooking here:
Let's make this example smaller:
Your pattern is a(bc+d)?. It will match: abcd abccd etc. While it will not match bcd nor bzd in the case of abzd it will match as matching only a because the whole group of bc+d is optional. Similarly it will match abcbcd as a dropping the whole optional group that couldn't be matched (at the second b).
Regexps will match as much of the string as they can and return a true match when they can match something and have satisfied the entire pattern. If you make something optional, they will leave it out when they have to including it only when it's present and matches.
Here's what you tried:
qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx
First, s and x aren't needed modifiers here.
Second, this regex can match:
Any or no whitespace followed by
a word of at least one alpha character followed by
optionally a grouped square bracketed number with at least one digit (eg [0] or [9999]) followed by
at least one white space followed by
a word of at least one alpha character followed by
optionally a square bracketed number with at least one digit.
Clearly when you ask it to match abcd[0] xyzg[0:4] the colon ends the \d+ pattern but doesn't satisfy the \] so it backtracks the whole group, and then happily finds the group was optional. So by not matching the last optional group, your pattern has matched successfully.