Extract anchor tag link text using Powershell - powershell

I'm attempting to extract the link text from something like the line below using PowerShell.
Entertainment, Intimate Apparel/Swimsuit, and Suspicious
I've tried the following but it's only matching the first result and is including the > and < which I don't want. I'm sure it's an issue with the Regex but I don't know it well enough to see what's wrong. Note the string above is $result.categorization
$result.categorization -match '(\>(.*?)\<)'
This returns
Name,Value
2,Entertainment
1,>Entertainment<
0,>Entertainment<
I want to return
Name,Value
2,Suspicious
1,Intimate Apparel/Swimsuit
0,Entertainment
I also tried the Regex listed Regular expression to extract link text from anchor tag but that didn't match on anything.

I don't know where the headers and numbers in the output come from, but here's a solution that extracts the link texts from the single-line input exactly as specified:
$str = #'
Entertainment, Intimate Apparel/Swimsuit, and Suspicious
'#
$str -split ', and |, ' -replace '.*?>([^<]*).*', '$1'
$str -split ', and |, ' splits the input line into individual <a> elements.
-replace then operates on each <a> element individually:
'.*?>([^<]*).*' matches the entire line, but captures only the link text in the one and only capture group, (...).
Replacement text $1 then replaces the entire line with what the capture group matched, i.e., effectively only returning the link text.
As for what you tried:
-match never extracts part of its input - it returns a Boolean indicating whether a match was found with a scalar LHS, or a filtered sub-array of matching items with an array as the LHS.
That said, the automatic $Matches variable does contain information about what parts matched, but only with a scalar LHS.
'(\>(.*?)\<)' contains two nested capture groups that match literal > followed by any number of characters (matching non-greedily), followed by literal <.
It is the inner capture group that would capture the link text.
However:
There is no need for the outer capture group.
> and < do not need \-escaping in a regular expression (although it does no harm).

Related

Powershell split text file into pages by delimiter

New to PowerShell here. Have a large text file with many similar pages overlapping at the moment. Wish to use the delimiter: "TESTING/TEST SYSTEM" which appears at the top of every page to separate them into individual pages. The raw original source always have a 1 and 0. 1 on the first line, then 0 on the second line, probably off some old mainframe system, I do not wish to use the 1 and 0 as delimiter, as I have other files I wish to run this command against with different delimiter, which do not have 1 and 0.
Here's what I found so far on StackOverflow, and is partially working:
(Get-Content -Raw inFile.txt) -split '(TESTING/TEST SYSTEM)'|
Set-Content -LiteralPath { 'c:\test\outFile{0}.txt' -f $script:index++ }
However, this keeps creating two extra files. First file only contains those 1 and 0. Second file actually contains the delimiter, striped from the rest of the content of each page. The third file has the rest of the content. This repeats till all the pages are separated, creating 3 pages for each section. I just need the delimiter to be part of each page. The 1 and 0 can be part of it as well, or removed, whichever is easier. Thanks so much for your help!
(Get-Content -Raw inFile.txt) -split '(?=TESTING/TEST SYSTEM)' |
Set-Content -LiteralPath { 'c:\test\outFile{0}.txt' -f $script:index++ }
Note:
-split invariably matches something before the first separator match; if the input starts with a separator, the first array element returned is '' (the empty string).
If no other tokens are empty, or if it is acceptable / desired to eliminate all empty tokens, you can simply append -ne '' to the -split operation.
If you want to make splitting case-sensitive, use -csplit instead of -split.
If you wan to ensure that the regex only matches at the start of a line, use '(?m)(?=^TESTING/TEST SYSTEM)'
(?=...) in the separator regex is a (positive) look-ahead assertion that causes the separator to be included as part of each token, as explained below.
The binary form of the -split operator:
By default excludes what the (first) RHS operand - the separator regex - matches from the array of tokens it returns:
'a#b#c' -split '#' # -> 'a', 'b', 'c'
If you use a capture group ((...)) in the separator regex, what the capture group matches is included in the return array, as separate tokens:
'a#b#c' -split '(#)' # -> 'a', '#', 'b', '#', 'c'
If you want to include what the separator regex matches as part of each token, you must use a look-around assertion:
With a look-ahead assertion ((?=...)) at the start of each token:
'a#b#c' -split '(?=#)' # -> 'a', '#b', '#c'
With a look-behind assertion ((?<=...)) at the end of each token:
'a#b#c' -split '(?<=#)' # -> 'a#', 'b#', 'c'

Powershell replace command not removing newline

I have text that prints out like this:
mdbAppText_Arr: [0]: The cover is open. {goes to next line here}
Please close the cover. and [1] Backprinter cover open
46
I tried getting rid of the newline after open., and it's still there. Any idea of a better way or fix for what I'm doing? I need to get rid of the newline because it's going to a csv file, and messing up formatting (going to newline there).
This is my code:
$mdbAppText_Arr = $mdbAppText.Split("|")
$mdbAppText_Arr[0].replace("`r",";").replace("`n",";").replace("`t",";").replace("&",";")
#replace newline/carriage return/tab with semicolon
if($alarmIdDef -eq "12-7")
{
Write-Host "mdbAppText_Arr: [0]: $($mdbAppText_Arr[0]) and [1] $($mdbAppText_Arr[1]) "
[byte] $mdbAppText_Arr[0][31]
}
I've been looking at:
replace
replace - this one has a link reference to lookup in the asci table, but it's unclear to me what column the byte equivalent is in the table/link.
I'm using PowerShell 5.1.
-replace is a regex operator, so you need to supply a valid regular expression pattern as the right-hand side operand.
You can replace most newline sequences with a pattern describing a substring consisting of:
an optional carriage return (\r? in regex), followed by
a (non-optional) newline character (\n in regex):
$mdbAppText_Arr = $mdbAppText_Arr -replace '\r?\n'

powershell -ilike operations too similar

is there a way to say, that between a * there is only a numeric value of maybe two values.
i want to select items, but the way i can differentiate them is very limited.
i want to store values like "31.04.2003" with following line of code:
$contentDateReal = $content_ -ilike '*"*.*.*",'
this works for me, in the most times but, sometimes i got values like: "Installation Acrobat Reader 10.0.1 "
those one also fit the -ilike filter but i dont want them. is there a way to say, that i only want values that contains numbers, and that before the first dot, there is only 2 ("xx") index sizes, after the first dot also ("xx"), and after the second one there is space for four index values like "xxxx" or "2020".
While, you can use character ranges such as [0-9] to match a character (digit) in that range, PowerShell's wildcard expressions do not support matching a varying number of these characters.
That is, '10' -like '[0-9][0-9]' is $true, but '2' -like '[0-9][0-9]' is not.
Note: -ilike is just an alias for -like, which is case-insensitive by default, as all PowerShell operators are; conversely, use -clike for case-sensitive matching. This naming convention applies to all operators that (also) process text.
While you do want to match fixed numbers of digits, matching with a fixed number of [0-9] ranges may still yield false positives if additional digits are present at the start or at the end, so to rule these out you need to use the more sophisticated matching that regular expressions (regexes) provide:
PowerShell supports regexes via the -match operator (among others), so you could use the following:
('Some Software 31.04.2003', 'Installation Acrobat Reader 10.0.1').ForEach({
if ($_ -match '\b(\d{2}\.\d{2}\.\d{4})\b') {
"'$_' matched; extracted version number: $($Matches[1])"
}
})
The above yields the following, because only the first string matched:
'Some Software 31.04.2003' matched; extracted version number: 31.04.2003
Explanation of the regex:
\b matches at word boundaries, which means that something other than word character (a letter, a digit, or _) must occur at that position (which can include the start and end of the string).
\d matches a digit (roughly equivalent to [0-9], the latter limiting matching to the decimal digits in the ASCII sub-range of Unicode); {2}, for instance, stipulates that exactly 2 instances of digits must be present.
\. represents a verbatim . (it must be \-escaped, because . is a regex metacharacter representing any character).
Enclosing a subexpression in (...) creates a so-called capture group, which additionally captures what the subexpression matched, and makes that available starting with index 1 (for the first of potentially multiple (unnamed) capture groups) in the automatic $Matches variable variable.
Note that -match - unlike -like - matches substrings by default, so there's no need to also match what comes before or after the version number.

Design Powershell script for find the Numbers which contain file

Everyone help to design the script to find the Numbers which contain file..
For example:
20200514_EE#998501_12.
I need numbers 12 then write to the txt file
the contain will generated different sequence numbers..
For example: #20200514_EE#998501_123.#
so, I need numbers 123 then write to the txt file
How to write the script in Powershell or bat file ?
Very appreciate!
Thanks
Tony
You can do the following as a start. You have not provided enough information/examples to work through any issues you are experiencing.
'#20200514_EE#998501_123.#' -replace '^.*?(\d+)\D*$','$1'
'#20200514_EE#998501_123' -replace '^.*?(\d+)\D*$','$1'
-replace uses regex matching and then replaces with a string and/or matched substitute. ^ is the start of the string. .*? lazily matches all characters. \d+ matches one or more digits in a capture group due to the encapsulating (). \D* matches zero or more non-digits. $ matches the end of the string. For the replacement, $1 is capture group 1, which is what was captured by (\d+).
You can use the .Split() method also in combination with -replace.
'#20200514_EE#998501_123.#'.Split('_')[-1] -replace '\D+$'

Text file search for match strings regex

I am trying to understand how regex works and what are the possibilities of working with it.
So I have a txt file and I am trying to search for 8 char long strings containing numbers. for now I use a quite simple option:
clear
Get-ChildItem random.txt | Select-String -Pattern [0-9][a-z] | foreach {$_.line}
It sort of works but I am trying to find a better option. ATM it takes too long to read through the left out text since it writes entire lines and it does not filter them by length.
You can use a lookahead to assert that a string contains at least 1 digit, then specify the length of the match and finally anchor it with ^ (start of string) and $ (end of string) if the string is on a line of its own, or \b (word boundary) if it's part of an HTML document as your comments seem to suggest:
Get-ChildItem C:\files\ |Select-String -Pattern '^(?=.*\d)\w{8}$'
Get-ChildItem C:\files\ |Select-String -Pattern '\b(?=.*\d)\w{8}\b'
The pattern [0-9][a-z] matches a digit followed by a letter. If you want to match a sequence of 8 characters use .{8}. The dot in regular expressions matches any character except newlines. A number in curly brackets matches the preceding expression the given number of times.
If you want to match non-whitespace characters use \S instead of .. If you want to match only digits and letters use [0-9a-z] (a character class) instead of ..
For a more thorough introduction please go find a tutorial. The subject is way too complex to be covered by a single answer on SO.
What you're currently searching for is a single number ranging from 0-9 followed by a single lowercase letter ranging from a-z.
this, for example, will match any 8 char long strings containing only alphanumeric characters.
\w{8}
i often forget what some regex classes are, and it may be useful to you as a learning tool, but i use this as a point of reference: http://regexr.com/
It can also validate what you're typing inline via a text field so you can see if what you're doing works or not.
If you need more of a tutorial than a reference, i found this extremely useful when i learned: regexone.com