Powershell split text file into pages by delimiter

Powershell split text file into pages by delimiter - powershell

New to PowerShell here. Have a large text file with many similar pages overlapping at the moment. Wish to use the delimiter: "TESTING/TEST SYSTEM" which appears at the top of every page to separate them into individual pages. The raw original source always have a 1 and 0. 1 on the first line, then 0 on the second line, probably off some old mainframe system, I do not wish to use the 1 and 0 as delimiter, as I have other files I wish to run this command against with different delimiter, which do not have 1 and 0.
Here's what I found so far on StackOverflow, and is partially working:
(Get-Content -Raw inFile.txt) -split '(TESTING/TEST SYSTEM)'|
Set-Content -LiteralPath { 'c:\test\outFile{0}.txt' -f $script:index++ }
However, this keeps creating two extra files. First file only contains those 1 and 0. Second file actually contains the delimiter, striped from the rest of the content of each page. The third file has the rest of the content. This repeats till all the pages are separated, creating 3 pages for each section. I just need the delimiter to be part of each page. The 1 and 0 can be part of it as well, or removed, whichever is easier. Thanks so much for your help!

(Get-Content -Raw inFile.txt) -split '(?=TESTING/TEST SYSTEM)' |
Set-Content -LiteralPath { 'c:\test\outFile{0}.txt' -f $script:index++ }
Note:
-split invariably matches something before the first separator match; if the input starts with a separator, the first array element returned is '' (the empty string).
If no other tokens are empty, or if it is acceptable / desired to eliminate all empty tokens, you can simply append -ne '' to the -split operation.
If you want to make splitting case-sensitive, use -csplit instead of -split.
If you wan to ensure that the regex only matches at the start of a line, use '(?m)(?=^TESTING/TEST SYSTEM)'
(?=...) in the separator regex is a (positive) look-ahead assertion that causes the separator to be included as part of each token, as explained below.
The binary form of the -split operator:
By default excludes what the (first) RHS operand - the separator regex - matches from the array of tokens it returns:
'a#b#c' -split '#' # -> 'a', 'b', 'c'
If you use a capture group ((...)) in the separator regex, what the capture group matches is included in the return array, as separate tokens:
'a#b#c' -split '(#)' # -> 'a', '#', 'b', '#', 'c'
If you want to include what the separator regex matches as part of each token, you must use a look-around assertion:
With a look-ahead assertion ((?=...)) at the start of each token:
'a#b#c' -split '(?=#)' # -> 'a', '#b', '#c'
With a look-behind assertion ((?<=...)) at the end of each token:
'a#b#c' -split '(?<=#)' # -> 'a#', 'b#', 'c'

Related

Powershell replace command not removing newline

I have text that prints out like this:
mdbAppText_Arr: [0]: The cover is open. {goes to next line here}
Please close the cover. and [1] Backprinter cover open
46
I tried getting rid of the newline after open., and it's still there. Any idea of a better way or fix for what I'm doing? I need to get rid of the newline because it's going to a csv file, and messing up formatting (going to newline there).
This is my code:
$mdbAppText_Arr = $mdbAppText.Split("|")
$mdbAppText_Arr[0].replace("`r",";").replace("`n",";").replace("`t",";").replace("&",";")
#replace newline/carriage return/tab with semicolon
if($alarmIdDef -eq "12-7")
{
Write-Host "mdbAppText_Arr: [0]: $($mdbAppText_Arr[0]) and [1] $($mdbAppText_Arr[1]) "
[byte] $mdbAppText_Arr[0][31]
}
I've been looking at:
replace
replace - this one has a link reference to lookup in the asci table, but it's unclear to me what column the byte equivalent is in the table/link.
I'm using PowerShell 5.1.

-replace is a regex operator, so you need to supply a valid regular expression pattern as the right-hand side operand.
You can replace most newline sequences with a pattern describing a substring consisting of:
an optional carriage return (\r? in regex), followed by
a (non-optional) newline character (\n in regex):
$mdbAppText_Arr = $mdbAppText_Arr -replace '\r?\n'

Swap string order in one line or swap lines order in powershell

I need to swap place of 2 or more regex strings in one line or some lines in a txt file in powershell.
In npp i just find ^(String 1.*)\r\n(String 2.*)\r\n(String 3.*)$ and replace with \3\r\n\1\r\n\2:
String 1 aksdfh435##%$dsf
String 2 aksddfgdfg$dsf
String 3 aksddfl;gksf
Turns to:
String 3 aksddfl;gksf
String 1 aksdfh435##%$dsf
String 2 aksddfgdfg$dsf
So how can I do it in Powershell? And if possible can I use the command by calling powershell -command in cmd?

It's basically exactly the same in PowerShell, eg:
$Content = #'
Unrelated data 1
Unrelated data 2
aksdfh435##%$dsf
aksddfgdfg$dsf
aksddfl;gksf
Unrelated data 3
'#
$LB = [System.Environment]::NewLine
$String1= [regex]::Escape('aksdfh435##%$dsf')
$String2= [regex]::Escape('aksddfgdfg$dsf')
$String3= [regex]::Escape('aksddfl;gksf')
$RegexMatch = "($String1.*)$LB($String2.*)$LB($String3.*)$LB"
$Content -replace $RegexMatch,"`$3$LB`$1$LB`$2$LB"
outputs:
Unrelated data 1
Unrelated data 2
aksddfl;gksf
aksdfh435##%$dsf
aksddfgdfg$dsf
Unrelated data 3
I used [System.Environment]::NewLine since it uses the default line break no matter what system you're on. Bound to a variable for easier to read code. Either
\r\n
or
`r`n
would've worked as well. The former if using single quotes and the latter (using backticks) when using double quotes. The backtick is what I use to escape $1, $2 and so on as well, that being the format to use when grabbing the first, second, third group from the regex.
I also use the [regex]::Escape('STRING') method to escape the strings to avoid special characters messing things up.
To use file input instead replace $Content with something like this:
$Content = Get-Content -Path 'C:\script\lab\Tests\testfile.txt' -Raw
and replace the last line with something like:
$Content -replace $RegexMatch,"`$3$LB`$1$LB`$2$LB" | Set-Content -Path 'C:\script\lab\Tests\testfile.txt'

In PowerShell it is not very different.
The replacement string needs to be inside double-qoutes (") here because of the newline characters and because of that, you need to backtick-escape the backreference variables $1, $2 and $3:
$str -replace '^(String 1.*)\r?\n(String 2.*)\r?\n(String 3.*)$', "`$3`r`n`$1`r`n`$2"
This is assuming your $str is a single multiline string as the question implies.

Get rid of repeated fields in a log line

I have logs lines in which the information fields are repeated, the first time they are separated by a comma and a space, the second time they are separated by a semicolon, I want to get rid of their second occurrence, the word (SECOND) is not in the log, I put it there to make it more clear
targets:somehost state:Memory\Buffers=398672, Memory\Cached=4620216, Memory\MemFree=833748, Memory\MemTotal=8001352 (SECOND) Memory\Buffers=398672;Memory\Cached=4620216;Memory\MemFree=833748;Memory\MemTotal=8001352 type:Unix Resources
I Was thinking in using replace.
%{$_ -replace "Memory\\Buffers=([0-9]+);Memory\\Cached=([0-9]+);Memory\\MemFree=([0-9]+);Memory\\MemTotal=([0-9]+)",""}
but the log has a lot more fields, that I din't put in here to make it more readable.
is there a better way to do this?

Assuming all your log lines follow the pattern of your sample line, and assuming that all fields following state: are repeated, you can use the following regex (using a simplified input string, in which the fields Me\Bu=398672 and Me\Ca=4620216 are repeated):
'a:b c:Me\Bu=398672, Me\Ca=4620216 Me\Bu=398672;Me\Ca=4620216 d:e f' | % {
$_ -replace '[^ ]+;[^ ]+ '
}
The above yields:
a:b c:Me\Bu=398672, Me\Ca=4620216 d:e f
[^ ]+; matches the first field in the second occurrence of the fields list.
[^ ]+ matches all remaining fields in the second list.

You can remove everything starting from (and including) the second occurrence of Memory\Buffers= with a positive look-behind assertion ((?<=...)):
$string -replace '(?<=Memory\\Buffers=.*?)\s*Memory\\Buffers.*$'
As you've found, with a positive look-ahead assertion ((?=...)) you can then specify where the second sequence stops:
$string -replace '(?<=Memory\\Buffers=.*?)\s*Memory\\Buffers.*(?=\stype:)'

How to replace a string preceded by zero, one or more spaces in PowerShell

I'm using the .Replace() function to replace line feeds in the file I'm working on with a carriage return and a line feed but I would also like to match any number of spaces preceding the line feed. Can this be done in the same operation using a regular expression?
I've tried various combinations of "\s +*" but none have worked, except with a fixed number of manually typed spaces.
This version works for the one space case:
.Replace(" `n","`r`n")
For example, a file like this:
...end of line one\n
...end of line two \n
would look like:
...end of line one\r\n
...end of line two\r\n

The .Replace() method of the .NET [string] type performs literal string replacements.
By contrast, PowerShell's -replace operator is based on regexes (regular expressions), so it allows you to match a variable number of spaces (including none) with  *:
"...end of line two `n" -replace ' *\n', "`r`n"

Extract anchor tag link text using Powershell

I'm attempting to extract the link text from something like the line below using PowerShell.
Entertainment, Intimate Apparel/Swimsuit, and Suspicious
I've tried the following but it's only matching the first result and is including the > and < which I don't want. I'm sure it's an issue with the Regex but I don't know it well enough to see what's wrong. Note the string above is $result.categorization
$result.categorization -match '(\>(.*?)\<)'
This returns
Name,Value
2,Entertainment
1,>Entertainment<
0,>Entertainment<
I want to return
Name,Value
2,Suspicious
1,Intimate Apparel/Swimsuit
0,Entertainment
I also tried the Regex listed Regular expression to extract link text from anchor tag but that didn't match on anything.

I don't know where the headers and numbers in the output come from, but here's a solution that extracts the link texts from the single-line input exactly as specified:
$str = #'
Entertainment, Intimate Apparel/Swimsuit, and Suspicious
'#
$str -split ', and |, ' -replace '.*?>([^<]*).*', '$1'
$str -split ', and |, ' splits the input line into individual <a> elements.
-replace then operates on each <a> element individually:
'.*?>([^<]*).*' matches the entire line, but captures only the link text in the one and only capture group, (...).
Replacement text $1 then replaces the entire line with what the capture group matched, i.e., effectively only returning the link text.
As for what you tried:
-match never extracts part of its input - it returns a Boolean indicating whether a match was found with a scalar LHS, or a filtered sub-array of matching items with an array as the LHS.
That said, the automatic $Matches variable does contain information about what parts matched, but only with a scalar LHS.
'(\>(.*?)\<)' contains two nested capture groups that match literal > followed by any number of characters (matching non-greedily), followed by literal <.
It is the inner capture group that would capture the link text.
However:
There is no need for the outer capture group.
> and < do not need \-escaping in a regular expression (although it does no harm).