Remove list of phrases if they are present in a text file using Powershell - powershell

I'm trying to use a list of phrases (over 100) which I want to be removed from a text file (products.txt) which has lines of text inside it (they are tab separated / new line each). So that the results which do not match the list of phrases will be re-written in the current file.
#cd .\Desktop\
$productlist = #(
'example',
'juicebox',
'telephone',
'keyboard',
'manymore')
foreach ($product in $productlist) {
get-childitem products.txt | Select-String -Pattern $product -NotMatch | foreach {$_.line} | Out-File -FilePath .\products.txt
}
The above code does not remove the words listed in the $productlist, it simply outputs all links in products.txt again.
The lines inside of products.txt file are these:
productcatalog
product1example
juicebox038
telephoneiphone
telephoneandroid
randomitem
logitech
coffeetable
razer
Thank you for your help.

Here's my solution. You need the parentheses otherwise the input file will be in use when trying to write to the file. Select-string accepts an array of patterns. I wish I could pipe 'path' to set-content but it doesn't work.
$productlist = 'example', 'juicebox', 'telephone', 'keyboard', 'manymore'
(Select-String $productlist products.txt -NotMatch) | % line |
set-content products.txt

here's one way to do what you want. it's somewhat more direct than what yo used. [grin] it uses the way that PoSh can act on an entire collection when it is on the LEFT side of an operator.
what it does ...
fakes reading in a text file
when ready to do this in real life, replace the whole #region/#endregion block with a call to Get-Content.
builds the exclude list
converts that into a regex OR pattern
filters out the items that match the unwanted list
shows that resulting list
the code ...
#region >>> fake reading in a text file
# when ready to do this for real, replace the whole "#region/#endregion" block with a call to Get-Content
$ProductList = #'
productcatalog
product1example
juicebox038
telephoneiphone
telephoneandroid
randomitem
logitech
coffeetable
razer
'# -split [System.Environment]::NewLine
#endregion >>> fake reading in a text file
$ExcludedProductList = #(
'example'
'juicebox'
'telephone'
'keyboard'
'manymore'
)
$EPL_Regex = $ExcludedProductList -join '|'
$RemainingProductList = $ProductList -notmatch $EPL_Regex
$RemainingProductList
output ...
productcatalog
randomitem
logitech
coffeetable
razer

Related

how to parse this (inner)xml?

i’m very new to powershell, and i’m abit stuck.
I have this innerXML:
<sl-test.protocol>HTTP</sl-test.protocol>
<sl-test.responseTimeout>14000</sl-test.responseTimeout>
<env>${myenv}</env>
<http.port>8081</http.port>
And i want to convert it into a .properties file in this format:
sl-test.protocol=HTTP
sl-test.responseTimeout=14000
env=${myenv}
http.port=8081
i have the part to create the .properties file (hardcoded value right now) which works:
$test = New-Item -Name "mule-app.properties" -ItemType "file" -Value "test.prop=testprop`ntest2.prop=test2prop"
So basically i need to go from the innerXML to a big string of key/values separated by `n
But also i need to escape any $ with a backtick
desired string:
sl-test.protocol=HTTP`nsl-test.responseTimeout=14000`nenv`${myenv}`nhttp.port=8081
But right now i cant even seem to iterate through all the keys and values.
Note: the keys and values will be dynamic, it will not always be those 4
Any help will be greatly appreciated.
The .ChildNodes property of the nodes in an [xml] (System.Xml.XmlDocument instance allows you to loop over a given XML element's (System.Xml.XmlElement) child elements.
# Sample XML input.
[xml] $xml = #'
<el>
<sl-test.protocol>HTTP</sl-test.protocol>
<sl-test.responseTimeout>14000</sl-test.responseTimeout>
<env>${myenv}</env>
<http.port>8081</http.port>
</el>
'#
# Loop over all child elements of the document element.
$xml.el.ChildNodes |
ForEach-Object {
# Create and output a line for the output file, based on the
# element's name and inner text, with "$" escaped as "`$"
'{0}={1}' -f $_.Name, $_.InnerText.Replace('$', '`$')
} | # Set-Content out.properties -Encoding utf8
Uncomment and adapt the Set-Content call as needed.

Powershell - randomize same string in huge file using all random strings from array

I am looking for a way to randomize a specific string in a huge file by using predefined strings from array, without having to write temporary file on disk.
There is a file which contains the same string, e.g. "ABC123456789" at many places:
<Id>ABC123456789</Id><tag1>some data</tag1><Id>ABC123456789</Id><Id>ABC123456789</Id><tag2>some data</tag2><Id>ABC123456789</Id><tag1>some data</tag1><tag3>some data</tag3><Id>ABC123456789</Id><Id>ABC123456789</Id>
I am trying to randomize that "ABC123456789" string using array, or list of defined strings, e.g. "#('foo','bar','baz','foo-1','bar-1')". Each ABC123456789 should be replaced by randomly picked string from the array/list.
I have ended up with following solution, which is working "fine". But it definitely is not the right approach, as it do many savings on disk - one for each replaced string and therefore is very slow:
$inputFile = Get-Content 'c:\temp\randomize.xml' -raw
$checkString = Get-Content -Path 'c:\temp\randomize.xml' -Raw | Select-String -Pattern '<Id>ABC123456789'
[regex]$pattern = "<Id>ABC123456789"
while($checkString -ne $null) {
$pattern.replace($inputFile, "<Id>$(Get-Random -InputObject #('foo','bar','baz','foo-1','bar-1'))", 1) | Set-Content 'c:\temp\randomize.xml' -NoNewline
$inputFile = Get-Content 'c:\temp\randomize.xml' -raw
$checkString = Get-Content -Path 'c:\temp\randomize.xml' -Raw | Select-String -Pattern '<Id>ABC123456789'
}
Write-Host All finished
The output is randomized, e.g.:
<Id>foo
<Id>bar
<Id>foo
<Id>foo-1
However, I would like to achieve this kind of output without having to write file to disk in each step. For thousands of the string occurrences it takes a lot of time. Any idea how to do it?
=========================
Edit 2023-02-16
I tried the solution from zett42 and it works fine with simple XML structure. In my case there is some complication which was not important in my text processing approach.
Root and some other elements names in the structure of processed XML file contain colon and there must be some special setting for "-XPath" for this situation. Or, maybe the solution is outside of Powershell scope.
<?xml version='1.0' encoding='UTF-8'?>
<C23A:SC777a xmlns="urn:C23A:xsd:$SC777a" xmlns:C23A="urn:C23A:xsd:$SC777a" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:C23A:xsd:$SC777a SC777a.xsd">
<C23A:FIToDDD xmlns="urn:iso:std:iso:20022:tech:xsd:pacs.008.001.02">
<CxAAA>
<DxBBB>
<ABC>
<Id>ZZZZZZ999999</Id>
</ABC>
</DxBBB>
<CxxCCC>
<ABC>
<Id>ABC123456789</Id>
</ABC>
</CxxCCC>
</CxAAA>
<CxAAA>
<DxBBB>
<ABC>
<Id>ZZZZZZ999999</Id>
</ABC>
</DxBBB>
<CxxCCC>
<ABC>
<Id>ABC123456789</Id>
</ABC>
</CxxCCC>
</CxAAA>
</C23A:FIToDDD>
<C23A:PmtRtr xmlns="urn:iso:std:iso:20022:tech:xsd:pacs.004.001.02">
<GrpHdr>
<TtREEE Abc="XV">123.45</TtREEE>
<SttlmInf>
<STTm>ABCA</STTm>
<CLss>
<PRta>SIII</PRta>
</CLss>
</SttlmInf>
</GrpHdr>
<TxInf>
<OrgnlTxRef>
<DxBBB>
<ABC>
<Id>YYYYYY888888</Id>
</ABC>
</DxBBB>
<CxxCCC>
<ABC>
<Id>ABC123456789</Id>
</ABC>
</CxxCCC>
</OrgnlTxRef>
</TxInf>
</C23A:PmtRtr>
</C23A:SC777a>
As commented, it is not recommended to process XML like a text file. This is a brittle approach that depends too much on the formatting of the XML. Instead, use a proper XML parser to load the XML and then process its elements in an object-oriented way.
# Use XmlDocument (alias [xml]) to load the XML
$xml = [xml]::new(); $xml.Load(( Convert-Path -LiteralPath input.xml ))
# Define the ID replacements
$searchString = 'ABC123456789'
$replacements = 'foo','bar','baz','foo-1','bar-1'
# Process the text of all ID elements that match the search string, regardless how deeply nested they are.
$xml | Select-Xml -XPath '//Id/text()' | ForEach-Object Node |
Where-Object Value -eq $searchString | ForEach-Object {
# Replace the text of the current element by a randomly choosen string
$_.Value = Get-Random $replacements
}
# Save the modified document to a file
$xml.Save( (New-Item output.xml -Force).Fullname )
$xml | Select-Xml -XPath '//Id/text()' selects the text nodes of all Id elements, regardless how deeply nested they are in the XML DOM, using the versatile Select-Xml command. The XML nodes are selected by specifying an XPath expression.
Regarding your edit, when you have to deal with XML namespaces, use the parameter -Namespace to specify a namespace prefix to use in the XPath expression for the given namespace URI. In this example I've simply choosen a as the namespace prefix:
$xml | Select-Xml -XPath '//a:Id/text()' -Namespace #{a = 'urn:iso:std:iso:20022:tech:xsd:pacs.008.001.02'}
ForEach-Object Node selects the Node property from each result of Select-Xml. This simplifies the following code.
Where-Object Value -eq $searchString selects the text nodes that match the search string.
Within ForEach-Object, the variable $_ stands for the current text node. Assign to its Value property to change the text.
The Convert-Path and New-Item calls make it possible to use a relative PowerShell path (PSPath) with the .NET XmlDocument class. In general .NET APIs don't know anything about the current directory of PowerShell, so we have to convert the paths before passing to .NET API.

Powershell- match split and replace based on index

I have a file
AB*00*Name1First*Name1Last*test
BC*JCB*P1*Church St*Texas
CD*02*83*XY*Fax*LM*KY
EF*12*Code1*TX*1234*RJ
I need to replace the 5th element in the CD segment alone from LM to ET in each of the file in the folder. Element delimiter is * as mentioned in the above sample file content. I am new to PowerShell and tried a code as below but unfortunately it is not giving desired results. Can any of you please provide some help?
foreach($xfile in $inputfolder)
{
If ($_ match "^CD\*")
{
[System.IO.File]::ReadAllText($xfile).replace(($_.split("*")[5],"ET") | Set-Content $xfile
}
[System.IO.File]::WriteAllText($xfile),((Get-Content $xfile -join("~")))
}
here's a slightly different way to get there ... [grin] what it does ...
fakes reading in a test file
when ready to do this for real, remove the entire #region/#endregion block and use Get-Content.
sets the constants
iterates thru the imported text file lines
checks for a line that starts with the target pattern
if found ...
== escapes the old value with [regex]::Escape() to deal with the asterisks
== replaces the escaped old value with the new value
== outputs the new version of that line
if NOT found, outputs the line as-is
stores all the lines into the $OutStuff var
displays that on screen
the code ...
#region >>> fake reading in a plain text file
# in real life, use Get-Content
$InStuff = #'
AB*00*Name1First*Name1Last*test
BC*JCB*P1*Church St*Texas
CD*02*83*XY*Fax*LM*KY
EF*12*Code1*TX*1234*RJ
'# -split [System.Environment]::NewLine
#endregion >>> fake reading in a plain text file
$TargetLineStart = 'CD*'
$OldValue = '*LM*'
$NewValue = '*ET*'
$OutStuff = foreach ($IS_Item in $InStuff)
{
if ($IS_Item.StartsWith($TargetLineStart))
{
$IS_Item -replace [regex]::Escape($OldValue), $NewValue
}
else
{
$IS_Item
}
}
$OutStuff
output ...
AB*00*Name1First*Name1Last*test
BC*JCB*P1*Church St*Texas
CD*02*83*XY*Fax*ET*KY
EF*12*Code1*TX*1234*RJ
i will leave saving that to a new file [or overwriting the old one] to the user. [grin]
You could capture all that comes before the match in group 1, and match LM.
In the replacement use $1ET
^(CD*(?:[^*\r\n]+\*){5})LM\b
Regex demo
If you don't want to match LM literally, you could also match any other char than * or a newline.
^(CD*(?:[^*\r\n]+\*){5})[^*\r\n]+\b
Replace example
$allText = Get-Content -Raw file.txt
$allText -replace '(?m)^(CD*(?:[^*\r\n]+\*){5})LM\b','$1ET'
Output
AB*00*Name1First*Name1Last*test
BC*JCB*P1*Church St*Texas
CD*02*83*XY*Fax*ET*KY
EF*12*Code1*TX*1234*RJ

Extract string between two special characters in powershell

I need to extract a list with strings that are between two special characters (= and ;).
Below is an example of the file with line types and the needed strings in bold.
File is a quite big one, type is xml.
<type="string">data source=**HOL4624**;integrated sec>
<type="string">data source=**HOL4625**;integrated sec>
I managed to find the lines matching “data source=”, but how to get the name after?
Used code is below.
Get-content regsrvr.txt | select-string -pattern "data source="
Thank you very much!
<RegisteredServers:ConnectionStringWithEncryptedPassword type="string">data source=HOL4624;integrated security=True;pooling=False;multipleactiveresultsets=False;connect timeout=30;encrypt=False;trustservercertificate=False;packet size=4096</RegisteredServers:ConnectionStringWithEncryptedPassword>
<RegisteredServers:ConnectionStringWithEncryptedPassword type="string">data source=HOL4625;integrated security=True;pooling=False;multipleactiveresultsets=False;connect timeout=30;encrypt=False;trustservercertificate=False;packet size=4096</RegisteredServers:ConnectionStringWithEncryptedPassword>
The XML is not valid, so it's not a clean parse, anyway you can use string split with regex match:
$html = #"
<RegisteredServers:ConnectionStringWithEncryptedPassword type="string">data source=HOL4624;integrated security=True;pooling=False;multipleactiveresultsets=False;connect timeout=30;encrypt=False;trustservercertificate=False;packet size=4096</RegisteredServers:ConnectionStringWithEncryptedPassword>
<RegisteredServers:ConnectionStringWithEncryptedPassword type="string">data source=HOL4625;integrated security=True;pooling=False;multipleactiveresultsets=False;connect timeout=30;encrypt=False;trustservercertificate=False;packet size=4096</RegisteredServers:ConnectionStringWithEncryptedPassword>
"#
$html -split '\n' | % {$null = $_ -match 'data source=.*?;';$Matches[0]} |
% {($_ -split '=')[1] -replace ';'}
HOL4624
HOL4625
Since the connectionstring is for SQL Server, let's use .Net's SqlConnectionStringBuilder to do all the work for us. Like so,
# Test data, XML extraction is left as an exercise
$str = 'data source=HOL4624;integrated security=True;pooling=False;multipleactiveresultsets=False;connect timeout=30;encrypt=False;trustservercertificate=False;packet size=4096'
$builder = new-object System.Data.SqlClient.SqlConnectionStringBuilder($str)
# Check some parameters
$builder.DataSource
HOL4624
$builder.IntegratedSecurity
True
You can expand your try at using Select-String with a better use of regex. Also, you don't need to use Get-Content first. Instead you can use the -Path parameter of Select-String.
The following Code will read the given file and return the value between the = and ;:
(Select-String -Path "regsrvr.txt" -pattern "(?:data source=)(.*?)(?:;)").Matches | % {$_.groups[1].Value}
Pattern Explanation (RegEx):
You can use -pattern to capture an String given a matching RegEx. The Regex can be describe as such:
(?: opens an non-capturing Group
data source= matches the charactes data source=
) closes the non-capturing Group
(.*?) matches any amount of characters and saves them in a Group. The ? is the lazy operator. This will stop the matching part at the first occurence of the following group (in this case the ;).
(?:;) is the final non-capturing Group for the closing ;
Structuring the Output
Select-String returns a Microsoft.PowerShell.Commands.MatchInfo-Object.
You can find the matched Strings (the whole String and all captured groups) in there. We can also loop through this Output and return the Value of the captured Groups: | % {$_.groups[1].Value}
% is just an Alias for For-Each.
For more Informations look at the Select-String-Documentation and try your luck with some RegEx.

Remove blank lines after specific text (without using -notmatch)

We have a script that uses a function to go through a text file and replace certain words with either other words or with nothing. The spots that get replaced with nothing leave behind a blank line, which we need to remove in some cases (but not all). I've seen several places where people mention using things like -notmatch to copy over everything to a new file except what you want left behind, but there are a lot of blank lines we want left in place.
For example:
StrangerThings: A Netflix show
'blank line to be removed'
'blank line to be removed'
Cast: Some actors
Crew: hard-working people
'blank line left in place'
KeyGrip
'blank line to be removed'
Gaffer
'blank line left in place'
So that it comes out like this:
StrangerThings: A Netflix show
Cast: Some actors
Crew: hard-working people
KeyGrip
Gaffer
We've tried doing a -replace, but that doesn't get rid of the blank line. Additionally, we have to key off of the text to the left of the ":" in each line. The data to the right in most cases is dynamic, so we can't hard-code anything in for that.
function FormatData {
#FUNCTION FORMATS DATA BASED ON SECTIONS
#This is where we're replacing some words in the different sections
#Some of these we replace leave the blank lines behind
$data[$section[0]..$section[1]] -replace $oldword,$newword
$output | Set-Content $outputFile
}
$oldword = "oldword"
$newword = "newword"
FormatData
$oldword = "oldword1"
$newword = "" #leaves a blank line
FormatData
$oldword = "Some phrase: "
$newword = "" #leaves a blank line
FormatData
We just need a pointer in the right direction on how to delete/remove a blank line (or several lines) after specific text, please.
Since it looks like you are reading in an array and doing replacements, the array index will not go away. You can change the value to blank or white space, and it will still appear as a blank line when it is output to a file or console. Using the -replace operator with no replacement string, replaces the regex match with an empty string.
One approach could be to read the data in raw like Get-Content -Raw and then the text is read into memory as is, but you lose array indexing. At that point, you have full control over replacing newline characters if you choose to do so. A second approach would be to mark the blank lines you want to keep initially (<#####> in this example), do the replacements, remove the blank spaces, and then clean up the markings.
# Do this before any new word replacements happen. Pass this object into any functions.
$data = $data -replace "^\s*$","<#####>"
$data[$section[0]..$section[1]] -replace $oldword,$newword
($output | Where-Object {$_}) -replace "<#####>" | Set-Content $outputFile
Explanation:
Any value that is white space, blank, or null will evaluate to false in a PowerShell boolean conditional statement. Since the Where-Object script block performs a boolean conditional evaluation, you can simply just check the pipeline object ($_). Any value (in this case a line) that is not white space, null, or empty, will be true.
Below is a trivial example of the behavior:
$arr = "one","two","three"
$arr
one
two
three
$arr -replace "two"
one
three
$arr[1] = "two"
$arr
one
two
three
$arr -replace "two" | Where-Object {$_}
one
three
You can set a particular array value to $null and have it appear to go away. When writing to a file, it will appear as if the line has been removed. However, the array will still have that $null value. So you have to be careful.
$arr[1] = $null
$arr
one
three
$arr.count
3
If you use another collection type that supports resizing, you have the Remove method available. At that point though, you are adding extra logic to handle index removals and can't be enumerating the collection while you are changing its size.
If all you are doing is parsing a text file:
function FormatData {
$Input -replace $oldword,$newword
}
$FileContent = Get-Content "C:\TextFile.txt"
$OutputFile = "C:\TextOutput.txt"
$oldword = "oldword"
$newword = "newword"
$FileContent = $FileContent | FormatData
$oldword = '^(Crew: hard-working people)([`r`n]+).*oldword1.*[`r`n]+'
$newword = '$1$2$2' # Leaves a blank Line after Crew: hard-working people
$FileContent = $FileContent | FormatData
$oldword = '^.*oldword1.*[`r`n]+'
$newword = '' # Does not leave a blank Line
$FileContent = $FileContent | FormatData
$FileContent | Set-Content $outputFile