Search and print out information of .html datas in Powershell - powershell

I want to use Powershell to search in .html documents for specific strings and print them out.
Let me explain my first function which works:
I use this function to search for all .html documents in the path which contain the string "Tag". After that I search for the string "ID:", skip the tag "</TD><TD>" and use the following regular expression to print out the following 32 characters, which is the ID. Below you see a part of the html file and then my function.
<TR VALIGN=TOP><TD>Lokation:</TD><TD>\Test1\blabla\asdf\1234\WS Auswertungen</TD></TR>
<TR VALIGN=TOP><TD>Beschreibung:</TD><TD></TD></TR>
<TR VALIGN=TOP><TD>Eigentümer:</TD><TD><IMG ALIGN=MIDDLE SRC="file:///C:\Users\D0262290\AppData\Local\Temp\23\User.bmp"> Wilmes, Tanja</TD></TR>
<TR VALIGN=TOP><TD>ID:</TD><TD>55C7B7F411E2661E001000806C38EBA0</TD></TR>
</TABLE></TD><TD><IMG ALIGN=MIDDLE SRC="file:///C:\Users\D0262290\AppData\Local\Temp\23\User.bmp">
The function:
Function searchStringID {
Get-ChildItem -Path C:\Users\blub\lala\Dokus -Filter *.html |
Select-String -Pattern "Tag" |
select Path |
Get-ChildItem |
foreach {
if ((Get-Content -Raw -Path $_.FullName) -replace "<.*?>|\s" -match "(?s)ID:(?<Id>[a-z0-9]{32})" ) {
printToOutputLog
}
}
}
All this works fine.
Now I need to check for 2 more information and I can't figure out the regular expression I have to use because it has no fixed length of characters.
I always have to check for the string "Tag" in my problems below.
My first problem:
I have get the location of the file, so I gotta search for the string "Lokation:" (you can check it on the html I posted before).
So get the information I have have to skip the tags </TD><TD> again and use a regular expression to get the location. My problem here is that I have to idea how to manage the not-fixed length of characters. Is there a way to print out the characters between "Lokation:</TD><TD>" and "</TD></TR>" ?
The tags are all the same in the other html files, so I just need a solution which works for my example.
My second problem:
I have to read out the object's name. In the html document it's stored like this in a comment. The object's name begins after "[OBJECT:] and ends with "]". Here again, I can't figure out which expression I could use. The special characters in the example object's name below could be used.
<!-- ################################################################## -->
<!-- # [OBJECT: NAME BLA bla/ BLA_BLA 1 22:34] # -->
<!-- ################################################################## -->
I would be so thankful if anyone could help me. Every hint is useful to me because my brain is really stuck here.
Thanks and cheers

Ok, this one gets the contents of each file and runs each line through a Switch to match against three RegEx expressions. It worked for me against your sample data. It assigns each match to a variable for each of the three things you are looking for, and then outputs an object for each.
Function searchStringID {
Get-ChildItem -Path C:\Users\blub\lala\Dokus -Filter *.html |
Select-String -Pattern "Tag" |
select Path |
Get-ChildItem |
foreach {
Switch -Regex (Get-Content -Path $_.FullName){
"((?<=ID:.+?)[a-z0-9]{32})" {$ID = $Matches[1]}
"Lokation:.+?>(\\[^<]+)" {$Location = $Matches[1]}
"OBJECT: ?([^\]]+)" {$Object = $Matches[1]}
}
[PSCustomObject][Ordered]#{
'ID' = $ID
'Location' = $Location
'Name' = $Object
}
}
}
So then you could assign that to a variable and have an array of results to do with as you please (output to CSV? Sure! Display to the screen as a table? Can do! Email to the entire company? Um, yeah, but I wouldn't recommend that.)
Here's what it gave me when I ran it against your sample:
ID Location Name
-- -------- ----
55C7B7F411E2661E001000806C38EBA0 \Test1\blabla\asdf\1234\WS Auswertungen NAME BLA bla/ BLA_BLA 1 22:34

Related

Powershell - randomize same string in huge file using all random strings from array

I am looking for a way to randomize a specific string in a huge file by using predefined strings from array, without having to write temporary file on disk.
There is a file which contains the same string, e.g. "ABC123456789" at many places:
<Id>ABC123456789</Id><tag1>some data</tag1><Id>ABC123456789</Id><Id>ABC123456789</Id><tag2>some data</tag2><Id>ABC123456789</Id><tag1>some data</tag1><tag3>some data</tag3><Id>ABC123456789</Id><Id>ABC123456789</Id>
I am trying to randomize that "ABC123456789" string using array, or list of defined strings, e.g. "#('foo','bar','baz','foo-1','bar-1')". Each ABC123456789 should be replaced by randomly picked string from the array/list.
I have ended up with following solution, which is working "fine". But it definitely is not the right approach, as it do many savings on disk - one for each replaced string and therefore is very slow:
$inputFile = Get-Content 'c:\temp\randomize.xml' -raw
$checkString = Get-Content -Path 'c:\temp\randomize.xml' -Raw | Select-String -Pattern '<Id>ABC123456789'
[regex]$pattern = "<Id>ABC123456789"
while($checkString -ne $null) {
$pattern.replace($inputFile, "<Id>$(Get-Random -InputObject #('foo','bar','baz','foo-1','bar-1'))", 1) | Set-Content 'c:\temp\randomize.xml' -NoNewline
$inputFile = Get-Content 'c:\temp\randomize.xml' -raw
$checkString = Get-Content -Path 'c:\temp\randomize.xml' -Raw | Select-String -Pattern '<Id>ABC123456789'
}
Write-Host All finished
The output is randomized, e.g.:
<Id>foo
<Id>bar
<Id>foo
<Id>foo-1
However, I would like to achieve this kind of output without having to write file to disk in each step. For thousands of the string occurrences it takes a lot of time. Any idea how to do it?
=========================
Edit 2023-02-16
I tried the solution from zett42 and it works fine with simple XML structure. In my case there is some complication which was not important in my text processing approach.
Root and some other elements names in the structure of processed XML file contain colon and there must be some special setting for "-XPath" for this situation. Or, maybe the solution is outside of Powershell scope.
<?xml version='1.0' encoding='UTF-8'?>
<C23A:SC777a xmlns="urn:C23A:xsd:$SC777a" xmlns:C23A="urn:C23A:xsd:$SC777a" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:C23A:xsd:$SC777a SC777a.xsd">
<C23A:FIToDDD xmlns="urn:iso:std:iso:20022:tech:xsd:pacs.008.001.02">
<CxAAA>
<DxBBB>
<ABC>
<Id>ZZZZZZ999999</Id>
</ABC>
</DxBBB>
<CxxCCC>
<ABC>
<Id>ABC123456789</Id>
</ABC>
</CxxCCC>
</CxAAA>
<CxAAA>
<DxBBB>
<ABC>
<Id>ZZZZZZ999999</Id>
</ABC>
</DxBBB>
<CxxCCC>
<ABC>
<Id>ABC123456789</Id>
</ABC>
</CxxCCC>
</CxAAA>
</C23A:FIToDDD>
<C23A:PmtRtr xmlns="urn:iso:std:iso:20022:tech:xsd:pacs.004.001.02">
<GrpHdr>
<TtREEE Abc="XV">123.45</TtREEE>
<SttlmInf>
<STTm>ABCA</STTm>
<CLss>
<PRta>SIII</PRta>
</CLss>
</SttlmInf>
</GrpHdr>
<TxInf>
<OrgnlTxRef>
<DxBBB>
<ABC>
<Id>YYYYYY888888</Id>
</ABC>
</DxBBB>
<CxxCCC>
<ABC>
<Id>ABC123456789</Id>
</ABC>
</CxxCCC>
</OrgnlTxRef>
</TxInf>
</C23A:PmtRtr>
</C23A:SC777a>
As commented, it is not recommended to process XML like a text file. This is a brittle approach that depends too much on the formatting of the XML. Instead, use a proper XML parser to load the XML and then process its elements in an object-oriented way.
# Use XmlDocument (alias [xml]) to load the XML
$xml = [xml]::new(); $xml.Load(( Convert-Path -LiteralPath input.xml ))
# Define the ID replacements
$searchString = 'ABC123456789'
$replacements = 'foo','bar','baz','foo-1','bar-1'
# Process the text of all ID elements that match the search string, regardless how deeply nested they are.
$xml | Select-Xml -XPath '//Id/text()' | ForEach-Object Node |
Where-Object Value -eq $searchString | ForEach-Object {
# Replace the text of the current element by a randomly choosen string
$_.Value = Get-Random $replacements
}
# Save the modified document to a file
$xml.Save( (New-Item output.xml -Force).Fullname )
$xml | Select-Xml -XPath '//Id/text()' selects the text nodes of all Id elements, regardless how deeply nested they are in the XML DOM, using the versatile Select-Xml command. The XML nodes are selected by specifying an XPath expression.
Regarding your edit, when you have to deal with XML namespaces, use the parameter -Namespace to specify a namespace prefix to use in the XPath expression for the given namespace URI. In this example I've simply choosen a as the namespace prefix:
$xml | Select-Xml -XPath '//a:Id/text()' -Namespace #{a = 'urn:iso:std:iso:20022:tech:xsd:pacs.008.001.02'}
ForEach-Object Node selects the Node property from each result of Select-Xml. This simplifies the following code.
Where-Object Value -eq $searchString selects the text nodes that match the search string.
Within ForEach-Object, the variable $_ stands for the current text node. Assign to its Value property to change the text.
The Convert-Path and New-Item calls make it possible to use a relative PowerShell path (PSPath) with the .NET XmlDocument class. In general .NET APIs don't know anything about the current directory of PowerShell, so we have to convert the paths before passing to .NET API.

PowerShell script that searches for a string in a .txt and if it finds it, looks for the next line containing another string and does a job with it

I have the line
Select-String -Path ".\*.txt" -Pattern "6,16" -Context 20 | Select-Object -First 1
that would return 20 lines of context looking for a pattern of "6,16".
I need to look for the next line containing the string "ID number:" after the line of "6,16", read what is the text right next to "ID number:", find if this exact text exists in another "export.txt" file located in the same folder (so in ".\export.txt"), and see if it contains "6,16" on the same line as the one containing the text in question.
I know it may seem confusing, but what I mean is for example:
example.txt:5218: ID number:0002743284
shows whether this is true:
export.txt:9783: 0002743284 *some text on the same line for example* 6,16
If I understand the question correctly, you're looking for something like:
Select-String -List -Path *.txt -Pattern '\b6,16\b' -Context 0, 20 |
ForEach-Object {
if ($_.Context.PostContext -join "`n" -match '\bID number:(\d+)') {
Select-String -List -LiteralPath export.txt -Pattern "$($Matches[1]).+$($_.Pattern)"
}
}
Select-String's -List switch limits the matching to one match per input file; -Context 0,20 also includes the 20 lines following the matching one in the output (but none (0) before).
Note that I've placed \b, a word-boundary assertion at either end of the search pattern, 6,16, to rule out accidental false positives such as 96,169.
$_.Context.PostContext contains the array of lines following the matching line (which itself is stored in $_.Line):
-join "`n" joins them into a multi-line string, so as to ensure that the subsequent -match operation reports the captured results in the automatic $Matches variable, notably reporting the ID number of interest in $Matches[1], the text captured by the first (and only) capture group ((\d+)).
The captured ID is then used in combination with the original search pattern to form a regex that looks for both on the same line, and is passed to a second Select-String call that searches through export.txt
Note: An object representing the matching line, if any, is output by default; to return just $true or $false, replace -List with -Quiet.
There's a lot wrong with what you're expecting and the code you've tried so let's break it down and get to the solution. Kudos for attempting this on your own. First, here's the solution, read below this code for an explanation of what you were doing wrong and how to arrive at the code I've written:
# Get matching lines plus the following line from the example.txt seed file
$seedMatches = Select-String -Path .\example.txt -Pattern "6,\s*16" -Context 0, 2
# Obtain the ID number from the line following each match
$idNumbers = foreach( $match in $seedMatches ) {
$postMatchFields = $match.Context.PostContext -split ":\s*"
# Note: .IndexOf(object) is case-sensitive when looking for strings
# Returns -1 if not found
$idFieldIndex = $postMatchFields.IndexOf("ID number")
# Return the "ID number" to `$idNumbers` if "ID number" is found in $postMatchFields
if( $idFieldIndex -gt -1 ) {
$postMatchFields[$idFieldIndex + 1]
}
}
# Match lines in export.txt where both the $id and "6,16" appear
$exportMatches = foreach( $id in $idNumbers ) {
Select-String -Path .\export.txt -Pattern "^(?=.*\b$id\b)(?=.*\b6,\s*16\b).*$"
}
mklement0's answer essentially condenses this into less code, but I wanted to break this down fully.
First, Select-String -Path ".\*.txt" will look in all .txt files in the current directory. You'll want to narrow that down to a specific naming pattern you're looking for in the seed file (the file we want to find the ID to look for in the other files). For this example, I'll use example.txt and export.txt for the paths which you've used elsewhere in your question, without using globbing to match on filenames.
Next, -Context gives context of the surrounding lines from the match. You only care about the next line match so 0, 1 should suffice for -Context (0 lines before, 1 line after the match).
Finally, I've added \s* to the -Pattern to match on whitespace, should the 16 ever be padded from the ,. So now we have our Select-String command ready to go:
$seedMatches = Select-String -Path .\example.txt -Pattern "6,\s*16" -Context 0, 2
Next, we will need to loop over the matching results from the seed file. You can use foreach or ForEach-Object, but I'll use foreach in the example below.
For each $match in $seedMatches we'll need to get the $idNumbers from the lines following each match. When $match is ToString()'d, it will spit out the matched line and any surrounding context lines. Since we only have one line following the match for our context, we can grab $match.Context.PostContext for this.
Now we can get the $idNumber. We can split example.txt:5218: ID number:0002743284 into an array of strings by using the -split operator to split the string on the :\s* pattern (\s* matches on any or no whitespace). Once we have this, we can get the index of "ID Number" and get the value of the field immediately following it. Now we have our $idNumbers. I'll also add some protection below to ensure the ID numbers field is actually found before continuing.
$idNumbers = foreach( $match in $seedMatches ) {
$postMatchFields = $match.Context.PostContext -split ":\s*"
# Note: .IndexOf(object) is case-sensitive when looking for strings
# Returns -1 if not found
$idFieldIndex = $postMatchFields.IndexOf("ID number")
# Return the "ID number" to `$idNumbers` if "ID number" is found in $postMatchFields
if( $idFieldIndex -gt -1 ) {
$postMatchFields[$idFieldIndex + 1]
}
}
Now that we have $idNumbers, we can look in export.txt for this ID number "6,\s*16" on the same line, once again using Select-String. This time, I'll put the code first since it's nothing new, then explain the regex a bit:
$exportMatches = foreach( $id in $idNumbers ) {
Select-String -Path .\export.txt -Pattern "^(?=.*\b$id\b)(?=.*\b6,\s*16\b).*$"
}
$exportMatches will now contain the lines which contain both the target ID number and the 6,16 value on the same line. Note that order wasn't specified so the expression uses positive lookaheads to find both the $id and 6,16 values regardless of their order in the string. I won't break down the exact expression but if you plug ^(?=.*\b0123456789\b)(?=.*\b6,\s*16\b).*$ into https://regexr.com it will break down and explain the regex pattern in detail.
The full code is above in at the top of this answer.

Get-GPOReport and Search For Matched Name Value

I'm trying to use the PowerShell command 'Get-GPOReport' to get GPO information in XML string format so I can search it for sub-Element values with unknown and different Element tag names (I don't think XML Object format will work for me, so I didn't perform a cast with "[xml]"), but I haven't been able to parse the XML output so that I can grab the line or two after a desired "Name" Element line that matches the text I'm searching for.
After, I have been trying to use 'Select-String' or 'Select-XML' with XPath (formatting is unclear and I don't know if I can use a format for various policy record locations) to match text and grab a value, but I haven't had any luck.
Also, if anyone know how to search for GPMC GUI names (i.e. "Enforce password history") instead of needing to first locate back-end equivalent names to search for (i.e. "PasswordHistorySize"), that would also be more helpful.
The following initial code is the part that works:
$String = "PasswordHistorySize" # This is an example string, as I will search for various strings eventually from a file, but I'm not sure if I could search for equivalent Group Policy GUI text "Enforce password history", if anyone knows how to do that.
$CurrentGPOReport = Get-GPOReport -Guid $GPO.Id -ReportType Xml -Domain $Domain -Server $NearestDC
If ($CurrentGPOReport -match $String)
{
Write-Host "Policy Found: ""$($String)""" -Foregroundcolor Green
#
#
# The following code is what I've tried to use to get value data, without any luck:
#
$ValueLine1 = $($CurrentGPOReport | Select-String -Pattern $String -Context 0,2)
$Value = $($Pattern = ">(.*?)</" ; [regex]::match($ValueLine1, $Pattern).Groups[1].Value)
}
I've been looking at this since yesterday and didn't understand why Select-String wasn't working, and I figured it out today... The report is stored as a multi-line string, rather than an array of strings. You could do a -match against it for the value, but Select-String doesn't like the multi-line formatting it seems. If you -split '[\r\n]+' on it you can get Select-String to find your string.
If you want to use RegEx to just snipe the setting value you can do it with a multi-line regex search like this:
$String = "PasswordHistorySize" # This is an example string, as I will search for various strings eventually from a file, but I'm not sure if I could search for equivalent Group Policy GUI text "Enforce password history", if anyone knows how to do that.
$CurrentGPOReport = Get-GPOReport -Guid $GPO.Id -ReportType Xml -Domain $Domain -Server $NearestDC
$RegEx = '(?s)' + [RegEx]::Escape($String) + '.+?Setting.*?>(.*?)<'
If($CurrentGPOReport -match $RegEx)
{
Write-Host "Policy Found: ""$String""" -Foregroundcolor Green
$Value = $Matches[1]
}
I'm not sure how to match the GPMC name, sorry about that, but this should get you closer to your goals.
Edit: To try and get every setting separated out into it's own chunk of text and not just work on that one policy I had to alter my RegEx a bit. This one's a little more messy with the output, but can be cleaned up simply enough I think. This will split a GPO into individual settings:
$Policies = $CurrentGPOReport -split '(\<(q\d+:.+?>).+?\<(?:\/\2))' | Where { $_ -match ':Name' }
That will get you a collection of things that look like this:
<q1:Account>
<q1:Name>PasswordHistorySize</q1:Name>
<q1:SettingNumber>21</q1:SettingNumber>
<q1:Type>Password</q1:Type>
</q1:Account>
From there you just have to filter for whatever setting you're looking for.
I have tried this with XPath, as you'll have more control navigating in the XML nodes:
[string]$SearchQuery = "user"
[xml]$Xml = Get-GPOReport -Name "Default Domain Policy" -ReportType xml
[array]$Nodes = Select-Xml -Xml $Xml -Namespace #{gpo="http://www.microsoft.com/GroupPolicy/Settings"} -XPath "//*"
$Nodes | Where-Object -FilterScript {$_.Node.'#text' -match $SearchQuery} | ForEach-Object -Process {
$_.Name #Name of the found node
$_.Node.'#text' #text in between the tags
$_.Node.ParentNode.ChildNodes.LocalName #other nodes on the same level
}
After testing we found that in the XML output of the Get-GPOReport cmdlet, the setting names does not always match that of the HTML output. For example: "Log on as a service" is found as "SeServiceLogonRight" in the XML output.

Remove list of phrases if they are present in a text file using Powershell

I'm trying to use a list of phrases (over 100) which I want to be removed from a text file (products.txt) which has lines of text inside it (they are tab separated / new line each). So that the results which do not match the list of phrases will be re-written in the current file.
#cd .\Desktop\
$productlist = #(
'example',
'juicebox',
'telephone',
'keyboard',
'manymore')
foreach ($product in $productlist) {
get-childitem products.txt | Select-String -Pattern $product -NotMatch | foreach {$_.line} | Out-File -FilePath .\products.txt
}
The above code does not remove the words listed in the $productlist, it simply outputs all links in products.txt again.
The lines inside of products.txt file are these:
productcatalog
product1example
juicebox038
telephoneiphone
telephoneandroid
randomitem
logitech
coffeetable
razer
Thank you for your help.
Here's my solution. You need the parentheses otherwise the input file will be in use when trying to write to the file. Select-string accepts an array of patterns. I wish I could pipe 'path' to set-content but it doesn't work.
$productlist = 'example', 'juicebox', 'telephone', 'keyboard', 'manymore'
(Select-String $productlist products.txt -NotMatch) | % line |
set-content products.txt
here's one way to do what you want. it's somewhat more direct than what yo used. [grin] it uses the way that PoSh can act on an entire collection when it is on the LEFT side of an operator.
what it does ...
fakes reading in a text file
when ready to do this in real life, replace the whole #region/#endregion block with a call to Get-Content.
builds the exclude list
converts that into a regex OR pattern
filters out the items that match the unwanted list
shows that resulting list
the code ...
#region >>> fake reading in a text file
# when ready to do this for real, replace the whole "#region/#endregion" block with a call to Get-Content
$ProductList = #'
productcatalog
product1example
juicebox038
telephoneiphone
telephoneandroid
randomitem
logitech
coffeetable
razer
'# -split [System.Environment]::NewLine
#endregion >>> fake reading in a text file
$ExcludedProductList = #(
'example'
'juicebox'
'telephone'
'keyboard'
'manymore'
)
$EPL_Regex = $ExcludedProductList -join '|'
$RemainingProductList = $ProductList -notmatch $EPL_Regex
$RemainingProductList
output ...
productcatalog
randomitem
logitech
coffeetable
razer

Create Out-File-names using array elements

I need to create .txt/.sap files/shortcuts with changing content. I use a do until loop and the file names should be created with strings from an array. I cannot use the square brackets to access the array because Powershell interprets them as a wild card characters.
The following code shows the principle:
$strSAPSystems = #("Production", "Finance", "Example")
$i = 0
do{
"text1" | Out-File .\SAP_$strSAPSystems[$i].sap
"text2" | Out-File .\SAP_$$strSAPSystems[$i].sap -Append
$i++
}
until($i -eq $strSAPSystems.length)
results in an error: "out-file : cannot perform operation because the wildcard path ... did not resolve to a file"
I tried to add the -literalPath parameter but it didn't work. I am new to Powershell, is there a better way to have the files named after the SAP systems?
Thank you
You need to wrap the string(path) inside a subexpression $(.....) to extract the value of a single element in an array. Atm. the path becomes something like .SAP_Production Finance Example[$i].sap.
Also, you have an extra $ in the second Out-File. Personally I would rewrite everthing to:
$strSAPSystems = #("Production", "Finance", "Example")
$strSAPSystems | ForEach-Object {
"text1" | Out-File ".\SAP_$($_).sap"
"text2" | Out-File ".\SAP_$($_).sap -Append"
}
$_ is the current item in the array, and since it's a single object, I don't really need the subexpression $(), but I included it because it easier to see static and dynamic parts of the path.