Exclude HTML comment from text file - powershell

I have a config file from which I need to output some text and convert it to a CSV. I am stuck at the first step which is that this file has few HTML comments which are to be excluded and the remaining text is to be used for exporting to CSV purposes.
HTML comment looks like following:
<!--<add name= />
<add name= />
<add name= />-->
I have tried different regex's to solve this, but no luck. The closest I have got is to exclude the first and third line using the below regex, but that doesn't solve the issue as second line is still present:
Get-Content –Path C:\Pathtothefile -notmatch "^\s*(<!--)|>*(-->)$"
This regex will take out the line which starts with , but not the middle one which is part of the comment. I have multiple files with multiple comments.
Tried several different combos ("<!--[^>]*(-->)$"), no luck so far.

In the documents you need to process the <!-- always be at the start of the line and the --> at the end? If so then you probably need to get the content, and run it through a loop where you process your document line by line, toggling a state variable for content, or not.
$data=#"
<!--<add name= />
<add name= />
<add name= />-->
a,b,c,d
1,2,3,4
"#
$state='content'
$data -split "`n" |
ForEach-Object {
If ($_ -match '^<!--') {
$state='comment'
return $null # because `continue` doesn't work in a foreach-object
}
If ($_ -match '-->$') {
$state='content'
return $null
}
If ($state -eq 'content') {
$_
}
}
Results
a,b,c,d
1,2,3,4

Not knowing the content of your config file and despite jscott's hint.
To have a RegEx match over several lines you have to get the raw
content
Then you need to specify a regex option to match across line terminators i.e.reference
SingleLine mode (. matches any char including line feed), as well as
Multiline mode (^ and $ match embedded line terminators), e.g.
(?smi) - note the "i" is to ignore case
the ? to have an ungreedy match otherwise the start of one comment could match up the end of the last comment.
(Get-Content .\config.html -raw) -replace '(?smi)^\<!--.*?--\>?'
Checked this on Regex101

Related

IIS URL Rewrite - Bulk entry - Command line or another way?

I am decommissioning a company website (HTML), the company has re-branded and has a new site on a different domain/platform. I have to redirect(301) almost a 1000 individual pages and I really don't want to have to use the GUI to add every page so I am trying to find out if there is a command line facility that I can use to script the changes? At the moment the source and destination URLs are stored in a CSV.
Any thoughts or pearls of wisdom gratefully received.
Rob
SOLUTION
Thanks to Peter putting me on the right track, I was able to get this working and to help in the future these are the steps.
Install the URL Rewrite module into IIS, I created a dummy rule to I could see where the XML elements needed to go.
Create a CSV with the following headers: Name, Source, Target
Name has to be unique, the Source is the HTML page name and the Target is where you want the page to redirect to.
Create a PS1 file by copying the following:
$docTemplate = #'
<contact $($contacts -join "`n") />
'#
$entryTemplate = #'
<rule name="$($redirect.Name)" stopProcessing="true">
<match url="$($redirect.Source)" />
<action type="Redirect" url="$($redirect.Target)" redirectType="Permanent" />
</rule>
'#
Import-Csv C:\temp\Sites.csv -Delimiter ',' | Group-Object Id -ov grp | ForEach-Object {
$contacts = foreach ($Redirect in $_.Group) {
$ExecutionContext.InvokeCommand.ExpandString($entryTemplate)
}
$ExecutionContext.InvokeCommand.ExpandString($docTemplate) } |
Set-Content -LiteralPath file.xml
I did steal this from another Stackoverflow post (link) and modify it to fit my requirements, you can ignore the "contact" element, it needs to remain but isn't used.
When you run the PS1 file you will end up with a file called "file.XML" that you can then copy all the rules and paste them into the web.config file.
Warning, if you are doing lots of re-writes you will hit an issue with the web.config file being over 250k in size - results in a 500 error. To fix this you need to edit the registry and add a couple of keys to allow for a bigger web.config file:
Key1:
HKLM\SOFTWARE\Microsoft\InetStp\Configuration\MaxWebConfigFileSizeInKB
This is a DWORD, set the value to decimal and change the value to a bit larger then your web.config.
Key2:
HKLM\SOFTWARE\Wow6432Node\Microsoft\InetStp\Configuration\MaxWebConfigFileSizeInKB
Same as above, DWORD and set to a bit bigger than web.config file.
Finally do a IISRESET to pickup the changes.
Cheers
Rob

How to remove a multi line block of text from $pattern in Powershell

I'm getting the contents of a text file which is partly created by gsutil and I'm trying to put its contents in $body but I want to omit a block of text that contains special characters. The problem is that I'm not able to match this block of text in order for it to be removed. So when I print out $body it still contains all the text that I'm trying to omit.
Here's a part of my code:
$pattern = #"
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this you and any
users that download such composite files will need to have a compiled
crcmod installed (see "gsutil help crcmod").
"#
$pattern = ([regex]::Escape($pattern))
$body = Get-Content -Path C:\temp\file.txt -Raw | Select-String -Pattern $pattern -NotMatch
So basically I need it to display everything inside the text file except for the block of text in $pattern. I tried without -Raw and without ([regex]::Escape($pattern)) but it won't remove that entire block of text.
It has to be because of the special characters, probably the " , . () because if I make the pattern simple such as:
$pattern = #"
NOTE: You are uploading one or more
"#
then it works and this part of text is removed from $body.
It'd be nice if everything inside $pattern between the #" and "# was treated literally. I'd like the simplest solution without functions, etc. I'd really appreciate it if someone could help me out with this.
With the complete text of your question stored in file .\SO_55538262.txt
This script with manually escaped pattern:
$pattern = '(?sm)^==\> NOTE: You .*?"gsutil help crcmod"\)\.'
$body = (Get-Content .\SO_55538262.txt -raw) -replace $pattern
$body
Returns here:
I'm getting the contents of a text file which is partly created by gsutil and I'm trying to put its contents in $body but I want to omit a block of text that contains special characters. The problem is that I'm not able to match this block of text in order for it to be removed. So when I print out $body it still contains all the text that I'm trying to omit.
Here's a part of my code:
$pattern = #"
"#
$pattern = ([regex]::Escape($pattern))
$body = Get-Content -Path C:\temp\file.txt -Raw | Select-String -Pattern $pattern -NotMatch
So basically I need it to display everything inside the text file except for the block of text in $pattern. I tried without -Raw and without ([regex]::Escape($pattern)) but it won't remove that entire block of text.
It has to be because of the special characters, probably the " , . () because if I make the pattern simple such as:
$pattern = #" NOTE: You are uploading one or more "#
then it works and this part of text is removed from $body.
It'd be nice if everything inside $pattern between the #" and "# was treated literally. I'd like the simplest solution without functions, etc.
Explanation of the RegEx from regex101.com:
(?sm)^==\> NOTE: You .*?"gsutil help crcmod"\)\.
(?sm) match the remainder of the pattern with the following effective flags: gms
s modifier: single line. Dot matches newline characters
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
^ asserts position at start of a line
== matches the characters == literally (case sensitive)
\> matches the character > literally (case sensitive)
NOTE: You matches the characters NOTE: You literally (case sensitive)
.*?
. matches any character
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
"gsutil help crcmod" matches the characters "gsutil help crcmod" literally (case sensitive)
\) matches the character ) literally (case sensitive)
\. matches the character . literally (case sensitive)
An easy way to tackle this task (without regex) would be using the -notin operator. Since Get-Content is returning your file content as a string[]:
#requires -Version 4
$set = #('==> NOTE: You are uploading one or more large file(s), which would run'
'significantly faster if you enable parallel composite uploads. This'
'feature can be enabled by editing the'
'"parallel_composite_upload_threshold" value in your .boto'
'configuration file. However, note that if you do this you and any'
'users that download such composite files will need to have a compiled'
'crcmod installed (see "gsutil help crcmod").')
$filteredContent = #(Get-Content -Path $path).
Where({ $_.Trim() -notin $set }) # trim added for misc whitespace
v2 compatible solution:
#(Get-Content -Path $path) |
Where-Object { $set -notcontains $_.Trim() }

Tokenization based pattern replacement in web.config.token

I am using Release Manager 2015 to deploy my application.
I am using Microsoft's Extension Utilities pack to do this:
Extension Utility Pack - Documentation
This simply states:
Tokenization based pattern replacement
This task finds the pattern __<pattern>__ and replaces the same with the value from the variable with name <pattern>.
Eg. If you have a variable defined as foo with value bar,
on running this task on a file that contains __foo__ will be changed to bar.
So in my web.config.token file I simply add:
<add name="ADConnectionString" connectionString="__ADConnectionString__" />
and in release manager under variables created a variable with the name ADConnectionString which is then picked up during the step and replaced.
My question is that I cannot figure out a way to replace a tokenized string within a string.
<add name="CreateTextWriter" initializeData="directory=D:\__WEBLOGDIR__\__ENVIRONMENT__; basename=Web" />
This will work however
<host name="cache1.__ENVIRONMENT__.__DOMAIN__" cachePort="1"/>
will not. This is due to the RegEx being used for the matching.
$regex = '__[A-Za-z0-9._-]*__'
$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }
This will match the whole string rather than each tokenized string. To get around this I have changed the RegEx slightly to not be greedy in its selection.
$regex = '__[A-Za-z0-9._-]*?__'
Hope this helps someone else.

Extract specific data

Please help. I am trying to extract multiple filenames from the following .xml file. I then need to copy the list of files from one folder to another. A part of the XML I have posted below:
<component>
<altname>HP Broadcom Online Firmware Upgrade Utility for VMware 5.x</altname>
<filename>CP021404.scexe</filename>
<name>HP Broadcom Online Firmware Upgrade Utility for VMware 5.x</name>
<description>This package contains vSphere 5.1 and VMware </description>
<component>
<component>
<altname>Online ROM Flash - Power Management Controller </altname>
<filename>CP021615.scexe</filename>
I used Windows PowerShell as below and got the output, but the output contains filenames (CP021404.scexe, CP021614.scexe below), line# and symbol still in it. What am I doing wrong on my first PS attempt?
PowerShell
$input_path = ‘C:\PowerShell\hpsum_inventory.xml’
$output_file = ‘C:\powershell\hpsum_inventory-o.xml’
$regex = ".exe"
select-string -Path $input_path -Pattern $regex -AllMatches > $output_file
Output
PowerShell\hpsum_inventory.xml:8: <filename>CP021404.scexe</filename>
PowerShell\hpsum_inventory.xml:18: <filename>CP021614.scexe</filename>
The problem is that you're using a RegEx match and the period character in RegEx matches any character except Line Feed/New Line characters, so it's matching any character followed by 'exe'. Really what you want to do is read the file as XML, and just output the <filename> nodes.
$input_path = ‘C:\PowerShell\hpsum_inventory.xml’
$output_file = ‘C:\powershell\hpsum_inventory-o.xml’
$regex = "exe$"
(Select-Xml -Path $input_path -XPath //filename).node.InnerText | ?{$_ -match $regex} | out-file $output_file
Edit: Ok, you need to incorporate that into a string, that's easy enough. We'll add a ForEach loop (I use the alias % for that) to the last line to insert the file name into a string.
(Select-Xml -Path $input_path -XPath //filename).node.InnerText | ?{$_ -match $regex} | %{"copy c:\powershell\$_ x:\firmware\"} | out-file $output_file
Edit2: Ok, so you want the knowledge in general of how to match text in a file. Can do! Select string will do what you want actually, it just wasn't the best method in general for the example you gave earlier. This gets a bit more interesting, since you need to be familiar with RegEx matching patterns, but other than that it's fairly straight forward. You want to use the -Pattern match again, but let me suggest a better pattern:
"filename>(.*?)<"
That looks for the filename tag, including closing > on it, and grabs everything up to the next < character. The () denote a capturing group, so the rest is ignored as far as the capture goes. Then we pipe to a ForEach loop, and for each line that it finds that matches we select the Matches property, and the second Group property of that (the first contains the whole text, including the filename> and < bits). So it looks like this:
$input_path = 'C:\PowerShell\hpsum_inventory.xml'
$output_file = 'C:\powershell\hpsum_inventory-o.xml'
$regex = "filename>(.*?)<"
select-string -Path $input_path -Pattern "filename>(.*?)<"|%{$_.matches.groups[1].value}
Now that only gets the file names. If we want to incorporate the rest of your thing about inserting it into text you enclose the part in the ForEach loop inside a sub-expression $() and then put that into your double quoted string like such:
select-string -Path $input_path -Pattern "filename>(.*?)<"|%{"copy c:\powershell\$($_.matches.groups[1].value) x:\firmware"}|Out-File $output_file
Personally I would suggest not doing that directly as it limits you. I'd collect the data in an array, then pipe that array into a process that does what you want, but then at least you have the collection so you can do with it what you want.
$input_path = 'C:\PowerShell\hpsum_inventory.xml'
$output_file = 'C:\powershell\hpsum_inventory-o.xml'
$regex = "filename>(.*?)<"
$Filenames = select-string -Path $input_path -Pattern "filename>(.*?)<"|%{$_.matches.groups[1].value}
$Filenames|%{"copy c:\powershell\$_ x:\firmware"}|Out-File $output_file
Why do it that way? What if you don't want to over-write something? Then you can do something like:
$Filenames|?{$_ -notin (GCI X:\firmware -file|select -expand name)}|%{"copy c:\powershell\$_ x:\firmware"}|Out-File $output_file
For your collection of serial numbers, try the regex pattern of:
"Serial Number: (\S*)"
In RegEx there are a few escaped characters that have special meaning, and capitalizing them inverts that meaning. \s means whitespace, so spaces, tabs, what not. Doing it as a capital means something that is NOT whitespace. The asterisk means however many of the previous thing (not whitespace) it can find. So this looks for 'Serial Number: ' and then captures everything after that until it reaches the end of the line or encounters whitespace. Check out this link to see how it works.

Edit and save changes in file with powershell script

please tel me how to edit variable content in xml file with powershell script.
<application>
<name>My Application</name>
<platforms>
<platform>android</platform>
<icon gap:density="ld" src="/icon-1.png" />
<icon gap:density="md" src="/icon-2.png" />
</platforms>
</application>
i tried this but, it's not what i want, i want to edit based on the name of the variable: name, platform... but i dont know how in powershell
$editfiles=get-childitem . *.xml -rec
foreach ($file in $editfiles)
{
(get-content $file.pspath) |
foreach-object {$_ -replace "My Application", "My New App"} | set-content $file.pspath }
Tks
Many tks for your help
It is better to edit XML documents using an XML Api rather than text search/replace. Try this:
[xml]$xml = Get-Content foo.xml
$xml.application.name = "new name"
$xml.Save("$pwd\foo.xml")
$newitem = $xml.CreateElement("Value")
$newitem.set_InnerXML("111")
$xml.Items.ReplaceChild($newitem, $_)
something like this i think.. i didn't try it so i could be off track
Please use Keith Hill's answer, I'm only leaving mine here for reference. He's right, it's better to modify it through an XML API. I never use XML, I'm not familiar with it, so I didn't even think of it.
I gotta ask, did you try anything to do this? Did you look for an answer? This is pretty basic stuff that just a minute or two on Google probably would have gotten you an answer for.
(Get-Content "C:\Source\SomeFile.XML") -replace "My Application","Shiny New App"|Set-Content "C:\Source\SomeFile.XML"
Or if you wanted to change something less specific, such as the word "android" for the platform tag you could just include the tags to make sure it gets the right thing. (some shorthand used in this example)
(GC "C:\Source\SomeFile.XML") -replace "<platform>android</platform>","<platform>tacos</platform>"|SC "C:\Source\SomeFile.XML"
Seriously though, at least try and help yourself before coming and asking to be spoon-fed answers. I just hit up Google and searched for "powershell replace text in file" and the very first link would have given you the answer.
Edit:
Without knowing what you are looking for and going based solely off tags you will need to perform a RegEx (Regular Expression) search.
(GC "C:\Source\SomeFile.XML") -replace "(?<=`<platform`>).*?(?=`</platform`>)", 'New Platform'
That will pull the content of the file, look for any length of text that is preceded with and followed by and replace that text with 'New Platform'. Note that the Greater Than and Less Than symbols are escaped with a grave character (to the left of the 1, and above the Tab on your keyboard). Here's a breakdown of the RegEx:
(?<=<platform>) Checks that immediately preceding the string that we're looking for is the string <platform>. This will not be replaced, it just makes sure we have the right starting point.
.*? searches for any number of characters except a new line, and accepts the possibility that it may be blank. This is our match that will be replaced.
(?=</platform>) Checks that immediately following the string it just found should be the string </platform>. This will not be replaced, it just makes sure our match ends at the correct place.