Strip lines from text file based on content - powershell

I like to use one of the packaged HOSTS (MVPS,) files to protect myself from some of the nastier domains. Unfortunately, sometimes these files are a bit overzealous for me (blocking googleadsservices is a pain sometimes). I want an easy way to strip certain lines out of these files. In Linux I use:
cat hosts |grep -v <pattern> >hosts.new
And the file is rewritten minus the lines referencing the pattern I specified in the grep. So I just set it up to replace hosts with hosts.new on reboot and I'm done.
Is there an easy way to do this in PowerShell?

In PowerShell you'd do
(Get-Content hosts) -notmatch $pattern | Out-File hosts.new
or
(cat hosts) -notmatch $pattern > hosts.new
for short.
Of course, since Out-File (and with it the redirection operator) default to Unicode format, you may actually want to use Set-Content instead of Out-File:
(Get-Content hosts) -notmatch $pattern | Set-Content hosts.new
or
(gc hosts) -notmatch $pattern | sc hosts.new
And since the input file is read in a grouping expression (the parentheses around Get-Content hosts) you could actually write the output back to the source file:
(Get-Content hosts) -notmatch $pattern | Set-Content hosts

To complement Ansgar Wiechers' helpful answer (which offers pragmatic and concise solutions based on reading the entire input file into memory up-front):
PowerShell's grep equivalent is the Select-String cmdlet and, just like grep, it directly accepts a filename argument (PSv3+ syntax):
Select-String -NotMatch <pattern> hosts | ForEach-Object Line | Set-Content hosts.new
Select-String -NotMatch <pattern> hosts is short for
Select-String -NotMatch -Pattern <pattern> -LiteralPath hosts and is the virtual equivalent of
grep -v <pattern> hosts
However, Select-String doesn't output strings, it outputs [Microsoft.PowerShell.Commands.MatchInfo] instances that wrap matching lines (stored in property .Line) along with metadata about the match.
ForEach-Object Line extracts just the matching lines (the value of property .Line) from these objects.
Set-Content hosts.new writes the matching lines to file hosts.new, using "ANSI" encoding in Windows PowerShell - i.e., it uses the legacy code page implied by the active system locale, typically a supranational 8-bit superset of ASCII - and UTF-8 encoding (without BOM) in PowerShell Core.
Use the -Encoding parameter to specify a different encoding.
>, by contrast (an effective alias of the Out-File cmdlet), creates:
UTF16-LE ("Unicode") files by default in Windows PowerShell.
UTF-8 files (without BOM) in PowerShell Core - in other words: in PowerShell Core, using
> hosts.new in lieu of | Set-Content hosts.new will do.
Note: While both > / Out-File and Set-Content are suitable for sending string inputs to an output file, they are not generally suitable for sending other data types to a file for programmatic processing: > / Out-File output objects the way they would print to the console / terminal, which is pretty format for display, whereas Set-Content stringifies (simply put: calls .ToString() on) the input objects, which often results in loss of information.
For non-string data, consider a (more) structured data format such as XML (Export-CliXml), JSON (ConvertTo-Json) or CSV (Export-Csv).

Related

Issues with specific characters in outfile

I have a script that merges files and that works fine - but characters like åäö looks not good in the output file
Here is the complete script:
$startOfToday = (Get-Date).Date
Get-ChildItem "C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday | ForEach-Object {gc $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
In the files in looks like this for example
Order ID 1
Order ID 2
This is för får
In the output it gets like this for the last row
Order ID 1
Order ID 2
får för fär
is there a way to make those characters appear in the output file as they appear in the first file?
The implication is that your input files are UTF-8-encoded without a BOM, which in Windows PowerShell are (mis)interpreted to be ANSI-encoded (using the system's active ANSI code page, such as Windows-1252).
The solution is to tell gc (Get-Content) explicitly what encoding to use, via the -Encoding parameter:
Get-ChildItem C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday |
ForEach-Object { Get-Content -Encoding Utf8 $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
Note that PowerShell never preserves the input encoding automatically, therefore, in the absence of using -Encoding with Out-File, its default encoding is used, which is "Unicode" (UTF-16LE) in Windows PowerShell.
While PowerShell (Core) 7+ also doesn't preserve input encodings, it consistently defaults to BOM-less UTF-8, so your original code would work as-is there.
For more information about default encodings in Windows PowerShell vs. PowerShell (Core) 7+, see this answer.
Note: As AdminOfThings suggests in a comment, simply replacing Out-File with Set-Content in your original code also works in this particular case, because the same misinterpretation of the encoding is then performed on both in- and output, and the data is simply being passed through. This isn't a general solution, however, notably not if you need to process the strings in memory first, before saving them to a file.

ANSI Encoding via PowerShell [duplicate]

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.

Powershell magnling ascii text

I'm getting extra characters and lines when trying to modify hosts files. For example, this select string does not take anything out, but the two files are different:
get-content -Encoding ascii C:\Windows\system32\drivers\etc\hosts |
select-string -Encoding ascii -notmatch "thereisnolinelikethis" |
out-file -Encoding ascii c:\temp\testfile
PS C:\temp> (get-filehash C:\windows\system32\drivers\etc\hosts).hash
C54C246D2941F02083B85CE2774D271BD574F905BABE030CC1BB41A479A9420E
PS C:\temp> (Get-FileHash C:\temp\testfile).hash
AC6A1134C0892AD3C5530E58759A09C73D8E0E818EC867C9203B9B54E4B83566
I can confirm that your commands do inexplicably result in extra line breaks in the output file, in the start and in the end. Powershell also converts the tabs in the original file into four spaces instead.
While I cannot explain why, these commands do the same thing without these issues:
Try this code instead:
Get-Content -Path C:\Windows\System32\drivers\etc\hosts -Encoding Ascii |
Where-Object { -not $_.Contains("thereisnolinelikethis") } |
Out-File -FilePath "c:\temp\testfile" -Encoding Ascii
I think this is more of an issue with PowerShell's F&O (formatting & output) engine. Keep in mind that Select-String outputs a rich object called MatchInfo. When that object reaches the end of the output it needs to be rendered to a string. I think it is that rendering/formatting that injects the extra line. One of the properties on MatchInfo is the line that was matched (or notmatched). If you pass just the Line property down the pipe, it seems to work better (hashes match):
Get-Content C:\Windows\system32\drivers\etc\hosts |
Select-String -notmatch "thereisnolinelikethis" |
Foreach {$_.Line} |
Out-File -Encoding ascii c:\temp\testfile
BTW you only need to specify ASCII encoding when outputting back to the file. Everywhere else in PowerShell, just let the string flow as Unicode.
All that said, I would use Where-Object instead of Select-String for this scenario. Where-Object is a filtering command which is what you want. Select-String takes input of one form (string) and converts it to a different object (MatchInfo).
Out-File adds a trailing NewLine ("`r`n") to the testfile file.
C:\Windows\System32\drivers\etc\hosts does not contain a trailing newline out of the box, which is why you get a different FileHash
If you open the files with a StreamReader, you'll see that the underlying stream differs in length (due to the trailing newline in the new file):
PS C:\> $Hosts = [System.IO.StreamReader]"C:\Windows\System32\drivers\etc\hosts"
PS C:\> $Tests = [System.IO.StreamReader]"C:\temp\testfile"
PS C:\> $Hosts.BaseStream.Length
822
PS C:\> $Tests.BaseStream.Length
824
PS C:\> $Tests.BaseStream.Position = 822; $Tests.Read(); $Tests.Read()
13
10
ASCII characters 13 (0x0D) and 10 (0x0A) correspond to [System.Environment]::NewLine or CR+LF

PowerShell Set-Content and Out-File - what is the difference?

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.

How do I concatenate two text files in PowerShell?

I am trying to replicate the functionality of the cat command in Unix.
I would like to avoid solutions where I explicitly read both files into variables, concatenate the variables together, and then write out the concatenated variable.
Simply use the Get-Content and Set-Content cmdlets:
Get-Content inputFile1.txt, inputFile2.txt | Set-Content joinedFile.txt
You can concatenate more than two files with this style, too.
If the source files are named similarly, you can use wildcards:
Get-Content inputFile*.txt | Set-Content joinedFile.txt
Note 1: PowerShell 5 and older versions allowed this to be done more concisely using the aliases cat and sc for Get-Content and Set-Content respectively. However, these aliases are problematic because cat is a system command in *nix systems, and sc is a system command in Windows systems - therefore using them is not recommended, and in fact sc is no longer even defined as of PowerShell Core (v7). The PowerShell team recommends against using aliases in general.
Note 2: Be careful with wildcards - if you try to output to inputFiles.txt (or similar that matches the pattern), PowerShell will get into an infinite loop! (I just tested this.)
Note 3: Outputting to a file with > does not preserve character encoding! This is why using Set-Content is recommended.
Do not use >; it messes up the character encoding. Use:
Get-Content files.* | Set-Content newfile.file
In cmd, you can do this:
copy one.txt+two.txt+three.txt four.txt
In PowerShell this would be:
cmd /c copy one.txt+two.txt+three.txt four.txt
While the PowerShell way would be to use gc, the above will be pretty fast, especially for large files. And it can be used on on non-ASCII files too using the /B switch.
You could use the Add-Content cmdlet. Maybe it is a little faster than the other solutions, because I don't retrieve the content of the first file.
gc .\file2.txt| Add-Content -Path .\file1.txt
To concat files in command prompt it would be
type file1.txt file2.txt file3.txt > files.txt
PowerShell converts the type command to Get-Content, which means you will get an error when using the type command in PowerShell because the Get-Content command requires a comma separating the files. The same command in PowerShell would be
Get-Content file1.txt,file2.txt,file3.txt | Set-Content files.txt
I used:
Get-Content c:\FileToAppend_*.log | Out-File -FilePath C:\DestinationFile.log
-Encoding ASCII -Append
This appended fine. I added the ASCII encoding to remove the nul characters Notepad++ was showing without the explicit encoding.
If you need to order the files by specific parameter (e.g. date time):
gci *.log | sort LastWriteTime | % {$(Get-Content $_)} | Set-Content result.log
You can do something like:
get-content input_file1 > output_file
get-content input_file2 >> output_file
Where > is an alias for "out-file", and >> is an alias for "out-file -append".
Since most of the other replies often get the formatting wrong (due to the piping), the safest thing to do is as follows:
add-content $YourMasterFile -value (get-content $SomeAdditionalFile)
I know you wanted to avoid reading the content of $SomeAdditionalFile into a variable, but in order to save for example your newline formatting i do not think there is proper way to do it without.
A workaround would be to loop through your $SomeAdditionalFile line by line and piping that into your $YourMasterFile. However this is overly resource intensive.
To keep encoding and line endings:
Get-Content files.* -Raw | Set-Content newfile.file -NoNewline
Note: AFAIR, whose parameters aren't supported by old Powershells (<3? <4?)
I think the "powershell way" could be :
set-content destination.log -value (get-content c:\FileToAppend_*.log )