Powershell Out-File force end of line character - powershell

I discovered that I could force a Unicode file to ASCII using the script below, which is really great. I assume it's based on my environment or Windows default, but it's adding a CR and LF at the end of each line. Is there a way to force just a LF character rather than both without loading the entire file into memory? I have seen some solutions that load the entire file into memory and basically do a string replace, which won't work because some of my files are multiple GB.
Thanks!
get-content -encoding utf8 $inputFile | Out-file -force -encoding ASCII $outputFile

I suggest you use .NET System.File.IO classes from within your script. In particular the System.File.IO.StreamWriter class has a property, NewLine which you can set to whatever characters you want the line terminator characters to be. (Although to be readable by StreamReader the line terminator chars must be \n or \r\n (in C/C++ notation because of conflict with SO and PS on backtick)).
Secondary benefit of using IO.StreamWriter, according to this blog is much better perf.
Basic code flow is something like this (not tested):
# Note that IO.StreamWriter will use process's current working directory,
# not PS's. So safer to specify full paths
$inStream = [System.IO.StreamReader] "c:\temp\orig.txt"
$outStream = new-object System.IO.StreamWriter "c:\temp\copy.txt",
[text.encoding]::ASCII
$outStream.NewLine = '`n'
while (-not $inStream.endofstream) {
$outStream.WriteLine( $instream.Readline())
}
$inStream.close()
$outStream.close()
This script should have constant memory requirements, but hard to know what .NET might do under the covers.

Related

Powershell: Efficient way to delete first 10 rows of a HUGE textfile

I need to delete the first couple of lines of a .txt-file in powershell. There are plenty of questions and answers already on SA how to do it. Most of them copy the whole filecontent into memory, cut out the first x lines and then save the content into the textfile again.
However, in my case the textfiles are huge (500MB+), so loading them completly into memory, just to delete the first couple of lines, takes very long and feels like a huge waste of resources.
Is there a more elegant approach? If you only want to read the first x lines, you can use
Get-Content in.csv -Head 10
, which only reads the first 10 lines. Is there something similar for deletion?
Here is a another way to do it using StreamReader and StreamWriter, as noted in comments, it's important to know the encoding of your file for this use case.
See Remarks from the Official Documentation:
The StreamReader object attempts to detect the encoding by looking at the first four bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, big-endian Unicode, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.
If you need to specify an Encoding you can target the StreamReader(String, Encoding) Constructor. For example:
$reader = [System.IO.StreamReader]::new('path\to\input.csv', [System.Text.Encoding]::UTF8)
As noted previously in Remarks, this might not be needed for common encodings.
An alternative to below code, could be the use of $reader.ReadToEnd() as Brice points out in his comment, after skipping the first 10 lines, this would read the entire contents of the file in memory before writing to the new file. I haven't used this method for this answer since, mklement0's helpful answer provides a very similar solution to the problem and this answer was intended to be a memory friendly solution.
try {
$reader = [System.IO.StreamReader]::new('absolute\path\to\input.csv')
$writer = [System.IO.StreamWriter]::new('absolute\path\to\output.csv')
# skip 10 lines
foreach($i in 1..10) {
$null = $reader.ReadLine()
}
while(-not $reader.EndOfStream) {
$writer.WriteLine($reader.ReadLine())
}
}
finally {
($reader, $writer).foreach('Dispose')
}
It's very also worth noting zett42's helpful comment using $reader.ReadBlock(Char[], Int32, Int32) method and $writer.Write(..) instead of $write.WriteLine(..) could be an even faster and still memory friendly alternative to read and write in chunks.
You're essentially attempting to remove the starting bytes of the file without modifying the remaining bytes, Raymond C has a good read posted here about why that can't be done.
The underlying abstract model for storage of file contents is in the form of a chunk of bytes, each indexed by the file offset. The reason appending bytes and truncating bytes is so easy is that doing so doesn’t alter the file offsets of any other bytes in the file. If a file has ten bytes and you append one more, the offsets of the first ten bytes stay the same. On the other hand, deleting bytes from the front or middle of a file means that all the bytes that came after the deleted bytes need to “slide down” to close up the space. And there is no “slide down” file system function.
As Mike Anthony's helpful answer explains, there is no system-level function that efficiently implements what you're trying to do, so you have no choice but to rewrite your file.
While memory-intensive, the following solution is reasonably fast:
Read the file as a whole into memory, as a single string, using Get-Content's -Raw switch...
This is orders of magnitude faster than the line-by-line streaming that Get-Content performs by default.
... then use regex processing to strip the first 10 lines ...
... and save the trimmed content back to disk.
Important:
Since this rewrites the file in place, be sure to have a backup copy of your file.
Use -Encoding with Get-Content / Set-Content to correctly interpret the input / control the output character encoding (PowerShell fundamentally doesn't preserve the information about the character encoding of a file that was read with Get-Content). Without -Encoding, the default encoding is the system's active ANSI code page in Windows PowerShell, and, more sensibly, BOM-less UTF-8 in PowerShell (Core) 7+.
# Use -Encoding as needed.
(Get-Content -Raw in.csv) -replace '^(?:.*\r?\n){10}' |
Set-Content -NoNewLine in.csv
If the file is too large to fit into memory:
If you happen to have WSL installed, an efficient, streaming tail solution is possible:
Note:
Your input file must use a character encoding in which a LF character is represented as a single 0xA byte - which is true of most single-byte encodings and also of the variable-width UTF-8 encoding, but not of, say, UTF-16.
You must output to a different file (which you can later replace the input file with).
bash.exe -c 'tail +11 in.csv > out.csv'
Otherwise, line-by-line processing is required.
Note: I'm leaving aside other viable approaches, namely those that either read and write the file in large blocks, as zett42 recommends, or an approach that collects (large) groups of output lines before writing them to the output file in a single operation, as shown in Theo's helpful answer.
Caveat:
All line-by-line processing approaches risk inadvertently changing the newline format of the original file: on writing the lines back to a file, it is invariably the platform-native newline format that is used (CLRF on Windows, LF on Unix-like platforms).
Also, the information as to whether the input file had a trailing newline or not is lost.
Santiago's helpful answer shows a solution based on .NET APIs, which performs well by PowerShell standards.
Brice came up with an elegant and significant optimization that lets a .NET method perform the (lazy) iteration over the file's lines, which is much faster than looping in PowerShell code:
[System.IO.File]::WriteAllLines(
"$pwd/out.csv",
[Linq.Enumerable]::Skip(
[System.IO.File]::ReadLines("$pwd/in.csv"),
10
)
)
For the sake of completeness, here's a comparatively slower, PowerShell-native solution using a switch statement with the -File parameter for fast line-by-line reading (much faster than Get-Content):
& {
$i = 0
switch -File in.csv {
default { if (++$i -ge 11) { $_ } }
}
} | Set-Content out.csv # use -Encoding as needed
Note:
Since switch doesn't allow specifying a character encoding for the input file, this approach only works if the character encoding is correctly detected / assumed by default. While BOM-based files will be read correctly, note that switch makes different assumptions about BOM-less files based on the PowerShell edition: in Windows PowerShell, the system's active ANSI code page is assumed; in PowerShell (Core) 7+, it is UTF-8.
Because language statements cannot directly serve as pipeline input, the switch statement must be called via a script block (& { ... })
Streaming the resulting lines to Set-Content via the pipeline is what slows the solution down. Passing the new file content as an argument, to Set-Content's -Value parameter would drastically speed up the operation - but that would again require that the file fit into memory as a whole:
# Faster reformulation, but *input file must fit into memory as whole*.
# `switch` offers a lot of flexibility. If that isn't needed
# and reading the file in full is acceptable, the
# the Get-Content -Raw solution at the top is the fastest Powershell solution.
Set-Content out.csv $(
$i = 0
switch -File in.csv {
default { if (++$i -ge 11) { $_ } }
}
)
There may be another alternative by using switch to read the files line-by line and buffering a certain maximum amount of lines in a List.
This would be lean on memory consumtion and at the same time limit the number of disk writes to speed up the process.
Something like this perhaps
$maxBuffer = 10000 # the maximum number of lines to buffer
$linesBuffer = [System.Collections.Generic.List[string]]::new()
# get an array of the files you need to process
$files = Get-ChildItem -Path 'X:\path\to\the\input\files' -Filter '*.txt' -File
foreach ($file in $files) {
# initialize a counter for omitting the first 10 lines lines and clear the buffer
$omitCounter = 0
$linesBuffer.Clear()
# create a new file path by appending '_New' to the input file's basename
$outFile = '{0}\{1}_New{2}' -f $file.DirectoryName, $file.BaseName, $file.Extension
switch -File $file.FullName {
default {
if ($omitCounter -ge 10) {
if ($linesBuffer.Count -eq $maxBuffer) {
# write out the buffer to the new file and clear it for the next batch
Add-Content -Path $outFile -Value $linesBuffer
$linesBuffer.Clear()
}
$linesBuffer.Add($_)
}
else { $omitCounter++ } # no output, just increment the counter
}
}
# here, check if there is still some data left in the buffer
if ($linesBuffer.Count) { Add-Content -Path $outFile -Value $linesBuffer }
}

Encode File save utf8

I'm breaking my head: D
I am trying to encode a text file that will be saved in the same way as Notepad saves
It looks exactly the same but it's not the same only if I go into the file via Notepad and save again it works for me what could be the problem with encoding? Or how can I solve it? Is there an option for a command that opens Notepad and saves again?
i use now
(Get-Content 000014.log) | Out-FileUtf8NoBom ddppyyyyy.txt
and after this
Get-ChildItem ddppyyyyy.txt | ForEach-Object {
# get the contents and replace line breaks by U+000A
$contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
# create UTF-8 encoding without signature
$utf8 = New-Object System.Text.UTF8Encoding $false
# write the text back
[IO.File]::WriteAllText($_, $contents, $utf8)
}
When you open a file with notepad.exe it autodetects the encoding (or do you open the file explicitly File->Open.. as UTF-8?). If your file is actually not UFT-8 but something else notepad could be able to work around this and converts it to the required encoding when the file is resaved. So, when you do not specify the correct input encoding in your PoSh script things are will go wrong.
But that's not all; notepad also drops erroneous characters when the file is saved to create a regular text file. For instance, your text file might contain a NULL character that only gets removed when you use notepad. If this is the case it is highly unlikely that your input file is UTF-8 encoded (unless it is broken). So, it looks like your problem is your source file is UTF16 or similar; try to find the right input encoding and rewrite it, e.g. UTF-16 to UTF-8
Get-Content file.foo -Encoding Unicode | Set-Content -Encoding UTF8 newfile.foo
Try it like this:
Get-ChildItem ddppyyyyy.txt | ForEach-Object {
# get the contents and replace Windows line breaks by U+000A
$raw= (Get-Content -Raw $_ -Encoding UTF8) -replace "`r?`n", "`n" -replace "`0", ""
# create UTF-8 encoding without BOM signature
$utf8NoBom = New-Object System.Text.UTF8Encoding $false
# write the text back
[System.IO.File]::WriteAllLines($_, $raw, $utf8NoBom)
}
If you are struggling with the Byte-order-mark it is best to use a hex editor to check the file header manually; checking your file after I have saved it like shown above and then opening it with Notepad.exe and saving it under a new name shows no difference anymore:
The hex-dumped beginning of a file with BOM looks like this instead:
Also, as noted, while your regex pattern should work it want to convert Windows newlines to Unix style it is much more common and safer to make the CR optional: `r?`n
Als noted by mklement0 reading the file using the correct encoding is important; if your file is actually in Latin1 or something you will end up with a broken file if you carelessly convert it to UTF-8 in PoSH.
Thus, I have added the -Encoding UTF8 param to the Get-Content Cmdlet; adjust as needed.
Update: There is nothing wrong with the code in the question, the true problem was embedded NUL characters in the files, which caused problems in R, and which opening and resaving in Notepad implicitly removed, thereby resolving the problem (assuming that simply discarding these NULs works as intended) - see also: wp78de's answer.
Therefore, modifying the $contents = ... line as follows should fix your problem:
$contents = [IO.File]::ReadAllText($_) -replace "`r`n", "`n" -replace "`0"
Note: The code in the question uses the Out-FileUtf8NoBom function from this answer, which allows saving to BOM-less UTF-8 files in Windows PowerShell; it now supports a -UseLF switch, which would simplify the OP's command to (additional problems notwithstanding):
Get-Content 000014.log | Out-FileUtf8NoBom ddppyyyyy.txt -UseLF
There's a conceptual flaw in your regex, though it is benign in this case: instead of "`r`n?" you want "`r?`n" (or, expressed as a pure regex, '\r?\n') in order to match both CRLF ("`r`n") and LF-only ("`n") newlines.
Your regex would instead match CRLF and CR-only(!) newlines; however, as wp78de points out, if your input file contains only the usual CRLF newlines (and not also isolated CR characters), your replacement operation should still work.
In fact, you don't need a regex at all if all you need is to replace CRLF sequences with LF: -replace "`r`n", "`n"
Assuming that your original input files are ANSI-encoded, you can simplify your approach as follows, without the need to call Out-FileUtf8NoBom first (assumes Windows PowerShell):
# NO need for Out-FileUtf8NoBom - process the ANSI-encoded files directly.
Get-ChildItem *SomePattern*.txt | ForEach-Object {
# Get the contents and make sure newlines are LF-only
# [Text.Encoding]::Default is the encoding for the active ANSI code page
# in Windows PowerShell.
$contents = [IO.File]::ReadAllText(
$_.FullName,
[Text.Encoding]::Default
) -replace "`r`n", "`n"
# Write the text back with BOM-less UTF-8 (.NET's default)
[IO.File]::WriteAllText($_.FullName, $contents, $utf8)
}
Note that replacing the content of files in-place bears a risk of data loss, so it's best to create backup copies of the original files first.
Note: If you wanted to perform the same operation in PowerShell [Core] v6+, which is built on .NET Core, the code must be modified slightly, because [Text.Encoding]::Default no longer reflects the active ANSI code page and instead invariably returns a BOM-less UTF-8 encoding.
Therefore, the $contents = ... statement would have to change to (note that this would work in Windows PowerShell too):
$contents = [IO.File]::ReadAllText(
$_.FullName,
[Text.Encoding]::GetEncoding(
[cultureinfo]::CurrentCulture.TextInfo.AnsiCodePage
)
) -replace "`r`n", "`n"

Powershell Out-file special characters

I have a script that processes data from files and writes result based on a condition to txt. Given data are strings with words like: "Distribución" or "México". When processed, those special characters like "é" and "ó" are broken (typical white square or question mark).
How can i encode the output file to make it work with those characters? I tried encoding in Utf8, utf8 without BOM, it doesn't work. Here is to file writing line:
...| Out-file -encoding XXX .\result.txt
in XXX i tried ASCII, Utf8, nothing works :/
Out-File will always add a BOM. It's a particularly annoying "feature" of that Cmdlet. Unfortunately - to my knowledge - there is no quick way to save a file using UTF8 WITHOUT a BOM in powershell. You can, however, leverage .Net to do this. This isn't really production ready, but here's a quick example:
$outputPath = "D:\temp.txt"
$data = "Distribución or México"
[System.IO.File]::WriteAllLines($outputPath, $data)
Wrap it in a Cmdlet, function and / or module to make it reusable. Of course you can take more control over the file encoding with .Net too.

Find and Replace in a Large File

I want to find a piece of text in a large xml file and want to replace with some other text. The size of the file is around (50GB). I want to do this in command line. I am looking at PowerShell and want to know if it can handle the large size.
Currently I am trying something like this but it does not like it
Get-Content C:\File1.xml | Foreach-Object {$_ -replace "xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"", ""} | Set-Content C:\File1.xml
The text I want to replace is xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with an empty string "".
Questions
Can PowerShell handle large
files
I don't want the replace to happen in
memory and prefer streaming assuming
that will not bring the server to
its knees.
Are there any other approaches I can take (different
tools/strategy?)
Thanks
I had a similar need (and similar lack of powershell experience) but cobbled together a complete answer from the other answers on this page plus a bit more research.
I also wanted to avoid the regex processing, since I didn't need it either -- just a simple string replace -- but on a large file, so I didn't want it loaded into memory.
Here's the command I used (adding linebreaks for readability):
Get-Content sourcefile.txt
| Foreach-Object {$_.Replace('http://example.com', 'http://another.example.com')}
| Set-Content result.txt
Worked perfectly! Never sucked up much memory (it very obviously didn't load the whole file into memory), and just chugged along for a few minutes then finished.
Aside from worrying about reading the file in chunks to avoid loading it into memory, you need to dump to disk often enough that you aren't storing the entire contents of the resulting file in memory.
Get-Content sourcefile.txt -ReadCount 10000 |
Foreach-Object {
$line = $_.Replace('http://example.com', 'http://another.example.com')
Add-Content -Path result.txt -Value $line
}
The -ReadCount <number> sets the number of lines to read at a time. Then the ForEach-Object writes each line as it is read. For a 30GB file filled with SQL Inserts, I topped out around 200MB of memory and 8% CPU. While, piping it all into Set-Content at hit 3GB of memory before I killed it.
It does not like it because you can't read from a file and write back to it at the same time using Get-Content/Set-Content. I recommend using a temp file and then at the end, rename file1.xml to file1.xml.bak and rename the temp file to file1.xml.
Yes as long as you don't try to load the whole file at once. Line-by-line will work but is going to be a bit slow. Use the -ReadCount parameter and set it to 1000 to improve performance.
Which command line? PowerShell? If so then you can invoke your script like so .\myscript.ps1 and if it takes parameters then c:\users\joe\myscript.ps1 c:\temp\file1.xml.
In general for regexes I would use single quotes if you don't need to reference PowerShell variables. Then you only need to worry about regex escaping and not PowerShell escaping as well. If you need to use double-quotes then the back-tick character is the escape char in double-quotes e.g. "`$p1 is set to $ps1". In your example single quoting simplifies your regex to (note: forward slashes aren't metacharacters in regex):
'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'
Absolutely you want to stream this since 50GB won't fit into memory. However, this poses an issue if you process line-by-line. What if the text you want to replace is split across multiple lines?
If you don't have the split line issue then I think PowerShell can handle this.
This is my take on it, building on some of the other answers here:
Function ReplaceTextIn-File{
Param(
$infile,
$outfile,
$find,
$replace
)
if( -Not $outfile)
{
$outfile = $infile
}
$temp_out_file = "$outfile.temp"
Get-Content $infile | Foreach-Object {$_.Replace($find, $replace)} | Set-Content $temp_out_file
if( Test-Path $outfile)
{
Remove-Item $outfile
}
Move-Item $temp_out_file $outfile
}
And called like so:
ReplaceTextIn-File -infile "c:\input.txt" -find 'http://example.com' -replace 'http://another.example.com'
The escape character in powershell strings is the backtick ( ` ), not backslash ( \ ). I'd give an example, but the backtick is also used by the wiki markup. :(
The only thing you should have to escape is the quotes - the periods and such should be fine without.

PowerShell search script that ignores binary files

I am really used to doing grep -iIr on the Unix shell but I haven't been able to get a PowerShell equivalent yet.
Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the --binary-files=without-match option, which says "treat binary files as not matching the search string"
So far I have been using Get-ChildItems -r | Select-String as my PowerShell grep replacement with the occasional Where-Object added. But I haven't figured out a way to ignore all binary files like the grep -I command does.
How can binary files be filtered or ignored with Powershell?
So for a given path, I only want Select-String to search text files.
EDIT: A few more hours on Google produced this question How to identify the contents of a file is ASCII or Binary. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.
EDIT: It seems that an isBinary() needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.
EDIT: It seems that what grep is doing is checking for ASCII NUL Byte or UTF-8 Overlong. If those exists, it considers the file binary. This is a single memchr() call.
On Windows, file extensions are usually good enough:
# all C# and related files (projects, source control metadata, etc)
dir -r -fil *.cs* | ss foo
# exclude the binary types most likely to pollute your development workspace
dir -r -exclude *exe, *dll, *pdb | ss foo
# stick the first three lines in your $profile (refining them over time)
$bins = new-list string
$bins.AddRange( [string[]]#("exe", "dll", "pdb", "png", "mdf", "docx") )
function IsBin([System.IO.FileInfo]$item) { !$bins.Contains($item.extension.ToLower()) }
dir -r | ? { !IsBin($_) } | ss foo
But of course, file extensions are not perfect. Nobody likes typing long lists, and plenty of files are misnamed anyway.
I don't think Unix has any special binary vs text indicators in the filesystem. (Well, VMS did, but I doubt that's the source of your grep habits.) I looked at the implementation of Grep -I, and apparently it's just a quick-n-dirty heuristic based on the first chunk of the file. Turns out that's a strategy I have a bit of experience with. So here's my advice on choosing a heuristic function that is appropriate for Windows text files:
Examine at least 1KB of the file. Lots of file formats begin with a header that looks like text but will bust your parser shortly afterward. The way modern hardware works, reading 50 bytes has roughly the same I/O overhead as reading 4KB.
If you only care about straight ASCII, exit as soon you see something outside the character range [31-127 plus CR and LF]. You might accidentally exclude some clever ASCII art, but trying to separate those cases from binary junk is nontrivial.
If you want to handle Unicode text, let MS libraries handle the dirty work. It's harder than you think. From Powershell you can easily access the IMultiLang2 interface (COM) or Encoding.GetEncoding static method (.NET). Of course, they are still just guessing. Raymond's comments on the Notepad detection algorithm (and the link within to Michael Kaplan) are worth reviewing before deciding exactly how you want to mix & match the platform-provided libraries.
If the outcome is important -- ie a flaw will do something worse than just clutter up your grep console -- then don't be afraid to hard-code some file extensions for the sake of accuracy. For example, *.PDF files occasionally have several KB of text at the front despite being a binary format, leading to the notorious bugs linked above. Similarly, if you have a file extension that is likely to contain XML or XML-like data, you might try a detection scheme similar to Visual Studio's HTML editor. (SourceSafe 2005 actually borrows this algorithm for some cases)
Whatever else happens, have a reasonable backup plan.
As an example, here's the quick ASCII detector:
function IsAscii([System.IO.FileInfo]$item)
{
begin
{
$validList = new-list byte
$validList.AddRange([byte[]] (10,13) )
$validList.AddRange([byte[]] (31..127) )
}
process
{
try
{
$reader = $item.Open([System.IO.FileMode]::Open)
$bytes = new-object byte[] 1024
$numRead = $reader.Read($bytes, 0, $bytes.Count)
for($i=0; $i -lt $numRead; ++$i)
{
if (!$validList.Contains($bytes[$i]))
{ return $false }
}
$true
}
finally
{
if ($reader)
{ $reader.Dispose() }
}
}
}
The usage pattern I'm targeting is a where-object clause inserted in the pipeline between "dir" and "ss". There are other ways, depending on your scripting style.
Improving the detection algorithm along one of the suggested paths is left to the reader.
edit: I started replying to your comment in a comment of my own, but it got too long...
Above, I looked at the problem from the POV of whitelisting known-good sequences. In the application I maintained, incorrectly storing a binary as text had far worse consequences than vice versa. The same is true for scenarios where you are choosing which FTP transfer mode to use, or what kind of MIME encoding to send to an email server, etc.
In other scenarios, blacklisting the obviously bogus and allowing everything else to be called text is an equally valid technique. While U+0000 is a valid code point, it's pretty much never found in real world text. Meanwhile, \00 is quite common in structured binary files (namely, whenever a fixed-byte-length field needs padding), so it makes a great simple blacklist. VSS 6.0 used this check alone and did ok.
Aside: *.zip files are a case where checking for \0 is riskier. Unlike most binaries, their structured "header" (footer?) block is at the end, not the beginning. Assuming ideal entropy compression, the chance of no \0 in the first 1KB is (1-1/256)^1024 or about 2%. Luckily, simply scanning the rest of the 4KB cluster NTFS read will drive the risk down to 0.00001% without having to change the algorithm or write another special case.
To exclude invalid UTF-8, add \C0-C1 and \F8-FD and \FE-FF (once you've seeked past the possible BOM) to the blacklist. Very incomplete since you're not actually validating the sequences, but close enough for your purposes. If you want to get any fancier than this, it's time to call one of the platform libraries like IMultiLang2::DetectInputCodepage.
Not sure why \C8 (200 decimal) is on Grep's list. It's not an overlong encoding. For example, the sequence \C8 \80 represents Ȁ (U+0200). Maybe something specific to Unix.
Ok, after a few more hours of research I believe I've found my solution. I won't mark this as the answer though.
Pro Windows Powershell had a very similar example. I had completely forgot that I had this excellent reference. Please buy it if you are interested in Powershell. It went into detail on Get-Content and Unicode BOMs.
This Answer to a similar questions was also very helpful with the Unicode identification.
Here is the script. Please let me know if you know of any issues it may have.
# The file to be tested
param ($currFile)
# encoding variable
$encoding = ""
# Get the first 1024 bytes from the file
$byteArray = Get-Content -Path $currFile -Encoding Byte -TotalCount 1024
if( ("{0:X}{1:X}{2:X}" -f $byteArray) -eq "EFBBBF" )
{
# Test for UTF-8 BOM
$encoding = "UTF-8"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FFFE" )
{
# Test for the UTF-16
$encoding = "UTF-16"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FEFF" )
{
# Test for the UTF-16 Big Endian
$encoding = "UTF-16 BE"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "FFFE0000" )
{
# Test for the UTF-32
$encoding = "UTF-32"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "0000FEFF" )
{
# Test for the UTF-32 Big Endian
$encoding = "UTF-32 BE"
}
if($encoding)
{
# File is text encoded
return $false
}
# So now we're done with Text encodings that commonly have '0's
# in their byte steams. ASCII may have the NUL or '0' code in
# their streams but that's rare apparently.
# Both GNU Grep and Diff use variations of this heuristic
if( $byteArray -contains 0 )
{
# Test for binary
return $true
}
# This should be ASCII encoded
$encoding = "ASCII"
return $false
Save this script as isBinary.ps1
This script got every text or binary file I tried correct.
i agree that the other answers are more 'complete' but - because i do not know what file extensions i will encounter within a folder and i want to look thru them all, this is the easiest solution for me.
how about instead of avoiding searching thru binary files you just ignore the errors that you get from searching thru binary files?
it doesn't take long to run a search even if there are binary files within the folder being searched.
in the end, all that you care about is the strings that match the pattern (which there is next to no chance of it would find a string that matches the pattern inside of a binary file).
GCI -Recurse -Force -ErrorAction SilentlyContinue | ForEach-Object { GC $_ -ErrorAction SilentlyContinue | Select-String -Pattern "Pattern" } | Out-File -FilePath C:\temp\grep.txt -Width 999999