Forcing Input/Output encoding in Powershell to specific locale/codepage? - powershell

Working with files named in Japanese, and having trouble getting the encoding to process right. After running
chcp 50222
$OutputEncoding = [console]::outputencoding
[Console]::OutputEncoding = [Text.Encoding]::GetEncoding(50222)
I can view Japanese properly on the console, and see things like "【お題箱】琴浦さん", it shows up fine in the dir listing as it should; and when redirecting it to a file, it stores properly.
HOWEVER, when I try to pipe things through the tee commandlet, to see it on the console and feed it to a file at the same time, I get "・・・・・。・・・エ・オヲ・ケ・・セ・・ュ・" instead.
Best I can tell it's being re-encoded to something else between being output to the console, and being fed into tee.... so what can I do to fix that? Or is there something that'd do it better than tee?
(I've also noticed that things fed into tee from a 3rd-party download manager I have, have a significant delay until it shows up on the screen. It will pause a while, show a few screens in a burst, pause for a while, show another few screens, etc)

Based on Get-Help Tee-Object -Full, the command always uses Unicode (meaning UTF-16 LE or code page 1200) encoding. Code page 50222 (iso-2022-jp/Japanese (JIS-Allow 1 byte Kana - SO/SI)) isn't a standard encoding that Add-Content or Out-File supports, either, so the common workaround of Add-Content -Passthru won't work. I suspect you'll have to use a StreamWriter to even be able to write this encoding to a file.
I also have no idea if the console host that PowerShell uses actually respects chcp.exe or if it supports code page 50222.
Keep in mind, too, that internally all strings in .Net are Unicode (code page 1200). If there are glyphs that cannot be represented with code page 1200 that can be represented with code page 50222, you may have problems.
Try duplicating Tee-Object with ForEach-Object and a StreamWriter:
$Encoding = [System.Text.Encoding]::GetEncoding(50222)
$Append = $true
$StreamWriter = New-Object System.IO.StreamWriter -ArgumentList $OutputFile, $Append, $Encoding
previousCommand.exe | ForEach-Object {
$StreamWriter.WriteLine($_);
$_;
}
$StreamWriter.Close();
My real suspicion, though, is that you may end up having to work very hard to get the system to take input in this encoding and to treat it correctly.

Related

Powershell: Out-Printer, unicode and font sizing

powershell "get-content foo.txt|Out-Printer"
As long as foo.txt is english, everything is fine (well mostly)
If foo.txt contains unicode characters e.g. राष्ट्र then what gets printed is stuff like °à¥€ ब
I tried passing the -Encoding option to get-content but it did not change the result.
Is it possible ensure that unicode text gets printed properly without
launching Word/IE etc in the background to print it?
My second question is
Is it possible to control which font (type and size) is used for
printing by out-printer?
In my environment (Windows 10 2004 Build 19041.985, Japanese locale), I got the correct result with the following situation:
Save the .txt file in UTF-8 with BOM, and print with Get-Content .\foo.txt | Out-Printer
Save the .txt file in UTF-8 without BOM, and print with Get-Content .\foo.txt -Encoding UTF8 | Out-Printer
I got the incorrect result (like 爨ー爨セ爨キ爭財、游・財、ー) with the following situation:
Save the .txt file in UTF-8 without BOM, and print with Get-Content .\foo.txt | Out-Printer
So it looks like an encoding problem. Please check what #RavenKnit said first.
Is it possible ensure that unicode text gets printed properly without launching Word/IE etc in the background to print it?
I couldn't find the way to do with the Get-PrinterPort, Get-WmiObject Win32_Printer, prnmngr.vbs or prnqctl.vbs. If you just don't want to show a Windows while you print the content of a file, you can use the Print verb. It runs notepad.exe for .txt files, winword.exe for .docx files, etc.
Start-Process .\foo.txt -Verb Print -WindowStyle Hidden -Wait
Is it possible to control which font (type and size) is used for printing by out-printer?
According to the source of Out-Printer, the default font is embbeded in the .resx file. So it looks like you cannot control the default font.

Powershell Out-file special characters

I have a script that processes data from files and writes result based on a condition to txt. Given data are strings with words like: "Distribución" or "México". When processed, those special characters like "é" and "ó" are broken (typical white square or question mark).
How can i encode the output file to make it work with those characters? I tried encoding in Utf8, utf8 without BOM, it doesn't work. Here is to file writing line:
...| Out-file -encoding XXX .\result.txt
in XXX i tried ASCII, Utf8, nothing works :/
Out-File will always add a BOM. It's a particularly annoying "feature" of that Cmdlet. Unfortunately - to my knowledge - there is no quick way to save a file using UTF8 WITHOUT a BOM in powershell. You can, however, leverage .Net to do this. This isn't really production ready, but here's a quick example:
$outputPath = "D:\temp.txt"
$data = "Distribución or México"
[System.IO.File]::WriteAllLines($outputPath, $data)
Wrap it in a Cmdlet, function and / or module to make it reusable. Of course you can take more control over the file encoding with .Net too.

Using Powershell to Strip Content from PDF

Using Powershell to Strip Content from PDF While Keeping PDF Format.
My Task:
I have been attempting to perform what would be a simple task if the documents were not in PDF format. I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before ‘%PDF’ in the documents. A script that pulls all the desired data and exports it to a new file was needed. That part was super easy.
The Problem:
The data that is exported appears to be formatted correctly, except it doesn’t open as a PDF anymore. I can open it in Notepad++ and it looks identical to one that was clean manually and works. Examining the raw code of the Powershell altered PDF it appears that the ‘lines’ are much shorter than they should be.
$Path = 'C:\FileLocation'
$Output = '.\MyFile.pdf'
$LineArr = #()
$Target = Get-ChildItem -Path $Path -Filter *.pdf -Recurse -ErrorAction SilentlyContinue | Get-Content -Encoding default | Out-String -stream
$Target.Where({ $_ -like '*%PDF*' }, 'SkipUntil') | ForEach-Object{
If ($_.contains('%PDF')){
$LineArr += "%" + $_.Split('%')[1]
}
else{
$LineArr += $_
}
}
$LineArr | Out-File -Encoding Default -FilePath $Output
I understand the PDF format doesn't really use lines, so that might be where the problem is being created. Either when the data is being initially put into an array, or when it’s being written the PDF format is probably being broken. Is there a way to retain the format of the PDF while it is modified and then saved? It’s probably the case that I’m missing something simple.
So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. So, the function library was the next stop.
This is just a little blurb I spit out with no error checking or logging going on at this point. All that will be added in along with file searches later. All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs.
strPDFdoco = "C:\TestPDFs\Test.pdf"
strPDFString = "%%PDF"
strPDFendString = "%%%%END"
If FileExist(strPDFdoco)
strPDFName = ItemExtract(-1, strPDFdoco, "\")
strFixedPDFFullPath = ("C:\TestPDF\Fixed\": strPDFName)
strCurrentPDFFileSize = FileSize(strPDFdoco) ; Get size of PDF file
hndOldPDFFile = BinaryAlloc(strCurrentPDFFileSize) ; Allocate memory for reading PDF file
BinaryRead(hndOldPDFFile, strPDFdoco) ; Read PDF file
strStartIndex = BinaryIndexEx(hndOldPDFFile, 0, strPDFString, #FWDSCAN, #FALSE) ; Find start point for copy
strEndIndex = BinaryEodGet(hndOldPDFFile) ; find eof
strCount = strEndIndex - strStartIndex
strWritePDF = BinaryWriteEx( hndOldPDFFile, strStartIndex, strFixedPDFFullPath, 0, strCount)
BinaryFree(hndOldPDFFile)
ENDIF
Now that I have an idea how this works, making a tool to do this in PS sounds more doable. There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages.
There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. I might update with a PS version if I decide to go that route again, and if I remember.

Powershell Out-File force end of line character

I discovered that I could force a Unicode file to ASCII using the script below, which is really great. I assume it's based on my environment or Windows default, but it's adding a CR and LF at the end of each line. Is there a way to force just a LF character rather than both without loading the entire file into memory? I have seen some solutions that load the entire file into memory and basically do a string replace, which won't work because some of my files are multiple GB.
Thanks!
get-content -encoding utf8 $inputFile | Out-file -force -encoding ASCII $outputFile
I suggest you use .NET System.File.IO classes from within your script. In particular the System.File.IO.StreamWriter class has a property, NewLine which you can set to whatever characters you want the line terminator characters to be. (Although to be readable by StreamReader the line terminator chars must be \n or \r\n (in C/C++ notation because of conflict with SO and PS on backtick)).
Secondary benefit of using IO.StreamWriter, according to this blog is much better perf.
Basic code flow is something like this (not tested):
# Note that IO.StreamWriter will use process's current working directory,
# not PS's. So safer to specify full paths
$inStream = [System.IO.StreamReader] "c:\temp\orig.txt"
$outStream = new-object System.IO.StreamWriter "c:\temp\copy.txt",
[text.encoding]::ASCII
$outStream.NewLine = '`n'
while (-not $inStream.endofstream) {
$outStream.WriteLine( $instream.Readline())
}
$inStream.close()
$outStream.close()
This script should have constant memory requirements, but hard to know what .NET might do under the covers.

PowerShell search script that ignores binary files

I am really used to doing grep -iIr on the Unix shell but I haven't been able to get a PowerShell equivalent yet.
Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the --binary-files=without-match option, which says "treat binary files as not matching the search string"
So far I have been using Get-ChildItems -r | Select-String as my PowerShell grep replacement with the occasional Where-Object added. But I haven't figured out a way to ignore all binary files like the grep -I command does.
How can binary files be filtered or ignored with Powershell?
So for a given path, I only want Select-String to search text files.
EDIT: A few more hours on Google produced this question How to identify the contents of a file is ASCII or Binary. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.
EDIT: It seems that an isBinary() needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.
EDIT: It seems that what grep is doing is checking for ASCII NUL Byte or UTF-8 Overlong. If those exists, it considers the file binary. This is a single memchr() call.
On Windows, file extensions are usually good enough:
# all C# and related files (projects, source control metadata, etc)
dir -r -fil *.cs* | ss foo
# exclude the binary types most likely to pollute your development workspace
dir -r -exclude *exe, *dll, *pdb | ss foo
# stick the first three lines in your $profile (refining them over time)
$bins = new-list string
$bins.AddRange( [string[]]#("exe", "dll", "pdb", "png", "mdf", "docx") )
function IsBin([System.IO.FileInfo]$item) { !$bins.Contains($item.extension.ToLower()) }
dir -r | ? { !IsBin($_) } | ss foo
But of course, file extensions are not perfect. Nobody likes typing long lists, and plenty of files are misnamed anyway.
I don't think Unix has any special binary vs text indicators in the filesystem. (Well, VMS did, but I doubt that's the source of your grep habits.) I looked at the implementation of Grep -I, and apparently it's just a quick-n-dirty heuristic based on the first chunk of the file. Turns out that's a strategy I have a bit of experience with. So here's my advice on choosing a heuristic function that is appropriate for Windows text files:
Examine at least 1KB of the file. Lots of file formats begin with a header that looks like text but will bust your parser shortly afterward. The way modern hardware works, reading 50 bytes has roughly the same I/O overhead as reading 4KB.
If you only care about straight ASCII, exit as soon you see something outside the character range [31-127 plus CR and LF]. You might accidentally exclude some clever ASCII art, but trying to separate those cases from binary junk is nontrivial.
If you want to handle Unicode text, let MS libraries handle the dirty work. It's harder than you think. From Powershell you can easily access the IMultiLang2 interface (COM) or Encoding.GetEncoding static method (.NET). Of course, they are still just guessing. Raymond's comments on the Notepad detection algorithm (and the link within to Michael Kaplan) are worth reviewing before deciding exactly how you want to mix & match the platform-provided libraries.
If the outcome is important -- ie a flaw will do something worse than just clutter up your grep console -- then don't be afraid to hard-code some file extensions for the sake of accuracy. For example, *.PDF files occasionally have several KB of text at the front despite being a binary format, leading to the notorious bugs linked above. Similarly, if you have a file extension that is likely to contain XML or XML-like data, you might try a detection scheme similar to Visual Studio's HTML editor. (SourceSafe 2005 actually borrows this algorithm for some cases)
Whatever else happens, have a reasonable backup plan.
As an example, here's the quick ASCII detector:
function IsAscii([System.IO.FileInfo]$item)
{
begin
{
$validList = new-list byte
$validList.AddRange([byte[]] (10,13) )
$validList.AddRange([byte[]] (31..127) )
}
process
{
try
{
$reader = $item.Open([System.IO.FileMode]::Open)
$bytes = new-object byte[] 1024
$numRead = $reader.Read($bytes, 0, $bytes.Count)
for($i=0; $i -lt $numRead; ++$i)
{
if (!$validList.Contains($bytes[$i]))
{ return $false }
}
$true
}
finally
{
if ($reader)
{ $reader.Dispose() }
}
}
}
The usage pattern I'm targeting is a where-object clause inserted in the pipeline between "dir" and "ss". There are other ways, depending on your scripting style.
Improving the detection algorithm along one of the suggested paths is left to the reader.
edit: I started replying to your comment in a comment of my own, but it got too long...
Above, I looked at the problem from the POV of whitelisting known-good sequences. In the application I maintained, incorrectly storing a binary as text had far worse consequences than vice versa. The same is true for scenarios where you are choosing which FTP transfer mode to use, or what kind of MIME encoding to send to an email server, etc.
In other scenarios, blacklisting the obviously bogus and allowing everything else to be called text is an equally valid technique. While U+0000 is a valid code point, it's pretty much never found in real world text. Meanwhile, \00 is quite common in structured binary files (namely, whenever a fixed-byte-length field needs padding), so it makes a great simple blacklist. VSS 6.0 used this check alone and did ok.
Aside: *.zip files are a case where checking for \0 is riskier. Unlike most binaries, their structured "header" (footer?) block is at the end, not the beginning. Assuming ideal entropy compression, the chance of no \0 in the first 1KB is (1-1/256)^1024 or about 2%. Luckily, simply scanning the rest of the 4KB cluster NTFS read will drive the risk down to 0.00001% without having to change the algorithm or write another special case.
To exclude invalid UTF-8, add \C0-C1 and \F8-FD and \FE-FF (once you've seeked past the possible BOM) to the blacklist. Very incomplete since you're not actually validating the sequences, but close enough for your purposes. If you want to get any fancier than this, it's time to call one of the platform libraries like IMultiLang2::DetectInputCodepage.
Not sure why \C8 (200 decimal) is on Grep's list. It's not an overlong encoding. For example, the sequence \C8 \80 represents Ȁ (U+0200). Maybe something specific to Unix.
Ok, after a few more hours of research I believe I've found my solution. I won't mark this as the answer though.
Pro Windows Powershell had a very similar example. I had completely forgot that I had this excellent reference. Please buy it if you are interested in Powershell. It went into detail on Get-Content and Unicode BOMs.
This Answer to a similar questions was also very helpful with the Unicode identification.
Here is the script. Please let me know if you know of any issues it may have.
# The file to be tested
param ($currFile)
# encoding variable
$encoding = ""
# Get the first 1024 bytes from the file
$byteArray = Get-Content -Path $currFile -Encoding Byte -TotalCount 1024
if( ("{0:X}{1:X}{2:X}" -f $byteArray) -eq "EFBBBF" )
{
# Test for UTF-8 BOM
$encoding = "UTF-8"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FFFE" )
{
# Test for the UTF-16
$encoding = "UTF-16"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FEFF" )
{
# Test for the UTF-16 Big Endian
$encoding = "UTF-16 BE"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "FFFE0000" )
{
# Test for the UTF-32
$encoding = "UTF-32"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "0000FEFF" )
{
# Test for the UTF-32 Big Endian
$encoding = "UTF-32 BE"
}
if($encoding)
{
# File is text encoded
return $false
}
# So now we're done with Text encodings that commonly have '0's
# in their byte steams. ASCII may have the NUL or '0' code in
# their streams but that's rare apparently.
# Both GNU Grep and Diff use variations of this heuristic
if( $byteArray -contains 0 )
{
# Test for binary
return $true
}
# This should be ASCII encoded
$encoding = "ASCII"
return $false
Save this script as isBinary.ps1
This script got every text or binary file I tried correct.
i agree that the other answers are more 'complete' but - because i do not know what file extensions i will encounter within a folder and i want to look thru them all, this is the easiest solution for me.
how about instead of avoiding searching thru binary files you just ignore the errors that you get from searching thru binary files?
it doesn't take long to run a search even if there are binary files within the folder being searched.
in the end, all that you care about is the strings that match the pattern (which there is next to no chance of it would find a string that matches the pattern inside of a binary file).
GCI -Recurse -Force -ErrorAction SilentlyContinue | ForEach-Object { GC $_ -ErrorAction SilentlyContinue | Select-String -Pattern "Pattern" } | Out-File -FilePath C:\temp\grep.txt -Width 999999