Encoding issue with Powershell `Get-Clipboard` - powershell

I would like to retrieve HTML from the clipboard via the command line and am struggling to get the encoding right.
For instance, if you open a command prompt/WSL, copy the following ⇧Shift+⭾TAB and run:
powershell.exe Get-Clipboard
The correct text is retrieved (⇧Shift+⭾TAB).
But if you then try to retrieve the clipboard as html:
powershell.exe "Get-Clipboard -TextFormatType html"
The following text is retrieved
...⇧Shift+⭾TAB...
This seems to be an encoding confusion on part of the Get-Clipboard commandlet. How to work around this?
Edit: As #Zilog80 indicates in the comments, indeed the encoding of the text does not match the encoding which is assumed the text has. I can rectify in Ruby for instance using:
out = `powershell.exe Get-Clipboard -TextFormatType html`
puts out.encode('cp1252').force_encoding('utf-8')
Any idea for how to achieve the same on the command line?

This is indeed a shortcoming of Get-Clipboard. The HTML format is documented to support only UTF-8, regardless of the source encoding of the page, so the cmdlet should interpret it as such, but it doesn't.
I'm speculating as to the encoding PowerShell is going to be using when decoding the data, but it's probably whatever the system default ANSI encoding is. In that case
[Text.Encoding]::UTF8.GetString([Text.Encoding]::Default.GetBytes( `
(Get-Clipboard -TextFormatType Html -Raw) `
))
will recode the text, but with the caveat that if the default ANSI encoding does not cover all code points from 0-255, some characters might get lost. Fortunately Windows-1252 (the most common default) does cover all code points.

Related

Change Unicode to UTF-8 | PowerShell script

When I use Write-Host in my PowerShell script, the output looks like this: ????? ?????.
This happens because I'm entering strings in Arabic with Write-Host, and it seems that PowerShell doesn't support Arabic...
How do I print text using Write-Host, but in unicode UTF-8 (which supports Arabic).
Example: Write-Host "مرحباً بالعالم"
The output in this case will be: ????? ?????
Any solutions?
Fixed
You need to set a font that supports those characters. Like Cascadia Code PL
Note: The non-PL version didn't work, so get the PL one.
You might have to set the console encoding as well. Unless you really need a different encoding, default to utf8 is a good idea.
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
Before
I didn't actually need wt here, I'd suggest it. windowsterminal is a modern term, which is now an inbox app. Meaning it's default on windows going forward.
There's utf8 that doesn't work on the term, that wt supports (both using the cascadia code pl font

Powershell - remove metadata from text

I want to write a Powershell script that, when run, will remove all formatting and similar metadata from the text in the clipboard.
I'm talking about stuff like copying some text from Microsoft Word and pasting it in Excel, normally this pastes it with bold, italic or etc. formatting the text had in Word but I want to paste it, as if I had copied it to notepad and copied it from there.
I'd prefer to escape the need to emulate opening notepad.exe, pasting there and copying again, as I hope there is a more elegant/intelligent option.
I know there is a copy option "Text only" in Office apps but not only does it not always work as you'd expect/want but copying into other applications doesn't have that option.
I know how to get the text from the clipboard with "Get-Clipboard" and subsequently set it with "Set-Clipboard" but I have no idea WHERE the darn formatting information is stored.
tl;dr
Just calling Get-Clipboard should give you the desired plain-text representation.
Add -Raw if you want the text returned as a single, multi-line string rather than as an array of lines.
Background information:
Applications that copy rich text-based formats to the clipboard, such as Word and Excel copying RTF and HTML, usually also copy a plain-text representation of the same content.
PowerShell's Get-Clipboard cmdlet retrieves the plain-text representation:
by default in Windows PowerShell:
To get one of the rich formats (if present), use the -TextFormatType parameter with the appropriate enumeration value.[1]
invariably in PowerShell (Core) v7+, where Get-Clipboard supports only plain-text retrieval.
Separately, in both PowerShell editions, you can use the -Raw switch to request that multi-line text on the clipboard be returned as a single, multi-line string rather than an array of lines, which is the default.
[1] To express the default behavior with explicit arguments:
Get-Clipboard -Format Text -TextFormatType UnicodeText; the documentation doesn't specify if and how enumeration value Text differs.

Powershell keeps converting to ascii

I've followed the guide here: Use Windows PowerShell to Look for and Replace a Word in a Microsoft Word Document
My problem is that if I put a UTF8 string into the search filed it gets converted to ASCII before going into the word application.
If you simply copy and paste his code and change the find text to something like Japanese: カルシウム
It will go into work and search for the ascii equivalent: カルシウム
I have tried every suggestion about setting input and output to UTF8 that I can find but nothing seems to be working. I can't even get the powershell console to actually display Japanese characters, all I get are boxes. I think that might have something to do with the fact that I only have 3 fonts and perhaps none of them can display the Japanese characters in the console...but I don't care about that, I want to be able to send the Japanese characters in UTF8 for the find and replace.
Any Help?
For people who keep getting output encoding to ASCII or Unicode all the time, you can set output encoding to whatever encoding you want from Microsoft blog $OutputEncoding
PS C:\> $OutputEncoding //tells you what default encoding you have
PS C:\> $OutputEncoding = [Console]::OutputEncoding //change to the console or you can set it
$OutputEncoding = New-Object -typename System.Text.UTF8Encoding //whatever you want, look up japanese
PS C:\> $OutputEncoding // verify the
encoding
The answer is actually quite easy. If the powershell-script is saved as UTF8, characters are not encoded correctly. You'd need to save the ps1-script encoded as "UTF8 with BOM" in order to get the characters right.

How to do proper Unicode and ANSI output redirection on cmd.exe?

If you are doing automation on windows and you are redirecting the output of different commands (internal cmd.exe or external, you'll discover that your log files contains combined Unicode and ANSI output (meaning that they are invalid and will not load well in viewers/editors).
Is it is possible to make cmd.exe work with UTF-8? This question is not about display, s about stdin/stdout/stderr redirection and Unicode.
I am looking for a solution that would allow you to:
redirect the output of the internal commands to a file using UTF-8
redirect output of external commands supporting Unicode to the files but encoded as UTF-8.
If it is impossible to obtain this kind of consistence using batch files, is there another way of solving this problem, like using python scripting for this? In this case, I would like to know if it is possible to do the Unicode detection alone (user using the scripting should not remember if the called tools will output Unicode or not, it will just expect to convert the output to UTF-8.
For simplicity we'll assume that if the tool output is not-Unicode it will be considered as UTF-8 (no codepage conversion).
You can use chcp to change the active code page. This will be used for redirecting text as well:
chcp 65001
Keep in mind, though, that this will have no effect if cmd was started with the /u switch which forces Unicode (UTF-16 in this case) redirection output. If that switch is active then all output will be in UTF-16LE, regardless of the codepage set with chcp.
Also note that the console will be unusable for interactive output when set to Raster Fonts. I'm getting fun error messages in that case:
C:\Users\Johannes Rössel\Documents>x
Active code page: 65001
The system cannot write to the specified device.
So either use a sane setup (TrueType font for the console) or don't pull this stunt when using the console interactively and having a path that contains non-ASCII characters.
binmode(STDOUT, ":unix");
without
use encoding 'utf8';
Helped me. With that i had wide character in print warning.

ja chars in windows batch file

What is the secret to japanese characters in a Windows XP .bat file?
We have a script for open a file off disk in kiosk mode:
#ECHO OFF
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
It works fine when the OS is english, and it works fine for the japanese OS when XYZ is made up of english characters, but when XYZ is made up of japanese characters, they are getting mangled into gibberish by the time IE tries to find the file.
If the batch file is saved as Unicode or Unicode big endian the script wont even run.
I have tried various ways of encoding the japanese characters. ampersand escape does not work (〹)
Percent escape does not work %xx%xx%xx
ABC works, AB%43 becomes AB3 in the error message, so it looks like the percent escape is trying to do parameter substitution. This is confirmed because %043 puts in the name of the script !
One thing that does work is pasting the ja characters into a command prompt.
#ECHO OFF
CD "%ProgramFiles%\Internet Explorer\"
Set /p URL ="file to open: "
start iexplore.exe –K %URL%
This tells me that iexplore.exe will accept and parse the parameter correctly when it has ja characters, but not when they are written into the script.
So it would be nice to know what the secret may be to getting the parameter into IE successfully via the batch file, as opposed to via the clipboard and an environment variable.
Any suggestions greatly appreciated !
best regards
Richard Collins
P.S.
another post has has made this suggestion, which i am yet to follow up:
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Batch renaming of files with international chars on Windows XP
I will need to find out if this can be from inside the script.
For the record, a simple answer has been found for this question.
If the batch file is saved as ANSI - it works !
First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in. UTF-16 is out anyway, since cmd won't even parse that.
I have detailed an option in my answer to the following question:
Batch file encoding
which might be helpful for your needs.
In principle it boils down to the following:
Use an encoding which has single-byte mappings for ASCII
Put a chcp ... at the start of the batch file
Use the set codepage for the rest of the file
You can use codepage 65001, which is UTF-8 but make sure that your file doesn't include the U+FEFF character at the start (used as byte-order mark in UTF-16 and UTF-32 and sometimes used as marker for UTF-8 files as well). Otherwise the first command in the file will produce an error message.
So just use the following:
echo off
chcp 65001
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
and save it as UTF-8 without BOM (Note: Notepad won't allow you to do that) and it should work.
cmd /u won't do anything here, that advice is pretty much bogus. The /U switch only specifies that Unicode will be used for redirection of input and output (and piping). It has nothing to do with the encoding the console uses for output or reading batch files.
URL encoding won't help you either. cmd is hardly a web browser and outside of HTTP and the web URL encoding isn't exactly widespread (hence the name). cmd uses percent signs for environment variables and arguments to batch files and subroutines.
"Ampersand escape" also known as character entities known from HTML and XML, won't work either, because cmd is also not HTML or XML. The ampersand is used to execute multiple commands in a single line.
I too suffered this frustrating problem in batch/cmd files. However, so far as I can see, no one yet has stated the reason why this problem occurs, here or in other, similar posts at StackOverflow. The nearest statement addressing this was:
“First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in.”
Here is the basic problem. Cmd files are the Windows-2000+ successor to MS-DOS and IBM-DOS bat(ch) files. MS and IBM DOS (1984 vintage) were written in the IBM-PC character set (code page 437). There, the 8th-bit codes were assigned (or “clothed” with) characters different from those assigned to the corresponding codes of Windows, ANSI, or Unicode. The presumption of CP437 encoding is unalterable (except, as previously noted, through cmd.exe /u). Where the characters of the IBM-PC set have exact counterparts in the Unicode set, Windows Explorer remaps them to the Unicode counterparts. Alas, even Windows-1252 characters like š and ¾ have no counterpart in code page 437.
Here is another way to see the problem. Try opening your batch/cmd script using the Windows Edit.com program (at C:\Windows\system32\Edit.com). The Windows-1252 character 0145 ‘ (Unicode 8217) instead appears as IBM-PC 145 æ. A batch command to rename Mary'sFile.txt as Mary’sFile.txt fails, as it is interpreted as MaryæsFile.txt.
This problem can be avoided in the case of copying a file named Mary’sFile.txt: cite it as Mary?sFile.txt, e.g.:
xCopy Mary?sFile.txt Mary?sLastFile.txt
You will see a similar treatment (substitution of question marks) in a DIR list of files having Unicode characters.
Obviously, this is useless unless an extant file has the Unicode characters. This solution’s range is paltry and inadequate, but please make what use of it you can.
You can try to use Shift-JIS encoding.