Powershell - remove metadata from text

Powershell - remove metadata from text - powershell

I want to write a Powershell script that, when run, will remove all formatting and similar metadata from the text in the clipboard.
I'm talking about stuff like copying some text from Microsoft Word and pasting it in Excel, normally this pastes it with bold, italic or etc. formatting the text had in Word but I want to paste it, as if I had copied it to notepad and copied it from there.
I'd prefer to escape the need to emulate opening notepad.exe, pasting there and copying again, as I hope there is a more elegant/intelligent option.
I know there is a copy option "Text only" in Office apps but not only does it not always work as you'd expect/want but copying into other applications doesn't have that option.
I know how to get the text from the clipboard with "Get-Clipboard" and subsequently set it with "Set-Clipboard" but I have no idea WHERE the darn formatting information is stored.

tl;dr
Just calling Get-Clipboard should give you the desired plain-text representation.
Add -Raw if you want the text returned as a single, multi-line string rather than as an array of lines.
Background information:
Applications that copy rich text-based formats to the clipboard, such as Word and Excel copying RTF and HTML, usually also copy a plain-text representation of the same content.
PowerShell's Get-Clipboard cmdlet retrieves the plain-text representation:
by default in Windows PowerShell:
To get one of the rich formats (if present), use the -TextFormatType parameter with the appropriate enumeration value.[1]
invariably in PowerShell (Core) v7+, where Get-Clipboard supports only plain-text retrieval.
Separately, in both PowerShell editions, you can use the -Raw switch to request that multi-line text on the clipboard be returned as a single, multi-line string rather than an array of lines, which is the default.
[1] To express the default behavior with explicit arguments:
Get-Clipboard -Format Text -TextFormatType UnicodeText; the documentation doesn't specify if and how enumeration value Text differs.

Related

How i can add content in .txt with operators like < in a .txt with powershell? Use a delimiter to treat text?

I'm trying to add content in a txt file with the add-content command but within the text I have things like several "",,[], commands that I don't want powershell to run but I can't get powershell to identify all as text.
My question, could you add some kind of delimiter as in MySQL that indicates that it treats everything inside as text?

To prevent PS interpreting that characters, you must to use bacticks before the character (e.g.: `[, `\, `"). Do not confuse with apostrophe (')

Encoding issue with Powershell `Get-Clipboard`

I would like to retrieve HTML from the clipboard via the command line and am struggling to get the encoding right.
For instance, if you open a command prompt/WSL, copy the following ⇧Shift+⭾TAB and run:
powershell.exe Get-Clipboard
The correct text is retrieved (⇧Shift+⭾TAB).
But if you then try to retrieve the clipboard as html:
powershell.exe "Get-Clipboard -TextFormatType html"
The following text is retrieved
...â‡§Shift+â¾TAB...
This seems to be an encoding confusion on part of the Get-Clipboard commandlet. How to work around this?
Edit: As #Zilog80 indicates in the comments, indeed the encoding of the text does not match the encoding which is assumed the text has. I can rectify in Ruby for instance using:
out = `powershell.exe Get-Clipboard -TextFormatType html`
puts out.encode('cp1252').force_encoding('utf-8')
Any idea for how to achieve the same on the command line?

This is indeed a shortcoming of Get-Clipboard. The HTML format is documented to support only UTF-8, regardless of the source encoding of the page, so the cmdlet should interpret it as such, but it doesn't.
I'm speculating as to the encoding PowerShell is going to be using when decoding the data, but it's probably whatever the system default ANSI encoding is. In that case
[Text.Encoding]::UTF8.GetString([Text.Encoding]::Default.GetBytes( `
(Get-Clipboard -TextFormatType Html -Raw) `
))
will recode the text, but with the caveat that if the default ANSI encoding does not cover all code points from 0-255, some characters might get lost. Fortunately Windows-1252 (the most common default) does cover all code points.

What code format shows proper line breaks?

I am exporting some Access tables to txt files and there are a lot of problems with the txt file. One of those problems being line breaks not visible in the txt file itself. If I copy a line with a line break into Notepad++ from Notepad, it'll break into 2 lines.
So I believe this may be a code format problem, but I can't find the proper one to resolve this. I'm currently exporting to the default Western European, but should I export tot UTF, Unicode, ASCII or something else?

When exporting from MS Access (or VB/VBA in general), make sure you're using vbCrLf constant (Carriage Return plus Line Feed) for line breaks. That constant corresponds to HEX values 0D 0A.
In Windows, it is a convention to use the above 2 characters together as line breaks, while in many other platforms, such as Unix/Linux/MacOS/etc. typically just 0A is used.
That brings up an issue: Notepad, the standard Windows text file viewer, cannot deal with 0A alone and does not treat such symbols as line breaks. More advanced editors, such as Notepad++ or UltraEdit, display such files correctly, though.

The CSV export function in Microsoft Office applications (Excel, Access) terminate a data row with CR+LF and write for a line break within a data value (multi-line string) just LF into the file. (I think just CR was written into the CSV file for a line break in older versions of Office before Office 2007.)
Most text editors detect those LF without CR (respectively CR without LF) and convert them to CR+LF on loading the CSV file which results on viewing of the CSV file in text editor in supposed wrong CSV lines as number of data values is not correct on data rows with data values containing a line break.
However, newline characters within a double quoted value in a CSV file are correct according to CSV specification as described in Wikipedia article about Comma-separated values.
But most applications with support on import from CSV file do not support CSV files with newline characters within a double quoted value and therefore some data values are imported wrong. Also regular expression replaces can't be done on a CSV file with newline characters within a data value because the number of separator character is not constant on all lines.
UltraEdit has for editing such CSV files with only LF (or CR) for a line break within a data value a special configuration setting. At Advanced - Configuration - File Handling - DOS/Unix/Mac Handling the option Never prompt to convert files to DOS format or Prompt to convert if file is not DOS format with clicking on button No if this prompt is displayed must be selected and additionally Only recognize DOS terminated lines (CR/LF) as new lines for editing must be enabled.
The CSV file with CR+LF for end of data row and only LF (or CR) for a line-break within a data value is loaded with those settings in UltraEdit with number of lines equal the number of data rows. And the line-feeds without carriage return (respectively the carriage returns without line-feed) in the CSV file are displayed as character in the lines with a small rectangle as no font has a glyph for a carriage return or line-feed defined because they are whitespace characters with no width. A Perl regular expression find searching for \r(?!\n)|\n(?<!\r) can be used now to find those line breaks within data values and replace them with something different like a space character or remove them.
Which character encoding (ASCII, ANSI, Unicode (UTF-16), UTF-8) to use on export depends on which characters can exist in string values. A Unicode encoding is necessary if string values can have also characters not included in local code page.

Powershell keeps converting to ascii

I've followed the guide here: Use Windows PowerShell to Look for and Replace a Word in a Microsoft Word Document
My problem is that if I put a UTF8 string into the search filed it gets converted to ASCII before going into the word application.
If you simply copy and paste his code and change the find text to something like Japanese: カルシウム
It will go into work and search for the ascii equivalent: ã‚«ãƒ«ã‚·ã‚¦ãƒ 
I have tried every suggestion about setting input and output to UTF8 that I can find but nothing seems to be working. I can't even get the powershell console to actually display Japanese characters, all I get are boxes. I think that might have something to do with the fact that I only have 3 fonts and perhaps none of them can display the Japanese characters in the console...but I don't care about that, I want to be able to send the Japanese characters in UTF8 for the find and replace.
Any Help?

For people who keep getting output encoding to ASCII or Unicode all the time, you can set output encoding to whatever encoding you want from Microsoft blog $OutputEncoding
PS C:\> $OutputEncoding //tells you what default encoding you have
PS C:\> $OutputEncoding = [Console]::OutputEncoding //change to the console or you can set it
$OutputEncoding = New-Object -typename System.Text.UTF8Encoding //whatever you want, look up japanese
PS C:\> $OutputEncoding // verify the
encoding

The answer is actually quite easy. If the powershell-script is saved as UTF8, characters are not encoded correctly. You'd need to save the ps1-script encoded as "UTF8 with BOM" in order to get the characters right.

ja chars in windows batch file

What is the secret to japanese characters in a Windows XP .bat file?
We have a script for open a file off disk in kiosk mode:
#ECHO OFF
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
It works fine when the OS is english, and it works fine for the japanese OS when XYZ is made up of english characters, but when XYZ is made up of japanese characters, they are getting mangled into gibberish by the time IE tries to find the file.
If the batch file is saved as Unicode or Unicode big endian the script wont even run.
I have tried various ways of encoding the japanese characters. ampersand escape does not work (〹)
Percent escape does not work %xx%xx%xx
ABC works, AB%43 becomes AB3 in the error message, so it looks like the percent escape is trying to do parameter substitution. This is confirmed because %043 puts in the name of the script !
One thing that does work is pasting the ja characters into a command prompt.
#ECHO OFF
CD "%ProgramFiles%\Internet Explorer\"
Set /p URL ="file to open: "
start iexplore.exe –K %URL%
This tells me that iexplore.exe will accept and parse the parameter correctly when it has ja characters, but not when they are written into the script.
So it would be nice to know what the secret may be to getting the parameter into IE successfully via the batch file, as opposed to via the clipboard and an environment variable.
Any suggestions greatly appreciated !
best regards
Richard Collins
P.S.
another post has has made this suggestion, which i am yet to follow up:
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Batch renaming of files with international chars on Windows XP
I will need to find out if this can be from inside the script.

For the record, a simple answer has been found for this question.
If the batch file is saved as ANSI - it works !

First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in. UTF-16 is out anyway, since cmd won't even parse that.
I have detailed an option in my answer to the following question:
Batch file encoding
which might be helpful for your needs.
In principle it boils down to the following:
Use an encoding which has single-byte mappings for ASCII
Put a chcp ... at the start of the batch file
Use the set codepage for the rest of the file
You can use codepage 65001, which is UTF-8 but make sure that your file doesn't include the U+FEFF character at the start (used as byte-order mark in UTF-16 and UTF-32 and sometimes used as marker for UTF-8 files as well). Otherwise the first command in the file will produce an error message.
So just use the following:
echo off
chcp 65001
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
and save it as UTF-8 without BOM (Note: Notepad won't allow you to do that) and it should work.
cmd /u won't do anything here, that advice is pretty much bogus. The /U switch only specifies that Unicode will be used for redirection of input and output (and piping). It has nothing to do with the encoding the console uses for output or reading batch files.
URL encoding won't help you either. cmd is hardly a web browser and outside of HTTP and the web URL encoding isn't exactly widespread (hence the name). cmd uses percent signs for environment variables and arguments to batch files and subroutines.
"Ampersand escape" also known as character entities known from HTML and XML, won't work either, because cmd is also not HTML or XML. The ampersand is used to execute multiple commands in a single line.

I too suffered this frustrating problem in batch/cmd files. However, so far as I can see, no one yet has stated the reason why this problem occurs, here or in other, similar posts at StackOverflow. The nearest statement addressing this was:
“First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in.”
Here is the basic problem. Cmd files are the Windows-2000+ successor to MS-DOS and IBM-DOS bat(ch) files. MS and IBM DOS (1984 vintage) were written in the IBM-PC character set (code page 437). There, the 8th-bit codes were assigned (or “clothed” with) characters different from those assigned to the corresponding codes of Windows, ANSI, or Unicode. The presumption of CP437 encoding is unalterable (except, as previously noted, through cmd.exe /u). Where the characters of the IBM-PC set have exact counterparts in the Unicode set, Windows Explorer remaps them to the Unicode counterparts. Alas, even Windows-1252 characters like š and ¾ have no counterpart in code page 437.
Here is another way to see the problem. Try opening your batch/cmd script using the Windows Edit.com program (at C:\Windows\system32\Edit.com). The Windows-1252 character 0145 ‘ (Unicode 8217) instead appears as IBM-PC 145 æ. A batch command to rename Mary'sFile.txt as Mary’sFile.txt fails, as it is interpreted as MaryæsFile.txt.
This problem can be avoided in the case of copying a file named Mary’sFile.txt: cite it as Mary?sFile.txt, e.g.:
xCopy Mary?sFile.txt Mary?sLastFile.txt
You will see a similar treatment (substitution of question marks) in a DIR list of files having Unicode characters.
Obviously, this is useless unless an extant file has the Unicode characters. This solution’s range is paltry and inadequate, but please make what use of it you can.

You can try to use Shift-JIS encoding.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse