Generating Big5 characters in powershell - powershell

I want to create big5-encoded Chinese characters and save them into a txt file with powershell.
I know that in windows cmd.exe, I can easily create big5 characters with something like this:
echo 信 > testBig5.txt
However, in the above command creates Chinese characters encoded in UTF-16LE in powershell.
I use a binary editor(e.g. UltraEdit, Notepad++'s HEX-Editor plugin) to check whether the characters are encoded in big5 or not.
You may use this tool to view the Big5 encoding of a Chinese character.
=========================== edit 2017/3/31 =============================
I found something interesting:
The Big5 encoding of 信 is ab 48.
echo 信 | out-file -filepath abc.txt -encoding Default
echo 信信 | out-file -filepath abc.txt -encoding Default
echo 信信信 | out-file -filepath abc.txt -encoding Default
echo 信信信信 | out-file -filepath abc.txt -encoding Default
All of the 信s generated by the above code are ab 48.
However, as the length of the characters get longer than 4, the encoding of 信 becomes e4 bf a1, which is UTF-8 encoding.
echo 信信信信信 | out-file -filepath abc.txt -encoding Default # 信 is encoded as e4 bf a1 here.
It seems that the final encoding of the characters depends on the length of the charcters.
Now the question becomes how to generate long big5-encoded characters with powershell.

Related

ASCII encoding does not work in powershell

Just can't get ASCII encoding get to work in PowerShell. Tried a bunch of different approaches.
Whatever I try I get an UTF8 encoded file (that is what NPP tells me):
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding ascii
PSVersion = 5.1.14393.5066
Any hint is welcome!
ASCII is a 7-bit character set and doesn't contain any accented character, so obviously storing öäü in ASCII doesn't work. If you need UTF-8 then you need to specify encoding as utf8
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding utf8
If you need another encoding then specify it accordingly. For example to get the ANSI code page use this
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding default
-Encoding default will save the file in the current ANSI code page and -Encoding oem will use the current OEM code page. Just press Tab after -Encoding and PowerShell will cycle through the list of supported encodings. For encodings not in that list you can trivially deal with them using System.Text.Encoding
Note that "ANSI code page" is a misnomer and the actual encoding changes depending on each environment so it won't be reliable. For example if you change the code page manually then it won't work anymore. For a more reliable behavior you need to explicitly specify the encoding (typically Windows-1252 for Western European languages). In older PowerShell use
[IO.File]::WriteAllLines("c:\temp\check.txt", $newLine, [Text.Encoding]::GetEncoding(1252)
and in PowerShell Core you can use
$newLine | Out-File -FilePath "check2.txt" -Encoding ([Text.Encoding]::GetEncoding(1252))
See How do I change my Powershell script so that it writes out-file in ANSI - Windows-1252 encoding?
Found the solution:
$file = "c:\temp\check-ansi.txt"
$newLine = "Ein Test öÖäÄüÜ"
Remove-Item $file
[IO.File]::WriteAllLines($file, $newLine, [Text.Encoding]::GetEncoding(28591))

Powershell Chinese characters encoding error

I have a file called test.txt which contains a single Chinese character, 中, in it.
This character looks like this
under hex-editor's view.
If I do get-content test.txt | Out-File test_output.txt, the content of test_output.txt is different from test.txt. Why is this hapenning?
I've tried all the encoding parameters listed here ("Unicode", "UTF7", "UTF8", "UTF32", "ASCII", "BigEndianUnicode", "Default", and "OEM"), but none of them correctly converts the Chinese character.
How can I correctly convert Chinese characters using Get-Content and Out-File?
The encoding, e4 b8 ad, looks like URLencode of 中, is this why all the encoding parameters are not compatible with this Chinese character?
I use Notepad++ and Notepad++'s hex-editor plugin as my text-editor and hex-editor, respectively.
I tried get-content test.txt -encoding UTF8 | Out-File test_output.txt -encoding UTF8
My test.txt is "e4 b8 ad 0a". And the output is "ef bb bf e4 b8 ad 0d 0a"
test.txt is in UTF-8.
Get-Content doesn't recognize UTF-8 unless with BOM. Out-File uses UTF-16 by default.
So specifying encoding for both commands is necessary
In my case, the Unicode encoding solved my problem with the Chinese characters. The file I was modifying contained a C# code on a TFS sever.
$path="test.cs"
Get-Content -Path $path -Encoding Unicode
Set-Content -Path $path -Encoding Unicode
it might help somebody else.

Why does appending to a file insert whitespace (NUL)

I am running this command
Get-Content generated\no_animate.css >> generated\all.css
I want to append the contents of no_animate.css to all.css.
This is working if I run it like this from a cmd prompt:
powershell Get-Content generated\no_animate.css >> generated\all.css
If I put the exact same code into a .ps1 file and run that it is copying the contents but inserting whitespace (Represented as NUL in texteditor) between every character.
Why would it be doing this? How do I prevent it?
In PowerShell the redirection operators > and >> are shorthands for Out-File and Out-File -Append respectively. Out-File uses Unicode (little endian UTF-16 specifically) as its default encoding. With this encoding every character is represented by 2 bytes instead of just 1. For ASCII characters (characters from the basic latin block) the first byte has the value 0.
Running powershell Get-Content generated\no_animate.css >> generated\all.css from CMD uses the CMD redirection operator instead of the PowerShell one, which doesn't transform the text to Unicode.
If you want to use PowerShell and your input file is ascii-encoded use Add-Content (available in PowerShell v3 or newer):
Get-Content generated\no_animate.css | Add-Content generated\all.css
or Out-File with explicit encoding.
Get-Content generated\no_animate.css |
Out-File generated\all.css -Append -Encoding Ascii

UTF8 Character Encoding Error

I'm currently outputting a long string through powershell as unicode using the below syntax (reference for doing it this way):
$string | out-file $path -encoding unicode
If I try to import this file in mongo, or another process that can't read UTF8 characters, I get an "Invalid UTF8 character detected." Is this the incorrect syntax?
Unicode is not the same encoding as Utf8. Have you tried -encoding ASCII or -encoding Utf8?

Powershell: Setting Encoding for Get-Content Pipeline

I have a file saved as UCS-2 Little Endian I want to change the encoding so I ran the following code:
cat tmp.log -encoding UTF8 > new.log
The resulting file is still in UCS-2 Little Endian. Is this because the pipeline is always in that format? Is there an easy way to pipe this to a new file as UTF8?
As suggested here:
Get-Content tmp.log | Out-File -Encoding UTF8 new.log
I would do it like this:
get-content tmp.log -encoding Unicode | set-content new.log -encoding UTF8
My understanding is that the -encoding option selects the encdoing that the file should be read or written in.
load content from xml file with encoding.
(Get-Content -Encoding UTF8 $fileName)
If you are reading an XML file, here's an even better way that adapts to the encoding of your XML file:
$xml = New-Object -Typename XML
$xml.load('foo.xml')
PowerShell's get-content/set-content encoding flag doesn't handle all encoding types. You may need to use IO.File, for example to load a file using Windows-1252:
$myString = [IO.File]::ReadAllText($filePath, [Text.Encoding]::GetEncoding(1252))
Text.Encoding::GetEncoding
Text.Encoding::GetEncodings