I need to export some data using Powershell to a ASCII encoded file.
My problem is that Scandinavian characters like Æ, Ø and Å turns into ? ? ? in the output file.
Example:
$str = "ÆØÅ"
$str | Out-File C:\test\test.txt -Encoding ascii
In the output file the result of this is: ???
It seems as though you have conflicting requirements.
Save the text in ASCII encoding
Save characters outside the ASCII character range
ASCII encoding does not support the characters you mention, which is the reason they do not work as you expect them to. The MSDN documentation on ASCII Encoding states that:
ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
And also further that
If your application requires 8-bit encoding (which is sometimes incorrectly referred to as "ASCII"), the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
You can read more about ASCII encoding on the Wikipedia page regarding ASCII Encoding (this page also includes tables showing all possible ASCII characters and control codes).
You need to either use a different encoding (such as UTF-8) or accept that you can't use characters which fall outside the ASCII range.
Related
I am a very stupid Programme Manager and I have a client requesting us to send in either ASCII or ANSI encoding format.
Our programmers has used Unicode (UTF-16), so my question is if Unicode (UTF-16) is compatible with ASCII or ANSI? Or am I understanding this incorrectly? Are we to change encoding or?
We haven't tried anything yet.
In short: ASCII encoding contains 128 characters. ANSI encoding contains 256 characters. UTF-16 encoding has the capacity for 1,112,064 character codes. There is some nuance such as the bytes used to store each character, but I don't think that is relevant here.
You can certainly convert a UTF-16 document down to ANSI or ASCII encoding, but any characters that are beyond their specification will be lost (probably converted to the 128th or 256th character, respectively, or some sort of null character).
For you, as a manager, there are some questions. At minimum:
Why does the client need this particular encoding? Can it be accommodated in some other way?
Are any characters in your data beyond the scope of ASCII/ANSI. Most (all?) programming languages provide a method to retrieve an integer representation of a character and determine if it is beyond the range of the desired encoding. This could be leveraged to discover how many instances exist of a character not compatible with the desired encoding.
I am trying to generate strings with 1 of every ASCII character. I started with
32..255| %{[char]$_ | Out-File -filepath .\outfile.txt -Encoding ASCII -Append}
I expected the list of printable characters, but I got different characters.
Can anyone point me to either a better way to get my expected result or an explanation as to why I'm getting these results?
[char[]] (32..255) | Set-Content outfile.txt
In Windows PowerShell this will create an "ANSI"-encoded file. The term "ANSI" encoding is an umbrella term for the set of fixed-width, single-byte, 8-bit encodings on Windows that are a superset of ASCII encoding. The specific "ANSI" encoding that is used is implied by the code page associated with the legacy system locale in effect on your system[1]; e.g., Windows-1252 on US-English systems.
See the bottom section for why "ANSI" encoding should be avoided.
If you were to do the same thing in PowerShell Core, you'd get a UTF-8-encoded file without a BOM, which is the best encoding to use for cross-platform and cross-locale compatibility.
In Windows PowerShell, adding -Encoding utf8 would give you an UTF-8 file too, but with BOM.
If you used -Encoding Unicode or simply used redirection operator > or Out-File, you'd get a UTF-16LE-encoded file.
(In PowerShell Core, by contrast, > produces BOM-less UTF-8 by default, because the latter is the consistently applied default encoding).
Note: With strings and numbers, Set-Content and > / Out-File can be used interchangeably (encoding differences in Windows PowerShell aside); for other types, only > / Out-File produces meaningful representations, albeit suitable only for human eyeballs, not programmatic processing - see this answer for more.
ASCII code points are limited to 7-bit values, i.e., the range 0x0 - 0x7f (127).
Therefore, your input values 128 - 255 cannot be represented as ASCII characters, and using -Encoding ASCII results in invalid input characters getting replaced with literal ? characters (code point 0x3f / 63), resulting in loss of information.
Important:
In memory, casting numbers such as 32 (0x20) or 255 (0xFF) to [char] (System.Char) instances causes the numbers to be interpreted as UTF-16 code units, representing Unicode characters[2] such as U+0020 and U+00FF as 2-byte sequences using the native byte order, because that's what characters are in .NET.
Similarly, instances of the .NET [string] type System.String are sequences of one or more [char] instances.
On output to a file or during serialization, re-encoding of these UTF-16 strings may occur, depending on the implied or specified output encoding.
If the output encoding is a fixed single-byte encoding, such as ASCII, Default ("ANSI"), or OEM, loss of information may occur, namely if the string to output contains characters that cannot be represented in the target encoding.
Choose one of the Unicode-based encoding formats to guarantee that:
no information is lost,
the resulting file is interpreted the same on all systems, irrespective of their system locale.
UTF-8 is the most widely recognized encoding, but note that Windows PowerShell (unlike PowerShell Core) invariably prepends a BOM to such files, which can cause problems on Unix-like platforms and with utilities of Unix heritage; it is a format focused on and optimized for backward compatibility with ASCII encoding that uses between 1 - 4 bytes to encode a single character.
UTF-16LE (which PowerShell calls Unicode) is a direct representation of the in-memory code units, but note that each characters is encoded with (at least) 2 bytes, which results in up to twice the size of UTF-8 files for strings that primarily contain characters in the ASCII range.
UTF-16BE (which PowerShell calls bigendianunicode) reverses the byte order in each code unit.
UTF-32LE (which PowerShell calls UTF32), represents each Unicode character as a fixed 4-byte sequence; even more so than with UTF-16, this typically results in unnecessarily large files.
UTF-7 should be avoided altogether, as it is not part of the Unicode standard.
[1] Among the legacy code pages supported on Windows, there are also fixed double-byte as well as variable-width encodings, but only for East Asian locales; sometimes they're (incorrectly) collectively referred to as DBCS (Double-Byte Character Set), as opposed to SBCS (Single-Byte Character Set); see the list of all Windows code pages.
[2] Strictly speaking, a UTF-16 code unit identifies a Unicode code point, but not every code point by itself is a complete Unicode character, because some (rare) Unicode characters have a code point value that falls outside the range that can be represented with a 16-bit integer, and these code points can alternatively represented by a sequence of 2 other code points, known as surrogate pairs.
I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.
What is an example of a character encoding which is not compatible with ASCII and why isn't it?
Also, what are other encoding which have upward compatibility with ASCII (except UTF and ISO8859, which I already know) and for what reason?
There are EBCDIC-based encodings that are not compatible with ASCII. For example, I recently encountered an email that was encoded using CP1026, aka EBCDIC 1026. If you look at its character table, letters and numbers are encoded at very different offsets than in ASCII. This was throwing off my email parser particularly because LF is encoded as 0x25 instead of as 0x0A in ASCII.
I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.