Casting int to chars in Powershell has unexpected results - powershell

I am trying to generate strings with 1 of every ASCII character. I started with
32..255| %{[char]$_ | Out-File -filepath .\outfile.txt -Encoding ASCII -Append}
I expected the list of printable characters, but I got different characters.
Can anyone point me to either a better way to get my expected result or an explanation as to why I'm getting these results?

[char[]] (32..255) | Set-Content outfile.txt
In Windows PowerShell this will create an "ANSI"-encoded file. The term "ANSI" encoding is an umbrella term for the set of fixed-width, single-byte, 8-bit encodings on Windows that are a superset of ASCII encoding. The specific "ANSI" encoding that is used is implied by the code page associated with the legacy system locale in effect on your system[1]; e.g., Windows-1252 on US-English systems.
See the bottom section for why "ANSI" encoding should be avoided.
If you were to do the same thing in PowerShell Core, you'd get a UTF-8-encoded file without a BOM, which is the best encoding to use for cross-platform and cross-locale compatibility.
In Windows PowerShell, adding -Encoding utf8 would give you an UTF-8 file too, but with BOM.
If you used -Encoding Unicode or simply used redirection operator > or Out-File, you'd get a UTF-16LE-encoded file.
(In PowerShell Core, by contrast, > produces BOM-less UTF-8 by default, because the latter is the consistently applied default encoding).
Note: With strings and numbers, Set-Content and > / Out-File can be used interchangeably (encoding differences in Windows PowerShell aside); for other types, only > / Out-File produces meaningful representations, albeit suitable only for human eyeballs, not programmatic processing - see this answer for more.
ASCII code points are limited to 7-bit values, i.e., the range 0x0 - 0x7f (127).
Therefore, your input values 128 - 255 cannot be represented as ASCII characters, and using -Encoding ASCII results in invalid input characters getting replaced with literal ? characters (code point 0x3f / 63), resulting in loss of information.
Important:
In memory, casting numbers such as 32 (0x20) or 255 (0xFF) to [char] (System.Char) instances causes the numbers to be interpreted as UTF-16 code units, representing Unicode characters[2] such as U+0020 and U+00FF as 2-byte sequences using the native byte order, because that's what characters are in .NET.
Similarly, instances of the .NET [string] type System.String are sequences of one or more [char] instances.
On output to a file or during serialization, re-encoding of these UTF-16 strings may occur, depending on the implied or specified output encoding.
If the output encoding is a fixed single-byte encoding, such as ASCII, Default ("ANSI"), or OEM, loss of information may occur, namely if the string to output contains characters that cannot be represented in the target encoding.
Choose one of the Unicode-based encoding formats to guarantee that:
no information is lost,
the resulting file is interpreted the same on all systems, irrespective of their system locale.
UTF-8 is the most widely recognized encoding, but note that Windows PowerShell (unlike PowerShell Core) invariably prepends a BOM to such files, which can cause problems on Unix-like platforms and with utilities of Unix heritage; it is a format focused on and optimized for backward compatibility with ASCII encoding that uses between 1 - 4 bytes to encode a single character.
UTF-16LE (which PowerShell calls Unicode) is a direct representation of the in-memory code units, but note that each characters is encoded with (at least) 2 bytes, which results in up to twice the size of UTF-8 files for strings that primarily contain characters in the ASCII range.
UTF-16BE (which PowerShell calls bigendianunicode) reverses the byte order in each code unit.
UTF-32LE (which PowerShell calls UTF32), represents each Unicode character as a fixed 4-byte sequence; even more so than with UTF-16, this typically results in unnecessarily large files.
UTF-7 should be avoided altogether, as it is not part of the Unicode standard.
[1] Among the legacy code pages supported on Windows, there are also fixed double-byte as well as variable-width encodings, but only for East Asian locales; sometimes they're (incorrectly) collectively referred to as DBCS (Double-Byte Character Set), as opposed to SBCS (Single-Byte Character Set); see the list of all Windows code pages.
[2] Strictly speaking, a UTF-16 code unit identifies a Unicode code point, but not every code point by itself is a complete Unicode character, because some (rare) Unicode characters have a code point value that falls outside the range that can be represented with a 16-bit integer, and these code points can alternatively represented by a sequence of 2 other code points, known as surrogate pairs.

Related

ASCII or ANSI with Unicode (UTF-16)

I am a very stupid Programme Manager and I have a client requesting us to send in either ASCII or ANSI encoding format.
Our programmers has used Unicode (UTF-16), so my question is if Unicode (UTF-16) is compatible with ASCII or ANSI? Or am I understanding this incorrectly? Are we to change encoding or?
We haven't tried anything yet.
In short: ASCII encoding contains 128 characters. ANSI encoding contains 256 characters. UTF-16 encoding has the capacity for 1,112,064 character codes. There is some nuance such as the bytes used to store each character, but I don't think that is relevant here.
You can certainly convert a UTF-16 document down to ANSI or ASCII encoding, but any characters that are beyond their specification will be lost (probably converted to the 128th or 256th character, respectively, or some sort of null character).
For you, as a manager, there are some questions. At minimum:
Why does the client need this particular encoding? Can it be accommodated in some other way?
Are any characters in your data beyond the scope of ASCII/ANSI. Most (all?) programming languages provide a method to retrieve an integer representation of a character and determine if it is beyond the range of the desired encoding. This could be leveraged to discover how many instances exist of a character not compatible with the desired encoding.

Understanding encoding schemes

I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

How does code pages work in case of chinese

How does code pages work in case of chinese / japanese?
It is unable to encode all alphabet's characters for these languages in the limits of one byte so how does it work then?
Note that I'm taking about pre-Unicode times.
I'm most familiar with Japanese, but in general the strategy is the same for any language that needs more characters than fit in a single byte - you use a variable width multibyte encoding where some bytes are recognized as starting a "wide" character and ASCII is left alone.
In the early days so-called "ASCII-safe" encodings were useful. These used only seven bits (the high bit was always 0) so they worked with a variety of systems (including hardware) that expected only control characters to set the high bit in any byte. ISO-2022-JP is one of these and is still used in email quite often (mostly on feature phones).
Here's what ISO-2022-JP looks like if you don't decode it:
echo "日本語" | iconv -f utf8 -t iso2022jp | cat -v
^[$BF|K\8l^[(B
Note that "test" comes through unchanged and all other characters are valid ASCII; ^[ is an ASCII escape character. (ISO-2022 also has 8-bit versions, but the 7-bit is the most commonly used variety.)
Later variable width encodings, like EUC, Shift-JIS, and UTF-8 all work on the same principle except they use binary (non-ASCII) escapes, so the first character of any multi-byte character has the high bit set (that is, the unsigned byte value is >128). The Wikipedia article for UTF-8 has a nice table explaining how UTF handles this. Just like the older ASCII-safe encodings, these leave ASCII strings unmodified.
There also exist fixed-width multibyte encodings, but they're relatively uncommon. There was an attempt to popularize an encoding that just used two bytes for everything, called "UCS-2", but it ended up not having room for enough characters and was mostly superseded by variable width UTF-16 in the 1990s. UTF-16 is (practically speaking) the internal encoding used in Java and Javascript, but due to the history with UCS-2 sometimes things like string length work in strange ways.
Technically fixed-width UTF-32 exists, but it's not widely used and I've never personally encountered it in the wild.

Scandinavian characters when encoding to Ascii in Powershell

I need to export some data using Powershell to a ASCII encoded file.
My problem is that Scandinavian characters like Æ, Ø and Å turns into ? ? ? in the output file.
Example:
$str = "ÆØÅ"
$str | Out-File C:\test\test.txt -Encoding ascii
In the output file the result of this is: ???
It seems as though you have conflicting requirements.
Save the text in ASCII encoding
Save characters outside the ASCII character range
ASCII encoding does not support the characters you mention, which is the reason they do not work as you expect them to. The MSDN documentation on ASCII Encoding states that:
ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
And also further that
If your application requires 8-bit encoding (which is sometimes incorrectly referred to as "ASCII"), the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
You can read more about ASCII encoding on the Wikipedia page regarding ASCII Encoding (this page also includes tables showing all possible ASCII characters and control codes).
You need to either use a different encoding (such as UTF-8) or accept that you can't use characters which fall outside the ASCII range.

ASCII vs Unicode + UTF-8

Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding.
It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)
And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.
Unicode and ASCII are both Codepoints + Encoding scheme
Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.
Conversion and Representation(in binary/hexadecimal) of String:
String := sequence of Graphemes(character is a "kind of" its subset).
Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes
for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.
Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)
ASCII supports 128 character codepoints(mostly english)