why "mov %crN, %eax" can be encoded when crN is not exist? - x86-64

Intel manual volume 3, said that there's only CR0,2,3,4 + CR8 in 32E mode, and CR1 is reserved. But when I compile instruction in title, N could be any value < 16. I disassemble the obj file and found that it's encoded just follow reference when N<8. And when 7< N <16, it's encoded same as before but a LOCK prefix is added(make it a "serializing instruction" as NOTED in MOV cr version?).
Why it's encoded without any complain and is this encoding legal? Are those CRs really exist or they are just alias for other registers?

I. About "LOCK" prefix: current x86 really doesn't have fixed prefix meaning. There is the set of one-byte prefixes which all have their initial meaning (LOCK, REP, CS segment, etc.) but the latter is applied only to instructions where it has direct sense. But there are other usages when an instruction doesn't have such sense. For example, F3 (REPE) before BSR/BSF converts them to LZCNT/TZCNT respectively. With ordinary memory readings, the same REPE converts to XRELEASE. Segment prefixes 2Eh and 3Eh before branch instructions are prediction hints (used in Pentium 4 line). So, don't treat basic prefix roles as thorough principle.
II. Ongoing question: why reserved CRs should not be encoded, to your mind? I don't see any principal violation in this. For analogy, why don't you require IO instructions to encode only existing ports? You'd accept CR space in the same manner as IO or MSR space, and this won't be a problem anymore. :)

Related

Can you create a programming language with just one symbol?

Can you create a programming language with just one symbol like brainfuck.
Yes, it has been done before - see Unary.
Basically it's a strange encoding of brainfuck. Treat each BF command as a number. The whole program is then also a number, created by concatenating the commands together (with an extra 1 at front, for unambiguous decoding). Convert the number to unary numeric system (aka the number of digits is your number) and you're done.
Note however the programs in this tend to be very large - a cat implemented in Unary is (according to the information on page) 56623 characters long.
MGIFOS, Lenguage and Ellipsis follow the same principle. Note that e.g. a hello world in MGIFOS
has more characters than particles in the observable universe
Then Len(language,encoding) extends this principle to any language.
They are called OISC One Instruction Set Compiler.
The first one know of is Melzak's Arithmetic Machine (1961), with the instruction:
z = x-y or jump if y>x
You also have Zero Instruction Set Computer, which are more like neural nets.
Not forgetting the amazing FRACTRAN of Conway & Guy (1996), with no instruction but interprets a series of fractions (the program) in a Tuning complete way.

What is the limit to encoding base in case of Unicode strings as opposed to base64 having base = 64?

This is actually related to code golf in general, but also appliable elsewhere. People commonly use base64 encoding to store large amounts of binary data in source code.
Assuming all programming languages to be happy to read Unicode source code, what is the max N, for which we can reliably devise a baseN encoding?
Reliability here means being able to encode/decode any data, so every single combination of input bytes can be encoded, and then decoded. The encoded form is free from this rule.
The main goal is to minimize the character count, regardless of byte-count.
Would it be base2147483647 (32-bit) ?
Also, because I know it may vary from browser-to-browser, and we already have problems with copy-pasting code from codegolf answers to our editors, the copy-paste-ability is also a factor here. I know there is a Unicode range of characters that are not displayed.
NOTE:
I know that for binary data, base64 usually expands data, but here the character-count is the main factor.
It really depends on how reliable you want the encoding to be. Character encodings are designed with trade-offs, and in general the more characters allowed, the less likely it is to be universally accepted i.e. less reliable. Base64 isn't immune to this. RFC 3548, published in 2003, mentions that case sensitivity may be an issue, and that the characters + and / may be problematic in certain scenarios. It describes Base32 (no lowercase) and Base16 (hex digits) as potentially safer alternatives.
It does not get better with Unicode. Adding that many characters introduces many more possible points of failure. Depending on how stringent your requirements are, you might have different values for N. I'll cover a few possibilities from large N to small N, adding a requirement each time.
1,114,112: Code points. This is the number of possible code points defined by the Unicode Standard.
1,112,064: Valid UTF. This excludes the surrogates which cannot stand on their own.
1,111,998: Valid for exchange between processes. Unicode reserves 66 code points as permanent non-characters for internal use only. Theoretically, this is the maximum N you could justifiably expect for your copy-paste scenario, but as you noted, in practice many other Unicode strings will fail that exercise.
120,503: Printable characters only, depending on your definition. I've defined it to be all characters outside of the Other and Separator general categories. Also, starting from this bullet point, N is subject to change in future versions of Unicode.
103,595: NFKD normalized Unicode. Unfortunately, many processes automatically normalize Unicode input to a standardized form. If the process used NFKC or NFKD, some information may have been lost. For more reliability, the encoding should thus define a normalization form, with NFKD being better for increasing character count
101,684: No combining characters. These are "characters" which shouldn't stand on their own, such as accents, and are meant to be combined with another base character. Some processes might panic if they are left standing alone, or if there are too many combining characters on a single base character. I've now excluded the Mark category.
85: ASCII85, aka. I want my ASCII back. Okay, this is no longer Unicode, but I felt like mentioning it because it's a lesser known ASCII-only encoding. It's mainly used in Adobe's PostScript and PDF formats, and has a 5:4 encoded data size increase, rather than Base64's 4:3 ratio.

char * ptr = "String"; What encoding is used by default?

What is the encoding that is used here.
If I go peak at ptr[0] - will 'S' be encoded in ASCII/utf-8.
How do I know what is the encoding that is used ?
Please help
For the letter S, UTF-8 is ASCII. For the more general case, the ISO C standard (assuming you're talking C) does not mandate the encoding.
It merely states that a certain minimum number of source characters have to be made available, for both the source and execution environments (the first is where you develop your code, the second where you run it - they may be vastly different beasts).
That's all covered in, for C11, 5.2.1 Character sets, which specifies source and execution character sets. Later sections of 5.2 cover trigraphs, multibyte characters and so on.

BCPL octal numerical constants

I've been digging into the history of BCPL due to a question I was asked about the reasoning behind using the prefix "0x" for the representation hexadecimal numbers.
In my search I stumbled upon a really good explanation of the history behind this token. (Why are hexadecimal numbers prefixed with 0x?)
From this post, however, another questions sparked:
For octal constants, did BCPL use 8 <digit> (As per specs: http://cm.bell-labs.com/cm/cs/who/dmr/bcpl.pdf) or did it use #<digit> (As per http://rabbit.eng.miami.edu/info/bcpl_reference_manual.pdf) or were both of these syntaxes valid in different implementations of the language?
I've also been able to find a second answer here that used the # syntax which further intrigued me in the subject. (Why are leading zeroes used to represent octal numbers?)
Any historical insights are greatly appreciated.
There were many slight variations on syntax in BCPL.
For example, while the one we used had 16-bit cells (so that x!y gave you the 16-bit word from a word address at x + y (a word address being half of the byte address), we also had a need to extract from byte address and byte values (since we were primarily creating OS and control software on a 6809 byte-addressable CPU).
Hence in addition to:
x!y - get word from byte address (x + y) * 2
we also had
x!%y - get byte from byte address (x * 2) + y
x%!y - get word from byte address x + (y * 2)
x%%y - get byte from byte address x + y
I'm pretty certain they were implementation-specific as I never saw them anywhere else. And BCPL was around long before language standards were as important as they are today.
The canonical language specification would have been the earlier one from Richards since he wrote the language (and your second document is for the Essex BCPL implementation about a decade later). But keep in mind that Project MAC was the earliest iteration - there were plenty of advancements after that as well.
For example, there's a 2013 revision of the BCPL User Guide (see Martin's home page) which specifies #b, #o and #x as prefixes for various non-decimal bases.

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
Any normalization procedure may cause collisions, e.g. "office" == "office".
Normalization can change the number of bytes in the string.
Further questions
What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?
Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.
As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.
Here’s an example of how simple it is to enforce the OpaqueString profile on a password string:
# pip install precis-i18n
>>> import precis_i18n
>>> precis_i18n.get_profile('OpaqueString').enforce('😳å∆3⨁ucei=The4e-iy5am=3iemoo')
'😳å∆3⨁ucei=The4e-iy5am=3iemoo'
>>>
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.