Is there a number prefix like '0' or '0x' for base 10? - number-formatting

I just wrote a script which parses numbers of a defined length, which often include leading 0s
Of course, if I don't remove those 0s, they are interpreted as a prefix meaning my number is written in octal, which isn't handy at all (and already caused me troubles, in other contexts).
i.e. 011 is read as "9" in decimal
For that reason, I was wondering whether such a prefix exists, and in which languages.
Seems to me that "obviousness" led to incompletness.
I thought prefixing numbers with a "d" would make them read as base 10, but it doesn't seem to be the case in bash. (Maybe in C or Java ?)
Octal and hexadecimal prefixes work about everywhere, and thinking about it, I'm quite surprised such a prefix doesn't seem to exist for decimal number.
Is that a fact, or do I ignore an easy way of avoiding such parsing problems ?

Related

Is the file hashing/checksum value case insensitive?

My question is only about file hashing rather than hashing function in general. My assumption is that the value of a file checksum/hashing is case insensitive. My concern is that I cannot find any online documentation to confirm that. I only got the following two points to support my claim.
This link contains some file hash values. None of them contains any capital letter. https://www.virtualbox.org/download/hashes/6.1.2/SHA256SUMS
When I use Powershell Get-FileHash cmdlet, all returns are capitals. https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/get-filehash?view=powershell-7
Can anyone help me confirm my assumption, and provide some documentation on files in Windows as well as in Linux OS?
Hashes and checksums are often presented in hexadecimal notation. Although it is common to use upper case A-F instead of lower case a-f, it does not make any difference.
As for a reference, the question is so basic that it's hard to find a solid reference. One is ISO/IEC 9899 standard for the C programming language:
A hexadecimal constant consists of the prefix 0x or 0X followed by a
sequence of the decimal digits and the letters a (or A) through f (or
F) with values 10 through 15 respectively.
In some use cases, such as CSS lower case might be preferred, as it is more pleasant to read among other lower case characters. .Net's Int32.ToString supports standard numeric formaters. x for lower case, X for upper case.
In System.Convert, there's ToInt32 that will convert values from one base into 32 bit integers. Let's see how hex digit AA is converted to decimal in different cases. Like so,
[convert]::toint32("aa", 16)
170
[convert]::toint32("AA", 16)
170
[convert]::toint32("aA", 16)
170
[convert]::toint32("Aa", 16)
170
Every letter case combination represents the same decimal value, 170. Don't try this on hashes though, as those are usually larger than 32 bit integers.
My question is only about file hashing rather than hashing function in general. My assumption is that the value of a file checksum/hashing is case insensitive.
Hashes are byte sequences, they don't have case at all.
Hashes are generally encoded as hexadecimal for display, for which the 6 "letters" (a to f) can be either case. That's mostly a style issue though I've known system which did object when getting the "wrong" case (some would only accept lowercase, others only uppercase).
Also beware that e.g. it's not unheard of to store or show hashes as base64 where case is relevant. Without knowing why you're asking (e.g. is it idle musing, or do you have an actual use case) it's hard to answer completely categorically.

Are there any real alternatives to unicode?

As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.

Standard radix for decimal base?

If I need to identify hex, octal, or binary numbers, I can just use prefixes 0x, 0, 0b. They aren't necessarily universal but are pretty recognizable in the programming world.
Is there an identifier like that for decimal (base 10) numbers? I would like to be able to explicitly denote a decimal base number and would like to use a somewhat standard notation if possible.
If you need a general form, then (number)b(base) is used in a couple of projects. 1b10, 1b16, 1b2, etc. can give you some standard way to express bases.
I've never seen 8x used anywhere tbh. 0 as a prefix - yes.
The standard prefix for octal is a leading "0" (in the C / C++ / Perl worlds at least). So I wouldn't recommend using "8x".

Linux Sort vs Perl String Comparison

Because I was dealing with very large files, I sorted my base and candidate files before comparing them to see what lines were missing from the other. I did this to avoid keeping the records in memory. The sorting was done by using the Linux command-line tool, sort.
In my Perl script, I would look at whether the string in the line was lt, gt, or eq to the line in the other file, advancing the pointers in the file where necessary. However, I hit a problem when I noticed that my string comparison thought the strings in the base file were lt a string in the candidate file which contained special characters.
Is there a surefire way of making sure my Linux sort and Perl string comparisons are using the same type of string comparator?
The sort command uses the current locale, as specified by the environment variable LC_ALL, to determine the sort order for characters. Usually the easiest way to fix sorting issues is to manually set this to the C locale, which treats each 8-bit byte as a single character and compares by simple numeric value. In most shells this can be done as a one-off just for a single command by prefixing it like so:
LC_ALL=C sort < infile > outfile
This will also solve similar problems for some other text-processing programs. (E.g. I recall problems working with CSV files on a German person's computer -- this was traced back to the fact that Germans use a comma instead of a decimal point. Putting LC_ALL=C in front of the relevant commands fixed that issue too.)
[EDIT] Although Perl can be directed to treat some strings as Unicode, by default it still treats input and output as streams of 8-bit bytes, so the above approach should produce an order that is the same as Perl's sort() function. (Thanks to Ven'Tatsu for this nugget.)

How do I determine the character set of a string?

I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?
If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.
As for your question how to convert between character sets: Encode is there to do that for you
Determining whether a file is probably UTF-8 or not should be pretty easy. Determining the encoding if it is not UTF-8 would be very difficult in general.
If the file is encoded with UTF-8, the high bits of each byte should follow a pattern. If a character is one byte, its high bit will be cleared (zero). Otherwise, an n byte character (where n is 2–4) will have the high n bits of the first byte set to one, followed by a single zero bit. The following n - 1 bytes should all have the highest bit set and the second-highest bit cleared.
If all the bytes in your file follow these rules, it's probably encoded with UTF-8. I say probably, because anyone can invent a new encoding that happens to follow the same rules, deliberately or by chance, but interprets the codes differently.
Note that a file encoded with US-ASCII will follow these rules, but the high bit of every byte is zero. It's okay to treat such a file as UTF-8, since they are compatible in this range. Otherwise, it's some other encoding, and there's not an inherent test to distinguish the encoding. You'll have to use some contextual knowledge to guess.
Take a look at iconv
http://www.gnu.org/software/libiconv/
Text::Iconv