While dealing with unicode encoded characters in Java, I used Normalizer to normalize it and convert it to a String. Below is the code I used:
input = "¼";
input = Normalizer.normalize(input,Normalizer.Form.NFKD);
output: 1⁄4.
The forward slash that the method used was "⁄" whose unicode encoding is \u2044 as opposed to the regular forward slash that I am able to type using my keyboard which is "/" encoded as \u002f.
What is the difference between these and when should one be used over another?
Thanks in advance.
Rishit
Unicode these days contains heaps of variations of the common non-letter characters, and slashes are no exception. (That's not even all of them - search for "solidus" to get some more.) You've got fraction slashes (your one), full-width slashes, division slashes (yup, that's separate from the fraction one), thick slashes, extra-thick slashes - the list goes on.
The good news is you get to decide what slash is appropriate for your context.
If you're wanting to normalise just because you don't want fractions to appear squashed into a single character, or you want all fractions to display identically (unicode obviously can't have a character for every possible fraction) then using this fraction slash is probably what you want to go with.
On the other hand, if you want to normalise because you want to reduce the set of available characters to those that can be easily typed on a standard keyboard, it's likely the standard forward slash you should go with.
As Michael Berry mentioned, \u2044 is the fraction slash character.
It isn’t just a slash that looks a little different; it has specific rendering behavior. From the Unicode specification, section 6.2, “Other Punctuation”:
Fraction Slash. U+2044 FRACTION SLASH is used between digits to form numeric fractions, such as 2/3 and 3/9. The standard form of a fraction built using the fraction slash is defined as follows: any sequence of one or more decimal digits (General Category = Nd), followed by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction should be displayed as a unit, such as ³⁄₄ or . The precise choice of display can depend on additional formatting information.
If the displaying software is incapable of mapping the fraction to a unit, then it can also be displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 + THIN SPACE + 3 + FRACTION SLASH + 4 is displayed as 1 ³⁄₄.
Personally, I prefer the use of the fraction slash, as it makes fractions look better, like they’re professionally typeset. But there are some contexts where an ASCII slash is better, such as monospaced text, or wanting all-ASCII output, or as Michael mentioned, limiting text to characters which can be typed on a keyboard.
Related
I learned today that while common fractions have dedicated Unicode values, in order to form less common fractions like ³/₁₆ you have to use superscript/subscript characters followed by a slash. This is confirmed here and here.
This works for ¹¹/₁₆ and ¹³/₁₆, but it gets messed up with ¹⁵/₁₆. Do you see how the 5 rises higher than the one? I imagine this is because in order to show the number 5 clearly as a superscript, it requires more height than 1 and 3.
Well, that creates a problem. How do you display the fraction 15/16 nicely as Unicode characters? Unfortunately I can't use the sup and sub tags. I'm not displaying it in an HTML page. Rather, we're passing a string to a Java application that will then render these values. I know it renders Unicode values fine, but it wouldn't recognize HTML tags. Is there a Unicode solution?
The “proper” way of composing arbitrary vulgar fractions in Unicode is to not use the subscript and superscript digits at all, but to utilise the special properties of the character U+2044 FRACTION SLASH. You would simply type the regular ASCII digits and separate them with the slash like so: 15⁄16. The rendering engine will then automatically select the correct forms of the numbers, producing a clean, uniform look.
I put the word ‘proper’ in quotation marks because this method is not guaranteed to be supported on all systems, and some that do support it do so incorrectly or incompletely. If you absolutely need to make sure that 100% of recipients regardless of system will definitely see something that looks more or less right, I would therefore still (begrudgingly) recommend using the preformatted subscripts and superscripts as a substitute. As the other answer explained, the problem you are having is a font issue and cannot be solved if you do not have control over font settings.
This is indeed a font issue, however the problem arises from the fact that, in Unicode, ¹, ², and ³ belong to the Latin-1 Supplement block, while the other superscript digits belong to the Superscripts and Subscripts block, and some font substitution occurs.
Please see Why the display of Unicode characters for superscripted digits are not at the same height? for extra details; it is tagged as iOS, but I have the same problem on macOS too.
I found this site, Unicode Fraction Creator: https://lights0123.com/fractions/
Here's an example: ³⁄₂
Which is:
U+00B3 superscript three
U+2044 fraction slash
U+2082 subscript two
For a general answer on displaying fractions nicely, copy, paste, and change.
ASCII Characters
Name
hexadecimal value
⁄
Fraction Slash
8260
0
digit 0
48
1
digit 1
49
2
digit 2
50
3
digit 3
51
4
digit 4
52
5
digit 5
53
6
digit 6
54
7
digit 7
55
8
digit 8
56
9
digit 9
57
example: 1/0 =
1⁄0
What's the best way to round the result of a division in intersystems cache?
Thanks.
There are some functions, which used to format numbers, as well they would round it if necessary
$justify(expression,width[,decimal]) - Caché rounds or pads the number of fractional digits in expression to this value.
write $justify(5/3,0,3)
1.667
$fnumber(inumber,format,decimal)
write $fnumber(5/3,"",3)
1.667
$number(num,format,min,max)
write $number(5/3,3)
1.667
$normalize(num,scale)
w $normalize(5/3,3)
1.667
You just can choose which of them much more suitable for you. They doing different things, but result could be same.
In standard MUMPS (which Cache Object Script is backwards compatible with)
there are three "division" related operators. The first is the single character "/" (i.e. forward slash). This is a real number divide. 5/2 is 2.5, 10.5/5 is 2.1, etc. This takes two numbers (each possibly including a decimal point and a fraction ) and returns a number possibly with a fraction. A useful thing to remember is that this numeric divide yields results that are as simple as they can be. If there are leading zeros in front of the decimal point like 0007 it will treat the number as 7.
If there are trailing zeros after the decimal point, they will be trimmed as well.
So 2.000 gets trimmed to 2 (notice no decimal point) and 00060.0100 would be trimmed to just 60.01
In the past, many implementors would guarantee that 3/3 would always be 1 (not .99999) and that math was done as exactly as could be done. This is not an emphasis now, but there used to be special libraries to handle Binary Coded Decimal, (BCD) to guarantee as close to possible that fractions of a penny were never generated.
The next division operator was the single character "\" (i.e. backward slash).
this operator was called integer division or "div" by some folks. It would
do the division and throw away any remainder. The interesting thing about this is that it would always result in an integer, but the inputs didn't have to be an integer. So 10\2 is 5, but 23\2.3 is 10 and so is 23.3\2.33 , If there would be a fraction left over, it is just dropped. So 23.3\2.3 is 10 as well. The full divide operator would give you many fractions. 23.3/2.3 is 10.130434 etc.
The final division operator is remainder (or "mod" or "modulo"), symbolized by the single character "#" (sometimes called hash, pound sign, or octothorpe). To get the answer for this one, the integer division "/" is calculated, and what ever is left over when an integer division is calculated will be the result. In our example of 23\2 the answer is 11 and the remaining value is 1, so 23#2 is 1
ad 23.3#2.3 is .3 You may notice that (number#divisor)+((number\divisior)*divisor) is always going to be your original number back.
Hope this helps you make this idea clear in your programming.
Occasionally I've seen the symbol "plus or minus" written in fractional form, like this:
Is there a Unicode character for this?
Note: I already know about the standard "plus-minus sign" symbol, but it won't work in this context. I'm specifically looking for a version with the fraction bar.
You can approximate it to some extent with a superscript plus (U+207A), a division slash (U+2215) and a subscript minus (U+208B):
⁺∕₋
However, it requires font support to get right. Especially the super- and subscript +/− are not available in most fonts, so it might just render horribly.
For reference, that's how it looks for me (better than five years ago, but still somewhat broken):
However, using Cambria Math in Word 2010 it looks like this:
Which probably is exactly how it should look like (follows the same typesetting rules as fractions).
This is the only one I have seen in unicode (plus over minus):
±
HTML/XML Character reference:
±
HTML Named Entity:
±
This symbol is used to indicate the precision of an approximation.
You mean like ± (U+00B1 / "\x00b1")?
Edit: speaking specifically to a design which uses a solidus, the best I could find was ⁺⁄₋ which is U+207a (superscript plus sign) U+2044 (fraction slash) U+208b (subscript minus). The fraction slash has negative kerning in some fonts, which causes the appearance of composition. See this JSFiddle for an example of how this works with a larger font size.
<div style="font-size:20em;">⁺⁄₋</div>
+⁄−
<sup>+</sup>⁄<sub>−</sub>
In UTF-8: 0xC2 0xB1
For other encodings see:
http://www.fileformat.info/info/unicode/char/b1/index.htm
i need to choose a checksum algorithm to detect when users mistyped a 4 character [A-Z0-9] code by adding 1 character at the end of the code (in [A-Z0-9] also).
Summing ASCII codes and applying a modulo is a bad solution, since inverting 2 key strokes won't be noticed.
I would probably use the Fletcher algorithm, but i would like to know is anyone knows an algorithm designed for this use case (very very small amount of byte, position dependant) ?
Thank you.
You can try the ISO 7064 Mod x,y algorithms. According to the ISO description:
The check character systems specified in ISO/IEC 7064:2002 can detect ( http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=31531 ):
all single substitution errors (the substitution of a single character for another, for example 4234 for 1234);
all or nearly all single (local) transposition errors (the transposition of two single characters, either adjacent or with one character between them, for example 12354 or 12543 for 12345);
all or nearly all shift errors (shifts of the whole string to the left or right);
a high proportion of double substitution errors (two separate single substitution errors in the same string, for example 7234587 for 1234567);
high proportion of all other errors.
There are some partial implementations you can find like:
http://code.google.com/p/checkdigits/wiki/CheckDigitSystems (includes Java and Javascript implementations of several checksums algorithms).
http://www.codeproject.com/Articles/16540/Error-Detection-Based-on-Check-Digit-Schemes (explains and includes VC implementations).
For example, you could use ISO 7064 Mod 37,36, which can use 0-9 and A-Z (the data and the check character). The detailed description of the algorithm (if you don't feel like buying the ISO) can be found in:
http://www.cdfa.ca.gov/ahfss/animal_health/pdfs/NAIS/Program_Standard_and_Technical_Reference10-07.pdf (it's used for animal identification)
http://www.ifpi.org/content/library/GRid_Standard_v2_1.pdf (also used by the music industry)
http://www.ddex.net/sites/default/files/DDEX-DPID-10-2006.pdf (other media companies)
The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.
I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.
According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.
Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?
Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/
If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.