Does Unicode have a defined maximum number of code points? - unicode

I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer.
I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points?
The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other places that it is 1,112,114 code points. So is there a one number to be given or is the issue more complicated than that?

The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).
This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.
However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.
1,114,112 - 2,048 = 1,112,064
There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, … 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:
The Unicode Standard, Version 7.0, contains 112,956 characters
So only about 10% of the available code points have been allocated.
I can't account for why you found 1,112,114 as the number of code points.
Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.

Yes, all the code points that can't be represented in UTF-16 (including using surrogates) have been declared invalid.
U+10FFD seems to be the highest code point, but the surrogates, U+00FFFE and U+00FFFF aren't usable code points so the total count is a bit lower.

I have made a very little routine that prints onscreen a very long table, from 0 to n values where the var start is a number that can be customizable by the user. This is the snippet:
function getVal()
{
var start = parseInt(document.getElementById('start').value);
var range = parseInt(document.getElementById('range').value);
var end = start + range;
return [start, range, end];
}
function next()
{
var values = getVal();
document.getElementById('start').value = values[2];
document.getElementById('ok').click();
}
function prev()
{
var values = getVal();
document.getElementById('start').value = values[0] - values[1];
document.getElementById('ok').click();
}
function renderCharCodeTable()
{
var values = getVal();
var start = values[0];
var end = values[2];
const MINSTART = 0; // Allowed range
const MAXEND = 4294967294; // Allowed range
start = start < MINSTART ? MINSTART : start;
end = end < MINSTART ? (MINSTART + 1) : end;
start = start > MAXEND ? (MAXEND - 1) : start;
end = end >= MAXEND ? (MAXEND + 1) : end;
var tr = [];
var unicodeCharSet = document.getElementById('unicodeCharSet');
var cCode;
var cPoint;
for (var c = start; c < end; c++)
{
try
{
cCode = String.fromCharCode(c);
}
catch (e)
{
cCode = 'fromCharCode max val exceeded';
}
try
{
cPoint = String.fromCodePoint(c);
}
catch (e)
{
cPoint = 'fromCodePoint max val exceeded';
}
tr[c] = '<tr><td>' + c + '</td><td>' + cCode + '</td><td>' + cPoint + '</td></tr>'
}
unicodeCharSet.innerHTML = tr.join('');
}
function startRender()
{
setTimeout(renderCharCodeTable, 100);
console.time('renderCharCodeTable');
}
unicodeCharSet.addEventListener("load",startRender());
body
{
margin-bottom: 50%;
}
form
{
position: fixed;
}
table *
{
border: 1px solid black;
font-size: 1em;
text-align: center;
}
table
{
margin: auto;
border-collapse: collapse;
}
td:hover
{
padding-bottom: 1.5em;
padding-top: 1.5em;
}
tbody > tr:hover
{
font-size: 5em;
}
<form>
Start Unicode: <input type="number" id="start" value="0" onchange="renderCharCodeTable()" min="0" max="4294967300" title="Set a number from 0 to 4294967294" >
<p></p>
Show <input type="number" id="range" value="30" onchange="renderCharCodeTable()" min="1" max="1000" title="Range to show. Insert a value from 10 to 1000" > symbols at once.
<p></p>
<input type="button" id="pr" value="◄◄" onclick="prev()" title="Mostra precedenti" >
<input type="button" id="nx" value="►►" onclick="next()" title="Mostra successivi" >
<input type="button" id="ok" value="OK" onclick="startRender()" title="Ok" >
<input type="reset" id="rst" value="X" onclick="startRender()" title="Reset" >
</form>
<table>
<thead>
<tr>
<th>CODE</th>
<th>Symbol fromCharCode</th>
<th>Symbol fromCodePoint</th>
</tr>
</thead>
<tbody id="unicodeCharSet">
<tr><td colspan="2">Rendering...</td></tr>
</tbody>
</table>
Run it a first time, then open the code and set the start variable's value to a very high number just a little bit lower than MAXEND constant value. The following is what I obtained:
code equivalent symbol
{~~~ first execution output example ~~~~~}
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 !
34 "
35 #
36 $
37 %
38 &
39 '
40 (
41 )
42 *
43 +
44 ,
45 -
46 .
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
{~~~ second execution output example ~~~~~}
4294967275 →
4294967276 ↓
4294967277 ■
4294967278 ○
4294967279 ￯
4294967280 ￰
4294967281 ￱
4294967282 ￲
4294967283 ￳
4294967284 ￴
4294967285 ￵
4294967286 ￶
4294967287 ￷
4294967288 ￸
4294967289 
4294967290 
4294967291 
4294967292 
4294967293 �
4294967294
The output of course is truncated (between the first and the second execution) cause it is too long.
After the 4294967294 (= 2^32) the function inexorably stops so I suppose that it has reached its max possible value: so I interpret this as the max possible value of the unicode char code table. Of course as said by other answers, not all char code have an equivalent symbols but frequently they are empty, as the example showed. Also there are a lot of symbols that are repeated multiple time in different points between 0 to 4294967294 char codes
Edit: improvements
(thanks #duskwuff)
Now it is also possible to compare both String.fromCharCode and String.fromCodePoint behaviors. Notice that the first statement arrives to 4294967294 but the output is repeated every 65536 (16 bit = 2^16). The last one stops working at code 1114111 (cause the list of unicode char and symbols starts from 0 we have a total of 1,114,112 Unicode code points but as said in other answers not all of them are valid in the sense that they are empty points). Also remember that to use a certain unicode char you need to have an appropriate font that has the corresponding char defined in it. If not you will show an empty unicode char or an empty square char.
Notice:
I have noticed that in some Android systems using Chrome Browser for Android the js String.fromCodePoint returns an error for all codepoints.

Related

Find the number at the n position in the infinite sequence

Having an infinite sequence s = 1234567891011...
Let's find the number at the n position (n <= 10^18)
EX: n = 12 => 1; n = 15 => 2
import Foundation
func findNumber(n: Int) -> Character {
var i = 1
var z = ""
while i < n + 1 {
z.append(String(i))
i += 1
}
print(z)
return z[z.index(z.startIndex, offsetBy: n-1)]
}
print(findNumber(n: 12))
That's my code but when I find the number at 100.000th position, it returns an error, I thought I appended too many i to z string.
Can anyone help me, in swift language?
The problem we have here looks fairly straight forward. Take a list of all the number 1-infinity and concatenate them into a string. Then find the nth digit. Straight forward problem to understand. The issue that you are seeing though is that we do not have an infinite amount of memory nor time to be able to do this reasonably in a computer program. So we must find an alternative way around this that does not just add the numbers onto a string and then find the nth digit.
The first thing we can say is that we know what the entire list is. It will always be the same. So can we use any properties of this list to help us?
Let's call the input number n. This is the position of the digit that we want to find. Let's call the output digit d.
Well, first off, let's look at some examples.
We know all the single digit numbers are just in the same position as the number itself.
So, for n<10 ... d = n
What about for two digit numbers?
Well, we know that 10 starts at position 10. (Because there are 9 single digit numbers before it). 9 + 1 = 10
11 starts at position 12. Again, 9 single digits + one 2 digit number before it. 9 + 2 + 1 = 12
So how about, say... 25? Well that has 9 single digit numbers and 15 two digit numbers before it. So 25 starts at 9*1 + 15*2 + 1 = 40 (+ 1 as the sum gets us to the end of 24 not the start of 25).
So... 99 starts at? 9*1 + 89*2 + 1 = 188.
The we do the same for the three digit numbers...
100... 9*1 + 90*2 + 1 = 190
300... 9*1 + 90*2 + 199*3 + 1 = 787
1000...? 9*1 + 90*2 + 900*3 + 1 = 2890
OK... so now I'm seeing a pattern here that seems to need to know the number of digits in each number. Well... I can get the number of digits in a number by rounding up the log(base 10) of that number.
rounding up log base 10 of 5 = 1
rounding up log base 10 of 23 = 2
rounding up log base 10 of 99 = 2
rounding up log base 10 of 627 = 3
OK... so I think I need something like...
// in pseudo code
let lengthOfNumber = getLengthOfNumber(n)
var result = 0
for each i from 0 to lengthOfNumber - 1 {
result += 9 * 10^i * (i + 1) // this give 9*1 + 90*2 + 900*3 + ...
}
let remainder = n - 10^(lengthOfNumber - 1) // these have not been added in the loop above
result += remainder * lengthOfNumber
So, in the above pseudo code you can give it any number and it will return the position in the list that that number starts on.
This isn't the exact same as the problem you are trying to solve. And I don't want to solve it for you.
This is just a leg up on how I would go about solving it. Hopefully, this will give you some guidance on how you can take this further and solve the problem that you are trying to solve.

Incrementing numbers in REXX

I have a requirement where I take a SUBSTR (e.g st_detail = "%BOM0007992739871P", st_digit = SUBSTR(st_detail, 8, 10) ). I have to do some validations on the st_digit and if it is valid change "%BOM0002562186P" to "%BOM0002562186C". My code works fine upto this. But I was asked to increment st_digit (I used st_digit = st_digit + 1 ) and print 100 valid st_digits and append it with C. so I put the code in a loop and display st_detail. But when i ran it i got "%BOM0007.99273987E+9C" after first increment. Please help on how to display "%BOM0007992739872C"? (NOTE: this is a reference only and I can't display the validation logic here and my code works fine. The extra code i added was the code I used here)
out_ctr = 1
DO while out_ctr < 101
/* validations */
IF valid THEN
say st_digit " is valid"
ELSE
say st_digit " is invalid"
st_digit = st_digit + 1
out_ctr = out_ctr + 1
END
It seems the NUMERIC setting was " 9 0 SCIENTIFIC ". I changed it to NUMERIC DIGITS 12, So, now it works.
parse numeric my_numeric_settings
say my_numeric_settings /* 9 0 SCIENTIFIC */
NUMERIC DIGITS 16
parse numeric my_numeric_settings
say my_numeric_settings /* 16 0 SCIENTIFIC */
It's because I used SUBSTR(st_detail, 8, 10),So, st_digit is of length 10, which is greater than the DEFAULT setting of "9 0 SCIENTIFIC", So by changing it to either "NUMERIC DIGITS 10" or "NUMERIC DIGITS 12" the code worked.

Creating an optimal selection of overlapping time intervals

A car dealer rents out the rare 1956 Aston Martin DBR1 (of which Aston Martin only ever made 5).
Since there are so many rental requests, the dealer decides to place bookings for an entire year in advance.
He collects the requests and now needs to figure out which requests to take.
Make a script that selects the rental requests such that greatest number of individual customers
can drive in the rare Aston Martin.
The input of the script is a matrix of days of the year, each row representing the starting and ending
days of the request. The output should be the indices of the customers and their day ranges.
It is encouraged to plan your code first and write your own functions.
At the top of the script, add a comment block with a description of how your code works.
Example of a list with these time intervals:
list = [10 20; 9 15; 16 17; 21 100;];
(It should also work for a list with 100 time intervals)
We could select customers 1 and 4, but then 2 and 3 are impossible, resulting in two happy customers.
Alternatively we could select requests 2, 3 and 4. Hence three happy customers is the optimum here.
The output would be:
customers = [2, 3, 4],
days = [9, 15; 16, 17; 21, 100]
All I can think of is checking if intervals intersect, but I have no clue how to make an overall optimal selection.
My idea:
1) Sort them by start date
2) Make an array of intersections for each one
3) Start to reject from the ones which has the biggest intersection array, removing rejected item from arrays of intersected units
4) Repeat point 3 until only units with empty arrays will remain
In your example we will get data
10 20 [9 15, 16 17]
9 15 [10 20]
16 17 [10 20]
21 100 []
so we reject 10 20 as it has 2 intersections, so we will have only items with empty arrays
9 15 []
16 17 []
21 100 []
so the search is finished
code on javascript
const inputData = ' 50 74; 6 34; 147 162; 120 127; 98 127; 120 136; 53 68; 145 166; 95 106; 242 243; 222 250; 204 207; 69 79; 183 187; 198 201; 184 199; 223 245; 264 291; 100 121; 61 61; 232 247'
// convert string to array of objects
const orders = inputData.split(';')
.map((v, index) => (
{
id: index,
start: Number(v.split(' ')[1]),
end: Number(v.split(' ')[2]),
intersections: []
}
))
// sort them by start value
orders.sort((a, b) => a.start - b.start)
// find intersection for each one and add them to intersection array
orders.forEach((item, index) => {
for (let i = index + 1; i < orders.length; i++) {
if (orders[i].start <= item.end) {
item.intersections.push(orders[i])
orders[i].intersections.push(item)
} else {
break
}
}
})
// sort by intersections count
orders.sort((a, b) => a.intersections.length - b.intersections.length)
// loop while at least one item still has intersections
while (orders[orders.length - 1].intersections.length > 0) {
const rejected = orders.pop()
// remove rejected item from other's intersections
rejected.intersections.forEach(item => {
item.intersections = item.intersections.filter(
item => item.id !== rejected.id
)
})
// sort by intersections count
orders.sort((a, b) => a.intersections.length - b.intersections.length)
}
// sort by start value
orders.sort((a, b) => a.start - b.start)
// show result
orders.forEach(item => { console.log(item.start + ' - ' + item.end)})
Wanted to expand/correct a little bit on the acvepted answer.
You should start by sorting by the start date.
Then accept the very last customer.
Go through the list descending from there and accept all request that do not overlap with the already accepted ones.
That's the optimal solution.

Which data structure should I use for bit stuffing?

I am trying to implement bitstuffing for a project I am working on, namely a simple software AFSK modem. The simplified protocol looks something like this:
0111 1110 # burst sequence
0111 1110 # 16 times 0b0111_1110
...
0111 1110
...
... # 80 bit header (CRC, frame counter, etc.)
...
0111 1110 # header delimiter
...
... # data
...
0111 1110 # end-of-frame sequence
Now I need to find the reserved sequence 0111 1110 in the received data and therefore must ensure that neither the header nor the data contains six consecutive 1s. This can be done by bit stuffing, e.g. inserting a zero after every sequence of five 1s:
11111111
converts to
111110111
and
11111000
converts to
111110000
If I want to implement this efficiently I guess I should not use arrays of 1s and 0s, where I have to convert the data bytes to 1s and 0s, then populate an array etc. but bitfields of static size do not seem to fit either, because the length of the content is variable due to the bit stuffing.
Which data structure can I use to do bit stuffing more efficiently?
I just saw this question now and seeing that it is unanswered and not deleted I'll go ahead and answer. It might help others who see this question and also provide closure.
Bit stuffing: here the maximum contiguous sequence of 1's is 5. After 5 1's there should be a 0 appended after those 5 1's.
Here is the C program that does that:
#include <stdio.h>
typedef unsigned long long int ulli;
int main()
{
ulli buf = 0x0fffff01, // data to be stuffed
temp2= 1ull << ((sizeof(ulli)*8)-1), // mask to stuff data
temp3 = 0; // temporary
int count = 0; // continuous 1s indicator
while(temp2)
{
if((buf & temp2) && count <= 5) // enter the loop if the bit is `1` and if count <= 5
{
count++;
if(count == 5)
{
temp3 = buf & (~(temp2 - 1ull)); // make MS bits all 1s
temp3 <<= 1ull; // shift 1 bit to accomodeate the `0`
temp3 |= buf & ((temp2) - 1); // add back the LS bits or original producing stuffed data
buf = temp3;
count = 0; // reset count
printf("%llx\n",temp3); // debug only
}
}
else
{
count = 0; // this was what took 95% of my debug time. i had not put this else clause :-)
}
temp2 >>=1; // move on to next bit.
}
printf("ans = %llx",buf); // finally
}
The problem with this is that if there are more that 10 of 5 consecutive 1s then it might overflow. It's better to divide and then bitstuff and repeat.

T-SQL Decimal Division Accuracy

Does anyone know why, using SQLServer 2005
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,9),12499999.9999)
gives me 11.74438969709659,
but when I increase the decimal places on the denominator to 15, I get a less accurate answer:
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,15),12499999.9999)
give me 11.74438969
For multiplication we simply add the number of decimal places in each argument together (using pen and paper) to work out output dec places.
But division just blows your head apart. I'm off to lie down now.
In SQL terms though, it's exactly as expected.
--Precision = p1 - s1 + s2 + max(6, s1 + p2 + 1)
--Scale = max(6, s1 + p2 + 1)
--Scale = 15 + 38 + 1 = 54
--Precision = 30 - 15 + 9 + 54 = 72
--Max P = 38, P & S are linked, so (72,54) -> (38,20)
--So, we have 38,20 output (but we don use 20 d.p. for this sum) = 11.74438969709659
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,9),12499999.9999)
--Scale = 15 + 38 + 1 = 54
--Precision = 30 - 15 + 15 + 54 = 84
--Max P = 38, P & S are linked, so (84,54) -> (38,8)
--So, we have 38,8 output = 11.74438969
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,15),12499999.9999)
You can do the same math if follow this rule too, if you treat each number pair as
146804871.212533000000000 and 12499999.999900000
146804871.212533000000000 and 12499999.999900000000000
To put it shortly, use DECIMAL(25,13) and you'll be fine with all calculations - you'll get precision right as declared: 12 digits before decimal dot, and 13 decimal digits after.
Rule is: p+s must equal 38 and you will be on safe side!
Why is this?
Because of very bad implementation of arithmetic in SQL Server!
Until they fix it, follow that rule.
I've noticed that if you cast the dividing value to float, it gives you the correct answer, i.e.:
select 49/30 (result = 1)
would become:
select 49/cast(30 as float) (result = 1.63333333333333)
We were puzzling over the magic transition,
P & S are linked, so:
(72,54) -> (38,29)
(84,54) -> (38,8)
Assuming (38,29) is a typo and should be (38,20), the following is the math:
i. 72 - 38 = 34,
ii. 54 - 34 = 20
i. 84 - 38 = 46,
ii. 54 - 46 = 8
And this is the reasoning:
i. Output precision less max precision is the digits we're going to throw away.
ii. Then output scale less what we're going to throw away gives us... remaining digits in the output scale.
Hope this helps anyone else trying to make sense of this.
Convert the expression not the arguments.
select CONVERT(DECIMAL(38,36),146804871.212533 / 12499999.9999)
Using the following may help:
SELECT COL1 * 1.0 / COL2