Using barcodes with different length (12 chars or 16 chars). Cannot find a tool to centralize the barcode depending on the length. If I set 16 chars by default, and then I will need to print a 12 char barcode, it will not centralize but cut from the right. Any ideas how can i centralize it just like the ^FB command for text fields?
Related
I recently read GPT2 and the paper says:
This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.
I really don't understand the words. The number of characters that Unicode represents is 130K but how can this be reduced to 256? Where's the rest of approximately 129K characters? What am I missing? Does byte-level BPE allow duplicating of representation between different characters?
I don't understand the logic. Below are my questions:
Why the size of vocab is reduced? (from 130K to 256)
What's the logic of the BBPE (Byte-level BPE)?
Detail question
Thank you for your answer but I really don't get it. Let's say we have 130K unique characters. What we want (and BBPE do) is to reduce this basic (unique) vocabulary. Each Unicode character can be converted 1 to 4 bytes by utilizing UTF-8 encoding. The original paper of BBPE says (Neural Machine Translation with Byte-Level Subwords):
Representing text at the level of bytes and using the 256 bytes set as vocabulary is a potential solution to this issue.
Each byte can represent 256 characters (bits, 2^8), we only need 2^17 (131072) bits for representing the unique Unicode characters. In this case, where did the 256 bytes in the original paper come from? I don't know both the logic and how to derive this result.
I arrange my questions again, more detail:
How does BBPE work?
Why the size of vocab is reduced? (from 130K to 256 bytes)
Anyway, we always need 130K space for a vocab. What's the difference between representing unique characters as Unicode and Bytes?
Since I have little knowledge of computer architecture and programming, please let me know if there's something I missed.
Sincerely, thank you.
Unicode code points are integers in the range 0..1,114,112, of which roughly 130k are in use at the moment. Every Unicode code point corresponds to a character, like "a" or "λ" or "龙", which is handy to work with in many cases (but there are a lot of complicated details, eg. combining marks).
When you save text data to a file, you use one of the UTFs (UTF-8, UTF-16, UTF-32) to convert code points (integers) to bytes. For UTF-8 (the most popular file encoding), each character is represented by 1, 2, 3, or 4 bytes (there's some inner logic to discriminate single- and multi-byte characters).
So when the base vocabulary are bytes, this means that rare characters will be encoded with multiple BPE segments.
Example
Let's consider a short example sentence like “That’s great 👍”.
With a base vocabulary of all Unicode characters, the BPE model starts off with something like this:
T 54
h 68
a 61
t 74
’ 2019
s 73
20
g 67
r 72
e 65
a 61
t 74
20
👍 1F44D
(The first column is the character, the second its codepoint in hexadecimal notation.)
If you first encode this sentence with UTF-8, then this sequence of bytes is fed to BPE instead:
T 54
h 68
a 61
t 74
� e2
� 80
� 99
s 73
20
g 67
r 72
e 65
a 61
t 74
20
� f0
� 9f
� 91
� 8d
The typographic apostrophe "’" and the thumbs-up emoji are represented by multiple bytes.
With either input, the BPE segmentation (after training) may end with something like this:
Th|at|’s|great|👍
(This is a hypothetical segmentation, but it's possible that capitalised “That“ is too rare to be represented as a single segment.)
The number of BPE operations is different though: to arrive at the segment ’s, only one merge step is required for code-point input, but three steps for byte input.
With byte input, the BPE segmentation is likely to end up with sub-character segments for rare characters.
The down-stream language model will have to learn to deal with that kind of input.
So you already know the BPE right Byte-level BPE is an improvisation of how the base vocabulary is defined. Recall, there is 1,43,859 unicode characters in unicode alphabets, but wonder how the gpt-2 vocabulary size is just 50,257. Having a base vocabulary of 1.4L will increase the size even more during the training process(where we will combine frequent occuring unicode characters).
To solve this issue GPT-2 uses a byte-level process which has a base vocabulary of just 256 characters using which any unicode characters can be represented by either a single or multiple byte-level characters. I still dont know the process of how a unicode character is converted to byte-level representation.
Does this explanation gave you a clarity why we go to a byte-level representation. Once again gpt-2 uses this 256 base vocabulary and increase the vocabulary size by adding frequent co occuring characters.
I learned today that while common fractions have dedicated Unicode values, in order to form less common fractions like ³/₁₆ you have to use superscript/subscript characters followed by a slash. This is confirmed here and here.
This works for ¹¹/₁₆ and ¹³/₁₆, but it gets messed up with ¹⁵/₁₆. Do you see how the 5 rises higher than the one? I imagine this is because in order to show the number 5 clearly as a superscript, it requires more height than 1 and 3.
Well, that creates a problem. How do you display the fraction 15/16 nicely as Unicode characters? Unfortunately I can't use the sup and sub tags. I'm not displaying it in an HTML page. Rather, we're passing a string to a Java application that will then render these values. I know it renders Unicode values fine, but it wouldn't recognize HTML tags. Is there a Unicode solution?
The “proper” way of composing arbitrary vulgar fractions in Unicode is to not use the subscript and superscript digits at all, but to utilise the special properties of the character U+2044 FRACTION SLASH. You would simply type the regular ASCII digits and separate them with the slash like so: 15⁄16. The rendering engine will then automatically select the correct forms of the numbers, producing a clean, uniform look.
I put the word ‘proper’ in quotation marks because this method is not guaranteed to be supported on all systems, and some that do support it do so incorrectly or incompletely. If you absolutely need to make sure that 100% of recipients regardless of system will definitely see something that looks more or less right, I would therefore still (begrudgingly) recommend using the preformatted subscripts and superscripts as a substitute. As the other answer explained, the problem you are having is a font issue and cannot be solved if you do not have control over font settings.
This is indeed a font issue, however the problem arises from the fact that, in Unicode, ¹, ², and ³ belong to the Latin-1 Supplement block, while the other superscript digits belong to the Superscripts and Subscripts block, and some font substitution occurs.
Please see Why the display of Unicode characters for superscripted digits are not at the same height? for extra details; it is tagged as iOS, but I have the same problem on macOS too.
I found this site, Unicode Fraction Creator: https://lights0123.com/fractions/
Here's an example: ³⁄₂
Which is:
U+00B3 superscript three
U+2044 fraction slash
U+2082 subscript two
For a general answer on displaying fractions nicely, copy, paste, and change.
ASCII Characters
Name
hexadecimal value
⁄
Fraction Slash
8260
0
digit 0
48
1
digit 1
49
2
digit 2
50
3
digit 3
51
4
digit 4
52
5
digit 5
53
6
digit 6
54
7
digit 7
55
8
digit 8
56
9
digit 9
57
example: 1/0 =
1⁄0
So I'm studying for the upcoming exam, and there's this question: given a binary file with the size of 31 bytes what will its size be, after encoding it to base64?
The solution teacher gave us was (40 + 4) bytes as it needs to be a multiple of 4.
I'm not being able to come across this solution, and I have no idea how to solve this, so I was hoping somebody could help me figure this out.
Because base 64 encoding divide the input data in six bit block and one block use an ascii code.
If you have 31 byte in input you have 31*8/6 bit block to encode. As a rule of thumb every three byte in input you have 4 byte in output
If input data is not a multiple of six bit the base64 encoding fills the last block with 0 bit
In your example you have 42 block of six bit, with last filled with missing 0 bit.
Base 64 algorithm implementation filled the encoded data with '=' symbol in order to have of multiple of 4 as final result.
I'm reading the powerpoint specification and I came across a table like this:
Do tables like these have a name? How do I read this?
I'm pretty sure it means that the first 4 bits identifies the recVer and the next 12 identifies the recInstance, but what about recLen? Do all 32 bits pull double-duty and identify the recLen or does that mean the next 32 bits do that?
It looks like some type of packet header. The numbers at the top are the bit position. It is read left to right, top to bottom, so it is telling you that the header is made up of 4 bits interpreted as the recVer, followed by 12 bits that is interpreted as recInstance, followed by 16 bits that is the recType, followed by 32 bits which is the recLen.
This is a common way to show the header structure, as can be seen on Wikipedia's TCP page.
This is just part of the binary format for the powerpoint file. the 0,1,2 etc are the bit numbers. So you can see bit's 0 - 3 inclusive are the recVer etc.
The specification will tell you want recVer, recInstance and recType mean.
I think recLen should be obvious but it'll be in the spec.
To read it, you'd read in the bytes and then do bit manipulation to decode those fields. You don't say what language you'll be using but you can do bit manipulation in a number of languages.
Not sure about an official/standard name, but this looks like a record layout map.
You read it left to right, every box is a single bit.
The record is composed of
4 bits recVer
12 bits recInstance
16 bits recType
32 bits recLen
I'd like to squeeze or compress the result hash value from MD5 or SHA1 at a server side application so that at the client can decompress it or desqueeze it , is this possible ? its a usability issue for my application.
No, hash values cannot be compressed. By design their bits are highly random and have maximum entropy, so there is no redundancy to compress.
If you want to make the hash values easier to read for users you can use different tricks, such as:
Displaying fewer digits. Instead of 32 digits just show 16.
Using a different base. For instance, if you used base 62 using all the uppercase and lowercase letters plus numbers 0-9 as digits then you could show a 128-bit hash using 22 letters+digits versus 32 hex digits:
log62 (2128) ≈ 21.5
Adding whitespace or punctuation. You'll commonly see CD keys printed with dashes like AX7T4-BZ41O-JK3FF-QOZ96. It's easier for users to read this than 20 digits all jammed together.
Hash values are quite short; attempting compression on these (quite random and highly varied) values is difficult and inefficient. If you want to save space, truncating the value could help, but keep in mind that if you do this, you increase collision space (and decrease key space).