How many URL's for this configuration? - encoding

I am using yoururls with Base32 encoding to send shortened links within an sms. The URL is preceded by a message and since sms is limited to 160 characters and my messages are approximately 140 characters I need to be very careful about character count.
My question is this; How do I calculate how many URL's I can fit with a 4 character limit using base32 encoding?

I'm not sure if you are asking about permutations.
Each character in base32 encoding can have 32 values ([A - Z] and [2 - 7]). If you use the form http://yoursite.com/xxxx, where xxxx is the short URL, four digits can contain 32 4 permutations. That is, 1,048,576.
If you also include URLs with three digits (e.g. http://yoursite.com/xxx), you can have 32 3 permutations. That is, 32,768. Together with four-digit URLS, then you get a total of 1,081,344.
If you also use 2 digit URLs (e.g. http://yoursite.com/xx), you get additional 1,024 URLS, totalling 1,082,368. And including single digits (e.g. http://yoursite.com/x) will give you additional 32. totaling 1,082,400.
But you don't need to use only [A - Z] and [2 - 7]. As per RFC3986, you can use characters ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=. That's 84 different characters. With this:
http://yoursite.com/xxxx 49,787,136
http://yoursite.com/xxx added 50,379,840 (+592,704)
http://yoursite.com/xx added 50,386,896 (+ 7,056)
http://yoursite.com/x added 50,386,980 (+ 84)
Even if you leave out characters -._~:/?#[]#!$&'()*+,;=, because they would really not fit in a shortened URL, you'll still get 62 characters. With that:
http://yoursite.com/xxxx 14,776,336
http://yoursite.com/xxx added 15,014,664 (+238,328)
http://yoursite.com/xx added 15,018,508 (+ 3,844)
http://yoursite.com/x added 15,018,570 (+ 62)

Related

Complex Regex for password field

Im having a difficulty to assemble as complex as this password validation rules:
Shall be at least 8-20 characters:
Shall have At least one number (0-9)
Shall contain At least one special characters (!,#,#,$,%...etc)
No repetition of single character more than 3 times
No repetition of sequence of characters/numbers more than 2 times.
No sequence of 4 or more characters (e.g. abcd)
No sequence of 4 or more numbers (e.g. 1234)
No sequence of 4 or more keyboard characters (e.g. qwer)
May not have any of the above spelled backwards.
That maximum i could achieve is the below, its not working as expected, could someone advice:
^[A-Za-z0-9\\S+](?=.*[$#$#!%*?&]){8,20}

Collision-proof hash-like identificator

I need to generate a 6 chars length (letters and digits) id to identify SaaS workspace (unique per user). Of course I could just go with numbers, but it shouldn't provide any clear vision about the real workspace number (for the end user).
So even for id 1 it should be 6 chars length and something like fX8gz6 and fully decodable to 1 or 000001 or something that i can parse to real workspace id. And of course it have to be collision-proof.
What would be the best approach for that?
This is something similar to what Amazon uses for its cloud assets, but it uses 8 chars. Actually 8 chars is suitable as it is the output range after Base64 encoding of 6 binary bytes.
Assuming you have the flexibility to use 8 characters. In original question you said 6 chars, but again assuming. Here is a possible scheme:
Number your assets in Unsigned Int32, possibly auto-increment fashion. call it real-id. Use this real-id for all your internal purposes.
When you need to display it, follow something like this:
Convert your integer to 4 binary Bytes. Every language has library to extract the bytes out of integers and vice versa. Call it real-id-bytes
take a two byte random number. Again you can use libraries to generate an exact 16 bit random number. You can use cryptographic random number generators for better result, or the plain rand is just fine. Call it rand-bytes
Obtain 6 byte display-id-bytes= array-concat(rand-bytes, real-id-bytes)
Obtain display-id= Base64(display-id-bytes). This is exactly 8 chars long and has a mix of lowercase, uppercase and digits.
Now you have a seemingly random 8 character display-id which can be mapped to the real-id. To convert back:
Take the 8 character display-id
display-id-bytes= Base64Decode(display-id)
real-id-bytes= Discard-the-2-random-bytes-from(display-id-bytes)
real-id= fromBytesToInt32(real-id-bytes)
Simple. Now if you really cannot go for 8-char display-id then you have to develop some custom base-64 like algo. Also you might restrict yourself to only 1 random bytes. Also note that This is just an encoding scheme, NOT a encryption scheme. So anyone having the knowledge of your scheme can effectively break/decode the ID. You need to decide whether that is acceptable or not. If not then I guess you have to do some form of encryption. Whatever that is, surely 6-chars will be far insufficient.

BCPL octal numerical constants

I've been digging into the history of BCPL due to a question I was asked about the reasoning behind using the prefix "0x" for the representation hexadecimal numbers.
In my search I stumbled upon a really good explanation of the history behind this token. (Why are hexadecimal numbers prefixed with 0x?)
From this post, however, another questions sparked:
For octal constants, did BCPL use 8 <digit> (As per specs: http://cm.bell-labs.com/cm/cs/who/dmr/bcpl.pdf) or did it use #<digit> (As per http://rabbit.eng.miami.edu/info/bcpl_reference_manual.pdf) or were both of these syntaxes valid in different implementations of the language?
I've also been able to find a second answer here that used the # syntax which further intrigued me in the subject. (Why are leading zeroes used to represent octal numbers?)
Any historical insights are greatly appreciated.
There were many slight variations on syntax in BCPL.
For example, while the one we used had 16-bit cells (so that x!y gave you the 16-bit word from a word address at x + y (a word address being half of the byte address), we also had a need to extract from byte address and byte values (since we were primarily creating OS and control software on a 6809 byte-addressable CPU).
Hence in addition to:
x!y - get word from byte address (x + y) * 2
we also had
x!%y - get byte from byte address (x * 2) + y
x%!y - get word from byte address x + (y * 2)
x%%y - get byte from byte address x + y
I'm pretty certain they were implementation-specific as I never saw them anywhere else. And BCPL was around long before language standards were as important as they are today.
The canonical language specification would have been the earlier one from Richards since he wrote the language (and your second document is for the Essex BCPL implementation about a decade later). But keep in mind that Project MAC was the earliest iteration - there were plenty of advancements after that as well.
For example, there's a 2013 revision of the BCPL User Guide (see Martin's home page) which specifies #b, #o and #x as prefixes for various non-decimal bases.

Encoding that minimizes misreading / mistyping / misspeaking?

Let's say you have a system in which a fairly long key value can be accurately communicated to a user on-screen, via email or via paper; but the user needs to be able to communicate the key back to you accurately by reading it over the phone, or by reading it and typing it back into some other interface.
What is a "good" way to encode the key to make reading / hearing / typing it easy & accurate?
This could be an invoice number, a document ID, a transaction ID or some other abstract value. Let's say for the sake of this discussion the underlying key value is a big number, say 40 digits in base 10.
Some thoughts:
Shorter keys are generally better
a 40-digit base 10 value may not fit in the space given, and is easy to get lost in the middle of
the same value could be represented in base 16 in 33-34 digits
the same value could be represented in base 36 in 26 digits
the same value could be represented in base 64 in 22-23 digits
Characters that can't be visually confused with each other are better
e.g. an encoding that includes both O (oh) and 0 (zero), or S (ess) and 5 (five), could be bad
This issue depends on the font / face used to display the key, which you may be able to control in some cases (like printing on paper) but can't control in others (like web pages and email).
Also depends on whether you can control the exclusive use of upper and / or lower case -- e.g. capital D (dee) may look like O (oh) but lower case d (dee) would not; while lower case l (ell) looks like a 1 (one) while capital L (ell) would not. (With exceptions for especially exotic fonts / faces).
Characters that can't be verbally / aurally confused with each other are better
a (ay) 8 (eight)
B (bee) C (cee) D (dee) E (ee) g (gee) p (pee) t (tee) v (vee) z (zee) 3 (three)
This issue depends on the audio quality of the end-to-end channel -- bigger challenge if the expected user base could have a speech impediment, or may have to speak through a gas mask, or the communication channel could include CB radios or choppy VOIP phone systems.
Adding a check digit or two would detect errors but not help resolve errors.
An alpha - bravo - charlie - delta type dialog can help with hearing errors, but not reading errors.
Possible choices of encoding:
Base 64 -- compact, but too many hard-to-verbalize characters (underscore, dash etc.)
Base 34 -- 0-9 and A-Z but with O (oh) and I (aye) left out as the easiest to confuse with digits
Base 32 -- same as base 34 but leave out the 0 (zero) and 1 (one) as well
Is there a generally recognized encoding that is a reasonable solution for this scenario?
When I heard it first, I liked the article A Proposal for Proquints: Identifiers that are Readable, Spellable, and Pronounceable. It encodes data as a sequence of consonants and vowels. It's tied to the English language though. (Because in German, f and v sound equal, so they should not be used both.) But I like the general idea.

In Unicode, why are there two representations for the Arabic digits?

I was reading the specification of Unicode # Wikipedia (Arabic Unicode)
and I see that each of the Arabic digits has 2 Unicode code points.
For example 1 is defined as U+0661 and as U+06F1.
Which one should I use?
According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9.
In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'.
It also notes:
U+06F4 - 'different glyphs in Persian and Urdu'
U+06F5 - 'Persian and Urdu share glyph different from Arabic'
U+06F6 - 'Persian glyph different from Arabic'
U+06F7 - 'Urdu glyph different from Arabic'
For comparison:
U+066n: ٠١٢٣٤٥٦٧٨٩
U+06Fn: ۰۱۲۳۴۵۶۷۸۹
Or, enlarged by making the information into a title:
U+066n: ٠١٢٣٤٥٦٧٨٩
U+06Fn: ۰۱۲۳۴۵۶۷۸۹
Or:
U+066n U+06Fn
0 ٠ ۰
1 ١ ۱
2 ٢ ۲
3 ٣ ۳
4 ٤ ۴
5 ٥ ۵
6 ٦ ۶
7 ٧ ۷
8 ٨ ۸
9 ٩ ۹
(Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.)
Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).
In general you should not hard-code such info in your application.
On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.
On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.
Or use something like ICU.
There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits.
And the user might have changed the defaults in the Control Panel anyway.
Which code do you prefer for representing the number 4, U+0664 or U+06F4?
(٤ or ۴ )?
To be consistent, let this choice guide which codes you use for 1, 2, and the other duplicate codes.