Encoding of the Canadian PostBar barcodes

Encoding of the Canadian PostBar barcodes - encoding

I am working on a software to encode postal addresses using the PostBar barcode symbology in use in Canada.
I can't find the relevant information for these codes. Wikipedia does describe PostBars, but with a caveat saying that the article is about the D12 type, whereas the Canadian Post actually uses the types D52.01/D82.01/S52.40 and S82.39, which are different and undocumented. (I also know the "CANADA POST CORPORATION 4-STATE BAR CODE HANDBOOK" document, which doesn't help.)
I need the specifics of the encoding of the fields (DCI, Postal Code, Address Locator...) and the parameters of Reed-Solomon parity bits.
I am not after an implementation, which I am able to craft myself. Thank you in advance for any tip.

This is the only thing I could find on the subject. It is not much, I'm afraid:
https://en.wikipedia.org/wiki/Canada_Post#Barcodes
Canada Post uses a 13 character barcode for their pre-printed labels. Bar codes consist of two letters, followed by eight sequence digits, and a ninth digit which is the check digit. The last two characters are the letters CA. The check digit seems to ignore the letters and only concern itself with the first 8 numeric digits. The scheme is to multiply each of those 8 digits by a different weighting factor, (8 6 4 2 3 5 9 7). Add up the total of all of these multiplications and divide by 11. The remainder after dividing by 11 gives a number from 0 to 10. Subtracting this from 11 gives a number from 1 to 11. That result is the check digit, except in the two cases where it is 10 or 11. If 10 it is then changed to a 0, and if 11 then it is changed to a 5. The check digit may be used to verify if a barcode scan is correct, or if a manual entry of the barcode is correct.
And as bonus, an explanation of the barcodes, in Dutch:
https://www.postnl.nl/Images/Brochure-KIX-code-van-PostNL_tcm10-10210.pdf

I don't think we ( Canada Post ) use PostBar anymore. Management made adoption too much of a pain for the mailer so it died. I haven't seen one on an envelope in years. Now that OCR tech is so good it wouldn't help that much to include a PostBar anyway.
What they should have done is given away software that printed up the address labels in alpha-numeric order of the postal code and printed a bunch of positional marks on the top fold of the envelope based on that same postal code. That way a postal clerk need not even take the mail out of the box to see where it should be shipped to. LVM's (large volume mailers) would do this for a rebate on their bill.
Ase for smaller businesses or the general public we should have just soled them prepaid envelopes in 2 or 3 standard sizes for a dime less than the cost of a stamp alone. A standard envelop can have a dedicated spot for a machine readable postal code. I would have gone with good old public-domain Braille! printed or in sharpy:-) Oh well I'm rambling now I'll stop.

Related

What is this table called and how do I read it?

I'm reading the powerpoint specification and I came across a table like this:
Do tables like these have a name? How do I read this?
I'm pretty sure it means that the first 4 bits identifies the recVer and the next 12 identifies the recInstance, but what about recLen? Do all 32 bits pull double-duty and identify the recLen or does that mean the next 32 bits do that?

It looks like some type of packet header. The numbers at the top are the bit position. It is read left to right, top to bottom, so it is telling you that the header is made up of 4 bits interpreted as the recVer, followed by 12 bits that is interpreted as recInstance, followed by 16 bits that is the recType, followed by 32 bits which is the recLen.
This is a common way to show the header structure, as can be seen on Wikipedia's TCP page.

This is just part of the binary format for the powerpoint file. the 0,1,2 etc are the bit numbers. So you can see bit's 0 - 3 inclusive are the recVer etc.
The specification will tell you want recVer, recInstance and recType mean.
I think recLen should be obvious but it'll be in the spec.
To read it, you'd read in the bytes and then do bit manipulation to decode those fields. You don't say what language you'll be using but you can do bit manipulation in a number of languages.

Not sure about an official/standard name, but this looks like a record layout map.
You read it left to right, every box is a single bit.
The record is composed of
4 bits recVer
12 bits recInstance
16 bits recType
32 bits recLen

How probable is that two exact same calculations give different results?

I am currently working on remaking an old invoicing program that was originally written in VB6.
It has two parts, one on an android tablet, the other on a pc. The old database used , stored derived values because there was a chance that the calculations would be incorrect if repeated.
For example if one sold 5 items whose price was 10 euros at 10% discount and a tax value of 23% , it would store the above 4 values but also the result of the calucation of (5 * (10 * 1.23)) * 0.9.
I do not really like having duplicate or derivable information in my database, but the actual sell value must be the same, whether it is viewed on a tablet , or a pc.
So my question is , is there a chance (even the slightest one) that the above calucation (to a three decimal percision) would have different results on different operating systems (such as an android device and a desktop computer) ?
Thanks in advance for any help you can provide

Yes, it's possible. Floating-point arithmetic is always subject to rounding errors and different languages (and architectures) deal with those errors in different ways. There are best practices in dealing with these issues, though I don't consider myself knowledgeable enough to speak to them. But here are a couple of options for you.
Use a data type meant for floating-point arithmetic. For example, VB6 has a Single and Double type for floating point but also a Currency type for accurate decimal math.
Scale your floating-point values to integers and perform your calculations on these integer values. You can even store the results as integers in your DB. The ERP system we use does this and includes a data dictionary that defines how each type was scaled so that it can be "unscaled" before display.
Hope that helps.

Is there an explanation on the paging qn asked in 'The Social Network'?

"Suppose you are given a computer with a 16-bit virtual address and a page size of 256 bytes. The system uses 1-level page tables that start at address hex 400. Maybe you want DMA...who knows? The first few pages are reserved for hardware flags, etc. Assume page table entries have 8 status bits. The 8 status bits would be..."
http://www.youtube.com/watch?v=-3Rt2_9d7Jg
Can someone explain why the answer is as Mark/Jesse described?

According to this page documenting some technical inaccuracies of The Social Network, the question is a (badly) derived from a question from an actual Harvard course.
A sample problem: Suppose we are given a computer with a 16-bit
virtual addresses, and a page size of 256 bytes. The system uses
one-level page tables, which start at address 0x0400. (The first few
pages are reserved for hardware flags, etc. Maybe you wanted to have
DMA on your 16-bit system, who knows?) Assume page table entries have
eight status bits: 1 valid bit, 1 modify bit, 1 reference bit, and 5
permissions bits (this is a very secure system).
How many pages are there? How much memory do the page tables require?
The 8 status bits are architecture dependent and, in this particular problem, is made up as an assumption for an imaginary computer. The movie producers simply took the problem description and made one of the assumptions the question - a question that doesn't make sense to ask in the first place.
To more easily understand this, imagine that you have a question like the following
A car traveled across a road for 1 hour. Assuming the car's speed is
100km/h, how much distance did the car travel?
and the question turned into
A car traveled across a road for 1 hour. The speed of the car was...?
Edit: Didn't realise the original article used a similar analogy to mine.

Does a pronounceable encoding exist?

I am using UUIDs, but they are not particularly nice to read, write and communicate. So I would like to encode them. I could use base64, or base32, but they would not be easy anyway: base64 has capitalized letters and symbols. Base32 is a bit better, but you can still obtain clumsy stuff.
I was wondering if there's a nice and clean way to encode a number into palatable phonemes, so to achieve better readability and hopefully a bit of compression.

I hope you don't use this idea: The Automated Curse Generator :)

Bubble Babble is a good one to try. It generates nonsensical but readable output like:
xesef-disof-gytuf-katof-movif-baxux

This question is very old; interestingly, as old as the solution I'm about to present, but it hasn't been mentioned here yet.
It's Proquint. Similar to Bubble Babble, but the differences make the results easier to read, in my opinion.
Here's how it works, from their documentation:
In sum, we propose encoding a 16-bit string as a proquint [PRO-nouncable QUINT-uplet] of alternating consonants and vowels as follows.
Four-bits as a consonant:
0 1 2 3 4 5 6 7 8 9 A B C D E F
b d f g h j k l m n p r s t v z
Two-bits as a vowel:
0 1 2 3
a i o u
Whole 16-bit word, where "con" = consonant, "vo" = vowel:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|con |vo |con |vo |con |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Separate proquints using dashes, which can go un-pronounced or be pronounced "eh". The suggested
optional magic number prefix to a sequence of proquints is "0q-".
Here are some IP dotted-quads and their corresponding proquints.
127.0.0.1 lusab-babad
63.84.220.193 gutih-tugad
63.118.7.35 gutuk-bisog
140.98.193.141 mudof-sakat
64.255.6.200 haguz-biram
128.30.52.45 mabiv-gibot
147.67.119.2 natag-lisaf
212.58.253.68 tibup-zujah
216.35.68.215 tobog-higil
216.68.232.21 todah-vobij
198.81.129.136 sinid-makam
12.110.110.204 budov-kuras

Why not use something similar to what PGP does to create readable keys, simply find a nice list of words that are distinctive, lets say you're using 128 bit UUID's, a list of 256 words (2^8) means 16 words.
Stupid question but why are people reading/writing UUID's/etc. with respect to your application?

If all you want is a way to communicate hex values readably (ie, over the phone, or when instructing someone verbally what to type), then I suggest you use one of the various phonetic alphabets, such as the NATO Phonetic Alphabet or the US Army/Navy Phonetic Alphabet.
In the latter, the letters A-F are spoken as "able", "baker", "charlie", "dog", "easy", and "fox", respectively, so you would read the hex sequence "3fd2cc0e" as "three fox dog two charlie charlie zero easy". A uuid would be read out in exactly the same fashion.

Bubble babble and base32 are inefficient, especially in your case. I suggest that you make your own algorithm. Since there are 20 consonants and 6 vowels (including 'y') you can have approx. 20*6*2+6*6=276 consonant/vowel-vowel/consonant pairs. So every byte of your number can be represented by a pair. With a bit of tweaking your algorithm could produce pronounceable words much shorter than bubble babble. You could even play dice and replace all odd digits with a consonant/vowel. For example, 0123456789ABCDEF (hex) encodes to ABECIDOFUGYHKRM. 3141592654 (dec) encodes to HHIA-ROIR. You are left with ten spare consonants which can be paired with vowels to replace some double consonants etc.

S/KEY uses a dictionary of 2048 words to map 64 bit numbers to a sequence of 6 predefined words/syllables. (People will always find swear words if they are looking for them ;) )

Urbit's phonetic naming system wasn't mentioned yet. It uses 3 characters for 8 bits, 6 for 16, so it's less efficient than Proquint or Bubble Babble, but more divisible.

and hopefully a bit of compression
Not sure exactly what you mean there; making something "readable" or "pronouncable" will inevitably expand the space required for it. Maybe you meant "hopefully a bit of redundancy"? It would be good if, even if the user makes a small mistake, the system can detect and perhaps even correct it.
Really it depends very much on how big your UUIDs are and how they are most often communicated. If they need to be communicated over phone or VoIP, you want more audible redundancy. If they need to be entered into mobile devices with numeric keypads, it tends to be difficult to enter alphabetic characters, moreso if they are case-sensitive. If they are written down a lot, you need to worry about characters that look similar (O and 0 and o, for instance). If they need to be memorised, then probably strings of real words are the best (have a look at the PGP Word List).
However I think a great all-round solution is just using numberic digits. They're a lot harder to confuse with each other (both when spoken and written) than some alphabetic characters. Easy to enter on mobile devices, and people aren't too bad at memorising numbers.
And the length of the string is not too bad either. Let's compare base32 with base 10 (decimal). The length of a decimal string is log_10(32) times the length of the corresponding base32 string, or about 1.5 times as long. Ten characters of base32 correspond to 15 decimal digits.
Not much of a penalty, IMO, seeing as in base 32 it's easy to confuse C and T, or S, F and X (when spoken), and someone speaking with a foreign accent is more likely to cause trouble.

If they were easy to read they probably wouldn't be particularly unique.

redundant encoding?

This is more of a computer science / information theory question than a straightforward programming one, so if anyone knows of a better site to post this, please let me know.
Let's say I have an N-bit piece of data that will be sent redundantly in M messages, where at least M-1 of those messages will be received successfully. I am interested in different ways of encoding the N-bit piece of data in fewer bits per message. (this is similar to RAID but at a much smaller level, where N = 8 or 16 or 32)
Example: suppose N = 16 and M = 4. Then I could use the following algorithm:
1st and 3rd message: send "0" + bits 0-7
2nd and 4th message: send "1" + bits 8-15
If I can guarantee that 3 messages of the 4 will get through, then at least one message from each group will get through. Thus I can make this work with 9 bits or less, there's probably a way to do this with fewer total bits but I'm not sure how.
Are there some simple encoding/decoding algorithms to do this kind of thing? Does this problem have a name? (if I know what it's called, I can google it!)
note: in my particular case, the messages either arrive correctly or do not arrive at all (no messages arrive with errors).
(edit: moved 2nd part to a separate question)

(Incomplete answer follows. I may add more later.)
The term you may be interested in is channel coding: adding redundancy to a source in order to make it robust during transmission over a noisy channel. In information theory, the complementary problem to channel coding is source coding: reducing the redundancy in a source to represent it using fewer bits. (The combination of these two problems is called joint source-channel coding.)
Your first question asks to find a channel code. The simple example you give is similar to a repetition code, i.e., you send the same message more than twice (usually an odd number of times), and then the message which is received most often is accepted as the original message.
This code is inefficient. To use standard notation, let k = number of bits in original message, and n = number of bits in the transmitted message. For your example, k = 16 and n = 36. A measure of coding efficiency is k/n, where higher means more efficient. In your case, k/n = 0.44. This is low.
The repetition code is a simple kind of block code, i.e., redundancy is added to each block of k bits to create a codeword of n bits. So are the Hamming and Reed-Solomon codes as others mentioned. Hamming codes are relatively easy to understand with some basic linear algebra.
These should be enough terms for you to search on your own. Good luck.

I'm not sure if I understood all the details of your question correctly, but your problem is definitely aboud designing some kind of error correcting code. This is a vast area of computer science and thick tomes have been written about it. Start with wikipedia and see if you can get any simple schemes (like Hamming or Reed-Solomon codes) to work in your case.
If you want to deal not only with symbol corruption, but also deletion of symbols, you should look at erasure codes, this is definitely a more difficult task but good methods exist in many cases.
EDIT: This material from hackersdelight.org seems a nice introduction.

See erasure codes.

You're looking for a packet erasure code. There are only two useful packet erasure codes that are not totally encumbered by patents, and there's only one open-source library to implement those. Find it here: http://planete-bcast.inrialpes.fr/rubrique.php3?id_rubrique=5

Here's a trivially simple scheme that's almost twice as efficient as your example.
You chopped the message into blocks of (N/M)*2 bits. Instead, chop it into N/(M-1)-bit blocks. (Round it up if necessary.) The first block, src[0], encodes as itself: enc[0]=src[0]. The same for the last block: enc[M-1]=src[M-1]. Each of the other blocks gets XORed with its left neighbor: enc[i]=src[i-1]^src[i].
Prefix each encoded block with a log(M)-bit sequence number, essentially as you did, so the receiver can tell which was dropped. (If you can be sure that whichever blocks arrive will arrive in order, then a 1-bit sequence number will do. Just alternate 0 and 1.)
To decode, successively XOR from the left and the right until you hit the dropped block. E.g. src[1] == enc[0]^enc[1]. (Dropping one of the endpoint blocks isn't a special case -- e.g. if the first block is dropped, the scan from the right recovers it, and the scan from the left is of length 0.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse