Base-n integer representations library in Isabelle/HOL? - radix

Context: I am trying to translate math contest problems into Isabelle/HOL. A lot of these contest problems involve questions like "what are the last two digits of this number when written in base 7"?
Isabelle/HOL has a rudimentary treatment of base-10 integer representations in the ThreeDivides theory. It also has a far more elaborate theory for binary numeral representations. And arguably the fundamental nat datatype is a base-one representation. But so far I have been unable to find any theory file that deals with base-n representations in general. Am I missing anything?
I'd like it to include basic theorems like the existence and uniqueness of the representation (which ThreeDivides actually does not show for its decimal representations), an easy way to write things down (e.g., "[4, 3]⇘base 5⇙ = 23"), and basic rules for manipulation. I can create such a theory file myself of course, but I'd rather not waste the time if it already exists.
Is there a standard theory in Isabelle/HOL for stating and proving facts about the digits of integers in arbitrary natural number bases?

Related

Picking a check digit algorithm

I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.

Term for diff/delta on multiple files or data structures

I would like to know whether there is a proper term to describe "diffing" of / obtaining the delta between multiple files or data structures, such that the resulting "diff" contains first a description of the parts common to all files/structures, then descriptions of how this "base" file/structure must be modified to obtain the individual ones, ideally in a hierarchical fashion if some files/structures are more similar to each other than others.
There are some questions and answers about how to do this with certain tools (e.g. DIFF utility works for 2 files. How to compare more than 2 files at a time?), but as I want to do this for a specific type of data structure (namely JSON), I'm at a loss as to what I should even search for.
This type of problem seems to me like it should be common enough to have a name such as "hierarchical diff" (which however seems to be reserved for 2-way diffs on hierarchical data structures), "commonality finding", or something like that.
I guess a related concept about hierarchical ordering of commonalities and differences is formal concept analysis, but this operates on sets of properties rather than hierarchical data structures and won't help me much.
There are multiple valid denominations :
Data comparison (or Sequence comparison)
Delta encoding
Delta compression (or Differential compression)
Algorithms:
An O(ND) Difference Algorithm and Its Variations (Eugene Myer)
A technique for isolating differences between files (Paul Heckel)
The String-to-String Correction Problem with Block Moves (Walter Tichy)
Good Wikipedia links
Longest common subsequence problem
Comparison of file comparison tools
Diff Unix Utility
Some implementations
diff-match-patch (Neil Fraser - Google)
jsdifflib
jsondiffpatch

NLP Date Parsing

I've been experimenting with a number of NLP text parsers, but have found that most fail at even some of the simplest tasks that occur in actual texts (aren't preprocessed to show how "great" the systems are. An example is the following:
From Sundays until Thursdays every week
I've yet to find a single parser that can parse this correctly. I've tried with quite a number including Stanford's sutime. Can anyone recommend software that can handle natural text dates?
I did not find one either when I went looking so I wrote my own. It's part of my natural language engine for .NET.
Here's what the demo shows when you enter that phrase (qualified to next week rather than every week - it can handle that too but it's infinite):
Some comments:
1) Handling all possible english language temporal expressions is a huge task. I've been working on this problem for years to come up with a clean way to represent temporal expressions plus the many rules needed to parse english expressions of time.
2) In addition to finding a way to represent typical calendar date times and ranges of such, you also need ways to represent infinite sequences like 'every monday', and half-infinite sequences like 'every weekday before ...'. And then you'll need an algebra on top of that for combining temporal expressions.
3) Temporal expressions are often ambiguous in the English language and interpretation may vary from culture to culture.
4) The result must often be interpreted in the context of the sentence and/or the conversation history. "Who called Monday?" is a different Monday from "Remind me on Monday" and is different again from "Show me statistics for Monday".

User expectations and unicode normalization

This is a bit of a soft question, feel free to let me know if there's a better place for this.
I'm developing some code that accepts a password that requires international characters - so I'll need to compare an input unicode string with a stored unicode string. Easy enough.
My question is this - do users of international character sets generally expect normalization in such a case? My Google searches show some conflicts in opinion from 'always do it' (http://unicode.org/faq/normalization.html) to 'don't bother'. Are there any pros/cons to not normalizing? (i.e., less likely to able guess a password, etc.)
I would recommend that if your password field accepts Unicode input (presumably UTF-8 or UTF-16), that you normalize it before hashing and comparing. If you don't normalize it, and people access it from different systems (different operating systems, or different browsers if it's a web app, or with different locales), then you may get the same password represented with different normalization. This would mean that your user would type the correct password, but have it rejected, and it would not be obvious why, nor would they have any way to fix it.
I wouldn't bother for a couple reasons:
You're going to make things less secure. If two or more characters are all represented in your DB as the same thing, then that means there are fewer possible passwords for the site. (Though this probably isn't a huge deal, since the number of possible passwords is pretty huge.)
You will be building code into your program that does complicated work that is (probably) part of a library you didn't write...and eventually somebody won't be able to log in as a result. Better in my mind to keep things simple, and to trust that people using different character sets know how to type them properly. That said, I've never implemented this in an international password form, so I couldn't tell you what the standard design pattern is.

Best Dijkstra papers to explain this quote?

I was enjoying "The Humble Programmer" earlier today and ran across this choice quote:
Therefore, for the time being and perhaps forever, the rules of the second kind present themselves as elements of discipline required from the programmer. Some of the rules I have in mind are so clear that they can be taught and that there never needs to be an argument as to whether a given program violates them or not. Examples are the requirements that no loop should be written down without providing a proof for termination nor without stating the relation whose invariance will not be destroyed by the execution of the repeatable statement.
I'm looking for which of Dijkstra's 1300+ writings best describe in further detail rules such as he was describing above.
Page 5 through 18: http://userweb.cs.utexas.edu/users/EWD/ewd02xx/EWD249.PDF
Mid. page 3 through end: http://userweb.cs.utexas.edu/users/EWD/ewd04xx/EWD473.PDF
End page 5 through end: http://userweb.cs.utexas.edu/users/EWD/ewd06xx/EWD641.PDF
All: http://userweb.cs.utexas.edu/users/EWD/transcriptions/EWD02xx/EWD261.html (Dutch, translation=below)
Note: Dijkstra numbers his pages starting at 0. Given page numbers are starting at 1, the PDF page number, and not the written page numbers.
My translation of EWD261 in English:
How to program mathematically
A (well-defined) programme is structured just like a (well-defined) mathematical theory. The programmers' work is not different from that of a creative mathematician.
There are small, but important, differences, though:
There are not much basic concepts of programming and they are not difficult to comprehend (though misleadingly simple); this is why it's an ideal for development practice. (Besides this, there is the fact that a demand for correctness, the programme should really work!)
With most mathematical education one learns about existing theorems, viz. equipping a student with a specific (detailed) set of concepts; a programmer, however, has to develop the needed concept himself. Programming requires the abstractions which leads to a type of creativity, while the same in mathematics is limited to applying existing theorems.
Because programmes are big and nevertheless have to work will programmers learn how to develop carefully and consciously. This is exactly what one should teach! To teach extensive knowledge is, for me, not justified.