Lex reserved word rules versus lookup table - lex

This web page suggests that if your lex program "has a large number of reserved words, it is more efficient to let lex simply match a string and determine in your own code whether it is a variable or reserved word."
My question is: More efficient where, and why? If it means compiling the lexer is faster, I don't really care about that because it is one step removed from the program which uses the lexer to parse input.
It seems to be that lex just uses your description to build a state machine that processes one character at a time. It does not seem logical that increasing the size of the state machine would necessarily cause it to become any slower than using one rule for identifiers and then doing several string comparisons.
Additionally, if it turns out that there is some logical reason for this to make sense as an optimization, what would be considered a large number of reserved words? I have approximately 20, as compared to about 30 other rules for various things. Would that be considered a large number of reserved words? Should I attempt to use the same strategy for some of the other symbols?
I have attempted to google for a result, but the only relevant articles I found stated this strategy as though it were well-known without giving any reason.
In case it is relevant, I am using flex 2.5.35.
Edit: Here is another reference which claims that lex produces an inefficient scanner when asked to match several long literal strings. It also does not give a reason.

According to the flex manual, "[t]he speed of the scanner is independent of the number of rules or ... how complicated the rules are with regard to operators such as '*' and '|'."
The main performance losses are due to backtracking. This can be avoided by (among other things) using catch-all rules which will match tokens which "start with" the offending token. For example, if you have a list of reserved words consisting of [a-zA-Z_], and then a rule for matching identifiers of the form [a-zA-Z_][a-zA-Z_0-9]*, the rule for matching identifiers will catch any identifiers which start with the name of a reserved word without having to back up and try to match again.
According to the faq, flex generates a deterministic finite automaton which "does all the matching simultaneously, in parallel." The result of this is, as was said above, that the speed of the scanner is independent of the number of rules. On the other hand, string comparison is linear in the number of rules.
As a result, the reserved word rules should, in fact, be considerably faster than a lookup table.

Related

Picking a check digit algorithm

I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.

Isolating run sequence definition seems inconsistent

I've been reading Unicode Standard Annex #9, which describes the Unicode Bidirectional Algorithm, and I've hit a snag in section BD13, the definition of an isolating run sequence.
An isolating run sequence is a maximal sequence of level runs such that for all level runs except the last one in the sequence, the last character of the run is an isolate initiator whose matching PDI is the first character of the next level run in the sequence.
How can this be the case? Section BD9, which defines the matching PDI of an isolate initiator, says that
an isolate initiator and its matching PDI are always assigned the same explicit embedding level
and section BD7 defines a level run as
a maximal substring of characters that have the same embedding level
This would seem to indicate that an isolate initiator and its matching PDI cannot belong to consecutive level runs. Either they should belong to the same level run, or there should be at least one level run between them.
What's the resolution of this apparent inconsistency? Is there a distinction between an "explicit embedding level" and an "embedding level"? Is one of the quotes I'm relying on non-normative and not completely accurate? Are isolating run sequences just always one level run long?
In case the wording changes, this question is based on revision 33, the current revision of the annex.
While writing up this question, I figured out the examples under section BD13 and how they resolve the issue. An isolating run sequence is not contiguous. Its elements are not required to be consecutive level runs; indeed, they must be separated by level runs belonging to other isolating run sequences. This should be made more explicit.

standard mangling for reserved column or table names

I am wary of using quoted identifiers in PostgreSQL when my "business" table or column names happen to be reserved words. Are there any standard or best-practice mangling approaches? I would like to have a consistent mangling mechanism as there's also some code-generation in place and need to be able to automatically translate back and forth between mangled names used in PostgreSQL and non-mangled names used in code.
The simplest, and probably best, way to "mangle" terms to guarantee all names are not reserved words is to prefix them with an underscore character.
The best way is to not use reserved words - let the DBA chose the names for any clashes between business terms and DB terms.
Also consider terms that might clash with reserved words in your application language.
I like to reserve the underscore prefix (_name) for parameters and variables in server-side functions to rule out conflicts. I have seen that a lot lately among people working with functions.
For identifiers of database objects I pick the names by hand. Working with non-english terms (reduced to ASCII letters) avoids most of the conflicts.
Another alternative is to use the plural form for table names, since most reserved words are in singular form (a few exceptions!).
Most 1-letter prefixes would do the job without exception. Like: n_name. But since I hand-pick identifiers, there is no need for automation.
Anything is good that avoids the error-prone need for double-quotes. I never use reserved words even if Postgres would allow them.

How to start working with a large decision table

Today I've been presented with a fun challenge and I want your input on how you would deal with this situation.
So the problem is the following (I've converted it to demo data as the real problem wouldn't make much sense without knowing the company dictionary by heart).
We have a decision table that has a minimum of 16 conditions. Because it is an impossible feat to manage all of them (2^16 possibilities) we've decided to only list the exceptions. Like this:
As an example I've only added 10 conditions but in reality there are (for now) 16. The basic idea is that we have one baseline (the default) which is valid for everyone and all the exceptions to this default.
Example:
You have a foreigner who is also a pirate.
If you go through all the exceptions one by one, and condition by condition you remove the exceptions that have at least one condition that fails. In the end you'll end up with the following two exceptions that are valid for our case. The match is on the IsPirate and the IsForeigner condition. But as you can see there are 2 results here, well 3 actually if you count the default.
Our solution
Now what we came up with on how to solve this is that in the GUI where you are adding these exceptions, there should run an algorithm which checks for such cases and force you to define the exception more specifically. This is only still a theory and hasn't been tested out but we think it could work this way.
My Question
I'm looking for alternative solutions that make the rules manageable and prevent the problem I've shown in the example.
Your problem seem to be resolution of conflicting rules. When multiple rules match your input, (your foreigner and pirate) and they end up recommending different things (your cangetjob and cangetevicted), you need a strategy for resolution of this conflict.
What you mentioned is one way of resolution -- which is to remove the conflict in the first place. However, this may not always be possible, and not always desirable because when a user adds a new rule that conflicts with a set of old rules (which he/she did not write), the user may not know how to revise it to remove the conflict.
Another possible resolution method is prioritization. Mark a priority on each rule (based on things like the user's own authority etc.), sort the matching rules according to priority, and apply in ascending sequence of priority. This usually works and is much simpler to manage (e.g. everybody knows that the top boss's rules are final!)
Prioritization may also be used to mark a certain rule as "global override". In your example, you may want to make "IsPirate" as an override rule -- which means that it overrides settings for normal people. In other words, once you're a pirate, you're treated differently. This make it very easy to design a system in which you have a bunch of normal business rules governing 90% of the cases, then a set of "exceptions" that are treated differently, automatically overriding certain things. In this case, you should also consider making "?" available in the output columns as well.
One other possible resolution method is to include attributes in each of your conditions. For example, certain conditions must have no "zeros" in order to pass (? doesn't matter). Some conditions must have at least one "one" in order to pass. In other words, mark each condition as either "AND", "OR", or "XOR". Some popular file-system security uses this model. For example, CanGetJob may be AND (you want to be stringent on rights-to-work). CanBeEvicted may be OR -- you may want to evict even a foreigner if he is also a pirate.
An enhancement on the AND/OR method is to provide a threshold that the total result must exceed before passing that condition. For example, putting CanGetJob at a threshold of 2 then it must get at least two 1's in order to return 1. This is sometimes useful on conditions that are not clearly black-and-white.
You can mix resolution methods: e.g. first prioritize, then use AND/OR to resolve rules with similar priorities.
The possibilities are limitless and really depends on what your actual needs are.
To me this problem reminds business rules engine where there is no known algorithm to define outputs from inputs (e.g. using boolean logic) but the user (typically some sort of administrator) has to define all or some the logic itself.
This might sound a bit of an overkill but OTOH this provides virtually limit-less extension capabilities: you don't have to code any new business logic, just define a new rule set.
As I understand your problem, you are looking for a nice way to visualise the editing for these rules. But this all depends on your programming language and the tool you select for this. Java, for example, has JBoss Drools. Quoting their page:
Drools Guvnor provides a (logically
centralized) repository to store you
business knowledge, and a web-based
environment that allows business users
to view and (within certain
constraints) possibly update the
business logic directly.
You could possibly use this generic tool or write your own.
Everything depends on what your actual rules will look like. Rules like 'IF has an even number of these properties THEN' would be painful to represent in this format, whereas rules like 'IF pirate and not geek THEN' are easy.
You can 'avoid the ambiguity' by stating that you'll always be taking the first actual match, in other words your rules have a priority. You'd then want to flag rules which have no effect because they are 'shadowed' by rules higher up. They're not hard to find, so it's something your program should do.
Your interface could also indicate groups of rules where rules within the group can be in any order without changing the outcomes. This will add clarity to what the rules are really saying.
If some of your outputs are relatively independent of the others, you will also get a more compact and much clearer table by allowing question marks in the output. In that design the scan for first matching rule is done once for each output. Consider for example if 'HasChildren' is the only factor relevant to 'Can Be Evicted'. With question marks in the outputs (= no effect) you could be halving the number of exception rules.
My background for this is circuit logic design, not business logic. What you're designing is similar to, but not the same as, a PLA. As long as your actual rules are close to sum of products then it can work well. If your rules aren't, for example the 'even number of these properties' rule, then the grid like presentation will break down in a combinatorial explosion of cases. Your best hope if your rules are arbitrary is to get a clearer more compact presentation with either equations or with diagrams like a circuit diagram. To be avoided, if you can.
If you are looking for a Decision Engine with a GUI, than you can try this one: http://gandalf.nebo15.com/
We just released it, it's open source and production ready.
You probably need some kind of inference engine. Think about doing it in prolog.

Core Data Query slow

What's the secret to pulling up items that match characters typed into the search bar that react instantaneously? For instance, if I type in a letter "W" in the search bar, all phrases that contain a letter "W" in any character position within the phrase are returned immediately.
So if a database of 20,000 phrases contains 500 phrases with the letter "W", they would appear as soon as the user typed the first character. Then as additional characters are typed, the list would automatically gets shorter.
I can send query's up to a SQL server from the iPhone and get this type of response, however, no matter what we try and taking the suggestions of other users, we still can't get good response time when storing the database locally on the iPhone.
I know that this performance is available, because there are many other apps out there that display results as soon as you start typing.
Please note that this isn't the same as indexing all words in every phrase, as this only will bring up matches where the word starts with the character typed in. In this case, we're looking for characters within words.
I think asynchronous results filtering is the answer. Instead of updating the search results every time the user types a new character, put the db query on a background thread when the first character is typed. If a new character is typed before the query is finished, cancel the old query and start a new one. Finally, you will get to the point where the user stops typing long enough for the query to return. That way, the query itself never blocks the user's typing.
I believe the UISearchDisplayController class offers this type of asynchronous search, though whether you want to use that class or just adopt the asynchronous design pattern from it is up to you.
If you're willing to get away from the database for this, you could use a generalized suffix tree with all the terms in your phrases. You can build in a suffix tree in linear time and, I believe, use it to find all occurrences of a substring very quickly. The web has lots of pages about suffix trees and suffix arrays. Wikipedia is probably a good place to start.
I have a fun scheme for you. You can build an index of the characters that exist in each phrase via a 32-bit integer. Flip the bits [0-25] to represent the characters (case-insensitive) a-z that exist in the phrase. Build a second bitmap of the query string. Now you can do comparisons via bitwise operations (& and |) to determine matches. This is very fast and believe it or not SQLite actually supports bitwise operations in queries - so you can even use this scheme to go straight to the database. I have working code that does this built into one of our iPhone applications - Alphagram.