What's the secret to pulling up items that match characters typed into the search bar that react instantaneously? For instance, if I type in a letter "W" in the search bar, all phrases that contain a letter "W" in any character position within the phrase are returned immediately.
So if a database of 20,000 phrases contains 500 phrases with the letter "W", they would appear as soon as the user typed the first character. Then as additional characters are typed, the list would automatically gets shorter.
I can send query's up to a SQL server from the iPhone and get this type of response, however, no matter what we try and taking the suggestions of other users, we still can't get good response time when storing the database locally on the iPhone.
I know that this performance is available, because there are many other apps out there that display results as soon as you start typing.
Please note that this isn't the same as indexing all words in every phrase, as this only will bring up matches where the word starts with the character typed in. In this case, we're looking for characters within words.
I think asynchronous results filtering is the answer. Instead of updating the search results every time the user types a new character, put the db query on a background thread when the first character is typed. If a new character is typed before the query is finished, cancel the old query and start a new one. Finally, you will get to the point where the user stops typing long enough for the query to return. That way, the query itself never blocks the user's typing.
I believe the UISearchDisplayController class offers this type of asynchronous search, though whether you want to use that class or just adopt the asynchronous design pattern from it is up to you.
If you're willing to get away from the database for this, you could use a generalized suffix tree with all the terms in your phrases. You can build in a suffix tree in linear time and, I believe, use it to find all occurrences of a substring very quickly. The web has lots of pages about suffix trees and suffix arrays. Wikipedia is probably a good place to start.
I have a fun scheme for you. You can build an index of the characters that exist in each phrase via a 32-bit integer. Flip the bits [0-25] to represent the characters (case-insensitive) a-z that exist in the phrase. Build a second bitmap of the query string. Now you can do comparisons via bitwise operations (& and |) to determine matches. This is very fast and believe it or not SQLite actually supports bitwise operations in queries - so you can even use this scheme to go straight to the database. I have working code that does this built into one of our iPhone applications - Alphagram.
Related
Let's say I have huge text file which has rows in format like
Id Name Address
All the records are sorted by Id. If I am searching for an Id, how would I make it more efficient to search using findstr or write something better than findstr?
As a native application I would not be surprised if findstr has better search performance than most anything one could implement in PowerShell code or even a compiled .NET module. The problem with findstr is it is oblivious to the structure of your data. That is, if you search for the record with ID 123 it will happily returns records with ID 1234 or address "123 Main Street" as false positives. You could potentially use the /B or /R switches to combat this, but that still doesn't help in the case where you search for an ID that doesn't exist; findstr only stops searching when it reaches the end of the file.
Your ability to perform an optimized search depends on the specific format of the text file. If lines are fixed-length, meaning you can instantly seek to the $nth line by simply calculating $n * $lineLength, then you could quickly search the file for an ID using a binary search.
If lines are variable-length, then there's really no simple way to efficiently search the file other than line-by-line. Even if you've read enough of a line to know the ID doesn't match, you still need to read the rest of the line to know where the next line begins. At best, since the lines are sorted by ID you know that if you encounter a line with an ID greater than the one you're searching for you can abort the search immediately because that ID won't be found.
In the past I have been able to employ a binary search on text files with variable-length lines (fixed-sized characters would be very helpful, too, if not required). The key is for each iteration of the search, calculate your next offset and if it happens to land on the beginning of a line, great; if not, seek backwards until you can identify the character that is the beginning of the line (e.g. preceded by a CrLf). Once you've got yourself positioned on the start of a line, you can read the ID and determine if it's a match or in which direction the next iteration of the search needs to look.
It's definitely not a quick and simple solution (to write), but, depending on how huge is "huge", it could yield significant results when searching your file. Although, at that point it might be better to invest your development time in changing to a more search-friendly way of storing your data, if at all possible.
We're developing a REST API for our platform. Let's say we have organisations and projects, and projects belong to organisations.
After reading this answer, I would be inclined to use numerical ID's in the URL, so that some of the URLs would become (say with a prefix of /api/v1):
/organisations/1234
/organisations/1234/projects/5678
However, we want to use the same URL structure for our front end UI, so that if you type these URLs in the browser, you will get the relevant webpage in the response instead of a JSON file. Much in the same way you see relevant names of persons and organisations in sites like Facebook or Github.
Using this, we could get something like:
/organisations/dutchpainters
/organisations/dutchpainters/projects/nightwatch
It looks like Github actually exposes their API in the same way.
The advantages and disadvantages I can come up with for using names instead of IDs for URL definitions, are the following:
Advantages:
More intuitive URLs for end users
1 to 1 mapping of front end UI and JSON API
Disadvantages:
Have to use unique names
Have to take care of conflict with reserved names, such as count, so later on, you can still develop an API endpoint like /organisations/count and actually get the number of organisations instead of the organisation called count.
Especially the latter one seems to become a potential pain in the rear. Still, after reading this answer, I'm almost convinced to use the string identifier, since it doesn't seem to make a difference from a convention point of view.
My questions are:
Did I miss important advantages / disadvantages of using strings instead of numerical IDs?
Did Github develop their string-based approach after their platform matured, or did they know from the start that it would imply some limitations (like the one I mentioned earlier, it seems that they did not implement such functionality)?
It's common to use a combination of both:
/organisations/1234/projects/5678/nightwatch
where the last part is simply ignored but used to make the url more readable.
In your case, with multiple levels of collections you could experiment with this format:
/organisations/1234/dutchpainters/projects/5678/nightwatch
If somebody writes
/organisations/1234/germanpainters/projects/5678/wanderer
it would still map to the rembrandt, but that should be ok. That will leave room for editing the names without messing up url:s allready out there. Also, names doesn't have to be unique if you don't really need that.
Reserved HTTP characters: such as “:”, “/”, “?”, “#”, “[“, “]” and “#” – These characters and others are “reserved” in the HTTP protocol to have “special” meaning in the implementation syntax so that they are distinguishable to other data in the URL. If a variable value within the path contains one or more of these reserved characters then it will break the path and generate a malformed request. You can workaround reserved characters in query string parameters by URL encoding them or sometimes by double escaping them, but you cannot in path parameters.
https://www.serviceobjects.com/blog/path-and-query-string-parameter-calls-to-a-restful-web-service
Numerical consecutive IDs are not recommended anymore because it is very easy to guess records in your database and some might use that to obtain info they do not have access to.
Numerical IDs are used because the in the database it is a fixed length storage which makes indexing easy for the database. For example INT has 4 bytes in MySQL and BIGINT is 8 bytes so the number have the same length in memory (100 in INT has the same length as 200) so it is very easy to index and search for records.
If you have a lot of entries in the database then using a VARCHAR field to index is a bad idea. You should use a fixed width field like CHAR(32) and fill the difference with spaces but you have to add logic in your program to treat the differences when searching the database.
Another idea would be to use slugs but here you should take into consideration the fact that some records might have the same slug, depends on what are you using to form that slug. https://en.wikipedia.org/wiki/Semantic_URL#Slug
I would recommend using UUIDs since they have the same length and resolve this issue easily.
This web page suggests that if your lex program "has a large number of reserved words, it is more efficient to let lex simply match a string and determine in your own code whether it is a variable or reserved word."
My question is: More efficient where, and why? If it means compiling the lexer is faster, I don't really care about that because it is one step removed from the program which uses the lexer to parse input.
It seems to be that lex just uses your description to build a state machine that processes one character at a time. It does not seem logical that increasing the size of the state machine would necessarily cause it to become any slower than using one rule for identifiers and then doing several string comparisons.
Additionally, if it turns out that there is some logical reason for this to make sense as an optimization, what would be considered a large number of reserved words? I have approximately 20, as compared to about 30 other rules for various things. Would that be considered a large number of reserved words? Should I attempt to use the same strategy for some of the other symbols?
I have attempted to google for a result, but the only relevant articles I found stated this strategy as though it were well-known without giving any reason.
In case it is relevant, I am using flex 2.5.35.
Edit: Here is another reference which claims that lex produces an inefficient scanner when asked to match several long literal strings. It also does not give a reason.
According to the flex manual, "[t]he speed of the scanner is independent of the number of rules or ... how complicated the rules are with regard to operators such as '*' and '|'."
The main performance losses are due to backtracking. This can be avoided by (among other things) using catch-all rules which will match tokens which "start with" the offending token. For example, if you have a list of reserved words consisting of [a-zA-Z_], and then a rule for matching identifiers of the form [a-zA-Z_][a-zA-Z_0-9]*, the rule for matching identifiers will catch any identifiers which start with the name of a reserved word without having to back up and try to match again.
According to the faq, flex generates a deterministic finite automaton which "does all the matching simultaneously, in parallel." The result of this is, as was said above, that the speed of the scanner is independent of the number of rules. On the other hand, string comparison is linear in the number of rules.
As a result, the reserved word rules should, in fact, be considerably faster than a lookup table.
In our web application we display a list of pulses, but for linking and such we make every pulse uniquely available. In our Couch DB we are giving every pulse a unique id by md5'ing their unique attributes. I.E.: www.foo.com/bar/
Though these md5 sums are extremely long and make for ugly URLs. Is there another way to hash the attributes that will require less characters but still guarantee uniqueness.
Thanks a lot
Instead of creating an ugly md5 you could use a method like this to create a random string of a given length containing certain characters and insert this into a row next to the md5 row that is used for retrieving the data from the database using the 'pretty url' string. One thing to think about would be to take out the vowels from the possible characters as with them, you could end up with bad words :) Also, make sure it does not already exist in the database of course, and if it does just create another one... that won't happen very often though.
Looking for advice on handling ampersands and the word "and" in Lucene queries. My test queries are (including quotes):
"oil and gas field" (complete phrase)
"research and development" (complete phrase)
"r&d" (complete phrase)
Ideally, I'd like to use the QueryParser as the input is coming from the user.
During testing and doc reading, I found that using the StandardAnalyzer doesn't work for what I want. For the first two queries, a QueryParser.Parse converts them to:
contents:"oil gas field"
contents:"research development"
Which isn't what I want. If I use a PhraseQuery instead, I get no results (presumably because "and" isn't indexed.
If I use a SimpleAnalyzer, then I can find the phrases but QueryParser.Parse converts the last term to:
contents:"r d"
Which again, isn't quite what I'm looking for.
Any advice?
if you want to search for "and" you have to index it. Write you own Analyzer or remove "and" from the list of stop words. The same applies to the "r&d". Write your own Analyzer that creates 3 words from the text: "r", "d", "r&d".
Step one of working with Lucene is to accept that pretty much all of the work is done at the time of indexing. If you want to search for something then you index it. If you want to ignore something then you don't index it. It is this that allows Lucene to provide such high speed searching.
The upshot of this is that for an index to work effectively you have to anticipate what your analyzer needs to do up front. In this case I would write my own analyzer that doesn't strip any stop words and also transforms & to 'and' (and optionally # to 'at' etc). In the case of r&d matching research & development you are almost certainly going to have to implement some domain specific logic.
There are other ways of dealing with this. If you can differentiate between phrase searches and normal keyword searches then there is no reason you can't maintain two or more indexes to handle different types of search. This gives very quick searching but will require some more maintenance.
Another option is to use the high speed of Lucene to filter your initial results down to something more manageable using an analyzer that doesn't give false negatives. You can then run some detailed filtering over the full text of those documents that it does find to match the correct phrases.
Ultimately I think you are going to find that Lucene sacrifices accuracy in more advanced searches in order to provide speed, it is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer this much.