I am curious about the performance difference between calling findOne() ten times and calling find() once with ten arguments. Is the latter sufficiently better?
In general, it makes sense to test this in your environment with your data, but a theoretical answer is: it depends, but generally, yes - it improves performance since it reduces the number of round-trips to the database, and network lag is usually the dominant factor unless your query is highly inefficient. This is more prominent in production and with multiple data centers and less of an issue on localhost.
The question is where your ten arguments come from - if you wanted to select a set of 10 elements by id, an $in-query is more elegant and blazing fast. However, you could also use ten different queries in an $or-query and still reduce the number of round-trips. However, there are some pitfalls with $or and indexes that are described well in the documentation.
The topic of reducing the number of round-trips is also known as the N+1 problem in the context of pseudo-joins and there's a wide agreement that reducing the number of round-trips is key.
Related
For my project I need to cycle through an array of values. The number of elements and values are chosen at compile time. Currently I use mod to cycle through these values in various different ways (i.e not necessarily a simple i++).
However, I look up the cost of mod() and it seems that its an expensive function in most architectures including atmega Arduinos and my application is time sensitive.
I've come up with two potential solutions, both with pitfalls.
Overflow the index counter, exploiting that unsigned overflows to zero. This has the advantage of being very fast.
Disadvantages: I need exactly as many array elements as there are, unique bytes - at least 256. Also, the code is difficult to re-read since most ppl wouldn't assume I overflow on purpose.
An if statement that removes size_of_array whenever the index equals or exceeds it. The advantage is that size_of_array can whatever. Disadvantages: if statement is slower (how slow?).
In both situations edge cases that mod would deal correctly would not be encountered (i.e. taking the modulus of a very large number).
Are there any pitfalls to either solution that I have not thought of? Is there a better solution?
how can I compare methods of conflict resolution (ie. linear hashing, square hashing and double hashing) in the tables hash? What data would be best to show the differences between them? Maybe someone has seen such comparisons.
There is no simple approach that's also universally meaningful.
That said, a good approach if you're tuning an actual app is to instrument (collect stats) for the hash table implementation you're using in the actual application of interest, with the real data it processes, and for whichever functions are of interest (insert, erase, find etc.). When those functions are called, record whatever you want to know about the collisions that happen: depending on how thorough you want to be, that might include the number of collisions before the element was inserted or found, the number of CPU/memory cache lines touched during that probing, the elapsed CPU or wall-clock time etc..
If you want a more general impression, instrument an implementation and throw large quantities of random data at it - but be aware that the real-world applicability of whatever conclusions you draw may only be as good as the random data is similar to the real-world data.
There are also other, more subtle implications to the choice of collision-handling mechanism: linear probing allows an implementation to cleanup "tombstone" buckets where deleted elements exist, which takes time but speeds later performance, so the mix of deletions amongst other operations can affect the stats you collect.
At the other extreme, you could try a mathematical comparison of the properties of different collision handling - that's way beyond what I'm able or interested in covering here.
I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?
A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast
as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases
kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.
I have a static dictionary containing millions of keys which refer to values in a sparse data structure stored out-of-core. The number of keys is a small fraction, say 10%, of the number of values. The key size is typically 64-bit. The keys are linearly ordered and the queries will often consist of keys which are close together in this order. Data compression is a factor, but it is the values which are expected to be the biggest contributor to data size rather than the keys. Key compression helps, but is not critical. Query time should be constant, if possible, and fast since a user is interacting with the data.
Given these conditions I would like to know an effective way to query the dictionary to determine whether a specific key is contained in it. Query speed is the top priority, construction time is not as critical.
Currently I'm looking at cache-oblivious b+-trees and order-preserving minimal perfect hashes tied to external storage.
At this point CHD or some other form of hashing seems like a candidate. Since the keys are queried in approximate linear order it seems that an order-preserving hash would avoid cache misses, but I'm not knowledgeable enough to say whether CHD can preserve the order of the keys. Constant-time queries are also desirable. The search is O(1), but the upper limit on query times over the key space is also unknown.
Trees seem less attractive. Although there are some cache-oblivious and cache-specific approaches I think much of the effort is aimed at range queries on dynamic dictionaries rather than constant-time membership queries. Processors and memories, in general, don't like branches.
There have been a number of questions asked along these lines, but this case (hopefully) constrains the problem in a manner that might be useful to others.
any feedback would be appreciated,
thanks
I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.