Hash algorithm used in Lucene Indexing - hash

Any idea which hash algorithm is used while Indexing each word in lucene

Lucene doesn't use hashing to lookup terms, they are stored lexicographically in a file called Term Dictionary. Another file, called the Term Info Index is loaded into memory in order to provide random access to the Term Dictionary (which is basically a skip list).
More information on Lucene website.
Currently, the Term Info Index stores the position of every indexdivisor-th (indexdivisor=128 typically) in memory, meaning you can lookup a term by performing one binary search on the Term Info Index (in memory), and at most 128 entry scans on the Term Dictionary (on disk).
http://lucene.472066.n3.nabble.com/Understanding-lucene-indexes-and-disk-I-O-td714698.html
There is an optimisation, currently available in trunk, for this Term Info Index, which uses a prefix trie in order to perform lookups and performs much better for terms dictionary-intensive queries.
https://issues.apache.org/jira/browse/LUCENE-3030

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?
Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.
BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

Text equality operator performance in Postgresql

How does this query work in terms of string comparison performance (assume there is a standard B-tree index on last_name ?
select * from employee where last_name = 'Wolfeschlegelsteinhausenbergerdorff';
So as it walks the B-Tree, I am assuming it it doesn't do a linear search on each character in the last_name field. EG, it doesn't start to check that the fist letter starts with a W... Assuming it doesn't do a linear comparison, what does it do?
I ask because I am considering to write my own duplicate prevention mechanism, but I want the performance to be sound. I was originally thinking of hashing each string (into some primitive datatype, probably a Long) that is coming in through an API, and storing the hash codes in a set/cache (each entry expires after 5 minutes). Any collisions would/could prompt a true duplicate check, where the already processed strings are stored in postgresql. But I'm thinking, would it be better to just simply query postgresql, in stead of maintaining my own memory based Set of Hashes that fluhses old entries after 5-10 minutes. I would probably use redis for scalability since multiple nodes will be reading different streams. Is my set of memory cached hash codes going to be faster than just querying indexed postgres String columns (full text matching not text searching) ?
When strings are compared for equality, the function texteq is called.
If you look up the function in src/backend/utils/adt/varlena.c, you will find that the comparison is made using the C library function memcmp. I doubt that you can get faster than that.
When you look up the value in a B-tree index, it will be compared to the values stored in each index page from the root page to the leaf page, that are at most 5 or 6 pages.
Frankly, I doubt that you can manage to be faster than that, but I wish you luck trying.

are hashed indexes in mongodb field-size limited?

in our DB we have a large text field which we want to filter on exists/does not exist basis. So we don't need to perform any text search in it.
we assume that index would help, although it's not guaranteed the fiels wont exceed 1024 bytes. So that's not an option.
does hashed index on such field support $exists-filtering queries?
do hashed indexes impose any field-size limitations (in our experiments, hashed index is well capable of indexing fields where ordinary index fails)? We haven't found any explicit statement on this in docs though.
is chosen approach as a whole the correct one?
Yes, your approach is the correct one given the constraints. However, there are some caveats.
The performance advantage of an index compared to a collection scan is limited by the RAM available, since mongod tries to keep indices in RAM. If it can't (die to queries, for example), even an index will be read from disk, more or less eliminating the performance advantage in using it. So you should test wether the additional index does not push the RAM needed beyond the limits of your planned deployment.
The other, more severe problem is that you can not use said index to reliably distinguish unique documents with it, since there is no guarantee for uniqueness on hashes. Albeit a bit theoretical, you have to keep that in mind.

Why are key names stored in the document in MongodDB

I'm curious about this quote from Kyle Banker's MongoDB In Action:
It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.
I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?
What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:
If you want it done, you can currently do it at the Application/ORM/ODM level quite easily.
It's not necessarily a performance** advantage in all cases — think collections with lots of key names, and/or key names that vary wildly between documents.
It might not provide a measurable performance** advantage at all until you have millions of documents.
If the server does it, the full key names still have to be transmitted over the network.
If compressed key names are transmitted over the network, then readability really suffers using the javascript console.
Compressing the entire JSON document might offer offers an even better performance advantage.
Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".
Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)
* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.
** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.
Update
MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.
Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.
In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.
I believe one of the original reasons behind storing the key names with the documents is to allow a more easily scalable schema-less database. Each document is self-contained to a greater extent, in that if you move the document to another server (for example, via replication or sharding) you can index the contents of the document without having to reference separate or centralized metadata such as a mapping of key names to more compact key IDs.
Since there is no enforced schema for a MongoDB collection, the field names can potentially be different for every document in the same collection. In a sharded environment, inserts to each shard are (intentionally) independent so at a document level the raw data could end up differing unless the key mapping was able to be consistent per shard.
Depending on your use case, the key names may or may not consume a significant amount of space relative to the accompanying data. You could always workaround the storage concern from the application / ODM implementation by mapping YourFriendlyKeyNames to shorter DB key equivalents.
There is an open MongoDB Jira issue and some further discussion to have the server tokenize field names, which you can vote on to help prioritize including this feature in a future release.
MongoDB's current design goals include performance with dynamic schemas, replication & high availability, auto-sharding, and in-place updates .. with one potential tradeoff being some extra disk usage.
Having to look this up within the database for each and every query would be a serious penalty.
Most drivers allow you to specify ElementName, so that MyLongButReadablePropertyName in your domain model becomes mlbrpn in mongodb.
Therefore, when you query in your application, it's the application that transforms the query that would of been:
db.myCollection.find({"MyLongButReadablePropertyName" : "some value"})
into
db.myCollection.find({"mlbrpn" : "some value"})
Efficient drivers, like the C# driver cache this mapping, so it doesn't need to look this up for each and every query.
Coming back to the title of your question:
Why are key names stored in the document in MongodDB
This is the only way documents can be searched?
Without the key names stored , there'd be no key to search on.
Hope this helps

How to build field hash based sharding in MongoDB

I'm looking for a good approach to do the following :
Given a document with some field F I want to set up sharding to that my application can generate a static hash for that field value (meaning, the hash will always be the same if the value is the same) and then use that hash to target the appropriate shard in a normal MongoDB sharded setup.
Questions :
Is this a safe/good approach?
What is a good way to go about implementing it
Are there any gotchas concerning shard cluster setup that I should be aware of.
Thanks!
I've actually implemented this and it's very doable and results in very good write performance. I'm assuming you have the same reasons for implementing it as I did (instant shard targeting without warmup/balancing, write throughput, no performance degradation during chunk moves/splits, etc.).
Your questions :
Yes it is provided you implement it correctly.
What I do in our in-house ORM layer is mark certain fields in document as a hash sharded field. Our ORM will then automatically generate a hash value for that field value just prior to writing or reading the document. Outgoing queries are then decorated with that hash value (in our case always called "hash") which MongoDB sharding then uses for shard targeting. Obviously in this scenario "hash" is always the only sharding key.
Most important by far is to generate good hashes. A lot of field values (most commonly an ObjectId based _id field) are incremental so your hash algorithm must be so that the generated hashes for incremental values result in hash values that hit different shards. Other issues include selecting the appropriate chunk size.
Some downsides to consider :
Default MongoDB chunk balancing becomes less useful since you typically set up your initial cluster with a lot of chunks (this facilitates adding shards to your cluster while maintaining good chunk spread across all shards). This means the balancer will only start splitting if you have enough data in your premade chunks to require splitting.
It's likely to become an officially supported MongoDB feature in the near future which may make this entire effort a bit wasteful. Like me you may not have the luxury of waiting though.
Good luck.
UPDATE 25/03/2013 : As of version 2.4 MongoDB supports hash indexes natively.
This is a good and safe idea.
However the choice of the hash function is crucial :
do you want it to be uniform (you smooth the load of all your shard but you loose some semantic bulk access) ?
do you want it human readable ? (you loose efficiency (compared to binary hash which are very fast) but you win, well, readability
can you make it consistent (beware of language provided hash function)
can you you enforce unicity if you want ?
I have successfully choosen : uniformity, binary form, consistency and unicity with a murmurHash3 function :
value -> murmurmHash(valueInBinaryForm) followed by the valueInBinaryForm