NoSQL: indexing and keyword-based searching - mongodb

I have an application that stores items (e.g. web documents). Each item can feature a arbitrary large set of tags. And typical a common query is to retrieve all documents with given set of tags. Well, a pretty common Web application.
Now I'm thinking about a NoSQL database as persistent storage. Various NoSQL systems (e.g. MongoDB) support secondary indexes and with that keyword-based searches. Examples showing how to do it in different systems are easy to find. The problem is, I would like to know what's going on "under the hood", i.e. how/where the secondary indexes stored, and how a query with a list of tags is actually executed. Particularly in systems with many nodes.
I'm aware of solutions based on Map/Reduce or similar. But here I'm interested how the indexing works. Questions I have, for example, are:
Does the secondary index only store the item/object id or more?
If a query contains k tags, are k subqueries - one for each tag - executed and the k partial results are combined one the initiating node?
Where can I find such information for different NoSQL systems? Thanks a lot for any hints.
Christian

In MongoDB an index on tags would be done by utilizing the multi-keys feature whereby the database tries to match documents against each element of an array. You would index this tags attribute of a given document which would create a btree that is constructed out of ranges of tags in that array.
You can learn more about multikeys here and can get more information about indexing in MongoDB by watching this presentation: MongoDB Internals
Does the secondary index only store the item/object id or more?
The indexes consist of the indexed field (lets say it's a tags array in your case, then the field would be a single tag) and an offset used to efficiently locate the document in memory. It also has some padding + other overhead as described here
If a query contains k tags, are k subqueries - one for each tag - executed and the k partial results are combined one the initiating node?
It depends, but if, for example, the query were using an $or on the tag field I think the queries are performed in parallel, each in O(log n) time, and the results are combined to form the result set but I'm not sure about this though.

Related

MongoDB vs Elasticsearch - indexing parallel arrays

I have an application that needs to do filter the data based on more than 7+ fields.
2+ of these fields are array and currently stored on MongoDB (each of them individually store almost thousands of hexadecimal id). In MongoDB it's not possible to create parallel indexes (for very understandable reasons) Therefore, I'm just able to index based on one single field. In the following thread, the similar issue has been already discussed.
elasticsearch v.s. MongoDB for filtering application
The answer provides some good insights about how ElasticSearch differs from NoSQL databases. But I'm still confused about, will ElasticSearch be performant if I just create the nested mappings for two array fields.
Will the described "Vector Space Model" help me on filtering based on multiple array fields with a good performance when I do exact match / range searches?

mongoDB vs. elasticsearch query/aggregation performance comparison

This question is about choosing the type of database to run queries on for an application. Keeping other factors aside for the moment, and given that the choice is between mongodb and elastic, the key criterion is that the query should be resolved in near real time. The queries will be ad-hoc and as such can contain any of the fields in the JSON objects and will likely contain aggregations and subaggregations. Furthermore, there will not be nested objects and none of the fields will be containing 'descriptive' text (like movie reviews etc.), i.e., all the fields will be keyword type fields like State, Country, City, Name etc.
Now, I have read that elasticsearch performance is near real time and that elasticsearch uses inverted indices and creates them automatically for every field.
Given all the above, my questions are as follows.
(there is a similar question posted in stack but I do not think it answers my questions
elasticsearch v.s. MongoDB for filtering application)
1) Since the fields in the use case I mentioned do not contain descriptive text and hence would not require the full-text search capability and other additional features that elastic provides (especially for text search), what would be a better choice between elastic and mongo? How would elastic search and mongo query/aggregation performance compare if I were to create single field indices on all the available fields in mongo?
2) I am not familiar with advanced indexing, so I am assuming that it would be possible to create indices on all available fields in mongo (either using multiple single field indices or maybe compound indices?). I understand that this will come with a cost for storage and write speed which is true for elastic as well.
3) Also, in elastic the user can trade off write speed (indexing rate) with the speed with which the written document becomes available (refresh_interval) for a query. Is there a similar feature in mongo?
I think the size of your data set is also a very important aspect about choosing DB engine. According to this benckmark (2015), if you have over 10 millions of documents, Elasticsearch could be a better choice. If your data set is small there should be no obvious different about performance between Elasticsearch and MongoDB.

What data structure does Google Firebase Firestore use for it's default index

I'm curious if anyone knows, or can guess, the data structure Google's Firestore is using to index arbitrary NoSQL documents by every field. I'm looking to build something similar, making it as efficient as possible.
Some info about how their default index works:
all fields are indexed by default, but only works for equality searches not range (<,>)
any range searches require extra indexes
Source: https://firebase.google.com/docs/firestore/query-data/indexing
It's unlikely it's a standard btree index per field because the range searches would work without adding the requirement for another index. Plus if you added a new field (easy with document storage), it would take time to build an index and collections with billions of items.
One theory: 1 big index per document. Index "field_name:value" for every field in every document. The index maps to a sorted list document IDs which contain that field/value pair. It would be able to to equality search (my merging the sorted doc-ids for every equality requirement), but not a range search. Basically an inverted index.
Any suggestion for a better ways of implementing a pattern like this?
Clarification, single field indexes do support range/inequality queries, composite indexes are about combining multiple field filters in a single query. See this page for more on index types:
https://firebase.google.com/docs/firestore/query-data/index-overview
Each field index is stored in it's own key range with contiguous regions assigned to a server with compute and storage scaling independently under the covers. Cloud Firestore handles indexes fairly similar to Cloud Datastore (but not 100% the same).
You can see a basic overview on my Cloud Next conference session from last year.

MongoDB Find performance: single compound index VS two single field indexes

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.
Let's suppose we have a people collection of documents with the following shape:
{
_id: 10,
name: "Bob",
age: 32,
profession: "Hacker"
}
Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25
Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).
To better clarify our scenario, consider that:
the field name was already in our documents and we already have an index on that field
we are going to add the new field age due to some new features of our application
the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)
That said, we have to decide which of the following indexes offer the best performance:
One compound index: {name: 1, age: 1}
Two single-field indexes: {name: 1} and {age: 1}
According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.
By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.
This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.
According to test results, we are considering to use the single compound index approach.
Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?
Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:
do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes
Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.
And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.
Additional reads:
https://docs.mongodb.com/manual/core/query-optimization/index.html
https://dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eliminate-the-need-for-creating-compound-in
Index intersection vs. compound index?
mongodb compund index vs. index intersect
How does the order of compound indexes matter in MongoDB performance-wise?
In MongoDB, I am using a large query, how I will create compound index or single index, So My response time boost up

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.