MongoDB Online Archive - Do Partition Fields need to follow the ESR rule? - mongodb

From the official document, we can set up partition fields to speed up the performance when using Online Archive.
The order of fields listed in the path is important in the same way as
it is in Compound Indexes . Data in the specified path is partitioned
first by the value of the first field, and then by the value of the
next field, and so on. Atlas supports queries on the specified fields
using the partitions.
For example, suppose you are configuring the online archive for the
movies collection in the sample_mflix database. If your archived field
is the released date field, which you moved to the third position,
your first queried field is title, and your second queried field is
plot, your partition will look similar to the following:
/title/plot/released Atlas creates partitions first for the title
field, followed by the plot field, and then the released field. Atlas
uses the partitions for queries on the following fields:
the title field,
the title field and the plot field,
the title field and the plot field and the released field.
Atlas can also use the partitions to support a query on the title and
released fields. However, in this case, Atlas would not be as
efficient in supporting the query as it would be if the query were on
the title and plot fields only. Partitions are parsed in order; if a
query omits a particular partition, Atlas is less efficient in making
use of any partitions that follow that. Since a query on title and
released omits plot, Atlas uses the title partition more efficiently
than the released partition to support this query.
Here I simplify the situation for asking. I need to query by:
title(eq)/plot(eq)/released(range)
title(eq)/released(range)
From the document, title/plot/released support 1. but inefficient to 2.. If I change it to released/title/plot, it seems perfect but violates the ESR rule.
Does Partition Fields need to follow the ESR rule? What is the correct way to solve this requirement? Any deep dive explanations are welcome.

Related

How to maintain order of a Mongo collection by sorting on an indexed field efficiently

ObjectId _id <--- index
String UserName
int Points <--- Descending index
Using this document structure as a simple example, we have a collection of users, each with a name and a "points" value. The collection has the usual _id index but also a "descending index" on Points.
Problem
The sample use case would be to maintain a ranking scoreboard (something like the League of Legends/DOTA ranking system or chess elo system). Each users' Points field would be constantly changing but the scoreboard is viewed very frequently and thus needs to be accurately maintained.
My current unoptimized solution
I'm not sure what "ascending/descending sort order means" in the mongo docs, but apparently it doesn't matter for single-field indices anyways.
So currently I'm just doing a very brute force solution of sorting the collection each time a user's Points field gets updated. At least it's indexed so for a smaller userbase this shouldn't be too bad. However, sorting the entire userbase on each update/insertion just seems wrong in general.
Other things I'm considering
There are data structures traditionally used for maintaining order during insert/update such as search trees but implementing that without putting the entire collection in memory seems like a huge project in itself.
I tried to search for some built-in functionality of Mongo indices that automatically maintains order in the collection for you but I couldn't really find anything like that.
Maybe some logic to only re-sort only some chunk of documents directly above and below the insertion/update? This solution seems pretty dependent on the expected spread of Points across the userbase and the use cases of this system.
You don't need to sort additionally already created indices , when you create indices in mongoDB you specify in what direction they need to be sorted(ascending(1) or descending(-1)) , so when you search multiple documents based on some field the result will be already sorted based on this field index order.
Afcourse you can specify explicitly if you need the result in reverse order or sorted by other field.

DynamoDB equivalent to find().sort()

In mongoDB one can get all the documents sorted asc/desc on a particular field as follows:
db.collection_name.find().sort({field: sort_order})
query() requires one to specify the partition key and if one wants to query on a non key attribute, A GSI can be created and queries can be run on it for the same as explained here: Query on non-key attribute
scan() would do the job but doesn't provide an option to sort on any field.
One of the solution as described here: Dynamodb scan in sorted order is to have a common key for all documents and create a GSI on the attribute.
But as listed in the comments of the one of the solutions and I quote "The performance characteristics of a DynamoDB table apply the same to GSIs. A GSI with a single hash key of "OK" will only ever use one partition. This loses all scaling characteristics of DynamoDB".
Is there a way to achieve the above that scales well?
The only sorting applied by DynamoDB is by range key within a partition. There is no feature for sorting results by arbitrary field, you are expected to sort your own results in your application code. i.e. do a scan and sort the results on the client side.

What data structure does Google Firebase Firestore use for it's default index

I'm curious if anyone knows, or can guess, the data structure Google's Firestore is using to index arbitrary NoSQL documents by every field. I'm looking to build something similar, making it as efficient as possible.
Some info about how their default index works:
all fields are indexed by default, but only works for equality searches not range (<,>)
any range searches require extra indexes
Source: https://firebase.google.com/docs/firestore/query-data/indexing
It's unlikely it's a standard btree index per field because the range searches would work without adding the requirement for another index. Plus if you added a new field (easy with document storage), it would take time to build an index and collections with billions of items.
One theory: 1 big index per document. Index "field_name:value" for every field in every document. The index maps to a sorted list document IDs which contain that field/value pair. It would be able to to equality search (my merging the sorted doc-ids for every equality requirement), but not a range search. Basically an inverted index.
Any suggestion for a better ways of implementing a pattern like this?
Clarification, single field indexes do support range/inequality queries, composite indexes are about combining multiple field filters in a single query. See this page for more on index types:
https://firebase.google.com/docs/firestore/query-data/index-overview
Each field index is stored in it's own key range with contiguous regions assigned to a server with compute and storage scaling independently under the covers. Cloud Firestore handles indexes fairly similar to Cloud Datastore (but not 100% the same).
You can see a basic overview on my Cloud Next conference session from last year.

Choosing the right database index type

I have a very simple Mongo database for a personal nodejs project. It's basically just records of registered users.
My most important field is an alpha-numeric string (let's call it user_id and assume it can't be only numeric) of about 15 to 20 characters.
Now the most important operation is checking if the user exists at or all not. I do this by querying db.collection.find("user_id": "testuser-123")
if no record returns, I save the user along with some other not so important data like first name, last and signup date.
Now I obviously want to make user_id an index.
I read the Indexing Tutorials on the official MongoDB Manual.
First I tried setting a text index because I thought that would fit the alpha-numeric field. I also tried setting language:none. But it turned out that my query returned in ~12ms instead of 6ms without indexing.
Then I tried just setting an ordered index like {user_id: 1}, but I haven't seen any difference (is it only working for numeric values?).
Can anyone recommend me the best type of index for this case or quickest query to check if the user exists? Or maybe is MongoDB not the best match for this?
Some random thoughts first:
A text index is used to help full text search. Given your description this is not what is needed here, as, if I understand it well, you need to use an exact match of the whole field.
Without any index, MongoDB will use a linear search. Using big O notation, this is an O(n) operation. With an (ordered) index, the search is performed in O(log(n)). That means that an index will dramatically speed up queries when you will have many documents. But you will not necessary see any improvement if you have a small number of documents. In that case, O(n) can even be worst than O(log(n)). Some database management systems don't even bother using the index if the optimizer estimate that it will not provide enough benefits. I don't know if MongoDB does that, though.
Given your use case, I think the proper index is an unique index. This is an ordered index that would prevent insertion of two identical documents.
In your application, do not test before insert. In real application, this could lead to race condition when you have concurrent inserts. If you use an unique index, just try to insert -- and be prepared to gracefully handle an error caused by a duplicate key.

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.