MongoDB's geospatial index: how fast is it? - mongodb

I'm doing a where in box query on a collection of ~40K documents. The query takes ~0.3s and fetching documents takes ~0.6 seconds (there are ~10K documents in the result set).
The documents are fairly small (~100 bytes each) and I limit the result to return the lat/lon only.
It seems very slow. Is this about right or am I doing something wrong?

It seems very slow indeed. A roughly equivalent search on I did on PostgreSQL, for example, is almost too fast to measure (i.e. probably faster than 1ms).
I don't know much about MongoDB, but are you certain that the geospatial index is actually turned on? (I ask because in RDBMSs it's easy to define a table with geometrical/geographical columns yet not define the actual indexing appropriately, and so you get roughly the same performance as what you describe).

Related

MongoDB gets faster when processing repeated queries

I am developing a web application with MongoDB. I have noticed a phenomenon that when executing the same query repeatedly, the processing speed boosts and finally converges at a certain point. Like when querying all documents in the collection, the time usage per query will be 100000 ms -> 20000 ms -> 9000ms -> ... -> 500 ms.
I am wondering what is the reason behind the speed boost? And how to estimate the convergence point?
There are many reasons, i can give some points.
First of all, MongoDB is able to choose the best index for your query. To do so, MongoDB use a Query Plans. But this operation take time:
If there are no matching entries, the query planner generates
candidate plans for evaluation over a trial period. The query planner
chooses a winning plan, creates a cache entry containing the winning
plan, and uses it to generate the result documents.
https://docs.mongodb.com/manual/core/query-plans/
INDEX should be loaded into memory in order to speed up the performance, this also take time. Try to look at Touch:
https://docs.mongodb.com/manual/reference/command/touch/
The touch command loads data from the data storage layer into memory.
touch can load the data (i.e. documents) indexes or both documents and
indexes.
Another reason, the INDEX do not fit into memory in order to know if this is your case maybe you can check with totalIndexSize
https://docs.mongodb.com/manual/reference/method/db.collection.totalIndexSize/#db.collection.totalIndexSize
I'm more focused to improve the query-planner side since MongoDB doesn't take the best decision for you all the time.
All this topic should be carefully evaluated in my opinion to avoid performance degrade.
Good Luck!

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Optimizing for random reads

First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.

How to efficiently add a compound index with the _id field in MongoDB

I am doing a range query on _id and need to return only one particular field ("data") from the found documents. I would like to make this query indexOnly for optimal performance.
Here is the query:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1})
This query is of course not indexOnly so I need to add another index:
db.collection.ensureIndex({_id:1,data:1})
and tell MongoDB to use that Index with:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1}).hint({_id:1,data:1})
(The hint is needed because otherwise MongoDB will use the standard _id index for the query.)
This works as expected and makes the query indexOnly. However one cannot delete the standard _id index even though it is no longer needed which leads to a lot of wasted space for the doubled index. It is also annoying to be forced to always use the hint() in the query.
So I am wondering if there is a smarter way to do this.
I don't believe that there is any way to do what you want. The _id index cannot be removed, and you need to have the second index in order to perform a covered (indexOnly) query on your data.
Do you really have the need to have only a single index? I would suspect that you probably only have the requirement for either increased speed or reduced disk usage, but not both. If you do really have a requirement for both increased speed and reduced disk usage, you may need to look for a different database solution, since all of the techniques used to speed up MongoDB queries (indexes, covered queries, sharding, etc) tend to trade increased disk usage in order to gain the speed boost they provide.
EDIT:
Also, if the call to hint is bugging you, you can probably leave it off since MongoDB will eventually re-optimize it's query plan at which point it will switch over to your new index if it really is faster.

How complete should MongoDB indexes be?

For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.
The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.