Best strategy for creating MongoDB indexes for Parse Server? - mongodb

I have a Parse application with a number of possible queries triggered by Cloud functions. I understand that creating indexes can speed up those queries. But I'm wondering what the best strategy for indexing is.
for example, let's say these are some typical queries:
new Parse.Query('Customer')
.equalTo('email', email)
.equalTo('zipCode', zipCode)
new Parse.Query('Customer')
.equalTo('email', email)
.equalTo('areaCode', areaCode)
I could make a compound index for every such query (email+zipCode and email+areaCode, etc), but I have a lot of different queries, so this would be difficult to maintain, and my understanding is that too many indexes can hurt write performance.
If I instead make a single index for every field that is included in queries (email, areaCode, and zipCode for these examples), will all these compound queries be able to make use of the single indexes? I read about "index intersection" in the MongoDB docs, but it's not clear whether this is something that will happen automatically in a Parse query.

Related

In mongo if I have a bunch of IDs, is it more performant to query per ID, or by $in: IDs?

I'm wondering how $in works behind the scenes, and what optimizations are made. Does it loop through the database, looking for the required items, or know immediately where those are? Do indexes matter in those operations?
I'm trying to be efficient as possible, by making one query, and querying the documents I need in one go, but maybe when providing a single ID, which is guaranteed to be indexed, it's faster, and worth the multiple queries.
I guess there is a factor of how many documents we're talking about, in my case it's only a few. I assume with a lot of IDs it may worth it to just query them in one go, but maybe not. I'm not too experienced in mongo.
Generally, It is always better to reduce network roundtrip to the database.
In your case, using $in operator is better because if you make many requests to the database for each id, you will have so many roundtrips.
when you send your query to the database, it will try to create the most efficient execution plan for your query and if there are any indices that can help to achieve a more efficient execution plan, the database will use them.
Mongo creates an index on _id filed of the document by default.

Use simultaneously Elasticsearch and Postgres to perform queries

I'm working on a querying tool that would allow to make complex relational queries along to fulltext search on quite big datasets. My data is stored in a Postgres database with a elasticsearch engine for convenient and efficient fulltext search. However, some queries require complex join, cardinality tests, filters on joined data ...
My dilemma is that I cannot use only ElasticSearch or only Postgres. I need both to answer specific needs. But combining them seems to be a really difficult task.
My approach
... was to perform the ES query first and then use the results' id as a filter for the SQL query. The problem is that ES has a max_result_window that prevents to get all the matching data at once. Even worse, ES may return the first 10K results matching the fulltext search, but the subsequent SQL query may narrow those results number to something ridiculously small, while there's actually much more matching items but the 10K limit acted as bottleneck.
Taking the other way around is no better since if we use the result of the SQL query as the document ids as filters for the ElasticSearch query, the max_clause_count limit will be easily reached and the ES query wouldn't be able to be performed on more than (default) 1024 items.
Maybe my logic isn't good. Is there any other approach to combine both ES and Postgres queries simultaneously ? Thanks.

MongoDB Find performance: single compound index VS two single field indexes

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.
Let's suppose we have a people collection of documents with the following shape:
{
_id: 10,
name: "Bob",
age: 32,
profession: "Hacker"
}
Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25
Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).
To better clarify our scenario, consider that:
the field name was already in our documents and we already have an index on that field
we are going to add the new field age due to some new features of our application
the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)
That said, we have to decide which of the following indexes offer the best performance:
One compound index: {name: 1, age: 1}
Two single-field indexes: {name: 1} and {age: 1}
According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.
By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.
This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.
According to test results, we are considering to use the single compound index approach.
Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?
Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:
do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes
Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.
And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.
Additional reads:
https://docs.mongodb.com/manual/core/query-optimization/index.html
https://dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eliminate-the-need-for-creating-compound-in
Index intersection vs. compound index?
mongodb compund index vs. index intersect
How does the order of compound indexes matter in MongoDB performance-wise?
In MongoDB, I am using a large query, how I will create compound index or single index, So My response time boost up

mongodb indexing user-defined schemas

We are currently using MongoDB to allow tenants in a SaaS application to define entities that they can use in the application. We do not know know how each tenant is going to define the fields for the entities that they are creating upfront. Each entity will have a collection dynamically created for it in a separate database that belongs to the tenant.
For example, One tenant might define a Customer as First Name, Last Name, Email. Another tenant might define Shipment as Shipment Ref, Ship Date, Owner etc... Each tenant will have many entities/collections in their tenant database.
We have one field (ID) which we will always force the user to include in each entity/collection. We will index this field upfront when creating the collection.
However, how do we handle the case where we want to allow the tenant to sort/search/order/query large collections/entities quickly when/if the dataset becomes too large?
That is, since we do not know upfront what fields the user will be sorting/filtering/ordering by, what is the indexing strategy to use in this case with Mongo?
First of all Mongo requires you to have _id for each document and it indexes it automatically. You should take advantage of this and not create yet another ID field in case you require your clients to have ID field. I'm not sure if that's the case in your application.
What you are asking for can't have a perfect solution or even the most optimal one, but I can suggest couple options:
Create single field index for each field in the document. Let Mongo query optimizer decide which index to use depending on query. Disadvantages - takes lots of space on disk and in memory. Makes inserts slower. Mongo can use only 1 index in condition clause, so it will not be able to use compound index. You can easily extract schema with a tool like this. I wrote this little prototype that analyzes and prints Mongo schema.
Let your application learn what indexes to create. Get slow queries from Mongo profiler (in Mongo log), analyze common parts (automatically?) and create indexes on most commonly used fields. That's not so easy to implement and efficiency might change with time if your client changes queries or data. Application will be slow in the start until it learns about itself :).
Would just like to emphasize in choosing your design that the ID and not _id field you mention is actually some unique entity identifier then you are better of putting this in _id.
The reason here is that the performance trade-off for using another unique index over the required _id is a considerable overhead. Thinking about this, since _id is required it is the first thing that MongoDB looks for when determining which index to use. Otherwise consider a compound _id field containing your entity information and some other useful uniqueness.
As for the user defined fields, which is kind of the essence of mongo documents, for my money I would make it part of the API to set up indexes as required. Depending on the type of searching that is happening you'll probably want compound indexes and generated queries that make sense to these.
Simply indexing every field will probably have limited use as only one index is going to be picked for the find anyhow, and the query optimizer is going to try all of them. As has been mentioned, a long option could be to set indexes according to the usage patterns. But it could take some work to do.

What is the fundmental difference between MongoDB / NoSQL which allows faster aggregation (MapReduce) compared to MySQL

Greeting!
I have the following problem. I have a table with huge number of rows which I need to search and then group search results by many parameters. Let's say the table is
id, big_text, price, country, field1, field2, ..., fieldX
And we run a request like this
SELECT .... WHERE
[use FULLTEXT index to MATCH() big_text] AND
[use some random clauses that anyway render indexes useless,
like: country IN (1,2,65,69) and price<100]
This we be displayed as search results and then we need to take these search results and group them by a number of fields to generate search filters
(results) GROUP BY field1
(results) GROUP BY field2
(results) GROUP BY field3
(results) GROUP BY field4
This is a simplified case of what I need, the actual task at hand is even more problematic, for example sometimes the first results query does also its own GROUP BY. And example of such functionality would be this site
http://www.indeed.com/q-sales-jobs.html
(search results plus filters on the left)
I've done and still doing a deep research on how MySQL functions and at this point I totally don't see this possible in MySQL. Roughly speaking MySQL table is just a heap of rows lying on HDD and indexes are tiny versions of these tables sorted by the index field(s) and pointing to the actual rows. That's a super oversimplification of course but the point is I don't see how it is possible to fix this at all, i.e. how to use more than one index, be able to do fast GROUP BY-s (by the time query reaches GROUP BY index is completely useless because of range searches and other things). I know that MySQL (or similar databases) have various helpful things such index merges, loose index scans and so on but this is simply not adequate - the queries above will still take forever to execute.
I was told that the problem can be solved by NoSQL which makes use of some radically new ways of storing and dealing with data, including aggregation tasks. What I want to know is some quick schematic explanation of how it does this. I mean I just want to have a quick glimpse at it so that I could really see that it does that because at the moment I can't understand how it is possible to do that at all. I mean data is still data and has to be placed in memory and indexes are still indexes with all their limitation. If this is indeed possible, I'll then start studying NoSQL in detail.
PS. Please don't tell me to go and read a big book on NoSQL. I've already done this for MySQL only to find out that it is not usable in my case :) So I wanted to have some preliminary understanding of the technology before getting a big book.
Thanks!
There are essentially 4 types of "NoSQL", but three of the four are actually similar enough that an SQL syntax could be written on top of it (including MongoDB and it's crazy query syntax [and I say that even though Javascript is one of my favorite languages]).
Key-Value Storage
These are simple NoSQL systems like Redis, that are basically a really fancy hash table. You have a value you want to get later, so you assign it a key and stuff it into the database, you can only query a single object at a time and only by a single key.
You definitely don't want this.
Document Storage
This is one step up above Key-Value Storage and is what most people talk about when they say NoSQL (such as MongoDB).
Basically, these are objects with a hierarchical structure (like XML files, JSON files, and any other sort of tree structure in computer science), but the values of different nodes on the tree can be indexed. They have a higher "speed" relative to traditional row-based SQL databases on lookup because they sacrifice performance on joining.
If you're looking up data in your MySQL database from a single table with tons of columns (assuming it's not a view/virtual table), and assuming you have it indexed properly for your query (that may be you real problem, here), Document Databases like MongoDB won't give you any Big-O benefit over MySQL, so you probably don't want to migrate over for just this reason.
Columnar Storage
These are the most like SQL databases. In fact, some (like Sybase) implement an SQL syntax while others (Cassandra) do not. They store the data in columns rather than rows, so adding and updating are expensive, but most queries are cheap because each column is essentially implicitly indexed.
But, if your query can't use an index, you're in no better shape with a Columnar Store than a regular SQL database.
Graph Storage
Graph Databases expand beyond SQL. Anything that can be represented by Graph theory, including Key-Value, Document Database, and SQL database can be represented by a Graph Database, like neo4j.
Graph Databases make joins as cheap as possible (as opposed to Document Databases) to do this, but they have to, because even a simple "row" query would require many joins to retrieve.
A table-scan type query would probably be slower than a standard SQL database because of all of the extra joins to retrieve the data (which is stored in a disjointed fashion).
So what's the solution?
You've probably noticed that I haven't answered your question, exactly. I'm not saying "you're finished," but the real problem is how the query is being performed.
Are you absolutely sure you can't better index your data? There are things such as Multiple Column Keys that could improve the performance of your particular query. Microsoft's SQL Server has a full text key type that would be applicable to the example you provided, and PostgreSQL can emulate it.
The real advantage most NoSQL databases have over SQL databases is Map-Reduce -- specifically, the integration of a full Turing-complete language that runs at high speed that query constraints can be written in. The querying function can be written to quickly "fail out" of non-matching queries or quickly return with a success on records that meet "priority" requirements, while doing the same in SQL is a bit more cumbersome.
Finally, however, the exact problem you're trying to solve: text search with optional filtering parameters, is more generally known as a search engine, and there are very specialized engines to handle this particular problem. I'd recommend Apache Solr to perform these queries.
Basically, dump the text field, the "filter" fields, and the primary key of the table into Solr, let it index the text field, run the queries through it, and if you need the full record after that, query your SQL database for the specific index you got from Solr. It uses some more memory and requires a second process, but will probably best suite your needs, here.
Why all of this text to get to this answer?
Because the title of your question doesn't really have anything to do with the content of your question, so I answered both. :)