mongoDB vs. elasticsearch query/aggregation performance comparison

mongoDB vs. elasticsearch query/aggregation performance comparison - mongodb

This question is about choosing the type of database to run queries on for an application. Keeping other factors aside for the moment, and given that the choice is between mongodb and elastic, the key criterion is that the query should be resolved in near real time. The queries will be ad-hoc and as such can contain any of the fields in the JSON objects and will likely contain aggregations and subaggregations. Furthermore, there will not be nested objects and none of the fields will be containing 'descriptive' text (like movie reviews etc.), i.e., all the fields will be keyword type fields like State, Country, City, Name etc.
Now, I have read that elasticsearch performance is near real time and that elasticsearch uses inverted indices and creates them automatically for every field.
Given all the above, my questions are as follows.
(there is a similar question posted in stack but I do not think it answers my questions
elasticsearch v.s. MongoDB for filtering application)
1) Since the fields in the use case I mentioned do not contain descriptive text and hence would not require the full-text search capability and other additional features that elastic provides (especially for text search), what would be a better choice between elastic and mongo? How would elastic search and mongo query/aggregation performance compare if I were to create single field indices on all the available fields in mongo?
2) I am not familiar with advanced indexing, so I am assuming that it would be possible to create indices on all available fields in mongo (either using multiple single field indices or maybe compound indices?). I understand that this will come with a cost for storage and write speed which is true for elastic as well.
3) Also, in elastic the user can trade off write speed (indexing rate) with the speed with which the written document becomes available (refresh_interval) for a query. Is there a similar feature in mongo?

I think the size of your data set is also a very important aspect about choosing DB engine. According to this benckmark (2015), if you have over 10 millions of documents, Elasticsearch could be a better choice. If your data set is small there should be no obvious different about performance between Elasticsearch and MongoDB.

Related

MongoDB vs Elasticsearch - indexing parallel arrays

I have an application that needs to do filter the data based on more than 7+ fields.
2+ of these fields are array and currently stored on MongoDB (each of them individually store almost thousands of hexadecimal id). In MongoDB it's not possible to create parallel indexes (for very understandable reasons) Therefore, I'm just able to index based on one single field. In the following thread, the similar issue has been already discussed.
elasticsearch v.s. MongoDB for filtering application
The answer provides some good insights about how ElasticSearch differs from NoSQL databases. But I'm still confused about, will ElasticSearch be performant if I just create the nested mappings for two array fields.
Will the described "Vector Space Model" help me on filtering based on multiple array fields with a good performance when I do exact match / range searches?

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.

It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Does ElasticSearch have the same indexes functionality that mongodb have?

I want to know as we have index creation feature in mognodb to speed up the query process https://docs.mongodb.org/v3.0/indexes/ what do we have for elasticsearch for this purpose? I googled it but I was unable to find any suitable information, I used indexing in mongodb on most frequently used fields to speed up the query process and now I want to do same in elasticsearch i want to know is there anything that elasticsearch provides .Thanks

Elasticsearch also has indices: https://www.elastic.co/blog/what-is-an-elasticsearch-index
They are also used as part of the database's key features to provide swift search capabilities.

It is annoying that "index" is used in a different context with ES and many other databases. I'm not as familiar with MongoDB so I'll resort to their documentation at v3.0/core/index-types.
Basically Elasticsearch was designed to serve efficient "filtering" (yes/no queries) and "scoring" (relevance ranking via tf-idf etc.), and it uses Lucene as the underlying inverted index.
MongoDB concepts and their ES counter-parts:
Single Field Index: trivially supported, perhaps as not_analyzed fields for exact matching
Compound Index: Lucene applies AND filter condition via efficient bitmaps, can ad-hoc merge any "single field" indexes
Multikey Index: Transparent support, no difference values and an array of values
Geospatial Index: directly supported via geo-shapes
Text Index: In some way ES was optimized for this use-case as analyzed field type
In my view at search applications relevance is more important that plain filtering the results, as some words occur at almost every document and thus are less relevant when searching.
Elasticsearch has other very useful concepts as well such as aggregations, nested documents and child/parent relationships.

should I use elasticsearch for non-free-text searches

I use Postgres as a data warehouse. I need to do free text search on many of the fields. My DBA recommends not to use Postgres for free text searches. I am now considering elasticsearch. The question is what to do if the user filters both by free text and some structured dimension. Should I query both elastic and postgres and take the intersection, or can I serve all query from elastic? What if there are no free text in the filter - is elastic appropriate for my general purpose querying?
EDIT: as requested some more information. database will contain a few million rows. I cannot give concrete details about data except that a row will contain ~30 columns, half of them are strings, between one word and a few sentences. The reasons to use elastic are not just the DBAs objection to full text index in postgres, but elastic also gives results ranking and specific text search semantics.

It is true that elasticsearch is great for full-text search, since it uses lucene under the covers, but it's also very good for structured search through filters. One other great thing that you can do with it is data analytics, that allows to visualize aggregations of your data.
That said, you don't necessarily need full-text search requirements in order to make good use of elasticsearch. There are many usecases where elasticsearch is used only for one of those three aspects that I mentioned: full-text search, structured search or data analytics. The next step is also to combine those together.
Your usecase is quite common and I would suggest to go ahead and consider running structured queries too against elasticsearch instead of querying two systems. The only obstacle that I can foresee could be document relations, that need to be properly represented and handled within elasticsearch.
Have a look at the elasticsearch query DSL, used to represent queries, and effectively combine structured and unstructured search together.

NoSQL: indexing and keyword-based searching

I have an application that stores items (e.g. web documents). Each item can feature a arbitrary large set of tags. And typical a common query is to retrieve all documents with given set of tags. Well, a pretty common Web application.
Now I'm thinking about a NoSQL database as persistent storage. Various NoSQL systems (e.g. MongoDB) support secondary indexes and with that keyword-based searches. Examples showing how to do it in different systems are easy to find. The problem is, I would like to know what's going on "under the hood", i.e. how/where the secondary indexes stored, and how a query with a list of tags is actually executed. Particularly in systems with many nodes.
I'm aware of solutions based on Map/Reduce or similar. But here I'm interested how the indexing works. Questions I have, for example, are:
Does the secondary index only store the item/object id or more?
If a query contains k tags, are k subqueries - one for each tag - executed and the k partial results are combined one the initiating node?
Where can I find such information for different NoSQL systems? Thanks a lot for any hints.
Christian

In MongoDB an index on tags would be done by utilizing the multi-keys feature whereby the database tries to match documents against each element of an array. You would index this tags attribute of a given document which would create a btree that is constructed out of ranges of tags in that array.
You can learn more about multikeys here and can get more information about indexing in MongoDB by watching this presentation: MongoDB Internals
Does the secondary index only store the item/object id or more?
The indexes consist of the indexed field (lets say it's a tags array in your case, then the field would be a single tag) and an offset used to efficiently locate the document in memory. It also has some padding + other overhead as described here
If a query contains k tags, are k subqueries - one for each tag - executed and the k partial results are combined one the initiating node?
It depends, but if, for example, the query were using an $or on the tag field I think the queries are performed in parallel, each in O(log n) time, and the results are combined to form the result set but I'm not sure about this though.