Choosing a Database to store Report Json - mongodb

I am trying to figure out which DB to use for a project with the following requirements,
Requirements:
scalability should be high, availability should be high
Data format is Json Document of several MBs in size
Query capabilities are my least concern, More of a key-value usecase
High performance/ low latency
i considered MongoDb, Cassandra, Redis, postgres (jsonb), a few other document oriented DBs, embedded databases ( small footprint will be a plus ).
Please help me find out which DB will be the best choice.
i wont need document/row wise comparison queries at all. at most requirement will be subset pick from the document. What i am looking for is a lightweight db with smaller footprint and low latency with high scalability. very low query capabilities are acceptable. should i be choosing embedded DBs? What are the points to consider here?
thanks for the help!.

If you use documents (json) use a document database. Especially if the documents differ in structure.
PostgreSQL does not scale horizontally. Have a look at CockroachDB if you like.
Cassandra can do key-value at scale as redis, but both are not really document databases.
I would suggest MongoDB or CouchDB - which one would be a good match depends on your needs. MongoDB is consistent and partition tolerant while CouchDB is partition tolerant and available.
If you can live with some limits for querying and like high availability try out CouchDB.

Related

What are the production best practices to store a large number of document when using MongoDB?

I am in need of storing applications transaction logs. Decided to use MongoDB. Every day there are almost 200000+- data is storing in single node MongoDB.
We have some reports and operation(if something happened then do something) depending on those logs. So, need to find documents matching different criteria. If going on that pace, is it vulnerable? Will it be slow to execute query?
Any suggestions to make it efficient to use MongoDB?
By the way, those data are in single collection. And MongoDB server version: 4.2.6
mongo collections can grow to be many terabytes without much issue. to be able to query that data in a speedy manner, you will have to analyze your queries and create indexes for the fields that are used in those queries.
indexes are not free though. they will take both diskspace and use up RAM, because for indexes to be useful, they need to fit entirely in RAM.
in most cases, if indexes and collections grow beyond what your hardware can handle, you will have to archive/evict old data and trim down the collections.
if your queries need to include that evicted data in order to generate your reports, you will have to have another collection for summarized values/data of the evicted records which you will have to combine with present data when generating the reports.
alternatively sharding can help with big data but there are some limitations on queries you can do with sharded collections.

ElasticSearch & Mongo

Very newbie question I assume.. I started playing around with ES and MongoDB and I'm trying to move data out a SQL DB as an exercise.
I can't help but wonder, what data would I store in Mongo and what in ES? Can I store everything in ES? Assume big data load, as in price trends.
To begin with, MongoDB is so-called a document store. Key feature of such concept is that is stores schema-dynamic documents:
Each record in a document collection can have a different structure
Types of each records can be different
Document properties (columns) can have nested structures
It's not schema-free, it's schema-dynamic (or flexible schema). To get into the concept, you can find a great tutorial here: https://docs.mongodb.org/manual/data-modeling/
MongoDB is the most widely used document store - please, see http://db-engines.com/en/system/MongoDB.
It has "drivers" for most programming languages, enabling rapid development. You can dive into Mongo quite quickly, there are a lot of tutorials and official Mongo University - a great course for developers and DBAs.
In short terms it supports indexing, aggregations, filters, load balancing, sharding, replications (replica sets) etc. Data is stored and transferred in a BSON format (http://bsonspec.org/).
A good comparison of MongoDB vs RDBMS concepts can be found in this official reference: https://docs.mongodb.org/manual/reference/sql-comparison/
What is it good for? It enables agile development, where schema can change over a period of time, especially form based data, user generated content, location based data, user profiles and more. It also enables storing large documents (up to 16MB each).
Now, Elasticsearch is not a database. It is a search engine with some great aggregation capabilities (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html - make sure you check out Metrics, Buckets and Pipeline aggregations).
Typical RDBSM is not designed for full-text searches or loosely structured data. Queries in ES can return results much faster than any database (e.g. seconds in RDBMS compared to milliseconds in ES). You need to remember that a key is to design indexes well, and that they will take your disk space.
There is a very detailed article comparing both in regards to performance, you may find it useful: http://blog.quarkslab.com/mongodb-vs-elasticsearch-the-quest-of-the-holy-performances.html.
You can actually use both successfully - MongoDB will store your data, where ES will be used as serving layer (search, aggregations etc.).
There is a big difference between mongodb and ES.
MongoDB is a database which was design in order to store data in it and query thats, while elasticsearch is an lucene base indexer in which you should only index data for searches and should not trust elastisearch. even though you can use store:true in elastic search, it is not recommended and i wouldn't rely on that for important data.

MongoDB sharding for data warehouse

Sharding provide a scalable throughput and storage. Scalable throughput and storage is kind of a paradise for analytics. However there is a huge trade off that I think about.
If I use hashed shard key,
- write will be very scalable
- however, if I am doing sequential read for facts, it will be exhaustive since it has to access all server
If I use ranged shard key, e.g. using field A,
- write might be scalable, if we are not using timestamp field
- however, sequential read will not be scalable if we are not using field A
In my opinion, it won't be very scalable as a data warehouse. However, I have no idea what other solution to make mongoDB data warehouse scalable.
Does mongoDB sharding is really suitable to make data warehouse scalable?
Erm, if you read a lot of data, it is most likely that you will exhaust the physical read capacity of one server. You want the reads to be done in parallel - unless I have a very wrong understanding of data warehousing and the limitations of the HDDs and SSDs around nowadays.
What you would do first is to select a subset of the data you want to analyze, right? If you have a lot of data, it makes sense that this matching is done in parallel. When the subset is selected, further analysis should be made, right? This is exactly what MongoDB does in the aggregation framework. An early match is done on all of the affected shards and the result is sent to the primary shard for that database, where further steps of the aggregation pipeline are applied.

MongoDB - Store Indexes on SSD & Data Collections on Magnetic?

Is it possible to store index collections on separate high-performance storage (i.e. flash/SSD) while keeping data collections on lower-cost conventional storage? The performance issues I am seeing using MongoDB appear to be related to index maintenance operations, and I am having to partition my data across database buckets on a single instance in order to avoid drastic dips in write performance - a solution that will only scale for so long. Therefore I would like to use SSD for indexes, but it doesn't make sense to pay for high-performance storage where it's not warranted (data collections).
The only discussion I have found on this subject is somewhat dated:
https://jira.mongodb.org/browse/SERVER-965
You cannot do what you are asking with MongoDB.
The work around is splitting your collections into multiple databases. MongoDB cannot do joins, so there is not really a trade off at the data level. Your application will have to manage multiple connections to multiple databases.
Put your high-performance collections on SSD servers with larger RAM
allocations.
Put your lower performance collections on slower spindle
disk or AWS EBS servers.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.