is sharding same as distributed database in mongoDB? [closed]

is sharding same as distributed database in mongoDB? [closed] - mongodb

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to implement mongodb as a distributed database but i cannot find good tutorials for it. Whenever i searched for distributed database in mongodb, it gives me links of sharding, so i am confused if both of them are the same things?

Generally speaking, if you got a read-heavy system, you may want to use replication. Which is 1 primary with at most 50 secondaries. The secondaries share the read stress while the primary takes care of writes. It is a auto-failover system so that when the primary is down, one of the secondaries would take the job there and becomes a new primary.
Sharding, however, is more flexible. All the Shards share write stress and read stress. That is to say, data are distributed into different Shards. And each shard can be consists of a Replication system and auto-failover works as described above.
I would choose replication first because it's simple and is basically enough for most scenarios. And once it's not enough, you can choose to convert from replication to sharding.
There is also another discussion of differences between replication and sharding for your reference.

Just some perspective on distributed databases:
In early nineties a lot of applications were desktop based and had a local database which contained MB/GBs of data.
Now with the advent of web based applications there can be millions of users who use and store their data, this data can run into GB/TB/PB. Storing all this data on a single server is economically expensive so there is a cluster of servers(or commodity hardware) across which data is horizontally partitioned. Sharding is another term for horizontal partitioning of data.
For example you have a Customer table which contains 100 rows, you want to shard it across 4 servers, you can pick 'key' based sharding in which customers will be distributed as follows: SHARD-1(1-25),SHARD-2(26-50),SHARD-3(51-75),SHARD-4(76-100)
Sharding can be done in 2 ways:
Hash based
Key based

Related

Multi-mastering syncing with MongoDB [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Use Case: Currently we have "offices" in places around the world with very intermittent internet access. Sometimes it's great, but sometimes it can go off for hours at a time.
Right now we are using CouchDB that has a master database in the cloud and we have documents with an office_id. We then do a filtered sync to all of these "offices" to only send over documents that have that office_id and that are less than a month old.
With CouchDB you can edit these documents and add new ones on the offline CouchDB server in these offices. At this time, we have a cron that does a replication sync to our master database in the cloud every 5 minutes or so.
Problem: While CouchDB makes it really easy to sync, I'm afraid of some scalability issues with CouchDB. Before it gets too late, I'm trying to explore different database avenues and ways I could do this.
Amazon seems to be doubling down with their DocumentDB offering that supports MongoDB so I'm curious: has anyone done multi-master syncing with MongoDB or an NoSQL equivalent?
I don't want to get into a scenario where scaling puts me in a corner.

Amazon DocumentDB is using shared storage which isn't at all what you are after, and doesn't solve your problem. MongoDB would be a very poor choice for your scenario, and master-master replication is a really hard problem to deal with. CouchDB (like you use) is the first that pops to mind, but you should really search for that explicit feature if you are looking for a replacement. Also note that a lot of multi-master setups makes the assumption that a partitioning occur between masters, but the clients can still connect to all or some masters, which isn't your case because clients only have a single valid master.
Another option would be to build your replication yourself using a queue system or similar, but that would require even more infrastructure on each location (since the problem is the internet connection going down), and that would only be "easy" if different offices rarely or never edit the same documents, because manually having to deal with merging is a pain.
You don't explain what your scaling worries are, but I don't really see MongoDB or any other NoSQL database to have that much different scalability traits than CouchDB.
EDIT: What you are actually after is Optimistic Replication (or Lazy Replication, Eventual Consistency)

Since you mentioned a "NoSQL equivalent", I would like to explain how Couchbase can accomplish this in 2 different ways:
1) Cross Data Center Replication (XDCR) - Allow Clusters of different sizes to synchronize the data between them. The replication can be paused/resumed without any issues (conflicts can be solved via timestamp or document revision). Replications can be uni and bidirectional and you can also filter which documents should be syncronized between clusters using Filtering Expressions (A simplified query system) https://docs.couchbase.com/server/6.5/xdcr-reference/xdcr-filtering-expressions.html
2) Sync Gateway - It is a middleware originally designed to synchronize your main database with a database on the edge. In this architecture, we assume that the connection will be down most of the time. https://blog.couchbase.com/couchbase-mobile-embedded-java-write-throughput/ . Your application could simply consume sync-gateway stream and insert the changes in the cluster replica. (Although I think XDCR should be enough for you)
Couchbase started as a fork of CouchDB a long time ago and most of the code has been rewritten, but some core concepts are still present in Couchbase.
Finally, a big plus of moving to Couchbase is that it is a distributed and highly scalable database ( from 1 to > 100 nodes in a single cluster) and you will be able to query your data using N1QL https://query-tutorial.couchbase.com/tutorial/#1

Product Catalog - Document Store or Column Family Store [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Wondering which technology would do better for a typical product catalog of a webshop. I'm writing my master thesis about nosql in the enterprise environment and focused on document stores for to long now I think.
Read a lot articles which recommend document stores because of it's flexibilty which is needed to model thousands of different products. But as far as I know now, Column-Family Stores like Cassandra offer the same flexibility.
What I like most of the idea of using cassandra is, what nosql-database.org says about it (marked the most interesting features):
massively scalable, partitioned row store, masterless architecture, linear scale performance, no single points of failure, read/write support across multiple data centers & cloud availability zones. API / Query Method: CQL and Thrift, replication: peer-to-peer, written in: Java, Concurrency: tunable consistency, Misc: built-in data compression, MapReduce support, primary/secondary indexes, security features.
In the end I focus on building a prototype of a highly available and scaleable Multishop System which makes use of polyglot persistence, saying K/V Stores for Sessions, Document Store or Column-Family Store for Product Catalog and maybe RDBMS for Inventory/Pricing like Sadalage and Fowler mentioned in their book "NoSQL Destilled".
If possible, provide scientific papers or other reliable sources for your answers.
Thanks!

Document Store's Achilles Heel
Stuart Halloway mentioned that a document store is the biggest schema lock solution that is way too inflexible, which I agree with. Couch/Mongo and others try to mitigate that by providing workarounds to create secondary indicies, ability and necessity to be aware of plain object ids, etc. And of course if you think about versioning (i.e. add a "time" variable to your system), document stores fail fast to provide a smooth support and time travel.
Column Store: Problem Relevance
Cassandra is a really compelling solution for building "scalable"/"distributed" systems with real examples such as Netflix, where 500 Cassandra nodes can be brought up in AWS for several minutes, and all the requests hit a Cassandra ring.
However, given the problem as it is stated in your question, Cassandra would be an unnecessary overkill. Not just because it is a bit more complex than "others", or because it is mentally harder to create a solid data model on top of column oriented stores, but also because a "product catalog" problem is not quite a rocket science. It can be, if you want to add machine learning later to predict/recognize/etc.., but a catalog itself is not, and simpler stores such as PostgreSQL for example would solve it easily.
Simple Desire to NoSQL
If you really want to use NoSQL for a product catalog, I would definitely consider 3 solutions to fit your prototype:
Riak as a "K/V for Sessions"
Datomic to solve "Product Catalog, Inventory and Pricing"
Depending on the size and nature of the problem and the final solution, I would consider Redis to cache those sessions, while having Datomic comfortably sit on top of Riak as its storage service.
Practice vs. Theory
Two classical NoSQL papers that made NoSQL sound real in practice for the first time are Dynamo and BigTable. I consider Datomic to be the next evolutionary step in the DB universe by introducing a hybrid data model with true indicies and relations without a schema lock, and immutability from which everything follows: safe time travel, caching, local db values, etc.
Practically, if it wasn't a master theses, depending on the real problem scale and definition, I would be choosing between Datomic and PostreSQL to solve catalog, inventory, pricing, etc.
A big advantage of Datomic here is time travel. In practice it is very important to be able to safely and easily do that in a "Shopping System".
A big advantage of PostgreSQL is its familiarity and SQL tools availability for analytics and reporting.

By now I think that Column-Family Stores are not well suited for product calaloges.
It's because products often contain some kind of collections like tags, tracklists for music records, different sizes for clothes and so on.
Cassandra supports collections by now BUT they are not searchable! This is a must have feature for tags for example.
In contrast MongoDb for example offers the $in operator to search in nested arrays...
I don't want to say it is not possible to model a product calalog in Cassandra but I think it is much more straight forward to do it in a document store.

How can two Amazon EC2 instances share the same data? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working on a solution deployed on an Amazon EC2 server instance that has its region set to US WEST. The solution uses mongodb for data storage and contains a web service that is used by a mobile application. The user base of the mobile application is split 40:60 between US and Asia, as such I need to set up another EC2 instance in the Asia Pacific region to lower their latency and connection time.
Since the data storage is located on an instance in US WEST, how would I set up a new instance in asia pacific that can share the same data with the US WEST instance? I am open to moving the mongodb databse elsewhere but I do not want to change to a different NoSQL soltion.

There are various different solutions here. I will try to provide a few.
Replica Sets
Perhaps the easiest solution would be to use a Replica Set, where you have two servers in US-EAST and one in ASIA. Replica Sets in MongoDB require a minimum of three nodes to work and as you have a higher amount of users near US-EAST it makes sense to put it there.
Now, with just the three nodes you only solve having the data available closer to ASIA with one of the nodes. You then need to use Read Preferences, to instruct your application to either read from one of the US-EAST or ASIA nodes. I have written an article about how PHP deals with those Read Preferences at http://derickrethans.nl/readpreferences.html — other language drivers will have a similar solution.
All drivers will maintain connections to each of the Replica Set nodes, so connection overhead should not be too much of a problem. But at least you can do reads from a node closest by to solve latency. Writes still always have to go to a primary (which will likely be in US-EAST).
Pros: Fairly easy to set-up, only three nodes required
Cons: Only good for directing reads, but not writes
Sharding
Sharding is a method in MongoDB that allows you to separate your whole data set into smaller piece so that is possible to fit a huge dataset into MongoDB, without having the constraints of the resources of one server. Typically, a sharded set-up consists of at least two shard, each containing a (3 node) replica set, but it also possible to have a replicaset consist of only one node which means you'd end up having two shards, each containing one data node.
Sharding in MongoDB supports "Tag Aware Sharding" (http://mongodb.onconfluence.com/display/DOCS/Tag+Aware+Sharding) which makes it possible to redirect specific documents to specific shards depending on a field in your document. If your documents f.e. have a range of user ideas, or country codes, you can use that to redirect documents to the correct shard.
Setting this up is however not very easy as it requires quite a good understanding of sharding with MongoDB. There is a really nice introduction at http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Pros: Allows you to have data localized to one specific location for both read and write
Cons: Not easy to setup, you need two data nodes, config servers and proxy servers.
Hopes this helps!

If your application is read heavy I would use mongodb's Replica sets:

Can NoSQL (e.g. MongoDB) replace Data Grid solutions e.g. Oracle Coherence [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking for an opinion on replacing existing Data Grid (i.e. Oracle Coherence) with some document store alternative e.g. NoSQL MongoDB. I was think about the most important pros and cons and came up with:
NoSQL
Pros:
No additional database
No ORM mapping necessary
Although the best query efficiency can be achieved when looking up by ID, other queries can be satisfied by map/reduce queries
Cons:
Quite difficult to achieve data consistency when updating multiple collections or even multiple rows in a same collection
Slower response time ? (i suspect that Coherence reponse time might be better)
A read operation can return old data
Data Grid
Pros
With a Data Grid it seems easier to keep data consistent e.g. the data grid becomes is a SOR (System of Record)
As Data Grid becomes SOR, all data should always be available in the grid
Remote Executors
Cons
Additional database means additional overhead & system/application requirements
With a huge amount of data and sharding in place any kind of queries can take a lot of time

Couchbase Server is a very good replacement for Oracle Coherence particularly for enterprise class applications. Orbitz is a great example where large number of nodes of Coherence were replaced by 70 nodes of Couchbase.
You can read more about the Coherence replacement here: http://gigaom.com/cloud/balancing-oracle-and-open-source-at-orbitz/
Slides from an Orbitz presentation about Couchbase are also available here: http://www.slideshare.net/Couchbase/t1-s6-oww-usescouchbase
Pros:
High availability of nodes using replication and failover (avoid cold cache scenarios)
Sub-milli second latencies (built-in object level cache based on memcached)
High read / write throughput (very low granularity of locking) ( http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-708169.pdf)
Strong consistency at a document / item level
TTL / Expiry per document / item
Cons:
Difficult to achieve consistency across multi-document updates. Can be achieved using sentinels. ( http://www.amainhobbies.com/FromTheCEO/2012/09/09/invalidating-couchbase-cache-entries-with-sentinels/ )

It can, but so can a pen and paper system.
The question is, will it be an acceptable replacement. That wholly depends on the situation. In some cases a NoSQL solution is faster, more scalable than a relational solution, but in some situations it is essential to have some kind of support for longer running transactions and relational constraints.
It depends.

You already gave the pros and cons in detail...
as iwein said it depends...
What are the queries that existing relational system forced?
we know that partitioning in nosql db's are easier than realtional db's...
So if you switch to mongo you can extend your systems performance more cheaper and quicker way...
if people are happy on your oracle system now. don't touch it :)

Yes - NoSQL can replace it. But a lot depends on what you are trying to do.
If you just need a simple document store with easy key based lookups - NoSQL is a no-brainer.
If you need an enterprise class solution with paid for support and features such as custom aggregation, entry processors etc etc. Maybe Coherence is what you want.
I've seen people build custom NoSQL solutions on top of Coherence - which is a really expensive thing to do.

How to decide which NoSQL technology to use? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Opinion-based Update the question so it can be answered with facts and citations by editing this post.
Improve this question
What is the pros and cons of MongoDB (document-based), HBase (column-based) and Neo4j (objects graph)?
I'm particularly interested to know some of the typical use cases for each one.
What are good examples of problems that graphs can solve better than the alternative?
Maybe any Slideshare or Scribd worthy presentation?

MongoDB
Scalability: Highly available and consistent but sucks at relations and many distributed writes. It's primary benefit is storing and indexing schemaless documents. Document size is capped at 4mb and indexing only makes sense for limited depth. See http://www.paperplanes.de/2010/2/25/notes_on_mongodb.html
Best suited for: Tree structures with limited depth
Use Cases: Diverse Type Hierarchies, Biological Systematics, Library Catalogs
Neo4j
Scalability: Highly available but not distributed. Powerful traversal framework for high-speed traversals in the node space. Limited to graphs around several billion nodes/relationships. See http://highscalability.com/neo4j-graph-database-kicks-buttox
Best suited for: Deep graphs with unlimited depth and cyclical, weighted connections
Use Cases: Social Networks, Topological analysis, Semantic Web Data, Inferencing
HBase
Scalability: Reliable, consistent storage in the petabytes and beyond. Supports very large numbers of objects with a limited set of sparse attributes. Works in tandem with Hadoop for large data processing jobs. http://www.ibm.com/developerworks/opensource/library/os-hbase/index.html
Best suited for: directed, acyclic graphs
Use Cases: Log analysis, Semantic Web Data, Machine Learning

I know this might seem like an odd place to point to but, Heroku has recently gone nuts with their noSQL offerings and have an OK overview of many of the current projects. It is in no way a Slideshare press but it will help you start the comparison process:
http://blog.heroku.com/archives/2010/7/20/nosql/?utm_medium=email&utm_source=EmailBlast&utm_content=619506254&utm_campaign=HerokuSeptemberNewsletter-VersionB&utm_term=NoSQLHerokuandYou

Checkout this for at glance comparison of NoSQL dbs:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

MongoDB:
MongoDB is document database unlike Relational database. The document stores semi structured data like JSON object ( schema free)
Key features:
Schema can change over evolution of application
Full indexing
Load balancing & Data sharding
Data replication
Consistency & Partitioning in CAP theory ( Consistency-Availability-Partitioning)
When to use:
Real time analytics
High speed logging
Semi structured data management
When not to use:
Highly transactional applications with strong ACID properties ( Atomicity, Consistency, Isolation & Durability). RDBMS is preferred in this use case.
Operating on data sets involving relations - foreign keys etc
HBASE:
HBase is an open source, non-relational, distributed column family database
Key features:
It provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection)
Supports variable schema where each row is different
Can serve as the input and output for MapReduce job
Compression, in-memory operation, and Bloom filters on a per-column (A data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set)
5.Achieve CP on CAP
When to use HBase:
If you’re loading data by key, searching data by key (or range), serving data by key, querying data by key
Storing data by row that doesn’t conform well to a schema (variable schema)
When not to use HBase:
For relational analytics
Full table scans
Data to be aggregated, analyzed by rows instead of columns
Neo4j:
Neo4j is graph database using Property Graph Data Model (Data is stored as a graph and nodes & relationships with properties)
Key features:
Supports full ACID(Atomicity, Consistency, Isolation and Durability) rules
Supports Indexes by using Apache Lucence
Schema free, bottom-up data model design
High scalability has been achieved due to compact storage and memory caching available for graphs
When to use:
Master data management
Network and IT Operations
Real time recommendations
Fraud detection
Social network (like facebook)
When not to use:
Bulk queries/Scans
If your application requires Partitioning & Sharding of data
Have a look at comparison of various NoSQL technologies in this article
Sources:
Wiki, Slide share, Cloudera,Tutorials Point,Neo4j

You could also evaluate a Multi-Model DBMS, as the second generation of NoSQL product. With a Multi-Model you don't have all the compromises on choosing just one model, but rather more than one model.
The first multi-model NoSQL is OrientDB.

Pretty decent article here on MongoDB and NoRM (.net extensions for MongoDB)
http://lukencode.com/2010/07/09/getting-started-with-mongodb-and-norm/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse