I am trying to understand the distributed architecture of OrientDB from:
https://github.com/orientechnologies/orientdb/wiki/Distributed-Architecture
https://github.com/orientechnologies/orientdb/wiki/Distributed-Routing
https://github.com/orientechnologies/orientdb/wiki/Replication
Its quite clear that replication and routing are supported. But I dont understand the following:
1.Can data be sharded accross nodes of a cluster? Each node can be 'responsible' for a particular portion of the graph, but thats still not same as storing only a portion of the graph.
2.A query can be routed to the appropriate node, but is it possible to execute the query in parallel accross nodes? For instance, to process a traversal query, every node independently executes the portion of data it is responsible for and the results are later combined.
Regarding the first question: yes, since version 1.6. Autosharding will be implemented in OrientDB 2.0.
Sources:
http://orientechnologies.blogspot.it/2013/09/new-orientdb-replication-engine-and.html
https://github.com/orientechnologies/orientdb/issues/1522
Related
Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support
I have used MongoDB but new to Cassandra. I have worked on applications which are using MongoDB and are not very large applications. Read and Write operations are not very much intensive. MongoDB worked well for me in that scenario. Now I am building a new application(w/ some feature like Stack Overflow[voting, totals views, suggestions, comments etc.]) with lots of Concurrent write operations on the same item into the database(in future!). So according to the information, I gathered via online, MongoDB is not the best choice (but Cassandra is). But the problem I am finding in Cassandra is Picking the right data model.
Construct Models around your queries. Not around relations and
objects.
I also looked at the solution of using Mongo + Redis. Is it efficient to update Mongo database first and then updating Redis DB for all multiple write requests for the same data item?
I want to verify which one will be the best to solve this issue Mongo + redis or Cassandra?
Any help would be highly appreciated.
Picking a database is very subjective. I'd say that modern MongoDB 3.2+ using the new WiredTiger Storage Engine handles concurrency pretty well.
When selecting a distributed NoSQL (or SQL) datastore, you can generally only pick two of these three:
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
This is called the CAP Theorem.
MongoDB has C and P, Cassandra has A and P. Cassandra is also a Column-Oriented Database, and will take a bit of a different approach to storing and retrieving data than, say, MongoDB does (which is a Document-Oriented Database). The reality is that either database should be able to scale to your needs easily. I would worry about how well the data storage and retrieval semantics fit your application's data model, and how useful the features provided are.
Deciding which database is best for your app is highly subjective, and borders on an "opinion-based question" on Stack Overflow.
Using Redis as an LRU cache is definitely a component of an effective scaling strategy. The typical model is, when reading cacheable data, to first check if the data exists in the cache (Redis), and if it does not, to query it from the database, store the result in the cache, and return it. While maybe appropriate in some cases, it's not common to just write everything to both Redis and the database. You need to figure out what's cacheable and how long each cached item should live, and either cache it at read time as I explained above, or at write time.
It only depends on what your application is for. For extensive write apps it is way better to go with Cassandra
We called Greenplum,Redshift MPP or share nothing.
but i really dont understand why?
if it mean during a mutil-level join query,one host are computing all the time,no hosts exchange data each other?,there is no shuffle?
and or other situation.
whats the crux means of "share nothing"?
Shared nothing means that no two servers share the same data (aside from mirrors for high availability). A simple example would be a two node cluster where the data is distributed by gender_code. Node1 would have all of the males and node2 would have all of the females.
In the real world, you have way more nodes than just two so you distribute the data by something like an ID column. This gives you an even distribution of data across all of the nodes.
As you can probably guess, the optimizer has to be pretty smart to reduce the amount of data movement needed to execute a query. It also needs to slice the query into multiple parts so that it can execute the multiple slices of the query at once. Greenplum has been around for over 10 years and has a mature optimizer which can execute a wide variety of queries pretty well.
"Shared Nothing" is a description of what resources are shared between the processes running in parallel. So you may have shared memory approaches running on a single host, shared storage between multiple hosts or self contained systems with their own processing, RAM and storage. A deployment based a a few of these self contained systems would be described as "shared nothing".
In a shared-nothing system each node will store a subset of the data. Query planners in these systems try to do as much work as possible on the same host the data is stored on and move or shuffle as little data as possible (on Greenplum systems these steps in the query plan are called motions).
We call MPP 'Shared Nothing' as a way to compare Greenplum to something with a 'Shared Everything' architecture like Oracle RAC which also has multiple servers in a cluster but they all connect back to the same set of datafiles.
For my application I need to move old data periodically from a mongodb server to another one (ie, two distinct servers). I also want to be able to query those data as if they were the same database.
In short terms, I want to be able to see two mongodb instances (on two different servers) as one and be able to control when and where the data is stored.
I read about the concept of sharding and chunks and rapidly saw the moveChunk function which can easily do what I want.
The problem is that it seems to be impossible to configure such architecture in mongodb. Am I missing something here?
Archiving Deleted Documents
For the problem of keeping the deleted documents, you have no option to achieve this with build-in features/mechanisms like sharding or replication. The only way to do it is to handle that case manually, e.g. holding a separate collection for deleted documents, and simply move documents to that collections instead of deleting them.
For your global problem of moving data you have the following two options:
Sharding
Using sharding you will split your data into pieces which will be stored on two (in your case) different servers. In this scenario you can use the moveChunk method as you have mentioned. But this method is very tricky, as for that you will need to disable the built-in automatic balancer to have a full manual control over your chunks. Anyway, this is not recommended by the MongoDB:
Only use the moveChunk in special circumstances such as preparing your sharded cluster for an initial ingestion of data, or a large bulk import operation. In most cases allow the balancer to create and balance chunks in sharded clusters
Besides this will only allow to split data, and finally, to get to your goal, you will end up with one full and one empty shard.
Replication
The replication approach is much more safe and easy to achieve. You can simply configure a replica set and add your second server to that set.
If the data is too big, you can configure your second server as hidden. So that no reads will be performed towards that server, so no inconsistent data will be received. After the data replication is finished, you will have the copy of your data on both servers.
As for using both servers as a single server, if you need to balance the requests between these tow, you can configure your readPreference to secondary, which will assure that all the reads are being sent to the secondary server, and writes by default are done on the primary.
In this case your code will be unaware of what server you are querying. You will just run your client methods, and the rest will be done behind the drivers.
Conclusion
So my advice would be to use the replication approach as more clean, pain-less and safe solution.
I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?
As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.