Sharing object between Node.js server with memcached / couchcase cluster - memcached

I was looking for a way to share object in several nodes cluster, and after a bit of research I thought it would be best using redis pub/sub. Then, I saw that redis doesn't support cluster yet, which means that a system based on redis will have single point of failure. Since high availability is a key feature for me, this solution is not applicable.
At the moment, I am looking into 2 other solutions for this issue:
Memcached
Couchbase
I have 2 questions:
On top of which solution it would be more efficient to simulate pub/sub?
which is better when keeping clusters in mind?
I was hoping that someone out there faced similar issues and share his experience.

I think it's a bad idea to use memcached and couchbase for pub/sub. Both solutions don't provide built-in pub/sub functions and implementing pub/sub on app side can cause a lot of ops/sec to memcache/couchbase server and as a result you'll get slow performance.
Couchbase stores data into disk, so for temporary storage it's better to use memcaced. It will be faster and will not load your disk.
If you can avoid that "pub/sub" and use memcached/couchbase just as simple HA shared key-value storage - do it. It will be much better then pub/sub.
When you install Couchbase server it provides 2 types of buckets: couchbase (with disk persistance, ability to create views, etc.) and memcached (only in-memory key-value storage). Both types of buckets act in the same way in clusters. Also couchbase support memcache api calls, so you don't need to change code to test both variants.
I've tried to use memcached provider for socket.io "pub/sub" sharing, but as I mentioned before it's ugly. And in my case there were few node.js servers with socket.io, so instead of sharing I've implemented something like "p2p messaging" between servers on top of sockets.
UPD: If you have such big amount of data may be it will be better not to have one shared storage, but use something like sharding with "predictible" data location.

Related

Distributed database which allows custom CRDT merging

I‘m rather new to distributed databases, though I have already studied related literature (e.g. CAP theorem, CRDT) and implemented some POC to allow scaling my application horizontally.
Now I however face a challenging problem. In ordere to scale the app horizontally, communication between services is done via a distributed queue. As a background here, I do require a custom CRDT method to keep the data eventually consistent, and I do require my application to work like a cache (remotely related to REDIS).
The challenge is now that I also need to persist the data. That requires me to keep the data within the application cache and database eventually consistent. I‘ve checked Cassandra, I saw a ticket [1] where somebody tried to add functionality for custom CRDT merge functionality (which as I mentioned do require for a reason). That never made it into Cassandra, and seems to have a few issues to resolve.
What are my options, either in form of a concrete distributed database engine allowing custom merging, or an algorithm that could help solve the problem (e.g. in form of a db trigger or something like this).
[1] https://issues.apache.org/jira/browse/CASSANDRA-6412
As far as I know, there are very few databases that allow you to specify your own custom conflict resolution algorithms. Tbh. the only one I really found - disclaimer: I'm not a Microsoft Advocate - is Azure CosmosDB. It has MongoDB-compatible API and can be configured to use master-master replication strategy, where you need to specify your own conflict resolution algorithm (using JavaScript). You can use it to define your own merge operation.
If you'll take a look outside of database-native solutions into application-level ones, there are several tools, like ie. Akka (available in both JVM or .NET version) which enables you to write custom CRDTs inside of distributed-data module. JVM version additionally supports multi-datacenter persistence, which is conceptually closer to how commutative CRDTs work and can be integrated with Cassandra backend.
I've implemented a MerkleClock CRDT at my merkle-crdt repository.
You could use an approach that when you update the database record column, you fetch the column's value and then you merge it with your CRDT of your current state and then when you save, you serialise the CRDT as JSON and store it in the database.

OrientDB in Azure

We would like to use OrientDB Graph in an Azure environment. Does anybody has experience using it? We also would like to know if high availability from OrientDB is required under Azure cloud? Azure already offers high availability for Azure storage, Azure Drive and SQL. I understand that they have replications and load balancing built in.
This is super important because we prefer not to get into the business of replications and infrastructure management.
Thanks
So you can spin up 2 or more machines and install OrientDB on them, then configure them together as a distributed cluster. However I haven't been able to find any way that is simpler, easier to do. I am interested in this topic too.
Azure does have features such as geo-replication, which is protects your data against a major data-center incident but doesn't provide any performance benefit and will not make it highly available.
Although pretty reliable, occasionally Microsoft will reboot servers for updates, so to protect against downtime you can use affinity groups so that, of your 2 or more servers, one will always be online. This however does need to be used in conjunction with database replication and ideally load balancing.
It's also worth noting that OrientDB recommends clusters have an odd number of servers as this can prevent conflicts when synchronising data after a communication issue between the servers.
I am using it in amazon and I had to create a java project to monitor http requests inserts and queries. The queries are very fast but takes longer inserting data .
I recommend this type of graph database mode to decrease the time of the queries. Also if you have empty fields OrientDB manages very well compared to other databases .
If you need help with the java project can response to this post and I´ll help u.
I hope it helps. Good luck.

RDBMS persistence for couchbase

Folks,
We are evaluating distributed caching solutions for our application. We started with looking at Memcache, then expanded to look at Couchbase. One of our key requirements is the ability to back up the (in-memory) cache reliably to RDBMS and to restore from it in case of nod/cluster failure.
Our preferred option would be to have a configuration switch in couchbase that would cause it to back up new entries to RDBMS.
What we would like to avoid is writing application code that sends cache entries/refreshes explicitly to RDBMS.
Can anyone tell me if couchbase (cluster) can be configured to do so?
Thanks.
-Raj
Couchbase cannot be configured to write through to an RDBMS for backup. What you should take a look at is the Couchbase bucket, not the memcached bucket. The Couchbase bucket uses the memcached layer as a cache and provides replication and persistence out of the box. With this setup you do not need a separate RDBMS because Couchbase will take care of all of the persistence for you and it will replicate your data so that if you have server failures you can just failover any failed nodes and promote other replica nodes to active ones. Take a look at this page http://www.couchbase.com/couchbase-server/features and if you have any other architecture questions here then I would recommend posting them on the Couchbase forums http://www.couchbase.com/forums where some of the developers can give you some more in depth answers.

NoSQL as local storage for logging and tracing

We are developing application which will have many physical servers. We want to use NoSQL for logging and tracing since it does not required structured data.
We don't want to have Centralized logging.
Can we install NoSQL (any one) in each server and store logging/tracing details? Will NoSQL impact my actually process in the server? Is it good idea to do it?
Problem1: Data Collection
Many people're using NoSQL solutions for storing application logs. The first challenge you may have is how to collect huge amount of data from various data sources reliably with ease of management. One concern of not having log collection layer, is lock contention of database caused by high write throughput.
So basically having log collection layer is recommended. There're some open-source log collector implementation such as syslog, Fluentd, Scribe, and Flume :)
Problem2: Storage & Processing
The next big problem is how to store and process data. The backend infrastructure requires a lot of changes as the data volume increase. At first, you can use MongoDB to store all of your data, but at some moment you end up using Apache Hadoop to architect a massively scalable architecture.
Here's an example architecture of having Fluentd for log collection, and MongoDB for log storage and processing.
Here're some links to put the Apache Logs into Amazon S3, MongoDB, or Hadoop HDFS by Fluentd.
Store Apache Logs into Amazon S3
Store Apache Logs into MongoDB
Fluentd + HDFS: Instant Big Data Collection
Disclaimer: I'm a committer of Fluentd project.
definitely this is good idea for doing same thing with nosql rather than sql.
because in logging and tracing volume of data is high and ratio of retrieving data is also high.
you for logging and tracing you need complex reports for analysis so nosql is better for you.
also nosql support distributed environment so you create infrastructure at different geographic location.

Is NoSQL suitable for Selling Tickets Web Application?

I want to write a high scalable web application for selling event tickets. I want to use NoSQL database, like Big Table or MongoDB and Cloud Service like Google App Engine (GAE) or Amazon Elastic Compute Cloud (Amazon EC2)
Is it posible using this type of database to be sure that two client will not be able to buy a ticket for the same place simultaneously? Or may be I will have to use RDBMS database and forget about Google App Engine?
Things like GAE's datastore can still support transactional semantics, for example:
http://code.google.com/appengine/docs/python/datastore/transactions.html
So yes, it is possible to do what you're seeking to do. (Note - GAE's Datastore is not exactly NoSQL, since it uses SQL-like queries.)
I have a problem with this question. Not all NoSQL databases are created equally, and different NoSQL databases have different ways they store data. Generally the thing you should be worried about are: data is actually written to disk and not just into memory. Most NoSQL databases can do this but not by default. Let's just say this is not a problem, you can usually tell the database like MOngo or Cassandra to write data to disk, can even tell how many servers at minimum the data should be written to.
The problem is that you may not get a true transactional support. When you deal with ecommerce it's important to have all or nothing type of transation where several operations either succeed completely or rolled back. There must be absolutely no chance that only part of your data is saved. For example, if you need to write data to more than one table (collection or document in NoSQL lingo), if server goes down in the middle of the process and your data is only written to one table, that's usually unacceptable in ecommerce.
I am not familiar with all NoSQL databases, but the ones I know don't have this option yet.
MySQL, on the other hand, does.
If transactional support or lack of it does not bother you, then I think its OK to use NoSQL as long as you tell it to save data to disk and not just into memory.
The answer is 'maybe.'
Depending on what you're trying to build, you many be able to use some of the techniques in this post:
http://kylebanker.com/blog/2010/06/07/mongodb-inventory-transactions/
Using something like get_or_insert you can easily ensure that two clients are not receiving the same resource simultaneously on Google App Engine. However, there are big differences between GAE and a RDBMS, so make sure you study them further before you make a decision.