How can I do Memcached HA? - nosql

I use memcached to save about 5MB data. About forty percent of the data updates ever seconds, that causes about 280 qps between memcached client and server, with get and set each takes half of the queries. Except the realization of such great data transaction, I also meet the HA problem.
Before I choose memcached, I've also looked at Redis. But it seems to be only one thread and not likely to performs well on data persistence. Also the client for Redis is not that easy and reach when it stands with Memcached.
But how can I do HA with memcached? How should I keep data duplication between the Master and slave memcached server; And when a memcached server crashed, it follows with the data consistency problem. Are there already some good tools for memcached HA, or if there is a better NoSql Database instead of memcached?

Maybe you can use repcached(repcached.lab.klab.org) or magent(https://code.google.com/p/memagent/)

Related

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

In database terms, what is the difference between replication and decentralisation?

I am currently researching different databases to use for my next project. I was wanting to use a decentralized database. For example Apache Cassandra claims to be decentralized. MongoDB however says it uses replication. From what I can see, as far as these databases are concerned, replication and decentralization are basically the same thing. Is that correct or is there some difference/feature between decentralization and replication that I'm missing?
Short answer, no, replication and decentralization are two different things. As a simple example, let's say you have three instances (i1, i2 and i3) that replicate the same data. You also have a client that fetches data from only i1. If i1 goes down you will still have the data replicated to i2 and i3 as a backup. But since i1 is down the client has no way of getting the data. This an example of a centralized database with single point of failure.
A centralized database has a centralized location that the majority of requests goes through. It could, as in Mongo DB's case be instances that route queries to instances that can handle the query.
A decentralized database is obviously the opposite. In Cassandra any node in a cluster can handle any request. This node is called the coordinator for the request. The node then reads/writes data from/to the nodes that are responsible for that data before returning a result to the client.
Decentralization means that there should be no single point of failure in your application architecture. These systems will provide deployment scheme, where there's no leader (or master) elected during the service life-cycle. These are often deliver services in a peer-to-peer fashion.
Replication means, that simply your data is copied over to another server instance to ensure redundancy and failure tolerance. Client requests can still be served from copies, but your system should ensure some level of "consistency", when making copies.
Cassandra serves requests in a peer-to-peer fashion. Meaning that clients can initiate requests to any node participating in the cluster. It also provides replication and tunable consistency.
MongoDB offers master/slave deployment, so it's not considered as decentralized. You can deliver a multi-master, to ensure that requests can still be served if master node goes down. It also provides replication out-of-the box.
Links
Cassandra's tunable consistency
MongoDB's master-slave configuration
Introduction to Cassandra's architecture

Propagate change in distributed in-memory cache

I've an application deployed on a cluster of 1000 commodity boxes. While starting, each instance of the application loads a non-trivial amount of data from database and uses this as cache. During a day, around 20%of this cached data needs to be updated.
What are the efficient ways of near simultaneous update of in-memory data of entire cluster? I thought of JMX, Zookeeper, but not sure if that would be really efficient/fast enough.
Well assuming you're using Memcached's consistent hashing, go a step further and have each cache replicate to their closest successor. This can lessen the problem but not entirely alleviate it but it's a simple solution, Gossip + CRDTs are another solution, Dynamo and Riak use a combination of Gossip, Consistent Hashing, and CRDTs.

MongoDB - how to best achieve active/active configuration?

I have an application which is very low on writes. I'm therefore interested in deploying a mongo installation which maximizes the read throughput for the hardware I have (3 database servers in one location). I don't really care for redundancy (backups), but would like automatic failover. Additionally, I'm fine with "eventual consistency", and don't mind if data which isn't the latest data is returned.
I've looked into both sharding and replica sets, and as far as I can tell, I don't really need to use sharding as its benefits suit more for applications with many writes.
I therefore went ahead and installed a replica set on the three servers I have, and I then set the reading preference to "Nearest", as that would allow reads to take place on any server.
The problem is, I later read that the client is "sticky" and basically once it has chosen a "nearest" mongo server, it's not likely to change it. Besides, even if it were to "check for nearest" again, it'll probably choose the same one over. This pretty much results in an active/passive configuration, without any load-balancing. I do have two application servers, so if they choose different mongo servers, it might work ok, but say I wanted to have more than 3 mongo servers in the replica set, then any servers besides specific two would be passive.
Basically my question is, what's the best way to have an active/active configuration for my deployment? All I want is for requests to go to free mongo servers rather than busy ones.
One way to force this which I thought of is to create three sharded-clusters (each server participating in all three), where each server is the primary in one of these clusters - but this is still not optimal, because besides the relative complexity involved in this configuration, this also doesn't guarantee complete load balancing (for example, in case all requests at a given moment happen to go to one specific shard).
What's the right way to achieve what I want? If it's not possible to achieve this kind of load balancing with mongo, would you recommend that I go with the sharded-clusters solution?
As you already suspected, scaling reads is not a "one size fits all" problem. Everything will depend on your data, your access patterns, your requirements and probably a few other things only you can determine.
In a nutshell, the main thing to consider is why a single server can't handle your read load. If it's because of the size of your data set and the size of your indexes then sharding your data across three shards will reduce the RAM requirements of each of them (or to put it another way will give you the combined RAM of all three systems). As long as you pick a good shard key (one that will distribute the load approximately evenly across all the systems) you will get almost three times the throughput on targeted queries.
If the main requirement for your reads is to reduce as much as possible the latency of reading the data, then a replica set can serve your purposes well as reading from the "nearest" node will reduce the network round-trip time without changing the duration of the operation on the MongoDB server. This assumes that your writes are infrequent enough or that your application has tolerance of possibly stale data.

PostgreSQL consuming large amount of memory for persistent connection

I have a C++ application which is making use of PostgreSQL 8.3 on Windows. We use the libpq interface.
We have a multi-threaded app where each thread opens a connection and keeps using without PQFinish it.
We notice that for each query (especially the SELECT statements) postgres.exe memory consumption would go up. It goes up as high as 1.3 GB. Eventually, postgres.exe crashes and forces our program to create a new connection.
Has anyone experienced this problem before?
EDIT: shared_buffer is currently set to be 128MB in our conf. file.
EDIT2: a workaround that we have in place right now is to call PQfinish for every transaction. But then, this slows down our processing a bit since establishing a connection every time is quite slow.
In PostgreSQL, each connection has a dedicated backend. This backend not only holds connection and session state, but is also an execution engine. Backends aren't particularly cheap to leave lying around, and they cost both memory and synchronization overhead even when idle.
There's an optimum number of actively working backends for any given Pg server on any given workload, where adding more working backends slows things down rather than speeding it up. You want to find that point, and limit the number of backends to around that level. Unfortunately there's no magic recipe for this, it mostly involves benchmarking - on your hardware and with your workload.
If you need more connections than that, you should use a proxy or pooling system that allows you to separate "connection state" from "execution engine". Two popular choices are PgBouncer and PgPool-II . You can maintain light-weight connections from your app to the proxy/pooler, and let it schedule the workload to keep the database server working at its optimum load. If too many queries come in, some wait before being executed instead of competing for resources and slowing down all queries on the server.
See the postgresql wiki.
Note that if your workload is read-mostly, and especially if it has items that don't change often for which you can determine a reliable cache invalidation scheme, you can also potentially use memcached or Redis to reduce your database workload. This requires application changes. PostgreSQL's LISTEN and NOTIFY will help you do sane cache invalidation.
Many database engines have some separation of execution engine and connection state built in to the core database engine's design. Sybase ASE certainly does, and I think Oracle does too, but I'm not too sure about the latter. Unfortunately, because of PostgreSQL's one-process-per-connection model it's not easy for it to pass work around between backends, making it harder for PostgreSQL to do this natively, so most people use a proxy or pool.
I strongly recommend that you read PostgreSQL High Performance. I don't have any relationship/affiliation with Greg Smith or the publisher*, I just think it's great and will be very useful if you're concerned about your DB's performance.
* ... well, I didn't when I wrote this. I work for the same company now.
The memory usage is not necessarily a problem. PostgreSQL uses shared memory for some caching, and this memory does not count towards the size of the process memory usage until it's actually used. The more you use the process, the larger parts of the shared buffers will be active in it's address space.
If you have a large value for shared_buffers, this will happen. If you have it too large, the process can run out of address space and crash, yes.
The problem is probably that you don't close the transaction,
In PostgreSQL even if you do only selects without DML it runs in transaction which need to be rollback.
By adding rollback at the end of the transaction will reduce your memory problem