Master turning into slave after redis sentinel failover - kubernetes

I am trying out redis master slave replication using sentinels.
I have 1 master and 2 slaves and 3 sentinels. All running as different pods.
My issue is:
1) When I delete the master pod, one of the slaves turns to master.
2) Ideally, there should be a new master now with only one slave. For some reason, the master IP that I deleted turns into a slave of the newly elected master.
3) Is this a desirable behaviour? Because when the sentinel shows there are 2 slaves to the newly elected master, in fact there exists only 1 slave pod because the master pod is deleted.
Below are the logs:
:M 29 May 2020 07:32:19.569 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8:M 29 May 2020 07:32:19.569 # Server initialized
8:M 29 May 2020 07:32:19.569 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
8:M 29 May 2020 07:32:19.569 * Ready to accept connections
8:M 29 May 2020 07:33:22.329 * Replica 172.16.2.12:6379 asks for synchronization
8:M 29 May 2020 07:33:22.329 * Full resync requested by replica 172.16.2.12:6379
8:M 29 May 2020 07:33:22.329 * Starting BGSAVE for SYNC with target: disk
8:M 29 May 2020 07:33:22.330 * Background saving started by pid 12
12:C 29 May 2020 07:33:22.333 * DB saved on disk
12:C 29 May 2020 07:33:22.334 * RDB: 2 MB of memory used by copy-on-write
8:M 29 May 2020 07:33:22.355 * Background saving terminated with success
8:M 29 May 2020 07:33:22.356 * Synchronization with replica 172.16.2.12:6379 succeeded
8:M 29 May 2020 07:33:23.092 * Replica 172.16.4.48:6379 asks for synchronization
8:M 29 May 2020 07:33:23.092 * Full resync requested by replica 172.16.4.48:6379
8:M 29 May 2020 07:33:23.092 * Starting BGSAVE for SYNC with target: disk
8:M 29 May 2020 07:33:23.092 * Background saving started by pid 13
13:C 29 May 2020 07:33:23.097 * DB saved on disk
13:C 29 May 2020 07:33:23.097 * RDB: 2 MB of memory used by copy-on-write
8:M 29 May 2020 07:33:23.158 * Background saving terminated with success
8:M 29 May 2020 07:33:23.158 * Synchronization with replica 172.16.4.48:6379 succeeded
8:M 29 May 2020 07:36:26.866 # Connection with replica 172.16.2.12:6379 lost.
8:M 29 May 2020 07:36:27.871 # Connection with replica 172.16.4.48:6379 lost.
8:S 29 May 2020 07:36:37.926 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
8:S 29 May 2020 07:36:37.927 * REPLICAOF 172.16.2.12:6379 enabled (user request from 'id=21 addr=172.16.3.135:56721 fd=9 name=sentinel-5261eb21-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=151 qbuf-free=32617 obl=36 oll=0 omem=0 events=r cmd=exec')
8:S 29 May 2020 07:36:37.933 # CONFIG REWRITE executed with success.
8:S 29 May 2020 07:36:38.284 * Connecting to MASTER 172.16.2.12:6379
8:S 29 May 2020 07:36:38.284 * MASTER <-> REPLICA sync started
8:S 29 May 2020 07:36:38.284 * Non blocking connect for SYNC fired the event.
8:S 29 May 2020 07:36:38.285 * Master replied to PING, replication can continue...
8:S 29 May 2020 07:36:38.285 * Trying a partial resynchronization (request 563ca4b5f67f1e24c129729eaa74800b108902a3:52568).
8:S 29 May 2020 07:36:38.321 * Full resync from master: f21b8c35187b109b621605b375ef62e61b301834:52901
8:S 29 May 2020 07:36:38.321 * Discarding previously cached master state.
8:S 29 May 2020 07:36:38.356 * MASTER <-> REPLICA sync: receiving 178 bytes from master
8:S 29 May 2020 07:36:38.356 * MASTER <-> REPLICA sync: Flushing old data
8:S 29 May 2020 07:36:38.356 * MASTER <-> REPLICA sync: Loading DB in memory
8:S 29 May 2020 07:36:38.356 * MASTER <-> REPLICA sync: Finished with success
I am using redis 5.0. Earlier I was using redis 4.0 but I did not face such issue.

Related

Redis standalone stop working (timeouts) on master replica sync

We use a redis cache in out kubernetes cluster which stops working really randomly. It's a Standalone version based on this image: bitnami/redis:6.0.15
As custom parameters we use:
MASTER true
REDIS_AOF_ENABLED no
Every time when the redis stop working I see the following logs:
Jul 5 13:30:27 redis-0 redis 1:M 05 Jul 2022 11:30:27.060 * 10000 changes in 60 seconds. Saving...
Jul 5 13:30:27 redis-0 redis 1:M 05 Jul 2022 11:30:27.090 * Background saving started by pid 364
Jul 5 13:31:34 redis-0 redis 364:C 05 Jul 2022 11:31:34.307 * DB saved on disk
Jul 5 13:31:34 redis-0 redis 364:C 05 Jul 2022 11:31:34.341 * RDB: 431 MB of memory used by copy-on-write
Jul 5 13:31:34 redis-0 redis 1:M 05 Jul 2022 11:31:34.488 * Background saving terminated with success
Jul 5 13:32:35 redis-0 redis 1:M 05 Jul 2022 11:32:35.022 * 10000 changes in 60 seconds. Saving...
Jul 5 13:32:35 redis-0 redis 1:M 05 Jul 2022 11:32:35.052 * Background saving started by pid 365
-----
Jul 5 13:32:40 redis-0 redis 1:S 05 Jul 2022 11:32:40.436 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
Jul 5 13:32:40 redis-0 redis 1:S 05 Jul 2022 11:32:40.436 * REPLICAOF 178.20.40.200:8886 enabled (user request from 'id=71457 addr=10.0.16.46:14072 fd=12 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=47 qbuf-free=32721 argv-mem=24 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=slaveof user=default')
Jul 5 13:32:41 redis-0 redis 1:S 05 Jul 2022 11:32:41.316 * Connecting to MASTER 178.20.40.200:8886
Jul 5 13:32:41 redis-0 redis 1:S 05 Jul 2022 11:32:41.316 * MASTER <-> REPLICA sync started
Jul 5 13:32:41 redis-0 redis 1:S 05 Jul 2022 11:32:41.362 * Non blocking connect for SYNC fired the event.
Jul 5 13:32:41 redis-0 redis Error 1:S 05 Jul 2022 11:32:41.409 # Error reply to PING from master: '-Reading from master: Connection reset by peer'
Jul 5 13:32:42 redis-0 redis 1:S 05 Jul 2022 11:32:42.316 * Connecting to MASTER 178.20.40.200:8886
Jul 5 13:32:42 redis-0 redis 1:S 05 Jul 2022 11:32:42.317 * MASTER <-> REPLICA sync started
Jul 5 13:32:42 redis-0 redis 1:S 05 Jul 2022 11:32:42.366 * Non blocking connect for SYNC fired the event.
Jul 5 13:32:42 redis-0 redis Error 1:S 05 Jul 2022 11:32:42.415 # Error reply to PING from master: '-Reading from master: Connection reset by peer'
Jul 5 13:32:43 redis-0 redis 1:S 05 Jul 2022 11:32:43.317 * Connecting to MASTER 178.20.40.200:8886
Jul 5 13:32:43 redis-0 redis 1:S 05 Jul 2022 11:32:43.317 * MASTER <-> REPLICA sync started
Jul 5 13:32:43 redis-0 redis 1:S 05 Jul 2022 11:32:43.366 * Non blocking connect for SYNC fired the event.
Jul 5 13:32:43 redis-0 redis Error 1:S 05 Jul 2022 11:32:43.416 # Error reply to PING from master: '-Reading from master: Connection reset by peer'
Jul 5 13:32:44 redis-0 redis 1:S 05 Jul 2022 11:32:44.320 * Connecting to MASTER 178.20.40.200:8886
Jul 5 13:32:44 redis-0 redis 1:S 05 Jul 2022 11:32:44.320 * MASTER <-> REPLICA sync started
Jul 5 13:32:44 redis-0 redis 1:S 05 Jul 2022 11:32:44.370 * Non blocking connect for SYNC fired the event.
Then I see that the queue increase, but I need to kill the pod to restart redis otherwise it will not work anymore.
next: GET 6126674261995698486,
inst: 1,
qu: 0, // queue => waiting operations
qs: 17,
aw: False,
rs: ReadAsync,
ws: Idle,
in: 0, // bytes waiting from input stream
in-pipe: 0,
out-pipe: 0,
serverEndpoint: redis.default.svc.cluster.local:6379,
mc: 1/1/0,
mgr: 10 of 10 available, // tread pool
clientName: production-9bbd94544-nlmv7,
IOCP: (Busy=0,Free=1000,Min=5,Max=1000), // no busy threads
WORKER: (Busy=14,Free=32753,Min=256,Max=32767),
v: 2.2.4.27433```
```Timeout performing GET (3000ms),
next: 2865582319381864083,
inst: 0,
qu: 0,
qs: 333,
aw: False,
rs: ReadAsync,
ws: Idle,
in: 0,
in-pipe: 0,
out-pipe: 0,
serverEndpoint: redis.default.svc.cluster.local:6379,
mc: 1/1/0,
mgr: 10 of 10 available,
clientName: production-58c7874fd8-tdcpz,
IOCP: (Busy=0,Free=1000,Min=1,Max=1000),
WORKER: (Busy=3,Free=32764,Min=256,Max=32767),
v: 2.2.4.27433
next: GET 6126674261995698486,
inst: 47,
qu: 0,
qs: 21368,
aw: False,
rs: ReadAsync,
ws: Idle,
in: 0,
in-pipe: 0,
out-pipe: 0,
serverEndpoint: redis.default.svc.cluster.local:6379,
mc: 1/1/0,
mgr: 10 of 10 available,
clientName: production-9bbd94544-nlmv7,
IOCP: (Busy=0,Free=1000,Min=5,Max=1000),
WORKER: (Busy=162,Free=32605,Min=256,Max=32767),
v: 2.2.4.27433```
Has anyone an idea?
Thank you.
Can you check if the redis service is running? kubectl get service/redis. It seems like the service is unable to receive traffic, which is possible if there are no pods to receive it.

MongoDB data corruption on a replica set

I am working with a MongoDB database running in a replica set.
Unfortunately, I noticed that the data appears to be corrupted.
There should be over 10,000 documents in the database. However, there are several thousand records that are not being returned in queries.
The total count DOES show the correct total.
db.records.find().count()
10793
And some records are returned when querying by RecordID (a custom sequence integer).
db.records.find({"RecordID": 10049})
{ "_id" : ObjectId("5dfbdb35c1c2a400104edece")
However, when querying for a records that I know for a fact should exist, it does not return anything.
db.records.find({"RecordID": 10048})
db.records.find({"RecordID": 10047})
db.records.find({"RecordID": 10046})
The issue appears to be very sporadic, and in some cases entire ranges of records are missing. The entire range from RecordIDs 1500 to 8000 is missing.
Questions: What could be the cause of the issue? What can I do to troubleshoot this issue further and recover the corrupted data? I looked into running repairDatabase but that is for standalone instances only.
UPDATE:
More info on replication:
rs.printReplicationInfo()
configured oplog size: 5100.880859375MB
log length start to end: 14641107secs (4066.97hrs)
oplog first event time: Wed Mar 03 2021 05:21:25 GMT-0500 (EST)
oplog last event time: Thu Aug 19 2021 17:19:52 GMT-0400 (EDT)
now: Thu Aug 19 2021 17:20:01 GMT-0400 (EDT)
rs.printSecondaryReplicationInfo()
source: node2-examplehost.com:27017
syncedTo: Thu Aug 19 2021 17:16:42 GMT-0400 (EDT)
0 secs (0 hrs) behind the primary
source: node3-examplehost.com:27017
syncedTo: Thu Aug 19 2021 17:16:42 GMT-0400 (EDT)
0 secs (0 hrs) behind the primary
UPDATE 2:
We did a restore from a backup and somehow it looks like it fixed the issue.
We did a restore from a backup and somehow it looks like it fixed the issue.

Recovering Kafka Cluster from a disk full error

We have a 3-node Kafka cluster. For data storage we have 2 mounted disks - /data/disk1 and /data/disk2 in each of the 3 nodes. The log.dirs setting in kafka.properties is:
log.dirs=/data/disk1/kafka-logs,/data/disk2/kafka-logs
It so happened that in one of nodes Node1, the disk partition /data/disk2/kafka-logs got 100% full.
The reason this happened is - we were replaying data from filebeat to a kafka topic and a lot of data got pushed in a very short time. I've temporarily changed the retention for that topic to 1 day from 7 days and so the topic size has become normal.
The problem is - in Node1 which has got /data/disk2/kafka-logs 100% full, kafka process just wouldn't start and emit the error:
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,093] INFO Recovering unflushed segment 0 in log my-topic-0. (kafka.log.Log)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,094] INFO Completed load of log my-topic-0 with 1 log segments and log end offset 0 in 2 ms (kafka.log.Log)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,095] ERROR There was an error in one of the threads during logs loading: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code (kafka.log.LogManager)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,101] FATAL [Kafka Server 1], Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
Jul 08 12:03:29 broker01 kafka[23949]: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
Jul 08 12:03:29 broker01 kafka[23949]: at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
Jul 08 12:03:29 broker01 kafka[23949]: at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
Jul 08 12:03:29 broker01 kafka[23949]: at org.apache.kafka.common.record.FileLogInputStream$FileChannelLogEntry.loadRecord(FileLogInputStream.java:135)
Jul 08 12:03:29 broker01 kafka[23949]: at org.apache.kafka.common.record.FileLogInputStream$FileChannelLogEntry.record(FileLogInputStream.java:149)
Jul 08 12:03:29 broker01 kafka[23949]: at kafka.log.LogSegment.$anonfun$recover$1(LogSegment.scala:22
The replication factor for most topics is either 2 or 3. So, I'm wondering if I can do the following:
Change replication factor to 2 for all the topics (Node 2 and Node 3 are running fine)
delete some stuff from Node1.
Restart Node 1
Change replication factor back to 2 or 3 as was the case initially.
Does anyone know of a better way or a better suggestion?
Update: Step 1 and 4 not needed. Just 2 and 3 are enough if you have replicas.
Your problem (and solution accordingly) is similar to that described in this question: kafka 0.9.0.1 fails to start with fatal exception
The easiest and fastest way is to delete part of the data. When the broker started, the data is replicated with the new retention.
So, I'm wondering if I can do the following...
Answering your question specifically - yes, you can do the steps you described in sequence and this will help to return the cluster to a consistent state.
To prevent this from happening in the future, you can try using the parameter log.retention.bytes instead of log.retention.hours, although I believe that the use of size-based retention policy for logs is not the best choice, because as my practice shows in most cases it is necessary to know the time at least of which the topic will be stored in the cluster.

MongoDB SECONDARY becoming RECOVERING at nighttime

I am running a conventional MongoDB Replica Set consisting of 3 members (member1 in datacenter A, member2 and member3 in datacenter B).
member1 is the current PRIMARY and I am adding members 2 and 3 via rs.add(). They are performing their initial sync and become SECONDARY very soon. Everything is fine all day long and the replication delay of both members is 0 seconds until 2 AM at nighttime.
Now: Every night at 2 AM both members shift into the RECOVERING state and stop replication at all, which leads to a replication delay of hours when I am having a look into rs.printSlaveReplicationInfo() in the morning hours. At around 2 AM there are no massive inserts or maintenance tasks known to me.
I get the following log entries on the PRIMARY:
2015-10-09T01:59:38.914+0200 [initandlisten] connection accepted from 192.168.227.209:59905 #11954 (37 connections now open)
2015-10-09T01:59:55.751+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.869+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.870+0200 [conn11111] getmore local.oplog.rs cursorid:1155433944036 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:32168 nreturned:0 reslen:20 134ms
2015-10-09T01:59:55.872+0200 [conn11111] end connection 192.168.227.209:58972 (36 connections now open)
And, which is more interesting, I get the following log entries on both SECONDARYs:
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] replSet syncing to: member1:27017
2015-10-09T01:59:56.065+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from member1:27017
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet our last optime : Oct 9 01:59:23 5617035b:17f
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet oldest at member1:27017 : Oct 9 01:59:23 5617035b:1af
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet RECOVERING
Which is also striking - the start of the oplog "resets" itself every night at around 2 AM:
configured oplog size: 990MB
log length start to end: 19485secs (5.41hrs)
oplog first event time: Fri Oct 09 2015 02:00:33 GMT+0200 (CEST)
oplog last event time: Fri Oct 09 2015 07:25:18 GMT+0200 (CEST)
now: Fri Oct 09 2015 07:25:26 GMT+0200 (CEST)
I am not sure if this is somehow correlated to the issue. I am also wondering that such a small delay (Oct 9 01:59:23 5617035b:17f <-> Oct 9 01:59:23 5617035b:1af) lets the members become stale.
Could this also be a server (VM host) time issue or is it something completely different? (Why is the first oplog event being "resetted" every night and not "shifting" to a timestamp like NOW minus 24 hrs?)
What can I do to investigate and to avoid?
Upping the oplog size should solve this (per our comments).
Some references for others who run into this issue
Workloads that Might Require a Larger Oplog Size
Error: replSet error RS102 too stale to catch up link1 & link2

Couchbase: 20k items stuck in Tap Queue

We are currently evaluating couchbase as a memcached replacement in the first place. Our setup looks like this:
php -> localhost moxi -> couchbase bucket (Total bucket size = 10240 MB (2048 MB x 5 nodes with replica count 1))
The Servers have 16GB RAM and are SSD backed.
We were inserting at about 400 ops/s and had no problem for a few days. When we reached about 13 million items. We found out that we forgot to implement the delete function in our testsetup and a lot of keys had no expiration set.
To start over again we flushed the bucket through the webinterface. This where our problems began.
We started to see that we had temp ooms, back-offs, and tap queue was filled with 20k items. the drain and fill rate was nearly the same. See attached screenshot
What also catched our eye was that node 4 had only 220k items, where everyone else had around 1.39M
Somehow it looks like the replication messed up something, but im relatively new to couchbase. Any hints, suggestions? - See more at: http://www.couchbase.com/communities/q-and-a/20k-items-stuck-tap-queue#sthash.v9MxNnTk.dpuf
The problem was solved for a short time, after removing the failing node from the cluster.
So now with this four nodes left in the cluster, after some hours the same happend again with another node. We tried setting the now failing node into FailOver state. That fixed the problem again, but after Re-Adding the node, the same phenomenon happened again on that node.
Other things we realized are:
* Three out of four nodes have thousands of items in their TAP replication queue, but one
("the failing one") has 0.
* Also three out of four nodes have a back-off rate of around 400, but one ("the failing one") has 0.
* Only the failing one has a massive amount of "Temp OOMs per second", but the other three have 0.
The phenomenon seems to disappear, if we lower the load to the servers by disabling the couchbase-writes for one out of two software project writing to couchbase.
But if we enable the writes again, after around 10 minutes we can see this in the memcached.log on the failing node:
Tue Dec 17 12:29:05.010547 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:29:05.010576 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=277 rev=522
Tue Dec 17 12:29:08.748103 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:29:08.748257 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=321 rev=948
Tue Dec 17 12:40:17.354448 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:40:17.354476 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=303 rev=491
This error then happens around 5 times within four hours:
Tue Dec 17 14:19:32.145071 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
And after these four hours it starts spamming this instantly (Maybe, because the load increased heavily, because in the evening our page generates much more load than in the morning/noon) together with this "error from mccouch":
Tue Dec 17 16:42:30.875343 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:42:36.493317 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:25.239876 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 16:43:25.240052 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=296 rev=483
Tue Dec 17 16:43:25.903997 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:31.906178 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:36.913045 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:42.919114 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:48.920354 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:54.924017 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:44:00.928572 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
We have no clue what is happening here, why this failing node seems to reject every replication and throwing this error.
Do you have any idea?
Thanks for all your help and greetings from Cologne,
Andy!
Seeing as you just want to delete all items in the Bucket have you tried just deleting and re-creating the bucket?
This will be much faster than flush, as flush actually needs to send a delete request for every document in the bucket.
I can't find it in the docs at the moment, but I think Flush is not really recommended with the latest versions.
you are not writing what is your operating system. If it's Linux try to check maximum amount of open sockets for user running the Couchbase. Check the file /etc/security/limits.conf.
the command for check on Linux is: ulimit -Hn.
Hope that helps.
Daniel
I think you should try these settings:
http://docs.couchbase.com/couchbase-manual-2.1/#specifying-backoff-for-replication