Distributed state-machine's zookeeper ensemble fails while processing parallel regions with error KeeperErrorCode = BadVersion - apache-zookeeper

Background :
Diagram :
Statemachine uml state diagram
We have a normal state machine as depicted in diagram that monitors spring-BATCH micro-service(deployed on streams source/processor/sink design) ,for each batch that is started .
We receive sequence of REST calls to internally fire events per batch id on respective batch's machine object. i.e. per batch id the new state machine object is created .
And each machine is having n number of parallel regions(representing spring batch's chunks ) also as shown in the diagram.
REST calls made are using multi-threaded environment where 2 simultaneous calls of same batchId may come for different region Ids of BATCHPROCESSING state .
Up till now we had a single node(single installation) running of this state machine micro-service but now we want to deploy it on multiple instances ; to receive REST calls .
For this , the Distributed State Machine is what we want to introduce . We have below configuration in place for Running Distributed State Machine .
#Configuration
#EnableStateMachine
public class StateMachineUMLWayConfiguration extends
StateMachineConfigurerAdapter<String, String> {
..
..
#Override
public void configure(StateMachineModelConfigurer<String,String> model)
throws Exception {
model
.withModel()
.factory(stateMachineModelFactory());
}
#Bean
public StateMachineModelFactory<String,String> stateMachineModelFactory() {
StorehubBatchUmlStateMachineModelFactory factory =null;
try {
factory = new StorehubBatchUmlStateMachineModelFactory
(templateUMLInClasspath,stateMachineEnsemble());
} catch (Exception e) {
LOGGER.info("Config's State machine factory got exception
:"+factory);
}
LOGGER.info("Config's State machine factory method Called:"+factory);
factory.setStateMachineComponentResolver(stateMachineComponentResolver());
return factory;
}
#Override
public void configure(StateMachineConfigurationConfigurer<String,
String>
config) throws Exception {
config
.withDistributed()
.ensemble(stateMachineEnsemble());
}
#Bean
public StateMachineEnsemble<String, String> stateMachineEnsemble() throws
Exception {
return new ZookeeperStateMachineEnsemble<String, String>(curatorClient(), "/batchfoo1", true, 512);
}
#Bean
public CuratorFramework curatorClient() throws Exception {
CuratorFramework client =
CuratorFrameworkFactory.builder().defaultData(new byte[0])
.retryPolicy(new ExponentialBackoffRetry(1000, 3))
.connectString("localhost:2181").build();
client.start();
return client;
}
StorehubBatchUmlStateMachineModelFactory's build method:
#Override
public StateMachineModel<String, String> build(String batchChunkId) {
Model model = null;
try {
model = UmlUtils.getModel(getResourceUri(resolveResource(batchChunkId)).getPath());
} catch (IOException e) {
throw new IllegalArgumentException("Cannot build model from resource " + resource + " or location " + location, e);
}
UmlModelParser parser = new UmlModelParser(model, this);
DataHolder dataHolder = parser.parseModel();
ConfigurationData<String, String> configurationData = new ConfigurationData<String, String>( null, new SyncTaskExecutor(),
new ConcurrentTaskScheduler() , false, stateMachineEnsemble,
new ArrayList<StateMachineListener<String, String>>(), false,
null, null,
null, null, false,
null , batchChunkId, null,
null ) ;
return new DefaultStateMachineModel<String, String>(configurationData, dataHolder.getStatesData(), dataHolder.getTransitionsData());
}
Created new custom service interface level method in place of DefaultStateMachineService.acquireStateMachine(machineId)
#Override
public StateMachine<String, String> acquireDistributedStateMachine(String machineId, boolean start) {
synchronized (distributedMachines) {
DistributedStateMachine<String,String> distributedStateMachine = distributedMachines.get(machineId);
StateMachine<String,String> distMachineDelegateX = null;
if (distributedStateMachine == null) {
StateMachine<String, String> machine = stateMachineFactory.getStateMachine(machineId);
distributedStateMachine = (DistributedStateMachine<String, String>) machine;
}
distributedMachines.put(machineId, distributedStateMachine);
return handleStart(distributedStateMachine, start);
}
}
Problem :
Now problem is that , micro service deployed on single instance runs successfully even for events received by it are from multi threaded environment where one thread hits with the event REST call belonging to Region 1 and simultaneously other thread comes for region 2 of same batch . Machine goes ahead in synch ,with successful parallel regions' processing , till its last state i.e BATCHCOMPLETED .
Also we checked at zookeeper side that at last the BATCHCOMPLETED STATE was being recorded in node's current version.
But , besides 1st instance , when we keep same micro service app-jar deployed on some other location to treat it as a 2nd instance of micro-service that is also now running to accept event REST calls(say by listening at another tomcat port 9002) ; it fails in middle somewhere randomly . This failure happens randomly after any one of the events among parallel regions is fired and when ensemble.setState() is being called internally on state change of that event .
It gives following error:
[36mo.s.s.support.AbstractStateMachine [0;39m [2m:[0;39m Interceptors threw exception, skipping state change
org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.springframework.statemachine.zookeeper.ZookeeperStateMachineEnsemble.setState(ZookeeperStateMachineEnsemble.java:241) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
at org.springframework.statemachine.ensemble.DistributedStateMachine$LocalStateMachineInterceptor.preStateChange(DistributedStateMachine.java:209) ~[spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.StateMachineInterceptorList.preStateChange(StateMachineInterceptorList.java:101) ~[spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.callPreStateChangeInterceptors(AbstractStateMachine.java:859) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.switchToState(AbstractStateMachine.java:880) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.access$500(AbstractStateMachine.java:81) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine$3.transit(AbstractStateMachine.java:335) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.handleTriggerTrans(DefaultStateMachineExecutor.java:286) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.handleTriggerTrans(DefaultStateMachineExecutor.java:211) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.processTriggerQueue(DefaultStateMachineExecutor.java:449) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.access$200(DefaultStateMachineExecutor.java:65) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor$1.run(DefaultStateMachineExecutor.java:323) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50) [spring-core-4.3.13.RELEASE.jar!/:4.3.13.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.scheduleEventQueueProcessing(DefaultStateMachineExecutor.java:352) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.execute(DefaultStateMachineExecutor.java:163) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.sendEventInternal(AbstractStateMachine.java:603) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.sendEvent(AbstractStateMachine.java:218) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.ensemble.DistributedStateMachine.sendEvent(DistributedStateMachine.java:108)
..skipping Lines....
Caused by: org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.springframework.statemachine.zookeeper.ZookeeperStateMachinePersist.write(ZookeeperStateMachinePersist.java:113) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
at org.springframework.statemachine.zookeeper.ZookeeperStateMachinePersist.write(ZookeeperStateMachinePersist.java:50) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
at org.springframework.statemachine.zookeeper.ZookeeperStateMachineEnsemble.setState(ZookeeperStateMachineEnsemble.java:235) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
... 73 common frames omitted
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1006) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
Question :
1.So is the configuration mentioned above needs something more to be configured to avoid that exception mentioned above??
Because Both state-machine micro-service instances were tested with the case when they both were connecting to same instance i.e. same string .connectString("localhost:2181").build() or case when they were made to connect to different zookeeper instances(i.e. 'localhost:2181' , 'localhost:2182').
Same exception of BAD VERSION occurs during state machine ensemble's processing in both cases .
2.Also If Batches would run in parallel so their respective machines would need to be created to run in parallel at state-machine micro-service end .
So here , technically new State machine we need for new batchId , running simultaneously .
But looking at the ZookeeperStateMachineEnsemble , One znode path seems to be associated with one ensemble , whenever ensemble object is instantiated once in the main config class ("StateMachineUMLWayConfiguration") .
So is it expected to only use that singleton ensemble instance only? Can't multiple ensembles be created at run-time referencing different znode paths run in parallel to log their respective Distributed State Machine's states to their respective znode paths??
a. Because batches running in parallel would need separate znode paths to be created . Thus due to our attempt of keeping separate znode path per batch , we need separate ensemble to be instantiated per batch's machine. But that seems to be getting into the lock condition while getting connection to znode through curator client.
b. REST call fired for event triggering does not complete , as the machine it acquired is stuck in ensemble to connect .
Thanks in advance .

Related

define non retryable exception in kafka

hi i have a kafka consumer application that uses spring kafka. It is consuming messages batch wise. But before that it was consuming sequentially. When i consume sequentially i used below annotation
#RetryableTopic(
attempts = "3}",
backoff = #Backoff(delay = 1000, multiplier = 2.0),
autoCreateTopics = "false",
topicSuffixingStrategy = TopicSuffixingStrategy.SUFFIX_WITH_INDEX_VALUE,
exclude = {CustomNonRetryableException.class})
In my code i throw CustomNonRetryableException whenever i dont need to retry an exception scenario. For other exceptions it will retry 3 times.
But when i switched to batch processing, i removed above code and used below kafkalistenercontainerfactory.
#Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, Object> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setCommitLogLevel(LogIfLevelEnabled.Level.DEBUG);
factory.getContainerProperties().setMissingTopicsFatal(false);
factory.setBatchListener(true);
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
factory.setCommonErrorHandler(new DefaultErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate(),
(r, e) -> {
return new TopicPartition(r.topic() + "-dlt", r.partition());
}),new FixedBackOff(1000L, 2L)));
return factory;
}
Now what i'm trying to do is apply that CustomNonRetryableException class so that it wont retry 3 times. I want only CustomNonRetryableException throwing scenarios to be retried one time and send to the dlt topic. How can i achieve it?
It's a bug - see this answer Spring Boot Kafka Batch DefaultErrorHandler addNotRetryableExceptions?
It is fixed and will be available in the next release.
Also see the note in that answer about the preferred way to handle errors when using a batch listener, so that only the failed record is retried, instead of the whole batch.

Apache Curator : No leader is getting selected intermittently

I am using Apache Curator Leader Election Recipe : https://curator.apache.org/curator-recipes/leader-election.html in my application.
Zookeeper version : 3.5.7
Curator : 4.0.1
Below are the sequence of steps:
1. Whenever my tomcat server instance is getting up, I create a single CuratorFramework instance(single instance per tomcat server) and start it :
CuratorFramework client = CuratorFrameworkFactory.newClient(connectionString, retryPolicy);
client.start();
if(!client.blockUntilConnected(10, TimeUnit.MINUTES)){
LOGGER.error("Zookeeper connection could not establish!");
throw new RuntimeException("Zookeeper connection could not establish");
}
Create an instance of LSAdapter and start it:
LSAdapter adapter = new LSAdapter(client, <some_metadata>);
adapter.start();
Below is my LSAdapter class :
public class LSAdapter extends LeaderSelectorListenerAdapter implements Closeable {
//<Class instance variables defined>
public LSAdapter(CuratorFramework client, <some_metadata>) {
leaderSelector = new LeaderSelector(client, <path_to_be_used_for_leader_election>, this);
leaderSelector.autoRequeue();
}
public void start() throws IOException {
leaderSelector.start();
}
#Override
public void close() throws IOException {
leaderSelector.close();
}
#Override
public void takeLeadership(CuratorFramework client) throws Exception {
final int waitSeconds = (int) (5 * Math.random()) + 1;
LOGGER.info(name + " is now the leader. Waiting " + waitSeconds + " seconds...");
LOGGER.debug(name + " has been leader " + leaderCount.getAndIncrement() + " time(s) before.");
while (true) {
try {
Thread.sleep(TimeUnit.SECONDS.toMillis(waitSeconds));
//do leader tasks
} catch (InterruptedException e) {
LOGGER.error(name + " was interrupted.");
//cleanup
Thread.currentThread().interrupt();
} finally {
}
}
}
}
When server instance is getting down, close LSAdapter instance(which application is using) and close CuratorFramework client created
CloseableUtils.closeQuietly(lsAdapter);
curatorFrameworkClient.close();
The issue I am facing is that at times, when server is restarted, no leader gets elected. I checked that by tracing the log inside takeLeadership(). I have two tomcat server instances with above code, connecting to same zookeeper quorum and most of the times one of the instance becomes leader but when this issue happens, both of them becomes follower. Please suggest what am I doing wrong.
As I answered on Curator's Jira, you are swallowing the interrupted exception. When you get InterruptedException you must exit your takeLeadership(). In your code example, you are merely resetting the interrupted state and continuing the loop - this will cause an infinite loop of interrupted exceptions, btw. After calling Thread.currentThread().interrupt(); you should exit the while loop.

Modify connector config in KafkaConnect before sending to task

I'm writing a SinkConnector in Kafka Connect and hitting an issue. This connector has a configuration as such :
{
"connector.class" : "a.b.ExampleFileSinkConnector",
"tasks.max" : '1',
"topics" : "mytopic",
"maxFileSize" : "50"
}
I define the connector's config like this :
#Override public ConfigDef config()
{
ConfigDef result = new ConfigDef();
result.define("maxFileSize", Type.STRING, "10", Importance.HIGH, "size of file");
return result;
}
In the connector, I start the tasks as such :
#Override public List<Map<String, String>> taskConfigs(int maxTasks) {
List<Map<String, String>> result = new ArrayList<Map<String,String>>();
for (int i = 0; i < maxTasks; i++) {
Map<String, String> taskConfig = new HashMap<>();
taskConfig.put("connectorName", connectorName);
taskConfig.put("taskNumber", Integer.toString(i));
taskConfig.put("maxFileSize", maxFileSize);
result.add(taskConfig);
}
return result;
}
and all goes well.
However, when starting the Task (in taskConfigs()), if I add this :
taskConfig.put("epoch", "123");
this breaks the whole infrastructure : all connectors are stopped and restarted in an endless loop.
There is no exception or error whatsoever in the connect log file that can help.
The only way to make it work is to add "epoch" in the connector config, which I don't want to do since it is an internal parameter that the connector has to send to the task. It is not intended to be exposed to the connector's users.
Another point I noticed is that it is not possible to update the value of any connector config parameter, apart to set it to the default value. Changing a parameter and sending it to the task produces the same behavior.
I would really appreciate any help on this issue.
EDIT : here is the code of SinkTask::start()
#Override public void start(Map<String, String> taskConfig) {
try {
connectorName = taskConfig.get("connectorName");
log.info("{} -- Task.start()", connectorName);
fileNamePattern = taskConfig.get("fileNamePattern");
rootDir = taskConfig.get("rootDir");
fileExtension = taskConfig.get("fileExtension");
maxFileSize = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("maxFileSize"));
maxTimeMinutes = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("maxTimeMinutes"));
maxNumRecords = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("maxNumRecords"));
taskNumber = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("taskNumber"));
epochStart = SimpleFileSinkConnector.parseLongConfig(taskConfig.get("epochStart"));
log.info("{} -- fileNamePattern: {}, rootDir: {}, fileExtension: {}, maxFileSize: {}, maxTimeMinutes: {}, maxNumRecords: {}, taskNumber: {}, epochStart : {}",
connectorName, fileNamePattern, rootDir, fileExtension, maxFileSize, maxTimeMinutes, maxNumRecords, taskNumber, epochStart);
if (taskNumber == 0) {
checkTempFilesForPromotion();
}
computeInitialFilename();
log.info("{} -- Task.start() END", connectorName);
} catch (Exception e) {
log.info("{} -- Task.start() EXCEPTION : {}", connectorName, e.getLocalizedMessage());
}
}
We found the root cause of the issue. The Kafka Connect Framework is actually behaving as designed - the problem has to do with how we are trying to use the taskConfigs configuration framework.
The Problem
In our design, the FileSinkConnector sets an epoch in its start() lifecycle method, and this epoch is passed down to its tasks by way of the taskConfigs() lifecycle method. So each time the Connector's start() lifecycle method runs, different configuration is generated for the tasks - which is the problem.
Generating different configuration each time is a no-no. It turns out that the Connect Framework detects differences in configuration and will restart/rebalance upon detection - stopping and restarting the connector/task. That restart will call the stop() and start() methods of the connector ... which will (of course) produces yet another configuration change (because of the new epoch), and the vicious cycle is on!
This was an interesting and unexpected issue ... due to a behavior in Connect that we had no appreciation for. This is the first time we tried to generate task configuration that was not a simple function of the connector configuration.
Note that this behavior in Connect is intentional and addresses real issues of dynamically-changing configuration - like a JDBC Sink Connector that spontaneously updates its configuration when it detects a new database table it wants to sink.
Thanks to those who helped us !

How do I collect from a flux without closing the stream

My usecase is to create an reactive endpoint like this :
public Flux<ServerEvent> getEventFlux(Long forId){
ServicePoller poller = new ServicePollerImpl();
Map<String,Object> params = new HashMap<>();
params.put("id", forId);
Flux<Long> interval = Flux.interval(Duration.ofMillis(pollDuration));
Flux<ServerEvent> serverEventFlux =
Flux.fromStream(
poller.getEventStream(url, params) //poll a given endpoint after a fixed duration.
);
Flux<ServerEvent> sourceFlux= Flux.zip(interval, serverEventFlux)
.map(Tuple2::getT2); // Zip the two streams.
/* Here I want to store data from sourceFlux into a collection whenever some data arrives without disturbing the downstream processing in Spring. So that I can access collection later on without polling again */
This sends back the data to front end as soon as it is available , however my second use case is to pool that data as it arrives into a separate collection , so that if a similar request arrives later on , I can offload the whole data from the pool without hitting the service again .
I tried to subscribe the flux , buffer , cache and collect the flux before returning from the original flux the controller , but all of that seems to close the stream hence Spring cant process it.
What are my options to tap into the flux and store values into a collection as and when they arrive without closing the flux stream ?
Exception encountered :
java.lang.IllegalStateException: stream has already been operated upon
or closed at
java.util.stream.AbstractPipeline.spliterator(AbstractPipeline.java:343)
~[na:1.8.0_171] at
java.util.stream.ReferencePipeline.iterator(ReferencePipeline.java:139)
~[na:1.8.0_171] at
reactor.core.publisher.FluxStream.subscribe(FluxStream.java:57)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.Flux.subscribe(Flux.java:6873)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.FluxZip$ZipCoordinator.subscribe(FluxZip.java:573)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.FluxZip.handleBoth(FluxZip.java:326)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE]
poller.getEventStream returns a Java 8 stream that can be consumed only once. You can either convert the stream to a collection first or defer the execution of poller.getEventStream by using a supplier:
Flux.fromStream(
() -> poller.getEventStream(url, params)
);
Solution that worked for me as suggested by #a better oliver
public Flux<ServerEvent> getEventFlux(Long forId){
ServicePoller poller = new ServicePollerImpl();
Map<String,Object> params = new HashMap<>();
params.put("id", forId);
Flux<Long> interval = Flux.interval(Duration.ofMillis(pollDuration));
Flux<ServerEvent> serverEventFlux =
Flux.fromStream(
()->{
return poller.getEventStream(url, params).peek((se)->{reactSink.addtoSink(forId, se);});
}
);
Flux<ServerEvent> sourceFlux= Flux.zip(interval, serverEventFlux)
.map(Tuple2::getT2);
return sourceFlux;
}

Spymemcache- Memcache/Membase Faileover

Platform: 64 Bit windows OS, spymemcached-2.7.3.jar, J2EE
We want to use two memcache/membase servers for caching solution. We want to allocate 1 GB memory to each memcache/membase server so total we can cache 2 GB data.
We are using spymemcached java client for setting and getting data from memcache. We are not using any replication between two membase servers.
We loading memcacheClient object at the time of our J2EE application startup.
URI server1 = new URI("http://192.168.100.111:8091/pools");
URI server2 = new URI("http://127.0.0.1:8091/pools");
ArrayList<URI> serverList = new ArrayList<URI>();
serverList.add(server1);
serverList.add(server2);
client = new MemcachedClient(serverList, "default", "");
After that we are using memcacheClient to get and set value in memcache/membase server.
Object obj = client.get("spoon");
client.set("spoon", 50, "Hello World!");
Looks like memcacheClient is setting and getting and value only from server1.
If we stop server1, it fails to get/set value. Should it not use server2 in case of server1 down? Please let me know if we are doing anything wrong here...
aspymemcached java client dos not handle membase failover for particular node.
Ref : https://blog.serverdensity.com/handling-memcached-failover/
We need to handle it manually(by our code)
We can do this by using ConnectionObserver
Here is my code :
public static void main(String a[]) throws InterruptedException{
try {
URI server1 = new URI("http://192.168.100.111:8091/pools");
URI server2 = new URI("http://127.0.0.1:8091/pools");
final ArrayList<URI> serverList = new ArrayList<URI>();
serverList.add(server1);
serverList.add(server2);
final MemcachedClient client = new MemcachedClient(serverList, "bucketName", "");
client.addObserver(new ConnectionObserver() {
#Override
public void connectionLost(SocketAddress arg0) {
//method call when connection lost
for(MemcachedNode node : client.getNodeLocator().getAll()){
if(!node.isActive()){
client.shutdown();
//re init your client here, and after re-init it will connect to your secodry node
break;
}
}
}
#Override
public void connectionEstablished(SocketAddress arg0, int arg1) {
//method call when connection established
}
});
Object obj = client.get("spoon");
client.set("spoon", 50, "Hello World!");
} catch (Exception e) {
}
}
client.get() would use first available node and therefore your value would be stored/updated on one node only.
You seems to be a bit contradicting in your requirements - first you're saying that 'we want to allocate 1 GB memory to each memcache/membase server so total we can cache 2 GB data' which implies distributed cache model (particular key is stored on one node in the cache farm) and then you expect to fetch it if that node is down, which obviously won't happen.
If you need your cache farm to survive node failure without losing data cached on that node you should use replication, which is available in MemBase but obviously you would pay the price of storing the same values multiple times so your desire of '1GB per server...total 2GB of cache' won't be possible.