TargetReplicaSelector RandomSecondaryReplica endpoint not found - azure-service-fabric

No endpoint found for the service '{serviceB}' partition '{guid}' that matches the specified TargetReplicaSelector : 'RandomSecondaryReplica'
This is an error that has not always showed up, but it does sometimes.
I'm calling a stateful service B from another stateful service A, with service remoting, asking for a random secondary replica, to access state written to the primary.
I can see in Explorer that the partition is there and shows OK, and it has a primary and two ActiveSecondaries.
The service B has following:
protected override IEnumerable<ServiceReplicaListener> CreateServiceReplicaListeners()
{
return new[] { new ServiceReplicaListener(context =>
this.CreateServiceRemotingListener(context), listenOnSecondary: true) };
}
I get all the partitions by this:
return Enumerable.Range(0, PartitionConstants.Partitions).Select(x =>
ServiceProxy.Create<IServiceB>(
ServiceBUri,
new ServicePartitionKey(x),
TargetReplicaSelector.RandomSecondaryReplica));
And the overall settings must be OK since sometimes it does work. And I know the primary is responding because I have saved state there.
So, what could cause this error when I can actually see the partition there, with the secondary replicas?
Update1 : Restarting the calling service made connection work. But they started together, and well after both had been running and working, the problem persisted, until I restarted. Howcome?
Update2 : This happens when whole cluster is started. At startup, Service A primaries calls Service B primaries for some registration. A polls B to know that it has initiated its internal state before doing this.
Then when this is complete, Service A goes on to check if its internal state needs update, and if so, it will call Service B again to retrieve state. Since it will not do any writing to B state, it calls secondary replicas. And here is when endpoint is not found.
When I restart Service A, endpoints are found.
Could it be that primaries are working and OK, but the secondaries are not yet OK?
How can I ascertain this? Is there some service fabric class that I can access to know whether the secondary will be found if I call for it?

Using a service primer found here, solved this issue. Seems like not all partition replicas was ready when being called.
Basically, what it does is counting all replicas of all partitions via FabricClient, until expected count is found.
Here is code:
public async Task WaitForStatefulService(Uri serviceInstanceUri, CancellationToken token)
{
StatefulServiceDescription description =
await this.Client.ServiceManager.GetServiceDescriptionAsync(serviceInstanceUri) as StatefulServiceDescription;
int targetTotalReplicas = description.TargetReplicaSetSize;
if (description.PartitionSchemeDescription is UniformInt64RangePartitionSchemeDescription)
{
targetTotalReplicas *= ((UniformInt64RangePartitionSchemeDescription)description.PartitionSchemeDescription).PartitionCount;
}
ServicePartitionList partitions = await this.Client.QueryManager.GetPartitionListAsync(serviceInstanceUri);
int replicaTotal = 0;
while (replicaTotal < targetTotalReplicas && !token.IsCancellationRequested)
{
await Task.Delay(this.interval);
//ServiceEventSource.Current.ServiceMessage(this, "CountyService waiting for National Service to come up.");
replicaTotal = 0;
foreach (Partition partition in partitions)
{
ServiceReplicaList replicaList = await this.Client.QueryManager.GetReplicaListAsync(partition.PartitionInformation.Id);
replicaTotal += replicaList.Count(x => x.ReplicaStatus == System.Fabric.Query.ServiceReplicaStatus.Ready);
}
}
}

Related

how to implement high availability service (2 nodes) using zookeeper

i need to provide a H/A mechanism.
i understand thet zookeeper can be chosen as a leader election
i'm looking for the right pattern for this flow:
i need to implement a service that invoke a flow.
when it starting the flow [the flow is a looping flow], it must validate that it is the leader. (say by it's ip address).
i understand that i can put a value into a zookeeper that define that entering
instance, and dispose it when 1 loop end, or for a period of time.
it is that the right pattern?
also it seems that race condition issues if i use something like:
...
...
List<String> names = zk.getChildren( path, false );
String id = null;
// See whether we have already run for election in this process
for ( String name : names ) {
{
if ( name.startsWith( myIP ) ) {
id = name;
break;
}
}
if ( id == null ) {
id = zk.create( path + "/" + myIP, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
}
boolean isLeader = id != null;
for example:
2 services read null
and than 2nd overrides the 1st signature, and both of them running the task.
can you help?
thanks
Using ZooKeeper correctly can be difficult and takes some studying of its APIs and semantics. There are some nice high-level libraries such as Curator that already have implemented common algorithms such as leader election. The ZooKeeper documentation also has a recipe for leader election.

Service Fabric Reliable Queues FabricNotReadableException

I have a Stateful service with 1000 partitions and 1 replica.
This service in the RunAsync method have an infinte while cycle where I call a Reliable Queue to get messages.
If there are no messages I wait 5 seconds, then retry.
I used to do exactly that with Azure Storage Queue with success.
But with Service Fabric I'm getting thousands of FabricNotReadableExceptions, the Service become unstable and I'm not able to update it or delete it, I need to cancel the entire cluster.
I tried to update it and after 18 hours it was still stuck, so there is something terribly wrong in what I'm doing.
This is the method code:
public async Task<QueueObject> DeQueueAsync(string queueName)
{
var q = await StateManager.GetOrAddAsync<IReliableQueue<string>>(queueName);
using (var tx = StateManager.CreateTransaction())
{
try
{
var dequeued = await q.TryDequeueAsync(tx);
if (dequeued.HasValue)
{
await tx.CommitAsync();
var result = dequeued.Value;
return JSON.Deserialize<QueueObject>(result);
}
else
{
return null;
}
}
catch (Exception e)
{
ServiceEventSource.Current.ServiceMessage(this, $"!!ERROR!!: {e.Message} - Partition: {Partition.PartitionInfo.Id}");
return null;
}
}}
This is the RunAsync
protected override async Task RunAsync(CancellationToken cancellationToken)
{
while (true)
{
var message = await DeQueueAsync("MyQueue");
if (message != null)
{
//process, takes around 500ms
}
else
{
Thread.Sleep(5000);
}
}
}
I also changed Thread.Sleep(5000) with Task.Delay and was having thousands of "A task was canceled" errors.
What I'm missing here?
It's the cycle too fast and SF cannot update the other replicas in time?
Should I remove all the replicas leaving just one?
Should I use the new ConcurrentQueue instead?
I have the problem in production and in local with 50 or 1000 partitions, doesn't matter.
I'm stuck and confused.
Thanks
You need to honor the cancellationToken that is passed in to your RunAsync implementation. Service Fabric will cancel the token when it wants to stop your service for any reason - including upgrades - and it will wait indefinitely for RunAsync to return after cancelling the token. This could explain why you couldn't upgrade your application.
I would suggest checking cancellationToken.IsCancelled inside your loop, and breaking out if it has been cancelled.
FabricNotReadableException can happen for a variety of reasons - the answer to this question has a comprehensive explanation, but the takeaway is
You can consider FabricNotReadableException retriable. If you see it, just try the call again and eventually it will resolve into either NotPrimary or Granted.

KafkaSpout (idle) generates a huge network traffic

After developing and executing my Storm (1.0.1) topology with a KafkaSpout and a couple of Bolts, I noticed a huge network traffic even when the topology is idle (no message on Kafka, no processing is done in bolts). So I started to comment out my topology piece by piece in order to find the cause and now I have only the KafkaSpout in my main:
....
final SpoutConfig spoutConfig = new SpoutConfig(
new ZkHosts(zkHosts, "/brokers"),
"files-topic", // topic
"/kafka", // ZK chroot
"consumer-group-name");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.startOffsetTime = OffsetRequest.LatestTime();
topologyBuilder.setSpout(
"kafka-spout-id,
new KafkaSpout(config),
1);
....
When this (useless) topology executes, even in local mode, even the very first time, the network traffic always grows a lot: I see (in my Activity Monitor)
An average of 432 KB of data received/sec
After a couple of hours the topology is running (idle) data received is 1.26GB and data sent is 1GB
(Important: Kafka is not running in cluster, a single instance that runs in the same machine with a single topic and a single partition. I just downloaded Kafka on my machine, started it and created a simple topic. When I put a message in the topic, everything in the topology is working without any problem at all)
Obviously, the reason is in the KafkaSpout.nextTuple() method (below), but I don't understand why, without any message in Kafka, I should have such traffic. Is there something I didn't consider? Is that the expected behaviour? I had a look at Kafka logs, ZK logs, nothing, I have cleaned up Kafka and ZK data, nothing, still the same behaviour.
#Override
public void nextTuple() {
List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
for (int i = 0; i < managers.size(); i++) {
try {
// in case the number of managers decreased
_currPartitionIndex = _currPartitionIndex % managers.size();
EmitState state = managers.get(_currPartitionIndex).next(_collector);
if (state != EmitState.EMITTED_MORE_LEFT) {
_currPartitionIndex = (_currPartitionIndex + 1) % managers.size();
}
if (state != EmitState.NO_EMITTED) {
break;
}
} catch (FailedFetchException e) {
LOG.warn("Fetch failed", e);
_coordinator.refresh();
}
}
long diffWithNow = System.currentTimeMillis() - _lastUpdateMs;
/*
As far as the System.currentTimeMillis() is dependent on System clock,
additional check on negative value of diffWithNow in case of external changes.
*/
if (diffWithNow > _spoutConfig.stateUpdateIntervalMs || diffWithNow < 0) {
commit();
}
}
Put a sleep for one second (1000ms) in the nextTuple() method and observe the traffic now, For example,
#Override
public void nextTuple() {
try {
Thread.sleep(1000);
} catch(Exception ex){
log.error("Ëxception while sleeping...",e);
}
List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
for (int i = 0; i < managers.size(); i++) {
...
...
...
...
}
The reason is, kafka consumer works on the basis of pull methodology which means, consumers will pull data from kafka brokers. So in consumer point of view (Kafka Spout) will do a fetch request to the kafka broker continuously which is a TCP network request. So you are facing a huge statistics on the data packet sent/received. Though the consumer doesn't consumes any message, pull request and empty response also will get account into network data packet sent/received statistics. Your network traffic will be less if your sleeping time is high. There are also some network related configurations for the brokers and also for consumer. Doing the research on configuration may helps you. Hope it will helps you.
Is your bolt receiving messages ? Do your bolt inherits BaseRichBolt ?
Comment out that line m.fail(id.offset) in Kafaspout and check it out. If your bolt doesn't ack then your spout assumes that message is failed and try to replay the same message.
public void fail(Object msgId) {
KafkaMessageId id = (KafkaMessageId) msgId;
PartitionManager m = _coordinator.getManager(id.partition);
if (m != null) {
//m.fail(id.offset);
}
Also try halt the nextTuple() for few millis and check it out.
Let me know if it helps

Wait for EC2 instance to start

I have a custom AMI which runs my service. Using the AWS Java SDK, I create an EC2 instance using RunInstancesRequest from the AMI. Now before I begin to use my service, I must ensure that the newly created instance is up and running. I poll the instance using:
var transitionCompleted = false
while (!transitionCompleted) {
val currentState = instance.getState.getName
if (currentState == desiredState) {
transitionCompleted = true
}
if(!transitionCompleted) {
try {
Thread.sleep(TRANSITION_INTERVAL)
} catch {
case e: InterruptedException => e.printStackTrace()
}
}
}
So when the currentState of the instance turns into desiredState(which is running), I get the status that the instance is ready. However any newly created instance, despite being in running state, is not available for immediate use as it is still initializing.
How do I ensure that I return only when I'm able to access the instance and its services? Are there any specific status checks to make?
PS: I use Scala
You are checking instance state, while what you are actually interested in are the instance status checks. You could use describeInstanceStatus method from the Amazon Java SDK, but instead of implementing your own polling (in a non-idiomatic Scala) it's better to use a ready solution from the SDK: the EC2 waiters.
import com.amazonaws.services.ec2._, model._, waiters._
val ec2client: AmazonEC2 = ...
val request = new DescribeInstanceStatusRequest().withInstanceIds(instanceID)
ec2client.waiters.instanceStatusOk.run(
new WaiterParameters()
.withRequest(request)
// Optionally, you can tune the PollingStrategy:
// .withPollingStrategy(...)
)
)
To customize polling delay and retry strategies of the waiter check the PollingStrategy documentation.

how to see properties of a JmDNS service in reciever side?

One way of creating JmDNS services is :
ServiceInfo.create(type, name, port, weight, priority, props);
where props is a Map which describes some propeties of the service. Does anybody have an example illustrating the use of theese properties, for instance how to use them in the reciever part.
I've tried :
Hashtable<String,String> settings = new Hashtable<String,String>();
settings.put("host", "hhgh");
settings.put("web_port", "hdhr");
settings.put("secure_web_port", "dfhdyhdh");
ServiceInfo info = ServiceInfo.create("_workstation._tcp.local.", "service6", 80, 0, 0, true, settings);
but, then in a machine receiving this service, what can I do to see those properties?
I would apreciate any help...
ServiceInfo info = jmDNS.getServiceInfo(serviceEvent.getType(), serviceEvent.getName());
Enumeration<String> ps = info.getPropertyNames();
while (ps.hasMoreElements()) {
String key = ps.nextElement();
String value = info.getPropertyString(key);
System.out.println(key + " " + value);
}
It has been a while since this was asked but I had the same question. One problem with the original question is that the host and ports should not be put into the text field, and in this case there should actually be two service types one secure and one insecure (or perhaps make use of subtypes).
Here is an incomplete example that gets a list of running workstation services:
ServiceInfo[] serviceInfoList = jmdns.list("_workstation._tcp.local.");
if(serviceInfoList != null) {
for (int index = 0; index < serviceInfoList.length; index++) {
int port = serviceInfoList[index].getPort();
int priority = serviceInfoList[index].getPriority();
int weight = serviceInfoList[index].getWeight();
InetAddress address = serviceInfoList[index].getInetAddresses()[0];
String someProperty = serviceInfoList[index].getPropertyString("someproperty");
// Build a UI or use some logic to decide if this service provider is the
// one you want to use based on prority, properties, etc.
...
}
}
Due to the way that JmDNS is implemented the first call to list() on a given type is slow (several seconds) but subsequent calls will be pretty fast. Providers of services can change the properties by calling info.setText(settings) and the changes will be propagated out to the listeners automatically.