Service Fabric redeployment does not reach healthy state - azure-service-fabric

For deploying an application on our Windows Fabric standalone cluster, we use the Service Fabric API.
Currently we are having issues with redeployment. A new fresh deployment is okay. When the application already exists, we remove it first, but the new application does not reach a healthy state.
Any suggestions on what it can be/what is wrong? Do we need some timeout between removing and deploying?
// Clean up previous deployment.
var application = (await fabricClient.QueryManager.GetApplicationListAsync(applicationUri).ConfigureAwait(false)).SingleOrDefault();
if(application != null)
{
// Removing application instance...
await fabricClient.ApplicationManager.DeleteApplicationAsync(new DeleteApplicationDescription(applicationName)
{
ForceDelete = true
}).ConfigureAwait(false);
}
var applicationType = (await fabricClient.QueryManager.GetApplicationTypeListAsync(applicationTypeName).ConfigureAwait(false)).SingleOrDefault();
if(applicationType != null)
{
// Unregistering application type...
await fabricClient.ApplicationManager.UnprovisionApplicationAsync(applicationTypeName, applicationTypeVersion).ConfigureAwait(false);
}
// Create the new deployment
// Copying application package...
fabricClient.ApplicationManager.CopyApplicationPackage(imageStoreConnectionString, applicationPackageDirectoryPath, applicationTypeName);
// Registering application type...
await fabricClient.ApplicationManager.ProvisionApplicationAsync(applicationTypeName).ConfigureAwait(false);
// Creating application...
await fabricClient.ApplicationManager.CreateApplicationAsync(new ApplicationDescription(applicationUri, applicationTypeName, applicationTypeVersion)).ConfigureAwait(false);

Related

Azure Mobile Services for Xamarin Forms - Conflict Resolution

I'm supporting a production Xamarin Forms app with offline sync feature implemented using Azure Mobile Services.
We have a lot of production issues related to users losing data or general instability that goes away if the reinstall the app. After having a look through, I think the issues are around how the conflict resolution is handled in the app.
For every entity that tries to sync we handle MobileServicePushFailedException and then traverse through the errors returned and take action.
catch (MobileServicePushFailedException ex)
{
foreach (var error in ex.PushResult.Errors) // These are MobileServiceTableOpearationErrors
{
var status = error.Status; // HttpStatus code returned
// Take Action based on this status
// If its 409 or 412, we go in to conflict resolving and tries to decide whether the client or server version wins
}
}
The conflict resolving seems too custom to me and I'm checking to see whether there are general guidelines.
For example, we seem to be getting empty values for 'CreatedAt' & 'UpdatedAt' timestamps for local and server versions of the entities returned, which is weird.
var serverItem = error.Result;
var clientItem = error.Item;
// sometimes serverItem.UpdatedAt or clientItem.UpdatedAt is NULL. Since we use these 2 fields to determine who wins, we are stumped here
If anyone can point me to some guideline or sample code on how these conflicts should be generally handled using information from the MobileServiceTableOperationError, that will be highly appreciated
I came across the following code snippet from the following doc.
// Simple error/conflict handling.
if (syncErrors != null)
{
foreach (var error in syncErrors)
{
if (error.OperationKind == MobileServiceTableOperationKind.Update && error.Result != null)
{
//Update failed, reverting to server's copy.
await error.CancelAndUpdateItemAsync(error.Result);
}
else
{
// Discard local change.
await error.CancelAndDiscardItemAsync();
}
Debug.WriteLine(#"Error executing sync operation. Item: {0} ({1}). Operation discarded.",
error.TableName, error.Item["id"]);
}
}
Surfacing conflicts to the UI I found in this doc
private async Task ResolveConflict(TodoItem localItem, TodoItem serverItem)
{
//Ask user to choose the resolution between versions
MessageDialog msgDialog = new MessageDialog(
String.Format("Server Text: \"{0}\" \nLocal Text: \"{1}\"\n",
serverItem.Text, localItem.Text),
"CONFLICT DETECTED - Select a resolution:");
UICommand localBtn = new UICommand("Commit Local Text");
UICommand ServerBtn = new UICommand("Leave Server Text");
msgDialog.Commands.Add(localBtn);
msgDialog.Commands.Add(ServerBtn);
localBtn.Invoked = async (IUICommand command) =>
{
// To resolve the conflict, update the version of the item being committed. Otherwise, you will keep
// catching a MobileServicePreConditionFailedException.
localItem.Version = serverItem.Version;
// Updating recursively here just in case another change happened while the user was making a decision
UpdateToDoItem(localItem);
};
ServerBtn.Invoked = async (IUICommand command) =>
{
RefreshTodoItems();
};
await msgDialog.ShowAsync();
}
I hope this helps provide some direction. Although the Azure Mobile docs have been deprecated, the SDK hasn't changed and should still be relevant. If this doesn't help, let me know what you're using for a backend store.

how to get slack notification on change of Kubernetes pod status?

How to get slack notification while any k8s pod status changed? can't use kube bots as it's not allowed in my organisation.
You can use "Alertmanager" from the Prometheus stack for such notifications.
Once you have the prometheus stack up and running, you can configure custom alerts based on any property of objects in kubernetes and forward them to slack
https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
Updated:
In case you can't deploy any external tool, you could right a simple shell script which gets pod status via kubectl
Something like:
kubectl get pods mypod -ojson | jq .status.phase
You can poll on this command and use the slack webhooks to send a notification when it changes value
I implemented solution using kubernetes api
https://github.com/kubernetes-client/csharp/tree/master/examples/watch
Basically it will check the not running pod and notify using Microsoft Teams web-hook. It will also notify pod which was initially not running and came back to running status again(recovered pod)
C# code snippet below with Main and Notify function.
static async Task Main(string[] args)
{
// Load from the default kubeconfig on the machine.
var config = KubernetesClientConfiguration.BuildConfigFromConfigFile();
// Use the config object to create a client.
var client = new Kubernetes(config);
try
{
var podlistResp = await client.ListNamespacedPodWithHttpMessagesAsync(Namespace, watch: true);
using (podlistResp.Watch<V1Pod, V1PodList>(async (type, item) =>
{
Console.WriteLine(type);
Console.WriteLine("==on watch event==");
var message = $"Namespace: {Namespace} Pod: {item.Metadata.Name} Type: {type} Phase:{item.Status.Phase}";
var remessage = $"Namespace: {Namespace} Pod: {item.Metadata.Name} Type: {type} back to Phase:{item.Status.Phase}";
Console.WriteLine(message);
if (!item.Status.Phase.Equals("Running") && !item.Status.Phase.Equals("Succeeded"))
{ Console.WriteLine("==on watch event==");
await Notify(message);
Console.WriteLine("==on watch event==");
}
if ( type== WatchEventType.Modified && item.Status.Phase.Equals("Running") )
{ Console.WriteLine("==on watch event==");
await Notify(remessage);
Console.WriteLine("==on watch event==");
}
}))
{
Console.WriteLine("press ctrl + c to stop watching");
var ctrlc = new ManualResetEventSlim(false);
Console.CancelKeyPress += (sender, eventArgs) => ctrlc.Set();
ctrlc.Wait();
}
}
catch (System.Exception ex)
{
Console.Error.WriteLine($"An error happened Message: {ex.Message}", ex);
}
}
private static async Task Notify(string message)
{
using (var client = new HttpClient())
{
client.BaseAddress = new Uri("https://outlook.office.com");
var body = new { text = message };
var content = new StringContent(JsonConvert.SerializeObject(body));
var result = await client.PostAsync("https://outlook.office.com/webhook/xxxx/IncomingWebhook/xxx", content);
result.EnsureSuccessStatusCode();
}
}
You can try to use kwatch, which sends Slack notifications on crashes -
https://github.com/abahmed/kwatch

TargetReplicaSelector RandomSecondaryReplica endpoint not found

No endpoint found for the service '{serviceB}' partition '{guid}' that matches the specified TargetReplicaSelector : 'RandomSecondaryReplica'
This is an error that has not always showed up, but it does sometimes.
I'm calling a stateful service B from another stateful service A, with service remoting, asking for a random secondary replica, to access state written to the primary.
I can see in Explorer that the partition is there and shows OK, and it has a primary and two ActiveSecondaries.
The service B has following:
protected override IEnumerable<ServiceReplicaListener> CreateServiceReplicaListeners()
{
return new[] { new ServiceReplicaListener(context =>
this.CreateServiceRemotingListener(context), listenOnSecondary: true) };
}
I get all the partitions by this:
return Enumerable.Range(0, PartitionConstants.Partitions).Select(x =>
ServiceProxy.Create<IServiceB>(
ServiceBUri,
new ServicePartitionKey(x),
TargetReplicaSelector.RandomSecondaryReplica));
And the overall settings must be OK since sometimes it does work. And I know the primary is responding because I have saved state there.
So, what could cause this error when I can actually see the partition there, with the secondary replicas?
Update1 : Restarting the calling service made connection work. But they started together, and well after both had been running and working, the problem persisted, until I restarted. Howcome?
Update2 : This happens when whole cluster is started. At startup, Service A primaries calls Service B primaries for some registration. A polls B to know that it has initiated its internal state before doing this.
Then when this is complete, Service A goes on to check if its internal state needs update, and if so, it will call Service B again to retrieve state. Since it will not do any writing to B state, it calls secondary replicas. And here is when endpoint is not found.
When I restart Service A, endpoints are found.
Could it be that primaries are working and OK, but the secondaries are not yet OK?
How can I ascertain this? Is there some service fabric class that I can access to know whether the secondary will be found if I call for it?
Using a service primer found here, solved this issue. Seems like not all partition replicas was ready when being called.
Basically, what it does is counting all replicas of all partitions via FabricClient, until expected count is found.
Here is code:
public async Task WaitForStatefulService(Uri serviceInstanceUri, CancellationToken token)
{
StatefulServiceDescription description =
await this.Client.ServiceManager.GetServiceDescriptionAsync(serviceInstanceUri) as StatefulServiceDescription;
int targetTotalReplicas = description.TargetReplicaSetSize;
if (description.PartitionSchemeDescription is UniformInt64RangePartitionSchemeDescription)
{
targetTotalReplicas *= ((UniformInt64RangePartitionSchemeDescription)description.PartitionSchemeDescription).PartitionCount;
}
ServicePartitionList partitions = await this.Client.QueryManager.GetPartitionListAsync(serviceInstanceUri);
int replicaTotal = 0;
while (replicaTotal < targetTotalReplicas && !token.IsCancellationRequested)
{
await Task.Delay(this.interval);
//ServiceEventSource.Current.ServiceMessage(this, "CountyService waiting for National Service to come up.");
replicaTotal = 0;
foreach (Partition partition in partitions)
{
ServiceReplicaList replicaList = await this.Client.QueryManager.GetReplicaListAsync(partition.PartitionInformation.Id);
replicaTotal += replicaList.Count(x => x.ReplicaStatus == System.Fabric.Query.ServiceReplicaStatus.Ready);
}
}
}

Service Fabric Reliable Queues FabricNotReadableException

I have a Stateful service with 1000 partitions and 1 replica.
This service in the RunAsync method have an infinte while cycle where I call a Reliable Queue to get messages.
If there are no messages I wait 5 seconds, then retry.
I used to do exactly that with Azure Storage Queue with success.
But with Service Fabric I'm getting thousands of FabricNotReadableExceptions, the Service become unstable and I'm not able to update it or delete it, I need to cancel the entire cluster.
I tried to update it and after 18 hours it was still stuck, so there is something terribly wrong in what I'm doing.
This is the method code:
public async Task<QueueObject> DeQueueAsync(string queueName)
{
var q = await StateManager.GetOrAddAsync<IReliableQueue<string>>(queueName);
using (var tx = StateManager.CreateTransaction())
{
try
{
var dequeued = await q.TryDequeueAsync(tx);
if (dequeued.HasValue)
{
await tx.CommitAsync();
var result = dequeued.Value;
return JSON.Deserialize<QueueObject>(result);
}
else
{
return null;
}
}
catch (Exception e)
{
ServiceEventSource.Current.ServiceMessage(this, $"!!ERROR!!: {e.Message} - Partition: {Partition.PartitionInfo.Id}");
return null;
}
}}
This is the RunAsync
protected override async Task RunAsync(CancellationToken cancellationToken)
{
while (true)
{
var message = await DeQueueAsync("MyQueue");
if (message != null)
{
//process, takes around 500ms
}
else
{
Thread.Sleep(5000);
}
}
}
I also changed Thread.Sleep(5000) with Task.Delay and was having thousands of "A task was canceled" errors.
What I'm missing here?
It's the cycle too fast and SF cannot update the other replicas in time?
Should I remove all the replicas leaving just one?
Should I use the new ConcurrentQueue instead?
I have the problem in production and in local with 50 or 1000 partitions, doesn't matter.
I'm stuck and confused.
Thanks
You need to honor the cancellationToken that is passed in to your RunAsync implementation. Service Fabric will cancel the token when it wants to stop your service for any reason - including upgrades - and it will wait indefinitely for RunAsync to return after cancelling the token. This could explain why you couldn't upgrade your application.
I would suggest checking cancellationToken.IsCancelled inside your loop, and breaking out if it has been cancelled.
FabricNotReadableException can happen for a variety of reasons - the answer to this question has a comprehensive explanation, but the takeaway is
You can consider FabricNotReadableException retriable. If you see it, just try the call again and eventually it will resolve into either NotPrimary or Granted.

Wait for EC2 instance to start

I have a custom AMI which runs my service. Using the AWS Java SDK, I create an EC2 instance using RunInstancesRequest from the AMI. Now before I begin to use my service, I must ensure that the newly created instance is up and running. I poll the instance using:
var transitionCompleted = false
while (!transitionCompleted) {
val currentState = instance.getState.getName
if (currentState == desiredState) {
transitionCompleted = true
}
if(!transitionCompleted) {
try {
Thread.sleep(TRANSITION_INTERVAL)
} catch {
case e: InterruptedException => e.printStackTrace()
}
}
}
So when the currentState of the instance turns into desiredState(which is running), I get the status that the instance is ready. However any newly created instance, despite being in running state, is not available for immediate use as it is still initializing.
How do I ensure that I return only when I'm able to access the instance and its services? Are there any specific status checks to make?
PS: I use Scala
You are checking instance state, while what you are actually interested in are the instance status checks. You could use describeInstanceStatus method from the Amazon Java SDK, but instead of implementing your own polling (in a non-idiomatic Scala) it's better to use a ready solution from the SDK: the EC2 waiters.
import com.amazonaws.services.ec2._, model._, waiters._
val ec2client: AmazonEC2 = ...
val request = new DescribeInstanceStatusRequest().withInstanceIds(instanceID)
ec2client.waiters.instanceStatusOk.run(
new WaiterParameters()
.withRequest(request)
// Optionally, you can tune the PollingStrategy:
// .withPollingStrategy(...)
)
)
To customize polling delay and retry strategies of the waiter check the PollingStrategy documentation.