Service Fabric - Warning: Failed to create infrastructure coordinator - azure-service-fabric

I have deployed a SB using ARM-Templates and when I go to the explorer it shows all nodes are healthy, but on the system tree my 2 node types have a warning saying:
Unhealthy event: SourceId='System.InfrastructureService',
Property='CoordinatorStatus', HealthState='Warning',
ConsiderWarningAsError=false. Failed to create infrastructure
coordinator:
Microsoft.WindowsAzure.ServiceRuntime.Management.DeploymentManagementEndpointNotFoundException:
Could not find the deployment management endpoint: ManagementUri at
Microsoft.WindowsAzure.ServiceRuntime.Management.DeploymentManagementServer.CreateChannelFactory()
at
Microsoft.WindowsAzure.ServiceRuntime.Management.DeploymentManagementServer.Initialize(IDeploymentManagementServer
server) at
Microsoft.WindowsAzure.ServiceRuntime.Management.DeploymentManagementClient..ctor(IDeploymentManagementServer
server) at
Microsoft.WindowsAzure.ServiceRuntime.Management.DeploymentManagementClient.CreateInstanceImpl(IDeploymentManagementServer
server) at
System.Fabric.InfrastructureService.ManagementClientFactory.Create()
at
System.Fabric.InfrastructureService.WindowsAzureInfrastructureCoordinatorFactory.Create()
at
System.Fabric.InfrastructureService.ServiceFactory.CreateCoordinatorByReflection(String
assemblyName, String factoryTypeName, Object[]
factoryCreateMethodArgs) at
System.Fabric.InfrastructureService.DelayLoadCoordinator.d__c.MoveNext()
.....any idea??? I would really appreciate it
Thanks

It could be possible that your ARM template has a mismatch on the durabilityLevel setting. There are 2 places it needs to be set for each node type.
a. VM extension resource section
b. Service Fabric cluster resource section.
Could you please check if both of those sections have the same value for each node type. E.g. "durabilityLevel": "Gold"

Related

Consul agent on kubernetes, on node or pod?

I deployed an aws eks cluster via terraform. I also deployed Consul following hasicorp’s tutorial and I see the nodes in consul’s UI.
Now I’m wondering how al the consul agents will know about the pods I deploy? I deploy something and it’s not shown anywhere on consul.
I can’t find any documentation as to how to register pods (services) on consul via the node’s consul agent, do I need to configure that somewhere? Should I not use the node’s agent and register the service straight from the pod? Hashicorp discourages this since it may increase resource utilization depending on how many pods one deploy on a given node. But then how does the node’s agent know about my services deployed on that node?
Moreover, when I deploy a pod in a node and ssh into the node, and install consul, consul’s agent can’t find the consul server (as opposed from the node, which can find it)
EDIT:
Bottom line is I can't find WHERE to add the configuration. If I execute ON THE POD:
consul members
It works properly and I get:
Node Address Status Type Build Protocol DC Segment
consul-consul-server-0 10.0.103.23:8301 alive server 1.10.0 2 full <all>
consul-consul-server-1 10.0.101.151:8301 alive server 1.10.0 2 full <all>
consul-consul-server-2 10.0.102.112:8301 alive server 1.10.0 2 full <all>
ip-10-0-101-129.ec2.internal 10.0.101.70:8301 alive client 1.10.0 2 full <default>
ip-10-0-102-175.ec2.internal 10.0.102.244:8301 alive client 1.10.0 2 full <default>
ip-10-0-103-240.ec2.internal 10.0.103.245:8301 alive client 1.10.0 2 full <default>
ip-10-0-3-223.ec2.internal 10.0.3.249:8301 alive client 1.10.0 2 full <default>
But if i execute:
# consul agent -datacenter=voip-full -config-dir=/etc/consul.d/ -log-file=log-file -advertise=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)
I get the following error:
==> Starting Consul agent...
Version: '1.10.1'
Node ID: 'f10070e7-9910-06c7-0e12-6edb6cc4c9b9'
Node name: 'ip-10-0-3-223.ec2.internal'
Datacenter: 'voip-full' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.0.3.223 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-08-16T18:23:06.936Z [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.936Z [WARN] agent: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.946Z [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.947Z [WARN] agent.auto_config: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.948Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ip-10-0-3-223.ec2.internal 10.0.3.223
2021-08-16T18:23:06.948Z [INFO] agent.router: Initializing LAN area manager
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp
2021-08-16T18:23:06.950Z [WARN] agent.client.serf.lan: serf: Failed to re-join any previously known node
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-08-16T18:23:06.951Z [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-08-16T18:23:06.951Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-08-16T18:23:06.953Z [INFO] agent: started state syncer
2021-08-16T18:23:06.953Z [INFO] agent: Consul agent running!
2021-08-16T18:23:06.953Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:06.954Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2021-08-16T18:23:34.169Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:34.169Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
So where to add the config?
I also tried adding a service in k8s pointing to the pod, but the service doesn't come up on consul's UI...
What do you guys recommend?
Thanks
Consul knows where these services are located because each service
registers with its local Consul client. Operators can register
services manually, configuration management tools can register
services when they are deployed, or container orchestration platforms
can register services automatically via integrations.
if you planning to use manual option you have to register the service into the consul.
Something like
echo '{
"service": {
"name": "web",
"tags": [
"rails"
],
"port": 80
}
}' > ./consul.d/web.json
You can find the good example at : https://thenewstack.io/implementing-service-discovery-of-microservices-with-consul/
Also this is a very nice document for having detailed configuration of the health check and service discovery : https://cloud.spring.io/spring-cloud-consul/multi/multi_spring-cloud-consul-discovery.html
Official document : https://learn.hashicorp.com/tutorials/consul/get-started-service-discovery
BTW, I was finally able to figure out the issue.
consul-dns is not deployed by default, i had to manually deploy it, then forward all .consul requests from coredns to consul-dns.
All is working now. Thanks!

Failed to send instantiate transaction and get notifications within the timeout period. undefined[fabric1.0 k8s]

I am trying to deploy Hyperledger fabric 1.0.5 on k8s, and use the balance transfer to test it. Everything is right before instantiate-chaincode, and I get this:
[2019-01-02 23:23:14.392] [ERROR] instantiate-chaincode - Failed to send instantiate transaction and get notifications within the timeout period. undefined
[2019-01-02 23:23:14.393] [ERROR] instantiate-chaincode - Failed to order the transaction. Error code: undefined
and I use kubectl logs to get the peer0's log which is like this:
[ConnProducer] NewConnection -> ERRO 61a Failed connecting to orderer2.orderer1:7050 , error: context deadline exceeded
[ConnProducer] NewConnection -> ERRO 61b Failed connecting to orderer1.orderer1:7050 , error: context deadline exceeded
[ConnProducer] NewConnection -> ERRO 61c Failed connecting to orderer0.orderer1:7050 , error: context deadline exceeded
[deliveryClient] connect -> DEBU 61d Connected to
[deliveryClient] connect -> ERRO 61e Failed obtaining connection: Could not connect to any of the endpoints: [orderer2.orderer1:7050 orderer1.orderer1:7050 orderer0.orderer1:7050]
I checked the connectivity of orderer0:7050 and found no problem.
What should I do next?
Thank for help!
You didn't describe what runbook you followed to deploy Hyperledger Fabric but looks like your pods cannot find each other through DNS. If you are following Kubernetes standards your pods should be in the orderer1 namespace and hopefully, you have Kubernetes services for orderer0, orderer1, and orderer2.
You can read more about communication between the Fabric components here in the "Communication between Fabric components" section. Also, read on the "Work around the chaincode sandbox" where it shows you a workaround for --dns-search.
It looks like firewall problem.
In my case to run hlf on k8s, I disabled firewall service.

Service Fabric on premise: Partition is below target replica or instance count

I have an error in Service Fabric on premise on my local cluster. I see it in Service Fabric Explorer web portal in one of my stateless service. I can't find any exception in application, my disk is not running out of memory. I also tried to look for some logs in C:\SfDevCluster\Log\Traces but I can't see any errors connected with this. What can be the source and where to look for solution?
Error event: SourceId='System.FM', Property='State'.
Partition is below target replica or instance count.
fabric:/localinstance/TestService 1 1 66712658-4aa3-48ce-80c6-b6e9cf258e56
InBuild _Node_2 131789911365352647
(Showing 1 out of 1 instances. Total available instances: 0)
Unhealthy event: SourceId='System.RA', Property='ReplicaOpenStatus', HealthState='Warning', ConsiderWarningAsError=false.
Replica had multiple failures during open on _Node_2. API call: IStatelessServiceInstance.Open(); Error = System.TypeInitializationException (-2146233036)
The type initializer for 'Microsoft.ServiceFabric.Services.ServiceTrace' threw an exception.
at Microsoft.ServiceFabric.Services.ServiceTrace.GetTraceIdForReplica(Guid partitionId, Int64 replicaId)
at Microsoft.ServiceFabric.Services.Runtime.StatelessServiceInstanceAdapter..ctor(StatelessServiceContext context, IStatelessUserServiceInstance userServiceInstance)
at Microsoft.ServiceFabric.Services.Runtime.StatelessServiceInstanceFactory.System.Fabric.IStatelessServiceFactory.CreateInstance(String serviceTypeName, Uri serviceName, Byte[] initializationData, Guid partitionId, Int64 instanceId)
at System.Fabric.ServiceFactoryBroker.<>c__DisplayClass9_0.b__0(IStatelessServiceFactory factory, ServiceInitializationParameters initParams)
at System.Fabric.ServiceFactoryBroker.CreateHelper[TFactory,TReturnValue](IntPtr nativeServiceType, IntPtr nativeServiceName, UInt32 initializationDataLength, IntPtr nativeInitializationData, Guid partitionId, Func`3 creationFunc, Action`2 initializationFunc, ServiceInitializationParameters initializationParameters)
For more information see: http://aka.ms/sfhealth

red hat OSP10 deploy fails on node profile tag, even though it is configured

I am trying to deploy rhosp10, and when getting to "openstack overcloud deploy" phase, I get these errors:
Error: only 0 of 1 requested ironic nodes are tagged to profile control (for flavor control)
Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:control,boot_option:local
Error: only 0 of 5 requested ironic nodes are tagged to profile compute (for flavor compute)
Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:compute,boot_option:local
Not enough nodes - available: 0, requested: 6
Configuration has 3 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy.
However, I configured 1 node to use control profile, and 5 to use compute profile. For example:
[stack#rhosp-1-director ~]$ openstack baremetal node show 4e153e0a-4c7b-4ee9-afb5-9036e263949b|grep prop
| properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'600508b1001c7b0731bc32edbb3a8369'}, u'cpus': u'48', u'capabilities': u'profile:control,boot_option:local', u'memory_mb': u'131072', u'local_gb': u'744'} |
[stack#rhosp-1-director ~]$ openstack baremetal node show 4989038d-de10-4365-8051-44fd42fd0ec7|grep prop
| properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'600508b1001c73b9fa55f385cd1a4008'}, u'cpus': u'48', u'capabilities': u'profile:compute,boot_option:local', u'memory_mb': u'131072', u'local_gb': u'744'} |
Another thing is that the following command yields no output:
openstack overcloud profiles list
I am following their manual from https://access.redhat.com/documentation/en/red-hat-openstack-platform/10-beta/single/director-installation-and-usage/#sect-Registering_Nodes_for_the_Overcloud step by step, so don't know what I'm doing wrong.
problem ended up being ironic auto cleaning. introspection never completed ok. not sure why, but disabling auto cleaning in ironic.conf right after undercloud install, followed by a reboot (for all ironic services to update this property), followed by the next steps, was successful (including introspection).

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved