Node-RED on bluemix giving 404 for http input node - ibm-cloud

I've put three http input nodes onto a tab in Node-RED on Bluemix. I've tried naming them various things, and I can only get one of them to actually work. The others give a 404. I've put the name from an iteration that worked on the failing nodes and they still fail. I've put the name from a failing node on the working node and it still works. I've tried deleting the nodes, re-adding them, stopping/restarting Node-RED. Is there something obvious that I should try?

Related

Connection from VS Code to Kubernetes failing

I am receiving an error message when trying to access details from VS Code of my Azure Kubernetes Cluster. This problem prevents me from attaching a debugger to the pod.
I receive the following error message:
Error loading document: Error: cannot open k8smsx://loadkubernetescore/pod-kube-system%20%20%20coredns-748cdb7bf4-q9f9x.yaml?ns%3Dall%26value%3Dpod%2Fkube-system%20%20%20coredns-748cdb7bf4-q9f9x%26_%3D1611398456559. Detail: Unable to read file 'k8smsx://loadkubernetescore/pod-kube-system coredns-748cdb7bf4-q9f9x.yaml?ns=all&value=pod/kube-system coredns-748cdb7bf4-q9f9x&_=1611398456559' (Get command failed: error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
)
My Setup
I have VS Code installed, with "Kubernetes", "Bridge to Kubernetes" and "Azure Kubernetes Service" installed
I have connected my Cluster through az login and can already access different information (e.g. my nodes, etc.)
When trying to access the workloads / pods on my cluster, I receive the above error message - and in the Kubernetes View in VS Code I get an error for the details of the pod.
Error in Kubernetes-View in VS Code
What I tried
I tried to reinstall the AKS Cluster and completely logging in freshly to it
I tried to reinstall all extensions mentioned above in VS Code
Browsing the internet, I do not find any comparable error message
The strange thing is that it used to work two weeks ago - and I did not change or update anything (as far as I remember)
Any ideas / hints that I can try further?
Thank you
As #mdaniel wrote: the Node view is just for human consumption, and that the tree item you actually want to click on is underneath Namespaces / kube-system / coredns-748cdb7bf4-q9f9x. Give that a try, and consider reporting your bad experience to their issue tracker since it looks like release 1.2.2 just came out 2 days ago and might not have been tested well.
final solution is to attach debugger in the other way - through Workloads / Deployments.

Master Kubernetes nodes offline GKE (mutliple clusters and projects)

This morning we noticed that all Kubernetes clusters in all projects ( 2 projects, 2 clusters per project ) showed unavailable / ERROR in the Google Cloud Console.
The dashboard shows no current issues: https://status.cloud.google.com/
It basically looks like the master nodes are down, the API does not respond and the clusters cannot be edited in the UI. Before the weekend everything was up and since at least yesterday evening they all show in red.
The deployed services fortunately respond, but we cannot manage the cluster in any way.
I reported it here too:
https://issuetracker.google.com/issues/172841082
Did anyone else encounter this and is there any way to restart or trigger the master node to restart? I cannot edit the cluster so an upgrade is not possible either.
I read elsewhere that only SRE folks from Google can (re)start them.
It's beyond me how this can happen.
By the way, auto-repair is set to on and I followed the troubleshooting page, basically with all paths leading to : master node down, nothing to be done.
Any help would be greatly appreciated, or simply a SRE doing a start node action ;).
Thank you #dany L, it was indeed a billing issue.
I'm surprised there is nothing like a message in the Cloud Console and one has to go to billing specifically to find out about this.
After billing was fixed, it took a few minutes while before the clusters were available, then everything looked back to normal.

Cassandra Kubernetes Statefulset NoHostAvailableException

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!

Service Fabric - Cannot do Config upgrade to add or remove nodes

I've got an on-premise Service Fabric consisting of 18 nodes (9 are seed nodes) - secured via gMSA windows security. Cluster code version 6.4.622.9590
Unfortunately I have to rebuild 6 of these nodes (3 Seed nodes). They all live in one data center (cluster spans 3 DCs). As such, I wish to remove these 6 nodes, rebuild them and then re-add them.
As per MSDOCs, adding/removing of nodes is performed via config upgrades. Note: I've already used this process recently to add 12 nodes so understand the concept of SF config upgrades well.
Unfortunately, I'm unable to do ANY config upgrades on this cluster until I remove the nodes - this is due to ValidationExceptions reported by the Start-ServiceFabricClusterConfigurationUpgradepowershell command:
If I don't add the 6 nodes to the "NodesToBeRemoved" section, I get validation error that not all removed nodes are in this field
If I do add the 6 nodes, I get the following validation error:
Start-ServiceFabricClusterConfigurationUpgrade :
System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
...gurationUpgrade], FabricException
+ FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
onfigurationUpgrade
So, we're stuck! I've also already removed node states, thus leaving all 6 nodes in the "Invalid State". The Get-ServiceFabricClusterConfiguration does not return these 6 nodes, but they are still shown in SF Explorer and listed in the cluster manifest XML file.
As far as reliability level is concerned - I'm pretty sure one can no longer change this in SF; i.e. older versions of SF allowed you to configure bronze/silver/gold in config file, but in recent versions (+6.0??) - this is a calculated field and managed internally by SF. In any case - because the seed nodes will be decreased from 9 to 6, I suspect the internal calculated reliability level will drop (presumably from Gold to silver).
I've also come across a hack that someone has used to remove nodes in a cluster... but in my scenario, nodes are still listed in manifest file... Nonetheless, the words hack and production should never meet!
So, how do I get our production cluster out of this situation? Rebuilding the cluster is not an option (that's the whole reason for clusters...high availability!).
I discovered that the above errors are primarily a symptom of lack of clearly documented procedures as well as bad/misleading error messages when doing service fabric configuration upgrades.
I performed quite a bit of my own testing to make sure I can confidently add/remove several nodes from a cluster. I also removed enough nodes to drop the Seed nodes from 9 to 6.
So, to resolve the above issue, here's what I had to do to remove nodes:
Use the SF explorer to remove node state - this changed node state from Error to Invalid
Get latest json config via Get-ServiceFabricClusterConfiguration
Remove the node from Nodes section
Completely remove the NodesToBeRemoved json section (i.e. you'll get the inconsistent error if you have an empty list of nodes to be removed - so just remove the containing json block
Do a config update
Note: Initially I tried just doing 2-5 above - but it didn't work and the node remained in error state.
That said, from my experience, please also note the following when removing nodes (this info is not clear in MSDOC:
You can remove multiple Seed nodes at once (I wanted to do this to try and replicate above scenario)
You can add multiple nodes at once too - just be aware you may not see any activity/indication via SF config upgrade status tooling that
anything is happening... be prepared to wait at least +15 minutes
(depends on how many nodes you're adding...afterall, SF is copying
installation files to the nodes)
Sometimes, when removing one or more nodes, the node won't be successfully removed - but left in an Error status. If this is the
case, use the SF Explorer (or powershell) to remove node state. Status
will change to Invalid. At this point, do another config upgrade
ensuring that:
The removed node(s) are not in Nodes section
The removed node(s) are not in the NodesToBeRemoved list
As per above, if the value of NodesToBeRemoved is (or should be) empty, remove this whole JSON block otherwise you'll get a misleading/vague warning about NodesToBeRemoved parameter contains inconsistent information.
The latter part really is the confusing part that tripped me up last time. The thing to also remember is that, once you successfully remove nodes, the Get-ServiceFabricClusterConfiguration will STILL return the removed nodes in the NodesToBeRemoved parameter. This will likely confuse/trip you up with any subsequent attempts to do a config upgrade. As such, I recommend you do another final config upgrade with this section completely removed.
As a final note: If you re-add a node that has previously been removed, it may come back in a Deactivated status. Simply activate this node and all should be fine.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.